## NAMRIA Topo Map (1:50,000) Scraper

### About
This Notebook is used for scraping/downloading the 1:50,000 topographic maps available at the NAMRIA [website]("http://www.namria.gov.ph/download.php") and serves as a basic exercise in scraping using Beautiful Soup and Python 3.

### RUNNING THE TOOL

_**REQUIREMENTS**_
* Python 3.6.1
* jupyter
* Beautiful Soup

They are also found in the **requirements.txt** file.

You can install these requirements using **pip** (sudo **pip install -r requirements.txt** or **pip install -r requirements.txt**)

_**PROCEDURE**_
1. Add the proxy in PROXY (if any).
2. Select the directory to save the scraped files to (SAVEDIR).
3. Add a QUERY list to limit the download (i.e. ["3343"]; only files containing strings matching any of the elements in QUERY will be downloaded).
4. Run All.

### LICENSE
_Copyright (C) 2017 Ben Hur S. Pintor (bhs.pintor@gmail.com)_ [[website]("https://benhur07b.github.io")]

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

In [None]:
from bs4 import BeautifulSoup
import requests
import shutil
import re
import sys
import os
from datetime import datetime

sys.dont_write_bytecode = True

BASE_URL = "http://www.namria.gov.ph/"
TOPO_URL = "http://www.namria.gov.ph/topo50Index.aspx/"

PROXY = ""  # insert proxy here

PROXIES = {"http": PROXY}

SAVEDIR = ""  # add directory to save scraped files here

QUERY = [""]  # add query items (strings to limit downloads) here, leave empty to download ALL data

if SAVEDIR:
    if not os.path.exists(SAVEDIR):
        os.makedirs(SAVEDIR)
        
    os.chdir(SAVEDIR)

In [None]:
def get_links(r, query):
    """Returns a list of links in a page provided for by a requests object r.
    
    Keyword arguments:
    r -- the requests object
    """
    
    links = list()
    soup = BeautifulSoup(r.text)
    for link in soup.find_all('area'):
        links.append(link.get('href'))
    
    if len(query) > 0:
        return [l for l in links if any(q in l for q in query)]  # return only the links that match any of the query items
                                                                 # for every l in links, checks if l contains any q in query
    else:
        return links

    
def save_to_file(url, proxies=None):
    """Saves the topo map in the url into an image.
    
    Keyword arguments:
    url -- the url of the sensor measurement
    proxies -- the proxy settings, if any (default = None)
    """
    
    if proxies:
        r = requests.get(url, stream=True, proxies=proxies)
    else:
        r = requests.get(url, stream=True)
    
    soup = BeautifulSoup(r.text)
    
    for img in soup.find_all('img'):
        img_src = img.get('src')
        img_name = img_src.split("/")[-1]
        if proxies:
            img_topo = requests.get("{}{}".format(BASE_URL,img_src), stream=True, proxies=proxies)
        else:
            img_topo = requests.get("{}{}".format(BASE_URL,img_src), stream=True)
                
        with open(img_name, 'wb') as outfile:
            shutil.copyfileobj(img_topo.raw, outfile)

        del img_topo
        print("{} saved.".format(name))

    
r = requests.get(TOPO_URL, proxies=PROXIES)
links = get_links(r, QUERY)
for link in links:
    if not os.path.exists(link):
        save_to_file("{}{}".format(BASE_URL, link), PROXIES)
    else:
        pass