# SIATA AUTOMATION DOWNLOAD SYSTEM (EXTRACT)
<b>By: <em>David Serna</em></b><br>
<b><em>Data Scientist and Data Engineer</em></b>

If you want to know more about the project, what is SIATA and the motivations to make this repository, please [visit](https://github.com/dsernag/siata_automation) it

Also, here is my LinkedIn profile: [<img width="25px" src="..\imgs\linkedIn_PNG32.png" alt="LinkedinLogo">](https://www.linkedin.com/in/dsernag/)

This is open source code, so you can use it, but remember to be etic and if is possibly, cite it.©

## Configure your environment

I highly recommend to create an isolated Python environment, for sanity check, replication purposes and as a learning activity. Also is required to install wget for Windows users.

### Conda environment

I love [miniconda](https://docs.conda.io/en/latest/miniconda.html), is light and portable and allow an easly manipulation of environments. If you don't install conda, please at least create an independt environment in Python, here is the [documentation](https://docs.python.org/3/library/venv.html).

After install miniconda, create a environment this way (This command is valid for any operating system):

```bash
conda create -n siata python=3.9 selenium ipykernel beautifulsoup4 pandas numpy matplotlib seaborn mysql-connector-python
```

There is also a `requirements.txt` file to install via pip

### Install wget

Wget is really handy when you want to download massively files. In this exercise because there will be a lot of files involved, wget is our first choice. We could also use the `urlib` library from Python, but when there are so many files (and we need to download it quickly —because the login doesn't last forever and is required to make the download) is beter wget 😉

So! Windows user must donwload this [installer](https://sourceforge.net/projects/gnuwin32/files/wget/1.11.4-1/wget-1.11.4-1-setup.exe/download?use_mirror=excellmedia). After make the installation, the `.exe` file for wget by default is in:

```
C:\Program Files (x86)\GnuWin32\bin
```

Copy the route, and add that route to your environment variables (if you don't know how, please refer to google 🤓)

### Donwload the a WebDriver (Chrome)

To download the open data from SIATA is required to have interaction with an interface. So, we need to install a WebDriver. For this exercise I use the stable version `105.0.5195.52`. [Here](https://chromedriver.storage.googleapis.com/index.html?path=105.0.5195.52%2F) you can download the driver for your specific OS.

You also need to install Chrome to be able to open the WebDriver. Remember the location of the driver, We will need it later.

### Katalon Recorder (Chrome extension)

To make more easy the automation process I supported myself on [Katalon Recorder](https://chrome.google.com/webstore/detail/katalon-recorder-selenium/ljdobmomdgdljniojadhoplhkpialdid) extension for Google Chrome to obtain the XPATHs and the "click way" to achieve it. Take a review of the tool, you can download the python code and adjust to your needs (ase we did here).

Now we are ready to begin!🚀🚀🚀

<b>NOTE:</b> This notebook is not intended no be used in a row. Is intended to be executed line by line, with supervision and understanding the process.


## Web Scraping to SIATA

### Initialize the chrome driver and login

In [1]:
# Import the libraries
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import warnings
import time
import os
from datetime import datetime
warnings.filterwarnings('ignore')

In [2]:
# Create the Chrome object and enter to the URL
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
web_page = webdriver.Chrome(chrome_options = chrome_options, executable_path = "../drivers/chromedriver.exe")
web_page.get('https://siata.gov.co/descarga_siata/index.php/index2/login')

In [4]:
# Log in
# You can create an environmental variable to store any string you want
# If you want to know more Google it!
username = os.getenv("userSIATA")
password = os.getenv("passwdSIATA")
# Click the user and send the credentials
web_page.find_element(By.ID, "usuario").click()
web_page.find_element(By.ID, "usuario").clear()
web_page.find_element(By.ID, "usuario").send_keys(username)
# Click the password and send the credentials
web_page.find_element(By.ID, "contrasena").click()
web_page.find_element(By.ID, "contrasena").clear()
web_page.find_element(By.ID, "contrasena").send_keys(password)
#Ingreso
web_page.find_element(By.ID, "login_form").click()
web_page.find_element(By.ID, "Ingresar").click()

### Side bar menu

After we login, at left side there is a menu bar. This panel represents the navigation bar of the website, so we need to know how to automate the access

#<img src="..\imgs\home.PNG" width="800" />

After analyze the HTML structure the elements of the left menu are part of a unordered list taged as "panel" class:

<img src="..\imgs\leftpanelhtml.PNG" width="600" />

Let's obtain the `id` of each `li`

In [5]:
# Extract the ids of the left menu bar
home_html = BeautifulSoup(web_page.page_source)
left_menu = home_html.find_all('li',class_= "panel")
buttons = [button.get('id') for button in left_menu]
print(buttons)

['menu_radar', 'menu_estaciones', 'menu_hidro', 'menu_calaire', 'menu_calaire_anual', 'menu_acelero', 'menu_graficador', 'menu_info_radar', 'menu_info_estac', 'menu_info_pluviomet', 'menu_info_nivel', 'menu_info_aire', 'menu_contactenos']


### Select the resource you need

For this exercise we're going to download the air data of PM 2.5:

In [6]:
# Select air quality
select_button = buttons[3]
web_page.find_element(By.XPATH, f"//li[@id='{select_button}']").click()

### Required fields

To complete a donwload, we must provide a reason (more than 10 words), the time interval, the variable which we desire (PM 2.5 in this case) and the stations. This is how looks the interface:

<img src="..\imgs\input_download.PNG" width="800px">

#### Dates and reason

In [7]:
# Send to the webdriver the reason and date interval
reason = os.getenv("reason")
start_date = "2000-01-01 00:00:00"
end_date = "2022-09-26 23:59:59"

# Send the reason
web_page.find_element(By.ID, "motivo_descarga").click()
web_page.find_element(By.ID, "motivo_descarga").clear()
web_page.find_element(By.ID, "motivo_descarga").send_keys(reason)


# Send the dates
web_page.find_element(By.ID, "datetimepicker").click()
web_page.find_element(By.ID, "datetimepicker").clear()
web_page.find_element(By.ID, "datetimepicker").send_keys(start_date)
web_page.find_element(By.ID, "datetimepicker2").click()
web_page.find_element(By.ID, "datetimepicker2").clear()
web_page.find_element(By.ID, "datetimepicker2").send_keys(end_date)

#### Stations and kind of variable

To scrap the kind of variables we are going to search in the HTML source code an `input` tag `type` "radio"

<img src="..\imgs\kindvariable.png" width="1000px">

These are the `value` of each `input` tag:

In [10]:
# Extract the Kind of variable from the input tag type radio
air_html = BeautifulSoup(web_page.page_source)
output_kind_variable = air_html.find_all('input', type = "radio")
kindvariable = [datos.get("value") for datos in output_kind_variable[:-2]]
kindvariable

['todas', 'pm25', 'pm10', 'no', 'no2', 'nox', 'ozono', 'co', 'so2']

However, the way to scrap the web is through a number (the number of the `input`), so we create a little dictionary. The `input` "todas" starts in the number 2 (If is not your case, be conscious of what you're doing) Let's find the number and click it!:

In [11]:
# Identify the number asociate to the kind of variable
dictio_kind_variable = {key:value for key, value in zip(kindvariable, range(2, len(kindvariable) + 2))}
print(dictio_kind_variable)

{'todas': 2, 'pm25': 3, 'pm10': 4, 'no': 5, 'no2': 6, 'nox': 7, 'ozono': 8, 'co': 9, 'so2': 10}


In [12]:
# Click pm25
web_page.find_element(By.XPATH, f"//form[@id='CalidadAire_form']/div[2]/div/label[{dictio_kind_variable['pm25']}]").click()

Now, to select the stations we need to identify how many stations are

<img src="..\imgs\stations.png" width="1000px">

The way to search it is through `class name`. Every checkbox is from the class "select-all-class". We also need to provide a number to later click the element. The box "Seleccionar Todas" is the element 0.

In [13]:
# Extract the number station and the order input
stations_html = web_page.find_elements(By.CLASS_NAME, "select-all-class")
stations = [i.get_attribute('value') for i in stations_html]
dictio_stations = {key:value for key, value in zip(stations, range(2, len(stations) + 2))}
dictio_stations

{'12': 2,
 '28': 3,
 '38': 4,
 '44': 5,
 '48': 6,
 '69': 7,
 '78': 8,
 '79': 9,
 '80': 10,
 '81': 11,
 '82': 12,
 '83': 13,
 '84': 14,
 '85': 15,
 '86': 16,
 '87': 17,
 '88': 18,
 '90': 19,
 '94': 20}

Here we have two choices. If you want to download the data from all stations just click "Seleccionar Todas", or if you have a predefined set of stations to donwload, you need to provide the list to be clicked:

In [14]:
# Click station by station
# for element in dictio_stations.values():
#     web_page.find_element(By.XPATH, f"//tbody[@id='body_estaciones']/tr[{element}]/td/input").click()

# Click to select all stations
web_page.find_element(By.XPATH, f"//tbody[@id='body_estaciones']/tr[1]/td/input").click()

### Query request

Ok, by default the donwload will be procesed as csv files. So we must only click to make the request.

<img src="..\imgs\request.PNG" width="300px">

However this is the tricky part. SIATA produces a single csv file for each month of each station, in this example, there is not much data, so the query will be fast. But, if you want to download the historic of e:g: precipitation, this request could last forever and never load the csv download links (because the granularity of precipitation is much finer, and there are more stations involved and are also older).

So be cautious in your automation process, perhaps is better to make this in two or three steps 😊

In [15]:
%%time
# Click to make the query
web_page.set_page_load_timeout(1000)
web_page.find_element(By.ID, "realizarConsulta").click()

CPU times: total: 46.9 ms
Wall time: 2.33 s


### Obtain and download data

To obtain the links, we analyze the HTML content and we identify that every download button class is "btn btn-info" and are located in an anchor tag

<img src="..\imgs\download.png" width="800px">

So, obtain those links!

In [17]:
html_download = BeautifulSoup(web_page.page_source)
donwload_anchors = html_download.find_all('a',class_= "btn btn-info")
download_list = [i.get('href') for i in donwload_anchors]
print(f"There are a total of : {len(download_list)} datasets to be downloaded")

There are a total of : 1448 datasets to be downloaded


We are going to use `wget` inside the jupyter cell. Is more secure and consistent that `urlib`. Let's do it!

In [18]:
init = datetime.now()
for i in download_list:
    !wget $i --no-check-certificate -P ../data/air -q -nv
end = datetime.now()

SYSTEM_WGETRC = c:/progra~1/wget/etc/wgetrc
syswgetrc = C:\Program Files (x86)\GnuWin32/etc/wgetrc
SYSTEM_WGETRC = c:/progra~1/wget/etc/wgetrc
syswgetrc = C:\Program Files (x86)\GnuWin32/etc/wgetrc
SYSTEM_WGETRC = c:/progra~1/wget/etc/wgetrc
syswgetrc = C:\Program Files (x86)\GnuWin32/etc/wgetrc
SYSTEM_WGETRC = c:/progra~1/wget/etc/wgetrc
syswgetrc = C:\Program Files (x86)\GnuWin32/etc/wgetrc
SYSTEM_WGETRC = c:/progra~1/wget/etc/wgetrc
syswgetrc = C:\Program Files (x86)\GnuWin32/etc/wgetrc
SYSTEM_WGETRC = c:/progra~1/wget/etc/wgetrc
syswgetrc = C:\Program Files (x86)\GnuWin32/etc/wgetrc
SYSTEM_WGETRC = c:/progra~1/wget/etc/wgetrc
syswgetrc = C:\Program Files (x86)\GnuWin32/etc/wgetrc
SYSTEM_WGETRC = c:/progra~1/wget/etc/wgetrc
syswgetrc = C:\Program Files (x86)\GnuWin32/etc/wgetrc
SYSTEM_WGETRC = c:/progra~1/wget/etc/wgetrc
syswgetrc = C:\Program Files (x86)\GnuWin32/etc/wgetrc
SYSTEM_WGETRC = c:/progra~1/wget/etc/wgetrc
syswgetrc = C:\Program Files (x86)\GnuWin32/etc/wgetrc
SYSTEM_WGE

In [19]:
files_downloaded = os.listdir("../data/air")
print(f"The donwload process takes {((end-init).seconds)/60:.2f} minutes and donwloaded {len(files_downloaded)} files")

The donwload process takes 27.77 minutes and donwloaded 1447 files


That's it folks! Hope be useful! If you want to [continue](Transform.ipynb) in the Transform process, are welcome. 

This is open source code, so you can use it, but remember to be etic and if is possibly, cite it.©