# Data Collection

The data collected for this weather prediction model will be obtained from the APIs provided by the National Environment Agency available on <a href="https://data.gov.sg/dataset/realtime-weather-readings">data.gov.sg</a>.

The aim of this project is to create a model to predict the occurance of rain in a particular hour and hence API calls for each hour of the years 2017-2021 are made to create the dataset that will be analyzed. The reason for this range is that 2016 and 2022 data is incomplete. 

The available weather-related parameters are:<br><br>
    1. Temperature<br>
    2. Humidity<br>
    3. Wind Direction<br>
    4. Wind Speed<br>
    5. Rainfall<br>

## Step 1: Load the required libraries

The most important library for the collection of data is the <code>requests</code> package. To avoid any API limitations, we will also be using the sleep function of the <code>time</code> module.<br>
Next,<code>pandas</code>, <code>numpy</code> and <code>datetime.datetime</code> will be used to organise data into DataFrames, arrays and work with dates.<br>
Lastly, we will be connecting to a MySQL database using <code>pymysql</code> to store the collected data as the API calls may be done over a few days.

In [1]:
import pandas as pd
import numpy as np
import requests
import pymysql
import time
from datetime import datetime

## Step 2: Initialize variables

Set up the SQL connection and initialize the range of datetimes to call. <br>
As NEA's API take in a format of "%Y-%m-%dT%H:%M:%S+08:00" (e.g 2018-01-01T00:00:00+08:00), the list should be converted into this format using <code>.strftime()</code>.<br>
Use the function <code>pd.date_range()</code> argument <code>freq='60T'</code> to create a range of datetimes every hour.

In [None]:
_CONN = pymysql.connect(host='localhost',
                            user='root',
                            password='********',
                            db='weather')
cursor = _CONN.cursor()

runtimes=list(pd.date_range('2017-01-01 00:00:00',
                            '2021-12-31 23:59:59',
                            freq='60T').strftime('%Y-%m-%dT%H:%M:%S+08:00'))

## Step 3: Collect the data

Iterate the over the list of datetimes, using the datetimes as the parameters of the API call by passing in the parameters as a dictionary<br><code>params={"date_time":runtime}</code>.<br>
Stop the process every 500 or so calls to hopefully avoid any limitations in API calls using <code>time.sleep()</code>.<br><br>

Using the example of calling for temperature data:<br><br>
First, call the API using <code>request.get("the correct api url",params=params).json()</code>. This passes the json response into a python dictionary.<br>
Next, initialize an empty DataFrame having the columns <code>["timestamp","station_id","latitude","longitude","reading"]</code> to store the temporary data that will then be passed into the SQL database.<br>
For every hour called, loop through the dictionary and collect all available data.<br>
The information that are of importance are the latitude and longitude of the station that collected the data as well as the data itself.<br>

Lastly, store the temporary DataFrame into the SQL database by execute the appropriate SQL <code>INSERT</code> statements.<br><br>
Do the same for the other variables (humidity, wind direction, wind speed and rainfall) to collect them as well. You may want to split the collection of data over a few days as it may take some time.

In [None]:
for runtime in runtimes:
    print("Scrapping data for {}".format(runtime))
    params={"date_time":runtime}
    if runtimes.index(runtime)>0 and runtimes.index(runtime)%500==0:
        time.sleep(60)
    a=requests.get('https://api.data.gov.sg/v1/environment/air-temperature', params=params).json()
    temperature=pd.DataFrame(columns=["timestamp","station_id","latitude","longitude","reading"])
    for w in range(0,len(a["items"][0]["readings"])):
        if "value" not in a["items"][0]["readings"][w]:
            continue            
        temperature=temperature.append({"timestamp":datetime.strptime(runtime,"%Y-%m-%dT%H:%M:%S+08:00"),
                                        "station_id":a["items"][0]["readings"][w]["station_id"],
                                        "latitude":a["metadata"]["stations"][w]["location"]["latitude"],
                                        "longitude":a["metadata"]["stations"][w]["location"]["longitude"],
                                        "reading":a["items"][0]["readings"][w]["value"]},ignore_index=True)
    for index,row in temperature.iterrows():
        cursor.execute("INSERT INTO temperature(timestamp,station_id,latitude,longitude,reading)VALUES(%s,%s,%s,%s,%s)",
                       (row["timestamp"],row["station_id"],row["latitude"],row["longitude"],row["reading"]))