# Introduction

This is where my plans for the air quality index (AQI) prediction service are. With any coding project larger than a simple script, things can quickly get out of hand if careful planning isn't carried out before starting to code.

This project was born from an idea I found on [Twitter](https://datamachines.xyz/2022/11/22/build-a-prediction-service-with-machine-learning-step-by-step/) by Pau Labarta Bajo. At the link, he gives a short outline of the steps needed to complete the project. The steps are vague enough to warrant some serious learning by those who are not familiar with deploying machine learning models (me), but just detailed enough to know the rough path to take. This is a great departure from tutorials where you mindlessly watch someone code X project and then claim you've learned how to do X.

First, a quick outline of the project. The air quality index is a measure of the levels of specific pollutants in the air. Below is a table I copied from the Wikipedia page for [AQI](https://en.wikipedia.org/wiki/Air_quality_index#CAQI).

<tbody><tr>
<th>Qualitative name</th>
<th>Index or sub-index</th>
<th colspan="4">Pollutant (hourly) concentration
</th></tr>
<tr>
<th colspan="2"></th>
<th>NO<sub>2</sub> μg/m<sup>3</sup></th>
<th>PM<sub>10</sub> μg/m<sup>3</sup></th>
<th>O<sub>3</sub> μg/m<sup>3</sup></th>
<th>PM<sub>2.5</sub> (optional) μg/m<sup>3</sup>
</th></tr>
<tr>
<td>Very low</td>
<td style="background:#79bc6a;">0–25</td>
<td>0–50</td>
<td>0–25</td>
<td>0–60</td>
<td>0–15
</td></tr>
<tr>
<td>Low</td>
<td style="background:#bbcf4c;">25–50</td>
<td>50–100</td>
<td>25–50</td>
<td>60–120</td>
<td>15–30
</td></tr>
<tr>
<td>Medium</td>
<td style="background:#eec20b;">50–75</td>
<td>100–200</td>
<td>50–90</td>
<td>120–180</td>
<td>30–55
</td></tr>
<tr>
<td>High</td>
<td style="background:#f29305;">75–100</td>
<td>200–400</td>
<td>90–180</td>
<td>180–240</td>
<td>55–110
</td></tr>
<tr>
<td>Very high</td>
<td style="background:#e8416f;">&gt;100</td>
<td>&gt;400</td>
<td>&gt;180</td>
<td>&gt;240</td>
<td>&gt;110
</td></tr>
</tbody>

The objective is to train a machine learning model on historical AQI data for a given city and then produce a 3-day hourly forecast of the AQI for the city. New data will be downloaded frequently and this data, along with the historical data and the forecast, will be presented in a plot in the online app. The model will be retrained once a week. 

## Technologies Used
---
1. Python
2. OpenWeatherAPI for air quality index data
3. Hopsworks as a feature storing service
4. Streamlit to build a simple data web app that shows the historical data and forecasted AQI

## Step 1: Feature Generation

The OpenWeatherAPI has a backlog of AQI data from 2020 November 27. This data will be downloaded via GET request and processed by a Python script.

The script needs automatically decide whether to download all of the historical data for a given city or to only download new data.

The script needs to verify data integrity, process the downloaded data and then upload the model inputs and outputs to Hopsworks.

This will all be accomplished by doing the following:
1. Have a function which determines the date range for which to download data
    * query Hopsworks for the feature group corresponding to a location. If it doesn't exist, then we know we need to download all historical data and so the start date is 2020 November 27.
    * Input Parameters: feature group name, has format zipcode_cityname_AQI
    * Output Parameters: datetime object which represents the start date for an AQI GET request
2. Have a function which only accomplishes the task of submitting a GET request for AQI data within a given time period
   * Input Parameters: start_date, end_date, latitude, longitude
   * Output: Pandas Dataframe of the AQI data
3. Have a function which takes in the raw data and generates features (lag, dayofweek, month, etc.). there will be a separate Jupyter notebook that shows the thought process for feature generation.

4. Store the features in Hopsworks.

I'm working with time-series data here which is much different than simple tabular data about customers or flattened images. There are things such as seasonality that can be used to do feature engineering. For the sake of the project I'm not going to get too detailed on this part. I'm mostly going to focus on getting everything working. Perhaps in a later version I'll analyze the data better.

In [1]:
import requests
import pandas as pd
import private
import datetime

start_date = datetime.datetime(2023, 1, 10, 0, 0 ,0)
end_date = datetime.datetime.now() - datetime.timedelta(hours=2)  # subtract two hours to add a lag

start_unix = int(datetime.datetime.timestamp(start_date))
end_unix = int(datetime.datetime.timestamp(end_date))

lat = '41.8798'
lon = '-87.6285'

aqi_url = 'http://api.openweathermap.org/data/2.5/air_pollution/history'
params = {'lat': lat, 'lon': lon, 'start': start_unix, 'end': end_unix, 'appid': private.MY_API_KEY}

aqi_response = requests.get(aqi_url, params=params)
aqi_resp_json = aqi_response.json()

In [7]:
"""
Extract data from AQI response.
"""
coord = aqi_resp_json['coord']  # latitude and longitude from aqi response
dates = [datetime.datetime.fromtimestamp(x['dt']) for x in aqi_resp_json['list']]
aqis = [x['main']['aqi'] for x in aqi_resp_json['list']]
pollutants = [x['components'] for x in aqi_resp_json['list']]

In [15]:
data = pd.DataFrame(dates, columns=['datetime'])
data['date'] = data['datetime'].dt.date
data['lat'] = coord['lat']
data['lon'] = coord['lon']
data['id'] = data.index
data = pd.concat([data, pd.DataFrame(pollutants)], axis=1)
data['aqi'] = aqis

data

Unnamed: 0,datetime,date,lat,lon,id,co,no,no2,o3,so2,pm2_5,pm10,nh3,aqi
0,2023-01-10 00:00:00,2023-01-10,41.8798,-87.6285,0,353.81,0.03,34.96,40.77,8.94,19.19,23.61,0.51,2
1,2023-01-10 01:00:00,2023-01-10,41.8798,-87.6285,1,327.11,0.01,27.76,47.21,7.87,16.98,20.52,0.38,2
2,2023-01-10 02:00:00,2023-01-10,41.8798,-87.6285,2,310.42,0.01,22.96,50.07,6.97,14.67,16.95,0.34,2
3,2023-01-10 03:00:00,2023-01-10,41.8798,-87.6285,3,303.75,0.01,21.25,50.07,6.5,13.04,14.59,0.35,2
4,2023-01-10 04:00:00,2023-01-10,41.8798,-87.6285,4,300.41,0.01,21.08,49.35,6.26,11.8,13.16,0.38,2
5,2023-01-10 05:00:00,2023-01-10,41.8798,-87.6285,5,307.08,0.01,22.28,46.49,5.84,11.35,12.87,0.41,2
6,2023-01-10 06:00:00,2023-01-10,41.8798,-87.6285,6,320.44,0.02,25.71,40.77,5.36,11.71,13.68,0.46,2
7,2023-01-10 07:00:00,2023-01-10,41.8798,-87.6285,7,343.8,0.09,31.87,32.54,5.13,12.8,15.4,0.57,2
8,2023-01-10 08:00:00,2023-01-10,41.8798,-87.6285,8,433.92,1.66,50.04,13.23,5.31,17.51,22.02,1.17,2
9,2023-01-10 09:00:00,2023-01-10,41.8798,-87.6285,9,547.41,13.08,61.69,1.13,5.78,23.07,30.22,1.79,2
