# Project 2 report

# Goal of the project

* The **goal** of this project is for you to present a **coherent** (logical) and **well-coded** investigation into some aspects of the dataset.

* **Ideas**: You don't need to answer them all, and you are strongly encouraged to answer questions that are not listed.

* When you are presenting your investigation of each problem, be sure to make a **coherent** discussion for each task (using Markdown, maths as appropriate and code cells).


# Project ideas

**Projects involving certain cities**


* Investigate the influence of a city size (population density, GDP?, other measures?) on air pollution. Might have to use other API's to get informations
    * For different pollutants

* Investigate the influence of weather for different cities.


* Distance from coast (does that affect).

* Tourism

* Green space - does that affect?


* Fitting a **linear regression model** where you predict the average pollution for a city based on (population, gdp, weather, location, ....). We could do this if the other two above do not take a long time


* **Simple prediction model** to warn people about pollution the next day (Based on weather conditions tomorrow and the mean pollution for a certain month for a certain city)

**Project involvin certain events**

* Wildfire 
    * Produce certain pollutants
    
**Other Projects**

* How countries are doing on their goal towards climate change?
    * Hypothesis
    * Visualize
        * Over time
        
* Rich vs poor
    * Do richer 


## Map

* Color coding map
    * color code how bad the pollution is
    
    
* Time lapse over a day


### Different kind of measures:

* **pm25**: Fine particulate matter of size 2.5
* **pm10**: Fine particulate matter of size 10
* **so2**: Sulfur dioxide 
* **co**: Carbon monoxide
* **no2**: Nitrogen dioxide
* **o3**: ozone
* **bc**: Black carbon 

# Assignments

### Setup

* Write an introduction to our problem
* Description of the data we use (what data, API, etc.)
* Maths (regression, computing averages, etc.)

### Data

* Fetch data for cities
    * Aggregate daily/monthly data for measurements (pm25 for example)
        * Average, max, min
    * Aggregate daily/monthly data for weather?
        * Average temp, wind, max temp, min
    * Data on population, gdp,......?
        * More fixed values, yearly might be enough
        
### Prediction, calculations  and plots

* Plots of time series
* Comparisons between cities
* ??


### Conclusions

* ??



# Data

We will need to have the data for each city such that each row i corresponds to day i and column j is a measure of some kind. That is our table will be:

Date, $X_1$, $X_2$, ..., $X_k$ where $X_1$ is a vector of size $n*1$ where n is the number of rows (days). $X_1$ could correspond to a measure of some kind (average temp for example) or measure of pm25

We have to make decisions on how we aggregate the data. Do we take average over each day or always measures at local time 12:00.

# Summary from meeting 17. Nov


We are going to sample **daily** data for cities on the format: 

$Date_i$, $X_1$, $X_2$, ..., $X_k$ where $X_1$ is a vector of size $n*1$ where n is the number of rows (days). $X_1$ could correspond to a measure of some kind (average temp for example) or measure of pm25.

The dataset could look like:

Date, MaxPM25, MinPM25, MaxPM10, MinPM25, max Temp, average Temp, average wind, maximum wind, population

dd/mm/yyyy - 20 - 10 - 60 - 9 - 5 - 3 - .......- 100.000


## Two different sources

### Air quality (response variable)

Here we use the Airquality API to fetch daily data for the airquality measures we want to look at.

Need to choose what measure we choose and how we measure that pollution over one day.

**Responsible**:
Ava, Aevar



### Effects (Prediction/causal variables)

Here we use the weather API from workshop (5, 6 or 7). We need to summarize the weather for each day and what variables we wan't to keep.

The weather API has output for **daily** weather data I think. The Readme file also recommends some API.

Then we need to fetch data on Popuation, gdp, ....., whatever we think of.

**Responsible**:
Ayush, Baldur, Karla


### Cities

* London
* Reykjavik
* Mexico

But all the code should have a variable called City so it's easy to reproduce.



In [2]:
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
import openaq
import warnings

warnings.simplefilter('ignore')

%matplotlib inline

# Set major seaborn asthetics
sns.set("notebook", style='ticks', font_scale=1.0)

# Increase the quality of inline plots
mpl.rcParams['figure.dpi']= 500

print ("pandas v{}".format(pd.__version__))
print ("matplotlib v{}".format(mpl.__version__))
print ("seaborn v{}".format(sns.__version__))
print ("openaq v{}".format(openaq.__version__))

Matplotlib is building the font cache; this may take a moment.


pandas v1.1.3
matplotlib v3.3.1
seaborn v0.11.0
openaq v1.1.0


In [3]:
# Fetch the api
# http://dhhagan.github.io/py-openaq/api.html
api = openaq.OpenAQ()

In [26]:
# Cities 

dt_cities = api.cities(df = True,limit=10000,country = 'MX')
#dt_cities
dt_cities = dt_cities[dt_cities['city'].str.startswith('M')]

dt_cities.head(100)

Unnamed: 0,country,name,city,count,locations
41,MX,Mérida,Mérida,90968789,1
42,MX,Mexicali,Mexicali,107690278,3
43,MX,MEXICO STATE,MEXICO STATE,1213888,14
44,MX,Miguel Hidalgo,Miguel Hidalgo,122439480,1
45,MX,Minatitlán,Minatitlán,124255314,1
46,MX,Monclova,Monclova,320523,1
47,MX,Monterrey,Monterrey,349419530,3
48,MX,Morelia,Morelia,131,1


In [6]:
# Fetch the measurment data. df = True (Change it straight to a data frame), 
# limit of rows to return (Max value: 10000, default = 100)

#dt = pd.DataFrame.from_dict(resp['results'])
#dt.groupby(['date']).mean("value")
#dt[dt["value"] <  0]
#dt[dt["date.local"] == "2021-01-03"]

In [34]:
#pd.DataFrame.from_dict(resp['results']).groupby(['date']).count()
#dt = pd.DataFrame.from_dict(resp['results'])

#dt['Date'] = pd.to_datetime(dt['date.local']).dt.date
df = api.measurements(city = 'Reykjavík',df = True,limit=55000, parameter = "pm25")
df.index.name = 'Date.local'
df.reset_index(inplace=True)
df['Date'] = df['Date.local'].dt.strftime('%Y-%m-%d')
df['value'] = df['value'].astype(float, errors = 'raise')
#dt.reset_index(inplace=True)
#dt['Date'] = pd.to_datetime(dt['Date.local']).dt.date

df.head(15)
Result = df.groupby(['Date'],as_index=False)['value'].mean()
Result.head(1000)

Unnamed: 0,Date,value
0,2020-02-28,5.228091
1,2020-02-29,30.230005
2,2020-03-01,313.376608
3,2020-03-02,230.364042
4,2020-03-03,8.264714
...,...,...
549,2021-09-26,11.607143
550,2021-09-27,19.107143
551,2021-09-28,2.940845
552,2021-09-29,4.497917


In [38]:
df = api.measurements(city = 'MEXICO STATE',df = True,limit=55000, parameter = "pm10", date_from = '2020-06-01', 
                      date_to = '2021-10-01')
df.index.name = 'Date.local'
df.reset_index(inplace=True)
df['Date'] = df['Date.local'].dt.strftime('%Y-%m-%d')
df['value'] = df['value'].astype(float, errors = 'raise')
#dt.reset_index(inplace=True)
#dt['Date'] = pd.to_datetime(dt['Date.local']).dt.date

df.head(15)
Result = df.groupby(['Date'],as_index=False)['value'].mean()
Result.head(1000)

Unnamed: 0,Date,value
0,2020-05-31,36.785714
1,2020-06-01,32.640288
2,2020-06-02,34.588235
3,2020-06-03,18.575472
4,2020-06-04,32.129412
...,...,...
304,2021-09-26,33.456522
305,2021-09-27,27.280899
306,2021-09-28,35.434146
307,2021-09-29,41.270936


In [61]:
# Specifying parameters
city = 'MEXICO STATE'
parameter = ["pm25", "pm10"]
date_from = '2020-06-01', 
date_to = '2021-10-01'

df = api.measurements(city = city, df = True, limit = 55000, parameter = parameter,
                      date_from = date_from, date_to = date_to)

df.index.name = 'Date.local'
df.reset_index(inplace=True)
df['Date'] = df['Date.local'].dt.strftime('%Y-%m-%d')
df['value'] = df['value'].astype(float, errors = 'raise')



Result = df.groupby(['Date', 'parameter'],as_index=False)['value'].mean()
Result.head(1000)

ResultWide = Result.pivot_table(index='Date',columns='parameter', values='value')
ResultWide.head(1000)
#Result.to_csv(city + '.csv')

parameter,pm10,pm25
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2020-10-05,45.333333,19.062500
2020-10-06,40.351145,19.365854
2020-10-07,44.153153,19.342857
2020-10-08,40.590909,17.827586
2020-10-09,38.810606,18.533333
...,...,...
2021-09-26,33.456522,20.345679
2021-09-27,27.280899,10.531646
2021-09-28,35.434146,16.819149
2021-09-29,41.270936,23.946237
