<a href="https://colab.research.google.com/github/a-fdezv/Machine-Learning-for-Time-Series-with-Python/blob/main/BigQuery_Job_pull_EPA_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [9]:
# @title Setup
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table
from google.colab import files

project = 'eco-watch-369004' # Project ID inserted based on the query results selected to explore
location = 'US' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()

## Reference SQL syntax from the original job
Use the ```jobs.query```
[method](https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query) to
return the SQL syntax from the job. This can be copied from the output cell
below to edit the query now or in the future. Alternatively, you can use
[this link](https://console.cloud.google.com/bigquery?j=eco-watch-369004:US:bquxjob_54035ec_1849ca481e8)
back to BigQuery to edit the query within the BigQuery user interface.

In [5]:
# Running this code will display the query used to generate your previous job

job = client.get_job('bquxjob_54035ec_1849ca481e8') # Job ID inserted based on the query results selected to explore
print(job.query)

SELECT
  date_local,
  county_code,
  site_num,
  latitude,
  longitude,
  parameter_name,
  arithmetic_mean,
  aqi
FROM
  `bigquery-public-data.epa_historical_air_quality.pm25_frm_daily_summary`
WHERE
  state_name = "Washington"
  AND EXTRACT(YEAR FROM date_local) > 2015
UNION ALL
SELECT
  date_local,
  county_code,
  site_num,
  latitude,
  longitude,
  parameter_name,
  arithmetic_mean,
  aqi
FROM
  `bigquery-public-data.epa_historical_air_quality.no2_daily_summary`
WHERE
  state_name = "Washington"
  AND EXTRACT(YEAR FROM date_local) > 2015
UNION ALL
SELECT
  date_local,
  county_code,
  site_num,
  latitude,
  longitude,
  parameter_name,
  arithmetic_mean,
  aqi
FROM
  `bigquery-public-data.epa_historical_air_quality.co_daily_summary`
WHERE
  state_name = "Washington"
  AND EXTRACT(YEAR FROM date_local) > 2015
UNION ALL
SELECT
  date_local,
  county_code,
  site_num,
  latitude,
  longitude,
  parameter_name,
  arithmetic_mean,
  aqi
FROM
  `bigquery-public-data.epa_historical_ai

# Result set loaded from BigQuery job as a DataFrame
Query results are referenced from the Job ID ran from BigQuery and the query
does not need to be re-run to explore results. The ```to_dataframe```
[method](https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.job.QueryJob.html#google.cloud.bigquery.job.QueryJob.to_dataframe)
downloads the results to a Pandas DataFrame by using the BigQuery Storage API.

To edit query syntax, you can do so from the BigQuery SQL editor or in the
```Optional:``` sections below.

In [None]:
# Running this code will read results from your previous job

job = client.get_job('bquxjob_54035ec_1849ca481e8') # Job ID inserted based on the query results selected to explore
results = job.to_dataframe()
results

In [4]:
results_duplicated = results[results.duplicated(['county_code','site_num','date_local','latitude','longitude','parameter_name'], keep=False)]
drop_indexes = results_duplicated[results_duplicated.aqi.isnull()]
#print(drop_indexes.index)
vals = results.drop(list(drop_indexes.index))
vals_duplicated_index = vals[vals.duplicated(['county_code','site_num','date_local','latitude','longitude','parameter_name'], keep='first')].index
data = vals.drop(list(vals_duplicated_index))
#dups = data_nodup.duplicated(['county_code','site_num','date_local','latitude','longitude','parameter_name'], keep='first')
data



Unnamed: 0,date_local,county_code,site_num,latitude,longitude,parameter_name,arithmetic_mean,aqi
0,2016-01-01,005,0002,46.21835,-119.204153,Wind Speed - Resultant,3.875000,
1,2016-01-01,005,0002,46.21835,-119.204153,Wind Direction - Resultant,217.000000,
2,2016-01-01,007,0011,47.43061,-120.341950,Wind Speed - Resultant,1.075000,
3,2016-01-01,007,0011,47.43061,-120.341950,Wind Direction - Resultant,184.625000,
4,2016-01-01,009,0013,48.29786,-124.624910,Wind Speed - Resultant,12.675000,
...,...,...,...,...,...,...,...,...
246558,2022-02-24,013,9991,46.20260,-117.953900,Ozone,0.033941,37
246559,2022-02-25,013,9991,46.20260,-117.953900,Ozone,0.034941,41
246560,2022-02-26,013,9991,46.20260,-117.953900,Ozone,0.026294,34
246561,2022-02-27,013,9991,46.20260,-117.953900,Ozone,0.024294,31


## Show descriptive statistics using describe()
Use the ```pandas DataFrame.describe()```
[method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html)
to generate descriptive statistics. Descriptive statistics include those that
summarize the central tendency, dispersion and shape of a dataset’s
distribution, excluding ```NaN``` values. You may also use other Python methods
to interact with your data.

In [5]:
data.describe()

Unnamed: 0,latitude,longitude,arithmetic_mean,aqi
count,195038.0,195038.0,195038.0,79333.0
mean,47.447662,-121.27968,136.493942,25.691528
std,0.805811,1.891789,1492.456699,22.068051
min,45.616667,-124.62491,-3.0,0.0
25%,47.1411,-122.46256,3.3375,12.0
50%,47.568236,-122.2144,12.8,22.0
75%,48.2469,-120.095556,66.666667,34.0
max,48.95074,-117.257654,35159.283333,908.0


In [6]:
import numpy as np
import pandas as pd
data['AQSID']=data['county_code'].astype(str) + data['site_num'].astype(str)
data.drop(['county_code','site_num'], axis=1, inplace=True)

In [7]:
data.parameter_name.unique()

array(['Wind Speed - Resultant', 'Wind Direction - Resultant',
       'PM10 Total 0-10um STP', 'Outdoor Temperature',
       'Relative Humidity ', 'Carbon monoxide',
       'PM2.5 - Local Conditions', 'Ozone', 'Barometric pressure',
       'Nitrogen dioxide (NO2)'], dtype=object)

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 195038 entries, 0 to 246562
Data columns (total 7 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   date_local       195038 non-null  dbdate 
 1   latitude         195038 non-null  float64
 2   longitude        195038 non-null  float64
 3   parameter_name   195038 non-null  object 
 4   arithmetic_mean  195038 non-null  float64
 5   aqi              79333 non-null   Int64  
 6   AQSID            195038 non-null  object 
dtypes: Int64(1), dbdate(1), float64(3), object(2)
memory usage: 12.1+ MB


In [9]:
aqsid_pm25=data[data['parameter_name']=='PM2.5 - Local Conditions']['AQSID']
aqsid_pm25=aqsid_pm25.unique()

In [10]:
df=data.loc[data.apply(lambda x: x.AQSID in aqsid_pm25, axis =1)]

In [11]:
df



Unnamed: 0,date_local,latitude,longitude,parameter_name,arithmetic_mean,aqi,AQSID
2,2016-01-01,47.430610,-120.341950,Wind Speed - Resultant,1.075000,,0070011
3,2016-01-01,47.430610,-120.341950,Wind Direction - Resultant,184.625000,,0070011
12,2016-01-01,47.597222,-122.319722,Wind Speed - Resultant,1.575000,,0330030
13,2016-01-01,47.597222,-122.319722,Wind Direction - Resultant,107.000000,,0330030
14,2016-01-01,47.568236,-122.308628,Wind Speed - Resultant,1.816667,,0330080
...,...,...,...,...,...,...,...
246526,2022-01-31,47.568236,-122.308628,Carbon monoxide,0.129167,2,0330080
246530,2022-01-31,47.597222,-122.319722,Carbon monoxide,0.379167,6,0330030
246532,2022-01-31,47.568236,-122.308628,Barometric pressure,1009.875000,,0330080
246534,2022-02-01,47.568236,-122.308628,Carbon monoxide,0.200000,2,0330080


In [71]:
df_index=df.set_index(['date_local','latitude','longitude','AQSID', 'parameter_name'])
df_unstack = df_index.unstack('parameter_name')

In [72]:
df_unstack

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,arithmetic_mean,arithmetic_mean,arithmetic_mean,arithmetic_mean,arithmetic_mean,arithmetic_mean,arithmetic_mean,arithmetic_mean,arithmetic_mean,arithmetic_mean,aqi,aqi,aqi,aqi,aqi,aqi,aqi,aqi,aqi,aqi
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,parameter_name,Barometric pressure,Carbon monoxide,Nitrogen dioxide (NO2),Outdoor Temperature,Ozone,PM10 Total 0-10um STP,PM2.5 - Local Conditions,Relative Humidity,Wind Direction - Resultant,Wind Speed - Resultant,Barometric pressure,Carbon monoxide,Nitrogen dioxide (NO2),Outdoor Temperature,Ozone,PM10 Total 0-10um STP,PM2.5 - Local Conditions,Relative Humidity,Wind Direction - Resultant,Wind Speed - Resultant
date_local,latitude,longitude,AQSID,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2
2016-01-01,45.643360,-122.587370,0110024,,,,,,,3.1,,,,,,,,,,13,,,
2016-01-01,46.319320,-119.999677,0770005,,,,,,,17.0,,,,,,,,,,61,,,
2016-01-01,46.380240,-120.332660,0770015,,,,14.041667,,,24.3,,214.208333,1.241667,,,,,,,77,,,
2016-01-01,46.598056,-120.499167,0770009,,,,,,14.0,15.8,,,,,,,,,13,59,,,
2016-01-01,47.186400,-122.451700,0530029,,,,,,,60.7,,,,,,,,,,154,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-01-31,48.399990,-119.518960,0470013,,,,23.458333,,,11.7,,171.916667,3.283333,,,,,,,49,,,
2022-01-31,48.520590,-122.614280,0570011,,,,,,,3.9,,,,,,,,,,16,,,
2022-01-31,48.544448,-117.903425,0650005,,,,29.416667,,17.0,10.4,,168.250000,1.404167,,,,,,16,43,,,
2022-02-01,47.568236,-122.308628,0330080,,0.2,,,,,,,,,,2,,,,,,,,


In [73]:
df_unstack.columns  = df_unstack.columns.map('_'.join).str.strip('_')

In [74]:
df_unstack

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,arithmetic_mean_Barometric pressure,arithmetic_mean_Carbon monoxide,arithmetic_mean_Nitrogen dioxide (NO2),arithmetic_mean_Outdoor Temperature,arithmetic_mean_Ozone,arithmetic_mean_PM10 Total 0-10um STP,arithmetic_mean_PM2.5 - Local Conditions,arithmetic_mean_Relative Humidity,arithmetic_mean_Wind Direction - Resultant,arithmetic_mean_Wind Speed - Resultant,aqi_Barometric pressure,aqi_Carbon monoxide,aqi_Nitrogen dioxide (NO2),aqi_Outdoor Temperature,aqi_Ozone,aqi_PM10 Total 0-10um STP,aqi_PM2.5 - Local Conditions,aqi_Relative Humidity,aqi_Wind Direction - Resultant,aqi_Wind Speed - Resultant
date_local,latitude,longitude,AQSID,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
2016-01-01,45.643360,-122.587370,0110024,,,,,,,3.1,,,,,,,,,,13,,,
2016-01-01,46.319320,-119.999677,0770005,,,,,,,17.0,,,,,,,,,,61,,,
2016-01-01,46.380240,-120.332660,0770015,,,,14.041667,,,24.3,,214.208333,1.241667,,,,,,,77,,,
2016-01-01,46.598056,-120.499167,0770009,,,,,,14.0,15.8,,,,,,,,,13,59,,,
2016-01-01,47.186400,-122.451700,0530029,,,,,,,60.7,,,,,,,,,,154,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-01-31,48.399990,-119.518960,0470013,,,,23.458333,,,11.7,,171.916667,3.283333,,,,,,,49,,,
2022-01-31,48.520590,-122.614280,0570011,,,,,,,3.9,,,,,,,,,,16,,,
2022-01-31,48.544448,-117.903425,0650005,,,,29.416667,,17.0,10.4,,168.250000,1.404167,,,,,,16,43,,,
2022-02-01,47.568236,-122.308628,0330080,,0.2,,,,,,,,,,2,,,,,,,,


Dropping unnecessary columns and renaming them

In [76]:
df_unstack.drop(['aqi_Barometric pressure','aqi_Outdoor Temperature','aqi_Relative Humidity ','aqi_Wind Direction - Resultant','aqi_Wind Speed - Resultant'], axis=1,inplace=True)
df_unstack.columns
col={'arithmetic_mean_Barometric pressure': 'Psfc', 'arithmetic_mean_Carbon monoxide': 'CO','arithmetic_mean_Nitrogen dioxide (NO2)': 'NO2','arithmetic_mean_PM10 Total 0-10um STP': 'PM10','arithmetic_mean_PM2.5 - Local Conditions': 'PM25','arithmetic_mean_Outdoor Temperature': 'Tmp','arithmetic_mean_Ozone': 'O3','arithmetic_mean_Relative Humidity ': 'RH','arithmetic_mean_Wind Direction - Resultant': 'wd','arithmetic_mean_Wind Speed - Resultant': 'wspd','aqi_Carbon monoxide':'AQI_CO','aqi_Nitrogen dioxide (NO2)':'AQI_NO2','aqi_Ozone':'AQI_O3','aqi_PM10 Total 0-10um STP':'AQI_PM10','aqi_PM2.5 - Local Conditions':'AQI_PM25'}
df_unstack.rename(columns=col,inplace=True)

In [88]:
df_nona=df_unstack.dropna(thresh=3) # drop any row where if it does not have at least 3 values that are **not** NaN (this is at least one more feature besides PM25 nad AQIPM25)

In [89]:
df_nona

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Psfc,CO,NO2,Tmp,O3,PM10,PM25,RH,wd,wspd,AQI_CO,AQI_NO2,AQI_O3,AQI_PM10,AQI_PM25
date_local,latitude,longitude,AQSID,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2016-01-01,46.380240,-120.332660,0770015,,,,14.041667,,,24.3,,214.208333,1.241667,,,,,77
2016-01-01,46.598056,-120.499167,0770009,,,,,,14.0,15.8,,,,,,,13,59
2016-01-01,47.430610,-120.341950,0070011,,,,15.875000,,,8.9,,184.625000,1.075000,,,,,37
2016-01-01,47.568236,-122.308628,0330080,1015.583333,0.355556,23.395833,34.666667,0.011471,,11.3,63.083333,213.583333,1.816667,5,37,15,,47
2016-01-01,47.597222,-122.319722,0330030,,0.984211,33.529167,35.750000,,,16.4,,107.000000,1.575000,13,45,,,60
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-01-31,47.568236,-122.308628,0330080,1009.875000,0.129167,8.086957,41.041667,0.025417,,3.2,87.000000,187.333333,2.554167,2,22,30,,13
2022-01-31,47.597222,-122.319722,0330030,,0.379167,19.773913,41.958333,,,5.3,,189.208333,2.575000,6,27,,,22
2022-01-31,47.663962,-117.257654,0630017,,,,,,10.0,7.8,,,,,,,9,33
2022-01-31,48.399990,-119.518960,0470013,,,,23.458333,,,11.7,,171.916667,3.283333,,,,,49


In [90]:
df_data = df_nona.reset_index() #reset the index

In [91]:
df_data



Unnamed: 0,date_local,latitude,longitude,AQSID,Psfc,CO,NO2,Tmp,O3,PM10,PM25,RH,wd,wspd,AQI_CO,AQI_NO2,AQI_O3,AQI_PM10,AQI_PM25
0,2016-01-01,46.380240,-120.332660,0770015,,,,14.041667,,,24.3,,214.208333,1.241667,,,,,77
1,2016-01-01,46.598056,-120.499167,0770009,,,,,,14.0,15.8,,,,,,,13,59
2,2016-01-01,47.430610,-120.341950,0070011,,,,15.875000,,,8.9,,184.625000,1.075000,,,,,37
3,2016-01-01,47.568236,-122.308628,0330080,1015.583333,0.355556,23.395833,34.666667,0.011471,,11.3,63.083333,213.583333,1.816667,5,37,15,,47
4,2016-01-01,47.597222,-122.319722,0330030,,0.984211,33.529167,35.750000,,,16.4,,107.000000,1.575000,13,45,,,60
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20077,2022-01-31,47.568236,-122.308628,0330080,1009.875000,0.129167,8.086957,41.041667,0.025417,,3.2,87.000000,187.333333,2.554167,2,22,30,,13
20078,2022-01-31,47.597222,-122.319722,0330030,,0.379167,19.773913,41.958333,,,5.3,,189.208333,2.575000,6,27,,,22
20079,2022-01-31,47.663962,-117.257654,0630017,,,,,,10.0,7.8,,,,,,,9,33
20080,2022-01-31,48.399990,-119.518960,0470013,,,,23.458333,,,11.7,,171.916667,3.283333,,,,,49


In [92]:
my_dict={k: g[:] for k,g in df_data.groupby("AQSID")} # create a dictionary where keys are the AQSID sites and the values are the corresponded dataframe with features values from 2016-present.

**Data** **Exploratory**

In [2]:
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
my_dict.keys()

NameError: ignored

In [94]:
my_dict.keys() #Analizing individual site to check for null values. 
for site in my_dict.keys():
  print(site)
  df_aqs = my_dict[str(site)]
  df_aqs.info()
  df_aqs.describe()
  corrM = df_unstack.corr()


0070011
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1887 entries, 2 to 20076
Data columns (total 19 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   date_local  1887 non-null   object 
 1   latitude    1887 non-null   float64
 2   longitude   1887 non-null   float64
 3   AQSID       1887 non-null   object 
 4   Psfc        0 non-null      float64
 5   CO          0 non-null      float64
 6   NO2         0 non-null      float64
 7   Tmp         1887 non-null   float64
 8   O3          0 non-null      float64
 9   PM10        0 non-null      float64
 10  PM25        422 non-null    float64
 11  RH          0 non-null      float64
 12  wd          1887 non-null   float64
 13  wspd        1887 non-null   float64
 14  AQI_CO      0 non-null      Int64  
 15  AQI_NO2     0 non-null      Int64  
 16  AQI_O3      0 non-null      Int64  
 17  AQI_PM10    0 non-null      Int64  
 18  AQI_PM25    414 non-null    Int64  
dtypes: Int64(5), float

In [29]:
df_unstack.describe()

Unnamed: 0,Psfc,CO,NO2,Tmp,O3,PM10,PM25,RH,wd,wspd,AQI_CO,AQI_NO2,AQI_O3,AQI_PM10,AQI_PM25
count,2174.0,4006.0,6412.0,17058.0,2937.0,6882.0,39186.0,1927.0,15885.0,15905.0,4003.0,6412.0,2937.0,6882.0,38579.0
mean,1004.456574,0.309192,14.702258,52.079481,0.022291,20.065679,7.868245,77.744984,185.215487,3.355486,4.617537,24.466781,26.139598,17.841906,28.596283
std,6.524691,0.152276,6.742637,15.220422,0.008852,22.627932,13.573955,13.009318,46.983886,1.84559,2.604318,9.981702,8.884185,16.05274,24.989668
min,978.333333,0.1,1.566667,-1.26087,-0.000529,1.0,-3.0,22.916667,12.375,0.3125,1.0,2.0,0.0,1.0,0.0
25%,1000.84375,0.2,9.809091,41.125,0.016529,10.0,3.3,68.6875,157.75,2.066667,2.0,18.0,21.0,9.0,14.0
50%,1004.541667,0.279167,14.193334,51.666667,0.022588,15.0,5.3,78.458333,185.75,2.970833,5.0,24.0,27.0,14.0,22.0
75%,1008.489583,0.391667,18.777083,63.583333,0.028647,23.0,8.5,88.1875,212.541667,4.2,6.0,30.0,32.0,21.0,35.0
max,1027.75,1.595455,59.234783,95.875,0.053235,440.0,513.4,100.0,350.833333,17.3875,25.0,98.0,105.0,320.0,509.0


Correlation between features

In [34]:
corrM = df_unstack.corr()
corrM

Unnamed: 0,Psfc,CO,NO2,Tmp,O3,PM10,PM25,RH,wd,wspd,AQI_CO,AQI_NO2,AQI_O3,AQI_PM10,AQI_PM25
Psfc,1.0,0.046873,0.110509,-0.051162,-0.205405,0.060491,0.025235,-0.141013,-0.070313,-0.202098,0.046328,0.03488,-0.208845,0.064019,0.089979
CO,0.046873,1.0,0.808326,-0.084874,-0.393592,0.564138,0.561069,0.129582,-0.132898,-0.203849,0.942724,0.6182,-0.357195,0.569931,0.659718
NO2,0.110509,0.808326,1.0,-0.060763,-0.56143,0.526971,0.368263,0.011068,-0.06216,-0.268352,0.780515,0.851743,-0.431835,0.553196,0.55234
Tmp,-0.051162,-0.084874,-0.060763,1.0,0.165978,0.184349,0.049589,-0.517765,0.166123,0.131905,-0.078542,0.052242,0.25338,0.206638,-0.009453
O3,-0.205405,-0.393592,-0.56143,0.165978,1.0,-0.152648,-0.130158,-0.384963,0.148179,0.528737,-0.464473,-0.310098,0.929289,-0.163588,-0.22424
PM10,0.060491,0.564138,0.526971,0.184349,-0.152648,1.0,0.899883,-0.28311,-0.036249,-0.148809,0.524606,0.497767,-0.065521,0.980823,0.815843
PM25,0.025235,0.561069,0.368263,0.049589,-0.130158,0.899883,1.0,-0.078832,-0.017988,-0.201385,0.484615,0.321287,-0.071963,0.841857,0.886548
RH,-0.141013,0.129582,0.011068,-0.517765,-0.384963,-0.28311,-0.078832,1.0,-0.314849,-0.156855,0.151255,-0.18254,-0.443261,-0.296156,-0.141865
wd,-0.070313,-0.132898,-0.06216,0.166123,0.148179,-0.036249,-0.017988,-0.314849,1.0,0.191482,-0.176441,0.002173,0.222015,-0.037544,-0.045738
wspd,-0.202098,-0.203849,-0.268352,0.131905,0.528737,-0.148809,-0.201385,-0.156855,0.191482,1.0,-0.203846,-0.253906,0.474714,-0.164579,-0.348883


From the above deescription whe can conclude that this problem can be tackle from 2 different point of views:


1.   We can do the forecast for each specific site, where the features used as independant variables to predict the target valu may vary from site to site.
2.   We can do the forecast for te entire WA region  taking all the available data as an all.

From those 2 point of views it seemed more accurate to create one model per AQ site. 

In [62]:
df_unstack.to_csv('/drive/My Drive/ML_HW1/data_EPA_WA_2016_daily.csv')

In [60]:
!cp data_EPA_WA_2016_daily.csv "drive/My Drive/ML_HW1/"