## Processing Input Data

### Importing Data

The particulate matter data show some differences each other, on one hand the PM2.5 present fewer input variables as well some different type of variables if we compare these data with PM10.

In [1]:
import pandas as pd
import numpy as np
from scipy.io import loadmat

In [2]:
PM25data=pd.read_csv("PRSA_data_2010.1.1-2014.12.31.csv")
PM10data=loadmat('data_Polish.mat')
PM10data=pd.DataFrame.from_dict(PM10data['x'])
PM10data.columns = ['PM_10','SO2','NO_2','OZON','WIND_SPEED', 'WIND_DIR', 'TEMP', 'PROM_SLON', 'WILGOT', 'WIND_X', 'WIND_Y']

### NAN Values in datasets 

There are some NaN numbers in the PM2.5 data, it can be seen in the following two cells. On possible workarround may be to interpolate the missing values with the known ones and ensure all dataset will have defined values.

In [3]:
PM25data.isna().sum()

No          0
year        0
month       0
day         0
hour        0
pm2.5    2067
DEWP        0
TEMP        0
PRES        0
cbwd        0
Iws         0
Is          0
Ir          0
dtype: int64

In [4]:
PM10data.isna().sum()

PM_10         0
SO2           0
NO_2          0
OZON          0
WIND_SPEED    0
WIND_DIR      0
TEMP          0
PROM_SLON     0
WILGOT        0
WIND_X        0
WIND_Y        0
dtype: int64

The interpolation is then carried out, nevertheless, the first 24 data points are still missing, thus they are removed completly.

In [5]:
PM25_interpolate=PM25data['pm2.5'].fillna(method='ffill')
PM25data_clean=PM25data
PM25data_clean['pm2.5']=PM25_interpolate
PM25data_clean=PM25data_clean[24:]

In [6]:
i_PM10 = pd.date_range('2006-01-01', periods=len(PM10data), freq='H')
PM10data.index=i_PM10
PM25data_clean=PM25data_clean.set_index(['year', 'month','day','hour'])
PM25data=PM25data_clean
PM25data.drop(['No'], axis=1, inplace=True)
PM25data['new_idx']=pd.to_datetime(PM25data.index.to_frame())
PM25data.set_index('new_idx', inplace=True)

In [7]:
PM10data

Unnamed: 0,PM_10,SO2,NO_2,OZON,WIND_SPEED,WIND_DIR,TEMP,PROM_SLON,WILGOT,WIND_X,WIND_Y
2006-01-01 00:00:00,9.1,19.0,18.5,28.1,1.68,314.0,-3.4,0.0,95.0,-0.266436,1.658738
2006-01-01 01:00:00,9.1,30.3,19.9,27.1,1.58,297.0,-3.5,0.0,94.0,1.568733,-0.188352
2006-01-01 02:00:00,9.1,9.3,8.6,39.6,1.40,263.0,-3.7,0.0,93.0,-1.091226,0.877055
2006-01-01 03:00:00,9.1,19.1,12.5,33.5,1.36,273.0,-4.0,0.0,94.0,0.425952,-1.291575
2006-01-01 04:00:00,9.1,22.4,16.1,32.2,1.32,252.0,-4.2,0.0,94.0,0.822376,1.032520
...,...,...,...,...,...,...,...,...,...,...,...
2010-03-15 10:00:00,22.7,2.5,9.5,49.9,1.70,214.0,12.2,2.0,99.0,0.617439,1.583909
2010-03-15 11:00:00,22.8,3.8,13.6,38.8,1.46,171.0,11.9,24.0,99.0,1.425823,0.314054
2010-03-15 12:00:00,26.8,4.9,17.8,35.8,1.35,281.0,12.0,60.0,99.0,-1.329954,-0.231778
2010-03-15 13:00:00,23.4,3.6,20.2,39.8,1.43,232.0,12.6,168.0,93.0,-0.657624,1.269815


In [8]:
PM25data

Unnamed: 0_level_0,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
new_idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2010-01-02 00:00:00,129.0,-16,-4.0,1020.0,SE,1.79,0,0
2010-01-02 01:00:00,148.0,-15,-4.0,1020.0,SE,2.68,0,0
2010-01-02 02:00:00,159.0,-11,-5.0,1021.0,SE,3.57,0,0
2010-01-02 03:00:00,181.0,-7,-5.0,1022.0,SE,5.36,1,0
2010-01-02 04:00:00,138.0,-7,-5.0,1022.0,SE,6.25,2,0
...,...,...,...,...,...,...,...,...
2014-12-31 19:00:00,8.0,-23,-2.0,1034.0,NW,231.97,0,0
2014-12-31 20:00:00,10.0,-22,-3.0,1034.0,NW,237.78,0,0
2014-12-31 21:00:00,10.0,-22,-3.0,1034.0,NW,242.70,0,0
2014-12-31 22:00:00,8.0,-22,-4.0,1034.0,NW,246.72,0,0


### Creating Daily Regressors

In [27]:
PM10_daily=PM10data.resample('d').mean()

In [26]:
PM25_daily=PM25data.resample('d').mean()

In [28]:
PM10_daily.head()

Unnamed: 0,PM_10,SO2,NO_2,OZON,WIND_SPEED,WIND_DIR,TEMP,PROM_SLON,WILGOT,WIND_X,WIND_Y
2006-01-01,11.808333,22.766667,20.9875,28.041667,2.096667,290.25,-4.654167,5.125,90.75,0.03473,-0.246105
2006-01-02,14.679167,16.183333,20.675,41.1,2.219583,299.458333,-8.991667,14.625,80.583333,0.137797,-0.555077
2006-01-03,58.6625,22.695833,37.633333,27.333333,0.95625,253.791667,-11.520833,24.0,84.125,0.076456,-0.054662
2006-01-04,40.95,15.5,36.958333,28.7375,1.548333,321.875,-9.108333,12.416667,89.916667,0.071056,-0.330765
2006-01-05,54.4375,23.083333,43.033333,26.945833,1.476667,232.958333,-8.0,9.75,96.291667,0.139858,0.078001


In [29]:
PM25_daily.head()

Unnamed: 0_level_0,pm2.5,DEWP,TEMP,PRES,Iws,Is,Ir
new_idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2010-01-02,145.958333,-8.5,-5.125,1024.75,24.86,0.708333,0.0
2010-01-03,78.833333,-10.125,-8.541667,1022.791667,70.937917,14.166667,0.0
2010-01-04,31.333333,-20.875,-11.5,1029.291667,111.160833,0.0,0.0
2010-01-05,42.458333,-24.583333,-14.458333,1033.625,56.92,0.0,0.0
2010-01-06,56.416667,-23.708333,-12.541667,1033.75,18.511667,0.0,0.0
