## Processing Input Data

### Importing Data

The particulate matter data show some differences each other, on one hand the PM2.5 present fewer input variables as well some different type of variables if we compare these data with PM10.

In [199]:
import pandas as pd
import numpy as np
from scipy.io import loadmat

In [200]:
PM25data=pd.read_csv("PRSA_data_2010.1.1-2014.12.31.csv")
PM10data=loadmat('data_Polish.mat')
PM10data=pd.DataFrame.from_dict(PM10data['x'])
PM10data.columns = ['PM_10','SO2','NO_2','OZON','WIND_SPEED', 'WIND_DIR', 'TEMP', 'PROM_SLON', 'WILGOT', 'WIND_X', 'WIND_Y']

In [201]:
PM25data.head()

Unnamed: 0,No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
0,1,2010,1,1,0,,-21,-11.0,1021.0,NW,1.79,0,0
1,2,2010,1,1,1,,-21,-12.0,1020.0,NW,4.92,0,0
2,3,2010,1,1,2,,-21,-11.0,1019.0,NW,6.71,0,0
3,4,2010,1,1,3,,-21,-14.0,1019.0,NW,9.84,0,0
4,5,2010,1,1,4,,-20,-12.0,1018.0,NW,12.97,0,0


In [202]:
PM10data.head()

Unnamed: 0,PM_10,SO2,NO_2,OZON,WIND_SPEED,WIND_DIR,TEMP,PROM_SLON,WILGOT,WIND_X,WIND_Y
0,9.1,19.0,18.5,28.1,1.68,314.0,-3.4,0.0,95.0,-0.266436,1.658738
1,9.1,30.3,19.9,27.1,1.58,297.0,-3.5,0.0,94.0,1.568733,-0.188352
2,9.1,9.3,8.6,39.6,1.4,263.0,-3.7,0.0,93.0,-1.091226,0.877055
3,9.1,19.1,12.5,33.5,1.36,273.0,-4.0,0.0,94.0,0.425952,-1.291575
4,9.1,22.4,16.1,32.2,1.32,252.0,-4.2,0.0,94.0,0.822376,1.03252


### NAN Values in datasets 

There are some NaN numbers in the PM2.5 data, it can be seen in the following two cells. On possible workarround may be to interpolate the missing values with the known ones and ensure all dataset will have defined values.

In [203]:
PM25data.isna().sum()

No          0
year        0
month       0
day         0
hour        0
pm2.5    2067
DEWP        0
TEMP        0
PRES        0
cbwd        0
Iws         0
Is          0
Ir          0
dtype: int64

In [204]:
PM10data.isna().sum()

PM_10         0
SO2           0
NO_2          0
OZON          0
WIND_SPEED    0
WIND_DIR      0
TEMP          0
PROM_SLON     0
WILGOT        0
WIND_X        0
WIND_Y        0
dtype: int64

The interpolation is then carried out, nevertheless, the first 24 data points are still missing, thus they are removed completly.

In [205]:
PM25_interpolate=PM25data['pm2.5'].fillna(method='ffill')

In [206]:
PM25_interpolate.isna()

0         True
1         True
2         True
3         True
4         True
         ...  
43819    False
43820    False
43821    False
43822    False
43823    False
Name: pm2.5, Length: 43824, dtype: bool

In [207]:
PM25data_clean=PM25data
PM25data_clean['pm2.5']=PM25_interpolate
PM25data_clean=PM25data_clean[24:]
PM25data_clean.drop(['No'], axis=1)

Unnamed: 0,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
24,2010,1,2,0,129.0,-16,-4.0,1020.0,SE,1.79,0,0
25,2010,1,2,1,148.0,-15,-4.0,1020.0,SE,2.68,0,0
26,2010,1,2,2,159.0,-11,-5.0,1021.0,SE,3.57,0,0
27,2010,1,2,3,181.0,-7,-5.0,1022.0,SE,5.36,1,0
28,2010,1,2,4,138.0,-7,-5.0,1022.0,SE,6.25,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...
43819,2014,12,31,19,8.0,-23,-2.0,1034.0,NW,231.97,0,0
43820,2014,12,31,20,10.0,-22,-3.0,1034.0,NW,237.78,0,0
43821,2014,12,31,21,10.0,-22,-3.0,1034.0,NW,242.70,0,0
43822,2014,12,31,22,8.0,-22,-4.0,1034.0,NW,246.72,0,0


In [208]:
PM25data_clean.head()

Unnamed: 0,No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
24,25,2010,1,2,0,129.0,-16,-4.0,1020.0,SE,1.79,0,0
25,26,2010,1,2,1,148.0,-15,-4.0,1020.0,SE,2.68,0,0
26,27,2010,1,2,2,159.0,-11,-5.0,1021.0,SE,3.57,0,0
27,28,2010,1,2,3,181.0,-7,-5.0,1022.0,SE,5.36,1,0
28,29,2010,1,2,4,138.0,-7,-5.0,1022.0,SE,6.25,2,0


In [209]:
PM25data_clean.isna().sum()

No       0
year     0
month    0
day      0
hour     0
pm2.5    0
DEWP     0
TEMP     0
PRES     0
cbwd     0
Iws      0
Is       0
Ir       0
dtype: int64

In [210]:
PM25data_clean.head()

Unnamed: 0,No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
24,25,2010,1,2,0,129.0,-16,-4.0,1020.0,SE,1.79,0,0
25,26,2010,1,2,1,148.0,-15,-4.0,1020.0,SE,2.68,0,0
26,27,2010,1,2,2,159.0,-11,-5.0,1021.0,SE,3.57,0,0
27,28,2010,1,2,3,181.0,-7,-5.0,1022.0,SE,5.36,1,0
28,29,2010,1,2,4,138.0,-7,-5.0,1022.0,SE,6.25,2,0


In [211]:
PM10data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36831 entries, 0 to 36830
Data columns (total 11 columns):
PM_10         36831 non-null float64
SO2           36831 non-null float64
NO_2          36831 non-null float64
OZON          36831 non-null float64
WIND_SPEED    36831 non-null float64
WIND_DIR      36831 non-null float64
TEMP          36831 non-null float64
PROM_SLON     36831 non-null float64
WILGOT        36831 non-null float64
WIND_X        36831 non-null float64
WIND_Y        36831 non-null float64
dtypes: float64(11)
memory usage: 3.1 MB


In [212]:
i = pd.date_range('2006-01-01', periods=len(PM10data), freq='H')
PM10data['year']=i.year
PM10data['month']=i.month
PM10data['day']=i.day
PM10data['hour']=i.hour

In [213]:
PM10data

Unnamed: 0,PM_10,SO2,NO_2,OZON,WIND_SPEED,WIND_DIR,TEMP,PROM_SLON,WILGOT,WIND_X,WIND_Y,year,month,day,hour
0,9.1,19.0,18.5,28.1,1.68,314.0,-3.4,0.0,95.0,-0.266436,1.658738,2006,1,1,0
1,9.1,30.3,19.9,27.1,1.58,297.0,-3.5,0.0,94.0,1.568733,-0.188352,2006,1,1,1
2,9.1,9.3,8.6,39.6,1.40,263.0,-3.7,0.0,93.0,-1.091226,0.877055,2006,1,1,2
3,9.1,19.1,12.5,33.5,1.36,273.0,-4.0,0.0,94.0,0.425952,-1.291575,2006,1,1,3
4,9.1,22.4,16.1,32.2,1.32,252.0,-4.2,0.0,94.0,0.822376,1.032520,2006,1,1,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36826,22.7,2.5,9.5,49.9,1.70,214.0,12.2,2.0,99.0,0.617439,1.583909,2010,3,15,10
36827,22.8,3.8,13.6,38.8,1.46,171.0,11.9,24.0,99.0,1.425823,0.314054,2010,3,15,11
36828,26.8,4.9,17.8,35.8,1.35,281.0,12.0,60.0,99.0,-1.329954,-0.231778,2010,3,15,12
36829,23.4,3.6,20.2,39.8,1.43,232.0,12.6,168.0,93.0,-0.657624,1.269815,2010,3,15,13
