## Processing Input Data

### Importing Data

The particulate matter data show some differences each other, on one hand the PM2.5 present fewer input variables as well some different type of variables if we compare these data with PM10.

In [1]:
import pandas as pd
import numpy as np
from scipy.io import loadmat

In [2]:
PM25data=pd.read_csv("PRSA_data_2010.1.1-2014.12.31.csv")
PM10data=loadmat('data_Polish.mat')
PM10data=pd.DataFrame.from_dict(PM10data['x'])
PM10data.columns = ['PM_10','SO2','NO_2','OZON','WIND_SPEED', 'WIND_DIR', 'TEMP', 'PROM_SLON', 'WILGOT', 'WIND_X', 'WIND_Y']

In [3]:
PM25data.head()

Unnamed: 0,No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
0,1,2010,1,1,0,,-21,-11.0,1021.0,NW,1.79,0,0
1,2,2010,1,1,1,,-21,-12.0,1020.0,NW,4.92,0,0
2,3,2010,1,1,2,,-21,-11.0,1019.0,NW,6.71,0,0
3,4,2010,1,1,3,,-21,-14.0,1019.0,NW,9.84,0,0
4,5,2010,1,1,4,,-20,-12.0,1018.0,NW,12.97,0,0


In [4]:
PM10data.head()

Unnamed: 0,PM_10,SO2,NO_2,OZON,WIND_SPEED,WIND_DIR,TEMP,PROM_SLON,WILGOT,WIND_X,WIND_Y
0,9.1,19.0,18.5,28.1,1.68,314.0,-3.4,0.0,95.0,-0.266436,1.658738
1,9.1,30.3,19.9,27.1,1.58,297.0,-3.5,0.0,94.0,1.568733,-0.188352
2,9.1,9.3,8.6,39.6,1.4,263.0,-3.7,0.0,93.0,-1.091226,0.877055
3,9.1,19.1,12.5,33.5,1.36,273.0,-4.0,0.0,94.0,0.425952,-1.291575
4,9.1,22.4,16.1,32.2,1.32,252.0,-4.2,0.0,94.0,0.822376,1.03252


### NAN Values in datasets 

There are some NaN numbers in the PM2.5 data, it can be seen in the following two cells. On possible workarround may be to interpolate the missing values with the known ones and ensure all dataset will have defined values.

In [5]:
PM25data.isna().sum()

No          0
year        0
month       0
day         0
hour        0
pm2.5    2067
DEWP        0
TEMP        0
PRES        0
cbwd        0
Iws         0
Is          0
Ir          0
dtype: int64

In [6]:
PM10data.isna().sum()

PM_10         0
SO2           0
NO_2          0
OZON          0
WIND_SPEED    0
WIND_DIR      0
TEMP          0
PROM_SLON     0
WILGOT        0
WIND_X        0
WIND_Y        0
dtype: int64

The interpolation is then carried out, nevertheless, the first 24 data points are still missing, thus they are removed completly.

In [7]:
PM25_interpolate=PM25data['pm2.5'].fillna(method='ffill')

In [8]:
PM25_interpolate.isna()

0         True
1         True
2         True
3         True
4         True
5         True
6         True
7         True
8         True
9         True
10        True
11        True
12        True
13        True
14        True
15        True
16        True
17        True
18        True
19        True
20        True
21        True
22        True
23        True
24       False
25       False
26       False
27       False
28       False
29       False
         ...  
43794    False
43795    False
43796    False
43797    False
43798    False
43799    False
43800    False
43801    False
43802    False
43803    False
43804    False
43805    False
43806    False
43807    False
43808    False
43809    False
43810    False
43811    False
43812    False
43813    False
43814    False
43815    False
43816    False
43817    False
43818    False
43819    False
43820    False
43821    False
43822    False
43823    False
Name: pm2.5, Length: 43824, dtype: bool

In [9]:
PM25data_clean=PM25data
PM25data_clean['pm2.5']=PM25_interpolate
PM25data_clean=PM25data_clean[24:]
PM25data_clean.drop(['No'], axis=1)

Unnamed: 0,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
24,2010,1,2,0,129.0,-16,-4.0,1020.0,SE,1.79,0,0
25,2010,1,2,1,148.0,-15,-4.0,1020.0,SE,2.68,0,0
26,2010,1,2,2,159.0,-11,-5.0,1021.0,SE,3.57,0,0
27,2010,1,2,3,181.0,-7,-5.0,1022.0,SE,5.36,1,0
28,2010,1,2,4,138.0,-7,-5.0,1022.0,SE,6.25,2,0
29,2010,1,2,5,109.0,-7,-6.0,1022.0,SE,7.14,3,0
30,2010,1,2,6,105.0,-7,-6.0,1023.0,SE,8.93,4,0
31,2010,1,2,7,124.0,-7,-5.0,1024.0,SE,10.72,0,0
32,2010,1,2,8,120.0,-8,-6.0,1024.0,SE,12.51,0,0
33,2010,1,2,9,132.0,-7,-5.0,1025.0,SE,14.30,0,0


In [10]:
PM25data_clean.head()

Unnamed: 0,No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
24,25,2010,1,2,0,129.0,-16,-4.0,1020.0,SE,1.79,0,0
25,26,2010,1,2,1,148.0,-15,-4.0,1020.0,SE,2.68,0,0
26,27,2010,1,2,2,159.0,-11,-5.0,1021.0,SE,3.57,0,0
27,28,2010,1,2,3,181.0,-7,-5.0,1022.0,SE,5.36,1,0
28,29,2010,1,2,4,138.0,-7,-5.0,1022.0,SE,6.25,2,0


In [11]:
PM25data_clean.isna().sum()

No       0
year     0
month    0
day      0
hour     0
pm2.5    0
DEWP     0
TEMP     0
PRES     0
cbwd     0
Iws      0
Is       0
Ir       0
dtype: int64

In [12]:
PM25data_clean.head()

Unnamed: 0,No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
24,25,2010,1,2,0,129.0,-16,-4.0,1020.0,SE,1.79,0,0
25,26,2010,1,2,1,148.0,-15,-4.0,1020.0,SE,2.68,0,0
26,27,2010,1,2,2,159.0,-11,-5.0,1021.0,SE,3.57,0,0
27,28,2010,1,2,3,181.0,-7,-5.0,1022.0,SE,5.36,1,0
28,29,2010,1,2,4,138.0,-7,-5.0,1022.0,SE,6.25,2,0


In [13]:
PM10data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36831 entries, 0 to 36830
Data columns (total 11 columns):
PM_10         36831 non-null float64
SO2           36831 non-null float64
NO_2          36831 non-null float64
OZON          36831 non-null float64
WIND_SPEED    36831 non-null float64
WIND_DIR      36831 non-null float64
TEMP          36831 non-null float64
PROM_SLON     36831 non-null float64
WILGOT        36831 non-null float64
WIND_X        36831 non-null float64
WIND_Y        36831 non-null float64
dtypes: float64(11)
memory usage: 3.1 MB


In [14]:
i = pd.date_range('2006-01-01', periods=len(PM10data), freq='H')
PM10data['year']=i.year
PM10data['month']=i.month
PM10data['day']=i.day
PM10data['hour']=i.hour

In [15]:
PM10data

Unnamed: 0,PM_10,SO2,NO_2,OZON,WIND_SPEED,WIND_DIR,TEMP,PROM_SLON,WILGOT,WIND_X,WIND_Y,year,month,day,hour
0,9.1,19.0,18.5,28.1,1.68,314.0,-3.4,0.0,95.0,-0.266436,1.658738,2006,1,1,0
1,9.1,30.3,19.9,27.1,1.58,297.0,-3.5,0.0,94.0,1.568733,-0.188352,2006,1,1,1
2,9.1,9.3,8.6,39.6,1.40,263.0,-3.7,0.0,93.0,-1.091226,0.877055,2006,1,1,2
3,9.1,19.1,12.5,33.5,1.36,273.0,-4.0,0.0,94.0,0.425952,-1.291575,2006,1,1,3
4,9.1,22.4,16.1,32.2,1.32,252.0,-4.2,0.0,94.0,0.822376,1.032520,2006,1,1,4
5,11.7,29.3,18.0,27.7,1.49,226.0,-4.4,0.0,95.0,-0.288231,1.461856,2006,1,1,5
6,11.7,35.5,23.9,21.9,1.27,258.0,-4.4,0.0,95.0,0.482137,1.174923,2006,1,1,6
7,11.7,20.5,25.1,20.4,1.48,264.0,-4.5,0.0,95.0,0.156906,1.471659,2006,1,1,7
8,14.3,33.7,32.7,12.4,1.42,279.0,-4.7,3.0,95.0,0.803838,-1.170575,2006,1,1,8
9,18.2,42.8,29.9,15.4,1.34,239.0,-4.4,17.0,94.0,0.317166,1.301924,2006,1,1,9
