Project: Prediction mean values and exceeding limit days of fine Particulate Matter (PM2.5) in the air - Milan (Italy).

Student: **Alessandro Monolo** | 1790210

Lecturer: Jonas Moons

Fundamentals of Machine Learning - Master Data-Driven Design, Hogeschool Utrecht.

August 2021 - Block E

## Data cleaning and Pre-Processing of CO Dataset in Milan from 2014 to 2019

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
# I will import, clean and merge all the CO - Data Frames from 2014 to 2019

###  Harmful Element CO Dataset of 2014

In [3]:
# Importing dataset of 2014, skipping the first row

In [4]:
CO_2014 = pd.read_csv("2014.csv", encoding="utf-8", skiprows=1)

In [5]:
CO_2014.drop(index=CO_2014.index[0], axis=0, inplace=True)

In [6]:
CO_2014.reset_index(inplace=True)

In [7]:
# After resetting the df index I set the new column names

In [8]:
CO_2014.rename(columns={"index": "TimeStamp", "-999 Valore mancante o invalido": "CO mg/m³"}, inplace=True)

In [9]:
# Save values equal to -999 to drop from the df

In [10]:
indexNames = CO_2014[CO_2014['CO mg/m³'] == '-999' ].index

In [11]:
CO_2014.drop(indexNames , inplace=True)

In [12]:
# Dropping any NaN values from the df

In [13]:
CO_2014.dropna()

Unnamed: 0,TimeStamp,CO mg/m³
0,2014/01/01 01:00,2.2
1,2014/01/01 02:00,2.2
2,2014/01/01 03:00,2.2
3,2014/01/01 04:00,2.3
4,2014/01/01 05:00,2.4
...,...,...
8755,2014/12/31 20:00,0.8
8756,2014/12/31 21:00,0.8
8757,2014/12/31 22:00,1.0
8758,2014/12/31 23:00,0.9


In [14]:
# Transforming object column to a numeric values column

In [15]:
CO_2014["CO mg/m³"] = pd.to_numeric(CO_2014["CO mg/m³"], errors = 'coerce')

In [16]:
# Create a new df column from timestamp formatting their values with year-month-day and hour format:

In [17]:
CO_2014['DateTime'] = pd.to_datetime(CO_2014['TimeStamp'], format='%Y%m%d %H')

In [18]:
# Drop the timestamp column being not useful anymore

In [19]:
CO_2014.drop('TimeStamp', axis=1, inplace=True)

In [20]:
# Set Datetime column as the new index of the df

In [21]:
CO_2014_index = CO_2014.set_index('DateTime')

In [22]:
# Getting mean daily values from the hourly values per each day in df

In [23]:
df_CO_2014 = CO_2014_index.resample('D').mean()

In [24]:
df_CO_2014.reset_index(inplace=True)

In [25]:
# Drop the last row since it is the first january of the next year

In [26]:
df_CO_2014 = df_CO_2014[:-1]

In [27]:
# Repeat the same process for all the CO df of 2015, 2016, 2017,2018 and 2019.

###  Harmful Element CO Dataset of 2015

In [28]:
CO_2015 = pd.read_csv("2015.csv", encoding="utf-8", skiprows=1)

In [29]:
CO_2015.drop(index=CO_2015.index[0], axis=0, inplace=True)

In [30]:
CO_2015.reset_index(inplace=True)

In [31]:
CO_2015.rename(columns={"index": "TimeStamp", "-999 Valore mancante o invalido": "CO mg/m³"}, inplace=True)

In [32]:
indexNames = CO_2015[CO_2015['CO mg/m³'] == '-999' ].index

In [33]:
CO_2015.drop(indexNames , inplace=True)

In [34]:
CO_2015.dropna()

Unnamed: 0,TimeStamp,CO mg/m³
0,2015/01/01 01:00,1.1
1,2015/01/01 02:00,0.9
2,2015/01/01 03:00,0.9
3,2015/01/01 04:00,0.8
4,2015/01/01 05:00,0.9
...,...,...
8755,2015/12/31 20:00,0.9
8756,2015/12/31 21:00,1.0
8757,2015/12/31 22:00,0.9
8758,2015/12/31 23:00,0.9


In [35]:
CO_2015["CO mg/m³"] = pd.to_numeric(CO_2015["CO mg/m³"], errors = 'coerce')

In [36]:
CO_2015['DateTime'] = pd.to_datetime(CO_2015['TimeStamp'], format='%Y%m%d %H')

In [37]:
CO_2015.drop('TimeStamp', axis=1, inplace=True)

In [38]:
CO_2015_index = CO_2015.set_index('DateTime')

In [39]:
df_CO_2015 = CO_2015_index.resample('D').mean()

In [40]:
df_CO_2015.reset_index(inplace=True)

In [41]:
df_CO_2015 = df_CO_2015[:-1]

###  Harmful Element CO Dataset of 2016

In [42]:
CO_2016 = pd.read_csv("2016.csv", encoding="utf-8", skiprows=1)

In [43]:
CO_2016.drop(index=CO_2016.index[0], axis=0, inplace=True)

In [44]:
CO_2016.reset_index(inplace=True)

In [45]:
CO_2016.rename(columns={"index": "TimeStamp", "-999 Valore mancante o invalido": "CO mg/m³"}, inplace=True)

In [46]:
indexNames = CO_2016[CO_2016['CO mg/m³'] == '-999' ].index

In [47]:
CO_2016.drop(indexNames , inplace=True)

In [48]:
CO_2016.dropna()

Unnamed: 0,TimeStamp,CO mg/m³
0,2016/01/01 01:00,1.0
1,2016/01/01 02:00,1.0
2,2016/01/01 03:00,1.7
3,2016/01/01 04:00,1.0
4,2016/01/01 05:00,0.9
...,...,...
8779,2016/12/31 20:00,1.9
8780,2016/12/31 21:00,2.4
8781,2016/12/31 22:00,2.4
8782,2016/12/31 23:00,2.3


In [49]:
CO_2016["CO mg/m³"] = pd.to_numeric(CO_2016["CO mg/m³"], errors = 'coerce')

In [50]:
CO_2016['DateTime'] = pd.to_datetime(CO_2016['TimeStamp'], format='%Y%m%d %H')

In [51]:
CO_2016.drop('TimeStamp', axis=1, inplace=True)

In [52]:
CO_2016_index = CO_2016.set_index('DateTime')

In [53]:
df_CO_2016 = CO_2016_index.resample('D').mean()

In [54]:
df_CO_2016.reset_index(inplace=True)

In [55]:
df_CO_2016 = df_CO_2016[:-1]

###  Harmful Element CO Dataset of 2017

In [56]:
CO_2017 = pd.read_csv("2017.csv", encoding="utf-8", skiprows=1)

In [57]:
CO_2017.drop(index=CO_2017.index[0], axis=0, inplace=True)

In [58]:
CO_2017.reset_index(inplace=True)

In [59]:
CO_2017.rename(columns={"index": "TimeStamp", "-999 Valore mancante o invalido": "CO mg/m³"}, inplace=True)

In [60]:
indexNames = CO_2017[CO_2017['CO mg/m³'] == '-999' ].index

In [61]:
CO_2017.drop(indexNames , inplace=True)

In [62]:
CO_2017.dropna()

Unnamed: 0,TimeStamp,CO mg/m³
0,2017/01/01 01:00,2.6
1,2017/01/01 02:00,3.2
2,2017/01/01 03:00,3.3
3,2017/01/01 04:00,3.0
4,2017/01/01 05:00,2.6
...,...,...
8755,2017/12/31 20:00,1.2
8756,2017/12/31 21:00,1.1
8757,2017/12/31 22:00,1.1
8758,2017/12/31 23:00,1.0


In [63]:
CO_2017["CO mg/m³"] = pd.to_numeric(CO_2017["CO mg/m³"], errors = 'coerce')

In [64]:
CO_2017['DateTime'] = pd.to_datetime(CO_2017['TimeStamp'], format='%Y%m%d %H')

In [65]:
CO_2017.drop('TimeStamp', axis=1, inplace=True)

In [66]:
CO_2017_index = CO_2017.set_index('DateTime')

In [67]:
df_CO_2017 = CO_2017_index.resample('D').mean()

In [68]:
df_CO_2017.reset_index(inplace=True)

In [69]:
df_CO_2017 = df_CO_2017[:-1]

###  Harmful Element CO Dataset of 2018

In [70]:
CO_2018 = pd.read_csv("2018.csv", encoding="utf-8", skiprows=1)

In [71]:
CO_2018.drop(index=CO_2018.index[0], axis=0, inplace=True)

In [72]:
CO_2018.reset_index(inplace=True)

In [73]:
CO_2018.rename(columns={"index": "TimeStamp", "-999 Valore mancante o invalido": "CO mg/m³"}, inplace=True)

In [74]:
indexNames = CO_2018[CO_2018['CO mg/m³'] == '-999' ].index

In [75]:
CO_2018.drop(indexNames , inplace=True)

In [76]:
CO_2018.dropna()

Unnamed: 0,TimeStamp,CO mg/m³
0,2018/01/01 01:00,1.0
1,2018/01/01 02:00,0.9
2,2018/01/01 03:00,0.8
3,2018/01/01 04:00,0.8
4,2018/01/01 05:00,0.8
...,...,...
8755,2018/12/31 20:00,1.6
8756,2018/12/31 21:00,1.7
8757,2018/12/31 22:00,1.7
8758,2018/12/31 23:00,1.6


In [77]:
CO_2018["CO mg/m³"] = pd.to_numeric(CO_2018["CO mg/m³"], errors = 'coerce')

In [78]:
CO_2018['DateTime'] = pd.to_datetime(CO_2018['TimeStamp'], format='%Y%m%d %H')

In [79]:
CO_2018.drop('TimeStamp', axis=1, inplace=True)

In [80]:
CO_2018_index = CO_2018.set_index('DateTime')

In [81]:
df_CO_2018 = CO_2018_index.resample('D').mean()

In [82]:
df_CO_2018.reset_index(inplace=True)

In [83]:
df_CO_2018 = df_CO_2018[:-1]

###  Harmful Element CO Dataset of 2019

In [84]:
CO_2019 = pd.read_csv("2019.csv", encoding="utf-8", skiprows=1)

In [85]:
CO_2019.drop(index=CO_2019.index[0], axis=0, inplace=True)

In [86]:
CO_2019.reset_index(inplace=True)

In [87]:
CO_2019.rename(columns={"index": "TimeStamp", "-999 Valore mancante o invalido": "CO mg/m³"}, inplace=True)

In [88]:
indexNames = CO_2019[CO_2019['CO mg/m³'] == '-999' ].index

In [89]:
CO_2019.drop(indexNames , inplace=True)

In [90]:
CO_2019.dropna()

Unnamed: 0,TimeStamp,CO mg/m³
0,2019/01/01 01:00,1.6
1,2019/01/01 02:00,1.8
2,2019/01/01 03:00,1.7
3,2019/01/01 04:00,1.6
4,2019/01/01 05:00,1.7
...,...,...
8755,2019/12/31 20:00,1.4
8756,2019/12/31 21:00,1.4
8757,2019/12/31 22:00,1.5
8758,2019/12/31 23:00,1.5


In [91]:
CO_2019["CO mg/m³"] = pd.to_numeric(CO_2019["CO mg/m³"], errors = 'coerce')

In [92]:
CO_2019['DateTime'] = pd.to_datetime(CO_2019['TimeStamp'], format='%Y%m%d %H')

In [93]:
CO_2019.drop('TimeStamp', axis=1, inplace=True)

In [94]:
CO_2019_index = CO_2019.set_index('DateTime')

In [95]:
df_CO_2019 = CO_2019_index.resample('D').mean()

In [96]:
df_CO_2019.reset_index(inplace=True)

In [97]:
df_CO_2019 = df_CO_2019[:-1]

**Last check**

In [98]:
print(df_CO_2014.shape)
print(df_CO_2015.shape)
print(df_CO_2016.shape) # Leap year with one day more
print(df_CO_2017.shape)
print(df_CO_2018.shape)
print(df_CO_2019.shape)

(365, 2)
(365, 2)
(366, 2)
(365, 2)
(365, 2)
(365, 2)


### Merging all the CO datasets into a CSV file:

In [99]:
# Create a list of dataframes:

In [100]:
data_frames = [df_CO_2014, df_CO_2015, df_CO_2016, df_CO_2017, df_CO_2018, df_CO_2019]

In [101]:
# Concat all the yearly dtaframes into a complete dataframe, using columns as concatenating axes:

In [102]:
Milan_CO_2014_2019 = pd.concat(data_frames, join='outer', axis=0)

In [103]:
# Finally, save the result as new yearly dtaframe into a csv file:

In [104]:
Milan_CO_2014_2019.to_csv("Milan_CO_2014_2019.csv", index=False)

In [105]:
df_CO = pd.read_csv("Milan_CO_2014_2019.csv")

In [106]:
df_CO

Unnamed: 0,DateTime,CO mg/m³
0,2014-01-01,2.043478
1,2014-01-02,1.433333
2,2014-01-03,1.425000
3,2014-01-04,1.491667
4,2014-01-05,1.541667
...,...,...
2186,2019-12-27,1.408333
2187,2019-12-28,1.387500
2188,2019-12-29,1.287500
2189,2019-12-30,0.945833
