Project: Prediction mean values and exceeding limit days of fine Particulate Matter (PM2.5) in the air - Milan (Italy).

Student: **Alessandro Monolo** | 1790210

Lecturer: Jonas Moons

Fundamentals of Machine Learning - Master Data-Driven Design, Hogeschool Utrecht.

August 2021 - Block E

## Data cleaning and Pre-Processing of PM 10 Dataset in Milan from 2014 to 2019

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
# I will import, clean and merge all the PM 10 - Data Frames from 2014 to 2019

### Particulate Matter PM 10 Dataset of 2014

In [3]:
# Importing dataset of 2014, skipping the first row

In [4]:
PM10_2014 = pd.read_csv("2014.csv", encoding="utf-8", skiprows=1)

In [5]:
PM10_2014.drop(index=PM10_2014.index[0], axis=0, inplace=True)

In [6]:
PM10_2014.reset_index(inplace=True)

In [7]:
# After resetting the df index I set the new column names

In [8]:
PM10_2014.rename(columns={"index": "TimeStamp", "-999 Valore mancante o invalido": "PM10 µg/m³"}, inplace=True)

In [9]:
# Save values equal to -999 to drop from the df

In [10]:
indexNames = PM10_2014[PM10_2014['PM10 µg/m³'] == '-999' ].index

In [11]:
PM10_2014.drop(indexNames , inplace=True)

In [12]:
# Transforming object column to a numeric values column

In [13]:
PM10_2014["PM10 µg/m³"] = pd.to_numeric(PM10_2014["PM10 µg/m³"], errors = 'coerce')

In [14]:
PM10_2014['DateTime'] = pd.to_datetime(PM10_2014['TimeStamp'])

In [15]:
# Drop the timestamp column being not useful anymore

In [16]:
PM10_2014.drop('TimeStamp', axis=1, inplace=True)

In [17]:
# Set Datetime column as the new index of the df

In [18]:
PM10_2014_index = PM10_2014.set_index('DateTime')

In [19]:
# Getting mean daily values from the hourly values per each day in df

In [20]:
df_PM10_2014 = PM10_2014_index.resample('D').mean()

In [21]:
df_PM10_2014.reset_index(inplace=True)

In [22]:
df_PM10_2014.shape

(365, 2)

In [23]:
# Repeat the same process for all the PM 10 df of 2015, 2016, 2017,2018 and 2019.

### Particulate Matter PM 10 Dataset of 2015

In [24]:
PM10_2015 = pd.read_csv("2015.csv", encoding="utf-8", skiprows=1)

In [25]:
PM10_2015.drop(index=PM10_2015.index[0], axis=0, inplace=True)

In [26]:
PM10_2015.reset_index(inplace=True)

In [27]:
PM10_2015.rename(columns={"index": "TimeStamp", "-999 Valore mancante o invalido": "PM10 µg/m³"}, inplace=True)

In [28]:
indexNames = PM10_2015[PM10_2015['PM10 µg/m³'] == '-999' ].index

In [29]:
PM10_2015.drop(indexNames , inplace=True)

In [30]:
PM10_2015["PM10 µg/m³"] = pd.to_numeric(PM10_2015["PM10 µg/m³"], errors = 'coerce')

In [31]:
PM10_2015['DateTime'] = pd.to_datetime(PM10_2015['TimeStamp'])

In [32]:
PM10_2015.drop('TimeStamp', axis=1, inplace=True)

In [33]:
PM10_2015_index = PM10_2015.set_index('DateTime')

In [34]:
df_PM10_2015 = PM10_2015_index.resample('D').mean()

In [35]:
df_PM10_2015.reset_index(inplace=True)

In [36]:
df_PM10_2015.shape

(365, 2)

### Particulate Matter PM 10 Dataset of 2016

In [37]:
PM10_2016 = pd.read_csv("2016.csv", encoding="utf-8", skiprows=1)

In [38]:
PM10_2016.drop(index=PM10_2016.index[0], axis=0, inplace=True)

In [39]:
PM10_2016.reset_index(inplace=True)

In [40]:
PM10_2016.rename(columns={"index": "TimeStamp", "-999 Valore mancante o invalido": "PM10 µg/m³"}, inplace=True)

In [41]:
indexNames = PM10_2016[PM10_2016['PM10 µg/m³'] == '-999' ].index

In [42]:
PM10_2016.drop(indexNames , inplace=True)

In [43]:
PM10_2016["PM10 µg/m³"] = pd.to_numeric(PM10_2016["PM10 µg/m³"], errors = 'coerce')

In [44]:
PM10_2016['DateTime'] = pd.to_datetime(PM10_2016['TimeStamp'])

In [45]:
PM10_2016.drop('TimeStamp', axis=1, inplace=True)

In [46]:
PM10_2016_index = PM10_2016.set_index('DateTime')

In [47]:
df_PM10_2016 = PM10_2016_index.resample('D').mean()

In [48]:
df_PM10_2016.reset_index(inplace=True)

In [49]:
df_PM10_2016.shape

(366, 2)

### Particulate Matter PM 10 Dataset of 2017

In [50]:
PM10_2017 = pd.read_csv("2017.csv", encoding="utf-8", skiprows=1)

In [51]:
PM10_2017.drop(index=PM10_2017.index[0], axis=0, inplace=True)

In [52]:
PM10_2017.reset_index(inplace=True)

In [53]:
PM10_2017.rename(columns={"index": "TimeStamp", "-999 Valore mancante o invalido": "PM10 µg/m³"}, inplace=True)

In [54]:
indexNames = PM10_2017[PM10_2017['PM10 µg/m³'] == '-999' ].index

In [55]:
PM10_2017.drop(indexNames , inplace=True)

In [56]:
PM10_2017["PM10 µg/m³"] = pd.to_numeric(PM10_2017["PM10 µg/m³"], errors = 'coerce')

In [57]:
PM10_2017['DateTime'] = pd.to_datetime(PM10_2017['TimeStamp'])

In [58]:
PM10_2017.drop('TimeStamp', axis=1, inplace=True)

In [59]:
PM10_2017_index = PM10_2017.set_index('DateTime')

In [60]:
df_PM10_2017 = PM10_2017_index.resample('D').mean()

In [61]:
df_PM10_2017.reset_index(inplace=True)

In [62]:
df_PM10_2017.shape

(365, 2)

### Particulate Matter PM 10 Dataset of 2018

In [63]:
PM10_2018 = pd.read_csv("2018.csv", encoding="utf-8", skiprows=1)

In [64]:
PM10_2018.drop(index=PM10_2018.index[0], axis=0, inplace=True)

In [65]:
PM10_2018.reset_index(inplace=True)

In [66]:
PM10_2018.rename(columns={"index": "TimeStamp", "-999 Valore mancante o invalido": "PM10 µg/m³"}, inplace=True)

In [67]:
indexNames = PM10_2018[PM10_2018['PM10 µg/m³'] == '-999' ].index

In [68]:
PM10_2018.drop(indexNames , inplace=True)

In [69]:
PM10_2018["PM10 µg/m³"] = pd.to_numeric(PM10_2018["PM10 µg/m³"], errors = 'coerce')

In [70]:
PM10_2018['DateTime'] = pd.to_datetime(PM10_2018['TimeStamp'])

In [71]:
PM10_2018.drop('TimeStamp', axis=1, inplace=True)

In [72]:
PM10_2018_index = PM10_2018.set_index('DateTime')

In [73]:
df_PM10_2018 = PM10_2018_index.resample('D').mean()

In [74]:
df_PM10_2018.reset_index(inplace=True)

In [75]:
df_PM10_2018.shape

(365, 2)

### Particulate Matter PM 10 Dataset of 2019

In [76]:
PM10_2019 = pd.read_csv("2019.csv", encoding="utf-8", skiprows=1)

In [77]:
PM10_2019.drop(index=PM10_2019.index[0], axis=0, inplace=True)

In [78]:
PM10_2019.reset_index(inplace=True)

In [79]:
PM10_2019.rename(columns={"index": "TimeStamp", "-999 Valore mancante o invalido": "PM10 µg/m³"}, inplace=True)

In [80]:
indexNames = PM10_2019[PM10_2019['PM10 µg/m³'] == '-999' ].index

In [81]:
PM10_2019.drop(indexNames , inplace=True)

In [82]:
PM10_2019["PM10 µg/m³"] = pd.to_numeric(PM10_2019["PM10 µg/m³"], errors = 'coerce')

In [83]:
PM10_2019['DateTime'] = pd.to_datetime(PM10_2019['TimeStamp'])

In [84]:
PM10_2019.drop('TimeStamp', axis=1, inplace=True)

In [85]:
PM10_2019_index = PM10_2019.set_index('DateTime')

In [86]:
df_PM10_2019 = PM10_2019_index.resample('D').mean()

In [87]:
df_PM10_2019.reset_index(inplace=True)

In [88]:
df_PM10_2019.shape

(365, 2)

**Last check**

In [89]:
print(df_PM10_2014.shape)
print(df_PM10_2015.shape)
print(df_PM10_2016.shape) # Leap year with one day more
print(df_PM10_2017.shape)
print(df_PM10_2018.shape)
print(df_PM10_2019.shape)

(365, 2)
(365, 2)
(366, 2)
(365, 2)
(365, 2)
(365, 2)


### Merging all the PM10 datasets into a CSV file:

In [90]:
# Create a list of dataframes:

In [91]:
data_frames = [df_PM10_2014, df_PM10_2015, df_PM10_2016, df_PM10_2017, df_PM10_2018, df_PM10_2019]

In [92]:
# Concat all the yearly dtaframes into a complete dataframe, using columns as concatenating axes:

In [93]:
Milan_PM10_2014_2019 = pd.concat(data_frames, join='outer', axis=0)

In [94]:
# Finally, save the result as new yearly dtaframe into a csv file:

In [95]:
Milan_PM10_2014_2019.to_csv("Milan_PM10_2014_2019.csv", index=False)

In [96]:
df_PM10 = pd.read_csv("Milan_PM10_2014_2019.csv")

In [97]:
df_PM10

Unnamed: 0,DateTime,PM10 µg/m³
0,2014-01-01,140.0
1,2014-01-02,46.0
2,2014-01-03,50.0
3,2014-01-04,32.0
4,2014-01-05,24.0
...,...,...
2186,2019-12-27,56.0
2187,2019-12-28,61.0
2188,2019-12-29,69.0
2189,2019-12-30,59.0
