Project: Prediction mean values and exceeding limit days of fine Particulate Matter (PM2.5) in the air - Milan (Italy).

Student: **Alessandro Monolo** | 1790210

Lecturer: Jonas Moons

Fundamentals of Machine Learning - Master Data-Driven Design, Hogeschool Utrecht.

August 2021 - Block E

## Data cleaning and Pre-Processing of Ozone Dataset in Milan from 2014 to 2019

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
# I will import, clean and merge all the O - Data Frames from 2014 to 2019

In [3]:
def sprint_info(df):
    print('This is df shape: ', df.shape, '\n')
    print(df.info(), '\n')
    print('This is df: \n', df)

### O - Ozone Dataset of 2014

In [4]:
# Importing dataset of 2014, skipping the first row

In [5]:
O_2014 = pd.read_csv("2014.csv", encoding="utf-8", skiprows=0)

In [6]:
O_2014.drop(index=O_2014.index[[0,1]], axis=0, inplace=True)

In [7]:
# After resetting the df index I set the new column names

In [8]:
O_2014.rename(columns={"Stazione": "TimeStamp", " Milano - Verziere": "O µg/m³"}, inplace=True)

In [9]:
# Save values equal to -999 to drop from the df

In [10]:
indexNames = O_2014[O_2014['O µg/m³'] == '-999' ].index

In [11]:
O_2014.drop(indexNames, inplace=True)

In [12]:
# Transforming object column to a numeric values column

In [13]:
O_2014["O µg/m³"] = pd.to_numeric(O_2014["O µg/m³"], errors = 'coerce')

In [14]:
# Create a new df column from timestamp formatting their values with year-month-day and hour format:

In [15]:
O_2014['DateTime'] = pd.to_datetime(O_2014['TimeStamp'], format='%Y%m%d %H')

In [16]:
# Drop the timestamp column being not useful anymore

In [17]:
O_2014.drop('TimeStamp', axis=1, inplace=True)

In [18]:
# Set Datetime column as the new index of the df

In [19]:
O_2014_index = O_2014.set_index('DateTime')

In [20]:
# Getting mean daily values from the hourly values per each day in df

In [21]:
df_O_2014 = O_2014_index.resample('D').mean()

In [22]:
df_O_2014.reset_index(inplace=True)

In [23]:
# Drop the last row since it is the first january of the next year

In [24]:
df_O_2014 = df_O_2014[:-1]

In [25]:
#sprint_info(df_O_2014)

In [26]:
# Repeat the same process for all the O df of 2015, 2016, 2017,2018 and 2019.

### O - Ozone Dataset of 2015

In [27]:
O_2015 = pd.read_csv("2015.csv", encoding="utf-8", skiprows=0)

In [28]:
O_2015.drop(index=O_2015.index[[0,1]], axis=0, inplace=True)

In [29]:
O_2015.rename(columns={"Stazione": "TimeStamp", " Milano - Verziere": "O µg/m³"}, inplace=True)

In [30]:
indexNames = O_2015[O_2015['O µg/m³'] == '-999' ].index

In [31]:
O_2015.drop(indexNames, inplace=True)

In [32]:
# Transforming object column to a numeric values column

In [33]:
O_2015["O µg/m³"] = pd.to_numeric(O_2015["O µg/m³"], errors = 'coerce')

In [34]:
# Create a new df column from timestamp formatting their values with year-month-day and hour format:

In [35]:
O_2015['DateTime'] = pd.to_datetime(O_2015['TimeStamp'], format='%Y%m%d %H')

In [36]:
# Drop the timestamp column being not useful anymore

In [37]:
O_2015.drop('TimeStamp', axis=1, inplace=True)

In [38]:
# Set Datetime column as the new index of the df

In [39]:
O_2015_index = O_2015.set_index('DateTime')

In [40]:
df_O_2015 = O_2015_index.resample('D').mean()

In [41]:
df_O_2015.reset_index(inplace=True)

In [42]:
df_O_2015 = df_O_2015[:-1]

In [43]:
#sprint_info(df_O_2015)

### O - Ozone Dataset of 2016

In [44]:
O_2016 = pd.read_csv("2016.csv", encoding="utf-8", skiprows=0)

In [45]:
O_2016.drop(index=O_2016.index[[0,1]], axis=0, inplace=True)

In [46]:
O_2016.rename(columns={"Stazione": "TimeStamp", " Milano - Verziere": "O µg/m³"}, inplace=True)

In [47]:
indexNames = O_2016[O_2016['O µg/m³'] == '-999' ].index

In [48]:
O_2016.drop(indexNames, inplace=True)

In [49]:
O_2016["O µg/m³"] = pd.to_numeric(O_2016["O µg/m³"], errors = 'coerce')

In [50]:
O_2016['DateTime'] = pd.to_datetime(O_2016['TimeStamp'], format='%Y%m%d %H')

In [51]:
O_2016.drop('TimeStamp', axis=1, inplace=True)

In [52]:
O_2016_index = O_2016.set_index('DateTime')

In [53]:
df_O_2016 = O_2016_index.resample('D').mean()

In [54]:
df_O_2016.reset_index(inplace=True)

In [55]:
df_O_2016 = df_O_2016[:-1]

In [56]:
#sprint_info(df_O_2016)

### O - Ozone Dataset of 2017

In [57]:
O_2017 = pd.read_csv("2017.csv", encoding="utf-8", skiprows=0)

In [58]:
O_2017.drop(index=O_2017.index[[0,1]], axis=0, inplace=True)

In [59]:
O_2017.rename(columns={"Stazione": "TimeStamp", " Milano - Verziere": "O µg/m³"}, inplace=True)

In [60]:
indexNames = O_2017[O_2017['O µg/m³'] == '-999' ].index

In [61]:
O_2017.drop(indexNames, inplace=True)

In [62]:
O_2017["O µg/m³"] = pd.to_numeric(O_2017["O µg/m³"], errors = 'coerce')

In [63]:
O_2017['DateTime'] = pd.to_datetime(O_2017['TimeStamp'], format='%Y%m%d %H')

In [64]:
O_2017.drop('TimeStamp', axis=1, inplace=True)

In [65]:
O_2017_index = O_2017.set_index('DateTime')

In [66]:
df_O_2017 = O_2017_index.resample('D').mean()

In [67]:
df_O_2017.reset_index(inplace=True)

In [68]:
df_O_2017 = df_O_2017[:-1]

In [69]:
#sprint_info(df_O_2017)

### O - Ozone Dataset of 2018

In [70]:
O_2018 = pd.read_csv("2018.csv", encoding="utf-8", skiprows=0)

In [71]:
O_2018.drop(index=O_2018.index[[0,1]], axis=0, inplace=True)

In [72]:
O_2018.rename(columns={"Stazione": "TimeStamp", " Milano - Verziere": "O µg/m³"}, inplace=True)

In [73]:
indexNames = O_2018[O_2018['O µg/m³'] == '-999' ].index

In [74]:
O_2018.drop(indexNames, inplace=True)

In [75]:
O_2018["O µg/m³"] = pd.to_numeric(O_2018["O µg/m³"], errors = 'coerce')

In [76]:
O_2018['DateTime'] = pd.to_datetime(O_2018['TimeStamp'], format='%Y%m%d %H')

In [77]:
O_2018.drop('TimeStamp', axis=1, inplace=True)

In [78]:
O_2018_index = O_2018.set_index('DateTime')

In [79]:
df_O_2018 = O_2018_index.resample('D').mean()

In [80]:
df_O_2018.reset_index(inplace=True)

In [81]:
df_O_2018 = df_O_2018[:-1]

In [82]:
#sprint_info(df_O_2018)

### O - Ozone Dataset of 2019

In [83]:
O_2019 = pd.read_csv("2019.csv", encoding="utf-8", skiprows=0)

In [84]:
O_2019.drop(index=O_2019.index[[0,1]], axis=0, inplace=True)

In [85]:
O_2019.rename(columns={"Stazione": "TimeStamp", " Milano - Verziere": "O µg/m³"}, inplace=True)

In [86]:
indexNames = O_2019[O_2019['O µg/m³'] == '-999' ].index

In [87]:
O_2019.drop(indexNames, inplace=True)

In [88]:
O_2019["O µg/m³"] = pd.to_numeric(O_2019["O µg/m³"], errors = 'coerce')

In [89]:
O_2019['DateTime'] = pd.to_datetime(O_2019['TimeStamp'], format='%Y%m%d %H')

In [90]:
O_2019.drop('TimeStamp', axis=1, inplace=True)

In [91]:
O_2019_index = O_2019.set_index('DateTime')

In [92]:
df_O_2019 = O_2019_index.resample('D').mean()

In [93]:
df_O_2019.reset_index(inplace=True)

In [94]:
df_O_2019 = df_O_2019[:-1]

In [95]:
#sprint_info(df_O_2019)

**Last check**

In [96]:
print(df_O_2014.shape)
print(df_O_2015.shape)
print(df_O_2016.shape) # Leap year with one day more
print(df_O_2017.shape)
print(df_O_2018.shape)
print(df_O_2019.shape)

(365, 2)
(365, 2)
(366, 2)
(365, 2)
(365, 2)
(365, 2)


### Merging all the NO datasets into a CSV file:

In [97]:
# Create a list of dataframes:

In [98]:
data_frames = [df_O_2014, df_O_2015, df_O_2016, df_O_2017, df_O_2018, df_O_2019]

In [99]:
# Concat all the yearly dtaframes into a complete dataframe, using columns as concatenating axes:

In [100]:
Milan_O_2014_2019 = pd.concat(data_frames, join='outer', axis=0)

In [101]:
# Finally, save the result as new yearly dtaframe into a csv file:

In [102]:
Milan_O_2014_2019.to_csv("Milan_O_2014_2019.csv", index=False)

In [103]:
df_O = pd.read_csv("Milan_O_2014_2019.csv")

In [104]:
df_O

Unnamed: 0,DateTime,O µg/m³
0,2014-01-01,7.252174
1,2014-01-02,6.400000
2,2014-01-03,6.033333
3,2014-01-04,5.554167
4,2014-01-05,6.708333
...,...,...
2186,2019-12-27,5.116667
2187,2019-12-28,4.129167
2188,2019-12-29,4.270833
2189,2019-12-30,13.954167
