## Air Quality data, Paris 2020

Air Quality data monitoring, Visualisation and Machine Learning

You can find all the data there:
https://github.com/antoinedme/aqi-data/tree/master/notebooks/data

This notebook is part of the AQI courses:
https://github.com/antoinedme/aqi-data


In [1]:
# Import classic pandas and numpy libraries
import pandas as pd
import numpy as np

In [2]:
# Particulate matter 2.5 µm
# (also called particle pollution): fine inhalable particles, with diameters that are generally 2.5
pm25 = pd.read_csv("data/PM10_2.csv", sep=';')

# Nitrogen dioxide, NO2 is an intermediate in the industrial synthesis of nitric acid, 
# millions of tons of which are produced each year for use primarily in the production of fertilizers.
no2 = pd.read_csv("data/NO2_2.csv", sep=';')

# Ozone is an inorganic molecule with the chemical formula O3
o3 = pd.read_csv("data/O3_2.csv", sep=';')

# PM10 is particulate matter 10 micrometers or less in diameter
pm10 = pd.read_csv("data/PM10_2.csv", sep=';')

# Sulfur dioxide is the chemical compound with the formula SO2
# It is a toxic gas responsible for the smell of burnt matches 
so2 = pd.read_csv("data/SO2_2.csv", sep=';')

In [3]:
# We just delete the first row that is the units
pm25.drop(0, axis=0, inplace=True)
no2.drop(0, axis=0, inplace=True)
o3.drop(0, axis=0, inplace=True)
pm10.drop(0, axis=0, inplace=True)
so2.drop(0, axis=0, inplace=True)

### Check the data

In this section we will manipulate a bit the data, prepare them and check few stats and correlations.

In [4]:
# Let's take the example of PM2.5 and print the first five rows
pm25.head()

Unnamed: 0,date,heure,PA01H,GEN,LOGNES,PA15L,CERGY,NOGENT,A1,DEF,...,RN2,HAUS,AUT,VITRY,RD934,BASCH,ELYS,RAMBO,BP_EST,TREMB
1,2020-01-25,1.0,75,55,31,77,46,49,57,78,...,55,82,90,47,71,83,69,33,62,30
2,2020-01-25,2.0,75,56,24,71,45,45,69,80,...,55,77,93,45,70,92,71,38,66,25
3,2020-01-25,3.0,75,64,26,70,42,38,70,77,...,44,66,66,37,63,67,73,35,59,25
4,2020-01-25,4.0,71,54,24,70,37,41,46,72,...,40,70,95,39,63,63,63,38,57,26
5,2020-01-25,5.0,55,42,23,78,26,42,40,61,...,39,41,80,41,63,60,44,43,42,23


In [5]:
# Let's see the core core statistics of PM2.5 on A1 (Paris highway n1)
pm25['A1'].describe()

count     1704
unique     108
top        n/d
freq       147
Name: A1, dtype: object

In [6]:
# Oups, here we might need to convert our values to integers as they are understood as "object"
pm25['A1'] = pd.to_numeric(pm25['A1'], errors='coerce')
pm25['A1'].describe()
# much more interesting data now:

count    1557.000000
mean       37.207450
std        19.714772
min         4.000000
25%        23.000000
50%        33.000000
75%        46.000000
max       132.000000
Name: A1, dtype: float64

In [7]:
# We want to work on the date and hour format
from datetime import datetime

In [8]:
# Let's print the value again
pm25.head()

Unnamed: 0,date,heure,PA01H,GEN,LOGNES,PA15L,CERGY,NOGENT,A1,DEF,...,RN2,HAUS,AUT,VITRY,RD934,BASCH,ELYS,RAMBO,BP_EST,TREMB
1,2020-01-25,1.0,75,55,31,77,46,49,57.0,78,...,55,82,90,47,71,83,69,33,62,30
2,2020-01-25,2.0,75,56,24,71,45,45,69.0,80,...,55,77,93,45,70,92,71,38,66,25
3,2020-01-25,3.0,75,64,26,70,42,38,70.0,77,...,44,66,66,37,63,67,73,35,59,25
4,2020-01-25,4.0,71,54,24,70,37,41,46.0,72,...,40,70,95,39,63,63,63,38,57,26
5,2020-01-25,5.0,55,42,23,78,26,42,40.0,61,...,39,41,80,41,63,60,44,43,42,23


In [17]:
# We want to format the date and time first two colums to standard
#pm25['A1'][1]
#datetime.strptime(str, '%m-%d-%y')

x = str(pm25['date'][1])+" "+str(pm25['heure'][1])
#The format we want is: ("%b %d %Y %H:%M:%S")

date_object = x.strptime(str(pm25['date'][1]), '%m-%d-%y')

AttributeError: 'str' object has no attribute 'strptime'

In [None]:


str = '9-15-18'
date_object = datetime.strptime(str, '%m-%d-%y')
print(date_object)