# Lille Metropole Dataset - Qualité de l'air

## 0. Setting-Up

#### Some context

https://fr.wikipedia.org/wiki/Indice_de_qualit%C3%A9_de_l%27air

Le nouvel indice de qualité de l’air ATMO est un indicateur journalier gradué de 1 (bon) à 6 (extrêmement mauvais) qui permet de caractériser de manière simple et globale la qualité de l’air d’une agglomération urbaine.

Il se compose de 5 sous-indices, chacun étant représentatif d'un polluant de l'air :
- dioxyde d'azote (NO2 ou code_no2)
- dioxyde de soufre (SO2 ou code_so2)
- ozone (O3 ou code_o3)
- particules fines de moins de 10mm (PM10 ou code_pm10)
- particules fines de moins de 2.5mm (PM2.5 ou code_pm25)

Le sous-indice le plus élevé détermine l'indice du jour.

Il est calculé à partir des mesures des stations représentatives de la pollution de fond. Il ne prend pas en compte les phénomènes de proximité (automobile ou industriel).

| O3         | SO2       | NO2       | PM10      | PM2.5   | Niveau              |
|------------|-----------|-----------|-----------|---------|---------------------|
| 0 à 50     | 0 à 100   | 0 à 40    | 0 à 20    | 0 à 10  | Bon                 |
| 50 à 100   | 100 à 200 | 40 à 90   | 20 à 40   | 10 à 20 | Moyen               |
| 100 à 130  | 200 à 350 | 90 à 120  | 40 à 50   | 20 à 25 | Dégradé             |
| 130 à 240  | 350 à 500 | 120 à 230 | 50 à 100  | 25 à 50 | Mauvais             |
| 240 à 380  | 500 à 750 | 230 à 340 | 100 à 150 | 50 à 75 | Très mauvais        |
| > 380      | > 750     | > 340     | > 150     | > 75    | Extrêmement mauvais |	 	 	 	 	
 	 	 	 	 	

#### Importing Librairies & Modules

In [64]:
import pandas as pd
import requests
# from arcgis.features import FeatureLayer

## 1. Collecting the Data

#### Verifying api response

In [65]:
#alternative source: https://data-atmo-hdf.opendata.arcgis.com/search?collection=Dataset
#alternative source: https://services8.arcgis.com/rxZzohbySMKHTNcy/ArcGIS/rest/services/ind_hdf_2021/FeatureServer/0
dataset_id = "indice-qualite-de-lair"
format = "json"
limit = "10"
r = requests.get(f"https://opendata.lillemetropole.fr/api/v2/catalog/datasets/{dataset_id}/exports/{format}?limit={limit}", timeout=2)

print(f"URL: {r.url}")
print(f"HTTP Response Status Code: {r.status_code}") 
print(f"HTTP Error: {r.raise_for_status()}")
print(f"Encoding: {r.encoding}")
print(f"Header content type: {r.headers.get('content-type')}")
print(f"Cookies: {r.cookies}")

r.close()


URL: https://opendata.lillemetropole.fr/api/v2/catalog/datasets/indice-qualite-de-lair/exports/json?limit=10
HTTP Response Status Code: 200
HTTP Error: None
Encoding: utf-8
Header content type: application/json; charset=utf-8
Cookies: <RequestsCookieJar[]>


#### Extracting full dataset into dataframe

In [66]:
dataset_id = "indice-qualite-de-lair"
format = "json"
limit = "-1" # argument to pass to get the full dataset 
df = pd.read_json(f"https://opendata.lillemetropole.fr/api/v2/catalog/datasets/{dataset_id}/exports/{format}?limit={limit}")

df.head()

Unnamed: 0,date_ech,code_qual,lib_qual,coul_qual,date_dif,source,type_zone,code_zone,lib_zone,code_no2,...,code_pm25,x_wgs84,y_wgs84,x_reg,y_reg,epsg_reg,objectid,geo_shape,geo_point_2d,code_posta
0,2022-01-01T01:00:00+00:00,2,Moyen,#50CCAA,2022-01-02T14:10:06+00:00,Atmo HDF,commune,59009,VILLENEUVE D ASCQ,2,...,2,3.15312,50.63246,710851,7059508,2154,809,"{'type': 'Feature', 'geometry': {'coordinates'...","{'lon': 3.153119999999999, 'lat': 50.63246}",59650
1,2022-01-01T01:00:00+00:00,2,Moyen,#50CCAA,2022-01-02T14:10:06+00:00,Atmo HDF,commune,59017,ARMENTIERES,1,...,2,2.88,50.69126,691505,7066057,2154,817,"{'type': 'Feature', 'geometry': {'coordinates'...","{'lon': 2.88, 'lat': 50.69125999999999}",59280
2,2022-01-01T01:00:00+00:00,2,Moyen,#50CCAA,2022-01-02T14:10:06+00:00,Atmo HDF,commune,59106,BOUVINES,2,...,2,3.19263,50.58194,713665,7053885,2154,902,"{'type': 'Feature', 'geometry': {'coordinates'...","{'lon': 3.19263, 'lat': 50.58194000000001}",59830
3,2022-01-01T01:00:00+00:00,2,Moyen,#50CCAA,2022-01-02T14:10:06+00:00,Atmo HDF,commune,59128,CAPINGHEM,2,...,2,2.96453,50.64707,697487,7061127,2154,924,"{'type': 'Feature', 'geometry': {'coordinates'...","{'lon': 2.96453, 'lat': 50.64706999999999}",59160
4,2022-01-01T01:00:00+00:00,2,Moyen,#50CCAA,2022-01-02T14:10:06+00:00,Atmo HDF,commune,59133,CARNIN,1,...,2,2.96055,50.5193,697198,7046891,2154,929,"{'type': 'Feature', 'geometry': {'coordinates'...","{'lon': 2.960549999999999, 'lat': 50.5193}",59112


In [67]:
df.tail()

Unnamed: 0,date_ech,code_qual,lib_qual,coul_qual,date_dif,source,type_zone,code_zone,lib_zone,code_no2,...,code_pm25,x_wgs84,y_wgs84,x_reg,y_reg,epsg_reg,objectid,geo_shape,geo_point_2d,code_posta
26367,2022-08-10T02:00:00+00:00,4,Mauvais,#FF5050,2022-08-11T14:10:07+00:00,Atmo HDF,commune,59346,LEZENNES,1,...,1,3.11762,50.61179,708339,7057201,2154,861350,"{'type': 'Feature', 'geometry': {'coordinates'...","{'lon': 3.117619999999999, 'lat': 50.611789999...",59260
26368,2022-08-10T02:00:00+00:00,4,Mauvais,#FF5050,2022-08-11T14:10:07+00:00,Atmo HDF,commune,59368,MADELEINE,1,...,1,3.06997,50.65441,704956,7061946,2154,861370,"{'type': 'Feature', 'geometry': {'coordinates'...","{'lon': 3.06997, 'lat': 50.65441}",59110
26369,2022-08-10T02:00:00+00:00,4,Mauvais,#FF5050,2022-08-11T14:10:07+00:00,Atmo HDF,commune,59437,NOYELLES LES SECLIN,1,...,2,3.02256,50.57303,701600,7052876,2154,861434,"{'type': 'Feature', 'geometry': {'coordinates'...","{'lon': 3.02256, 'lat': 50.573029999999996}",59139
26370,2022-08-10T02:00:00+00:00,4,Mauvais,#FF5050,2022-08-11T14:10:07+00:00,Atmo HDF,commune,59482,QUESNOY SUR DEULE,1,...,2,3.00715,50.71302,700506,7068475,2154,861478,"{'type': 'Feature', 'geometry': {'coordinates'...","{'lon': 3.00715, 'lat': 50.71301999999999}",59890
26371,2022-08-10T02:00:00+00:00,4,Mauvais,#FF5050,2022-08-11T14:10:07+00:00,Atmo HDF,commune,59487,RADINGHEM EN WEPPES,1,...,2,2.89797,50.62849,692768,7059060,2154,861483,"{'type': 'Feature', 'geometry': {'coordinates'...","{'lon': 2.89797, 'lat': 50.62849}",59320


## 2. Preparing Data for Analysis

#### First basic information about the DataFrame

In [68]:
# data.info has the merits of combining many functions together
# - see if there's null values replacing `df.isnull().sum()`
# - see the dtype of each colymns replacing `df.dtypes`
# - see the shape of the dataframe replacing `df.shape`
# - estimating the memory usage replacing `df.memory_usage`
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26372 entries, 0 to 26371
Data columns (total 23 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   date_ech      26372 non-null  object 
 1   code_qual     26372 non-null  int64  
 2   lib_qual      26372 non-null  object 
 3   coul_qual     26372 non-null  object 
 4   date_dif      26372 non-null  object 
 5   source        26372 non-null  object 
 6   type_zone     26372 non-null  object 
 7   code_zone     26372 non-null  int64  
 8   lib_zone      26372 non-null  object 
 9   code_no2      26372 non-null  int64  
 10  code_so2      26372 non-null  int64  
 11  code_o3       26372 non-null  int64  
 12  code_pm10     26372 non-null  int64  
 13  code_pm25     26372 non-null  int64  
 14  x_wgs84       26372 non-null  float64
 15  y_wgs84       26372 non-null  float64
 16  x_reg         26372 non-null  int64  
 17  y_reg         26372 non-null  int64  
 18  epsg_reg      26372 non-nu

One issue observed is the number of "object" dtypes knowing that we would be able to leverage more functions if some of them were more specifics (string, date...). 

In [69]:
df["date_ech"].value_counts(ascending=True)

2022-01-01T01:00:00+00:00     95
2022-05-13T02:00:00+00:00     95
2022-05-14T02:00:00+00:00     95
2022-05-15T02:00:00+00:00     95
2022-05-16T02:00:00+00:00     95
                            ... 
2022-01-05T01:00:00+00:00    190
2022-01-04T01:00:00+00:00    190
2022-07-29T02:00:00+00:00    190
2022-07-28T02:00:00+00:00    190
2022-01-06T01:00:00+00:00    223
Name: date_ech, Length: 269, dtype: int64

In [70]:
df["date_dif"].value_counts(ascending=True)

2022-01-02T14:10:06+00:00     95
2022-05-13T14:10:06+00:00     95
2022-05-14T14:10:06+00:00     95
2022-05-15T14:10:06+00:00     95
2022-05-16T14:10:07+00:00     95
                            ... 
2022-01-05T14:10:07+00:00    190
2022-07-29T14:10:07+00:00    190
2022-07-08T14:10:07+00:00    190
2022-01-07T14:10:06+00:00    223
2022-09-29T14:10:07+00:00    285
Name: date_dif, Length: 267, dtype: int64

When looking at the date columns, it seems that we have 95 observations each day (i.e. 95 cities) and anything above is suspicious. Let's verify this assumption. 

#### Dealing with Duplicates

Performing a `df.duplicated().sum()` gives me a *TypeError: unhashable type: 'dict'* error. This is because columns **geo_shape** and **geo_point_2d** are dictionaries. I'll exclude them. 

In [76]:
# check duplication of df excluding geo_shape and geo_point_2d
df.duplicated(df.columns.difference(['geo_shape', 'geo_point_2d'])).sum()

0

I found this result to be strange given my initial observations with dates **value_counts** and decided to proceed further by including columns one by one to see if duplicates emerge

In [78]:
df.duplicated(['date_ech', 'date_dif', 'code_zone', 'lib_zone', 'code_qual', 'lib_qual', 'coul_qual', 'source', 'type_zone', 'code_no2', 'code_so2', 'code_o3', 'code_pm10', 'code_pm25', 'x_wgs84', 'y_wgs84', 'x_reg', 'y_reg', 'epsg_reg']).sum()

817

In [79]:
df.duplicated(['date_ech', 'date_dif', 'code_zone', 'lib_zone', 'code_qual', 'lib_qual', 'coul_qual', 'source', 'type_zone', 'code_no2', 'code_so2', 'code_o3', 'code_pm10', 'code_pm25', 'x_wgs84', 'y_wgs84', 'x_reg', 'y_reg', 'epsg_reg', 'objectid']).sum()

0

The only difference between the two lines of code above is the presence of **objectid** in the second one. In other words, there are 817 duplicates if we ignore **objectid**. Since all other variables are similar and we expect one observation per day per city, I will remove **objectid**, then remove duplicates. 

In [80]:
# removeing objectid column
df.drop(columns=['objectid'], inplace=True)

# converting **geo_shape** and **geo_point_2d** to string
df['geo_shape'] = df.geo_shape.astype(str)
df['geo_point_2d'] = df.geo_point_2d.astype(str)

In [81]:
df.duplicated().sum()

817

In [83]:
df.drop_duplicates()

Unnamed: 0,date_ech,code_qual,lib_qual,coul_qual,date_dif,source,type_zone,code_zone,lib_zone,code_no2,...,code_pm10,code_pm25,x_wgs84,y_wgs84,x_reg,y_reg,epsg_reg,geo_shape,geo_point_2d,code_posta
0,2022-01-01T01:00:00+00:00,2,Moyen,#50CCAA,2022-01-02T14:10:06+00:00,Atmo HDF,commune,59009,VILLENEUVE D ASCQ,2,...,2,2,3.15312,50.63246,710851,7059508,2154,"{'type': 'Feature', 'geometry': {'coordinates'...","{'lon': 3.153119999999999, 'lat': 50.63246}",59650
1,2022-01-01T01:00:00+00:00,2,Moyen,#50CCAA,2022-01-02T14:10:06+00:00,Atmo HDF,commune,59017,ARMENTIERES,1,...,2,2,2.88000,50.69126,691505,7066057,2154,"{'type': 'Feature', 'geometry': {'coordinates'...","{'lon': 2.88, 'lat': 50.69125999999999}",59280
2,2022-01-01T01:00:00+00:00,2,Moyen,#50CCAA,2022-01-02T14:10:06+00:00,Atmo HDF,commune,59106,BOUVINES,2,...,2,2,3.19263,50.58194,713665,7053885,2154,"{'type': 'Feature', 'geometry': {'coordinates'...","{'lon': 3.19263, 'lat': 50.58194000000001}",59830
3,2022-01-01T01:00:00+00:00,2,Moyen,#50CCAA,2022-01-02T14:10:06+00:00,Atmo HDF,commune,59128,CAPINGHEM,2,...,2,2,2.96453,50.64707,697487,7061127,2154,"{'type': 'Feature', 'geometry': {'coordinates'...","{'lon': 2.96453, 'lat': 50.64706999999999}",59160
4,2022-01-01T01:00:00+00:00,2,Moyen,#50CCAA,2022-01-02T14:10:06+00:00,Atmo HDF,commune,59133,CARNIN,1,...,2,2,2.96055,50.51930,697198,7046891,2154,"{'type': 'Feature', 'geometry': {'coordinates'...","{'lon': 2.960549999999999, 'lat': 50.5193}",59112
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26367,2022-08-10T02:00:00+00:00,4,Mauvais,#FF5050,2022-08-11T14:10:07+00:00,Atmo HDF,commune,59346,LEZENNES,1,...,2,1,3.11762,50.61179,708339,7057201,2154,"{'type': 'Feature', 'geometry': {'coordinates'...","{'lon': 3.117619999999999, 'lat': 50.611789999...",59260
26368,2022-08-10T02:00:00+00:00,4,Mauvais,#FF5050,2022-08-11T14:10:07+00:00,Atmo HDF,commune,59368,MADELEINE,1,...,2,1,3.06997,50.65441,704956,7061946,2154,"{'type': 'Feature', 'geometry': {'coordinates'...","{'lon': 3.06997, 'lat': 50.65441}",59110
26369,2022-08-10T02:00:00+00:00,4,Mauvais,#FF5050,2022-08-11T14:10:07+00:00,Atmo HDF,commune,59437,NOYELLES LES SECLIN,1,...,2,2,3.02256,50.57303,701600,7052876,2154,"{'type': 'Feature', 'geometry': {'coordinates'...","{'lon': 3.02256, 'lat': 50.573029999999996}",59139
26370,2022-08-10T02:00:00+00:00,4,Mauvais,#FF5050,2022-08-11T14:10:07+00:00,Atmo HDF,commune,59482,QUESNOY SUR DEULE,1,...,2,2,3.00715,50.71302,700506,7068475,2154,"{'type': 'Feature', 'geometry': {'coordinates'...","{'lon': 3.00715, 'lat': 50.71301999999999}",59890


dates have a "yyyy-MM-dd'T'HH:mm:ss.SSSZ" pattern

In [36]:
df.to_csv(r'da_data_raw\indice-qualite-de-lair.csv', index = False)

In [31]:
dfn = df.convert_dtypes()
dfn.dtypes

date_ech         string
code_qual         Int64
lib_qual         string
coul_qual        string
date_dif         string
source           string
type_zone        string
code_zone         Int64
lib_zone         string
code_no2          Int64
code_so2          Int64
code_o3           Int64
code_pm10         Int64
code_pm25         Int64
x_wgs84         Float64
y_wgs84         Float64
x_reg             Int64
y_reg             Int64
epsg_reg          Int64
objectid          Int64
geo_shape        object
geo_point_2d     object
code_posta        Int64
dtype: object