# Clean Data
## Introduction
To measure the quality of the air we can use Sulfur Monoxide (SO), Nitrogen Dioxide (NO2), Carbon Dioxide (CO2), Ozone (O3) or Total Suspended Particules. The data obtained by OpenAQ is provided by the different sensors around the world. Nevertheless, we will focus on our country during the last year. We consider using 2020 because it is a year that we have a pandemic situation and we would know what happens during the lockdown etc.

## Get started

>Note: this notebook uses python 3 as kernel

This notebook assumes tha data is already downloaded and stored at ``../data/raw``

if not, execute:

``python ../src/data/get_data.py``

## Read Files

We will be using, at first instance, the following python modules:
- ``os`` python built-in package to deal with directories/files
- ``json`` python built-in package to code/encode JSON data
- ``pandas`` to read and process the data
- ``display`` to visualize the data
- ``pickle`` to keep stored the read and processed data as a byte stream to save loading times

In [1]:
import pandas as pd
import os
import json
from pandas_profiling import ProfileReport
from IPython.display import display
import pickle as pkl

# global vars
rel_path="../data/raw/"
clean_path="../data/clean/"
pkl_path="../data/interim/"
rep_path="../reports/"

pkl_parameters = pkl_path + "parameters.pkl"
pkl_countries = pkl_path + "countries.pkl"
pkl_measurements = pkl_path + "measurements.pkl"
pkl_locations = pkl_path + "locations.pkl"

os.makedirs(pkl_path, exist_ok=True)
os.makedirs(rep_path, exist_ok=True)

In [None]:
def read_parameters():
    _json = json.load(open(rel_path + 'parameters.json'))
    data = pd.DataFrame(_json["results"]).set_index('id')
    data.to_pickle(pkl_parameters, compression='infer', protocol=-1)
    return data

parameters = pkl.load(open(pkl_parameters, 'rb')) if os.path.exists(pkl_parameters) else read_parameters()
parameters.info()

# Profile report
profile = ProfileReport(parameters, title="Profiling Parameters")
profile.to_widgets()

profile.to_file(rep_path + 'parameters.html')
profile.to_file(rep_path + 'parameters.json')

# Display
parameters = parameters[['displayName','name','preferredUnit']]
display(parameters)

os.makedirs(clean_path, exist_ok=True)
parameters.to_json(clean_path + "parameters.json", orient='records')

MaxColorValue has missing values (null values), but this data is not required for our analysis.
We can observe the different parameters and the units applied to them.
This analysis is only for getting information about the API and what kind of data may we treat.

In [None]:
def read_countries():
    _json = json.load(open(rel_path + 'countries.json'))
    data = pd.DataFrame(_json['results'])
    data.to_pickle(pkl_countries, compression='infer', protocol=-1)
    return data

countries = pkl.load(open(pkl_countries, 'rb')) if os.path.exists(pkl_countries) else read_countries()
countries.info()

# Profile report
profile = ProfileReport(countries.loc[:, countries.columns != 'count'], title="Profiling Countries") # cannot insert count, already exists
profile.to_widgets()

profile.to_file(rep_path + 'countries.html')
profile.to_file(rep_path + 'countries.json')

# Display
countries = countries[['cities','code','name','parameters']]
display(countries)

countries.to_json(clean_path + "countries.json", orient='records')

In [None]:
def read_locations():
    _json = json.load(open(rel_path + 'locations_ES_Lleida.json'))
    data = pd.DataFrame(_json['results'])
    data.to_pickle(pkl_locations, compression='infer', protocol=-1)
    return data

locations = pkl.load(open(pkl_locations, 'rb')) if os.path.exists(pkl_locations) else read_locations()
locations.info()

# Profile report
profile = ProfileReport(locations, title="Profiling Measurements")
profile.to_widgets()

profile.to_file(rep_path + 'locations.html')
profile.to_file(rep_path + 'locations.json')

locations = locations[['city','country','measurements','name','parameters']]
locations.to_json(clean_path + "locations.json", orient='records')
display(locations)
print(locations.name.values)

In [2]:
for (dirpath, dirnames, filenames) in os.walk(rel_path):
    for filename in filenames:
        if filename.find('measurements') == 0:
            measurements = json.load(open(rel_path+filename))
            measurements = pd.DataFrame(measurements["results"])
            measurements.info()
            measurements=measurements[['location','city','date','parameter','value','unit']]
            measurements = measurements[measurements.value >= 0]
            
#             #Profile report
#             profile = ProfileReport(measurements, title="Profiling Measurements")
#             profile.to_widgets()

#             profile.to_file(rep_path + 'measurements_day_0.html')
#             profile.to_file(rep_path + 'measurements_day_0.json')

# # Display
            measurements.to_json(clean_path + filename, orient='records')
            display(measurements)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19625 entries, 0 to 19624
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   city         19625 non-null  object
 1   coordinates  19625 non-null  object
 2   country      19625 non-null  object
 3   date         19625 non-null  object
 4   entity       19625 non-null  object
 5   isAnalysis   19625 non-null  bool  
 6   isMobile     19625 non-null  bool  
 7   location     19625 non-null  object
 8   locationId   19625 non-null  int64 
 9   parameter    19625 non-null  object
 10  sensorType   19625 non-null  object
 11  unit         19625 non-null  object
 12  value        19625 non-null  int64 
dtypes: bool(2), int64(2), object(9)
memory usage: 1.7+ MB


Unnamed: 0,location,city,date,parameter,value,unit
0,ES2034A,Lleida,"{'local': '2020-12-30T18:00:00+01:00', 'utc': ...",o3,63,µg/m³
1,ES2034A,Lleida,"{'local': '2020-12-30T18:00:00+01:00', 'utc': ...",no2,4,µg/m³
2,ES2034A,Lleida,"{'local': '2020-12-30T17:00:00+01:00', 'utc': ...",no2,2,µg/m³
3,ES2034A,Lleida,"{'local': '2020-12-30T17:00:00+01:00', 'utc': ...",o3,65,µg/m³
4,ES2034A,Lleida,"{'local': '2020-12-30T16:00:00+01:00', 'utc': ...",no2,1,µg/m³
...,...,...,...,...,...,...
19620,ES2034A,Lleida,"{'local': '2019-01-01T13:00:00+01:00', 'utc': ...",o3,10,µg/m³
19621,ES2034A,Lleida,"{'local': '2019-01-01T12:00:00+01:00', 'utc': ...",no2,14,µg/m³
19622,ES2034A,Lleida,"{'local': '2019-01-01T12:00:00+01:00', 'utc': ...",o3,11,µg/m³
19623,ES2034A,Lleida,"{'local': '2019-01-01T11:00:00+01:00', 'utc': ...",no2,9,µg/m³


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19628 entries, 0 to 19627
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   city         19628 non-null  object
 1   coordinates  19628 non-null  object
 2   country      19628 non-null  object
 3   date         19628 non-null  object
 4   entity       19628 non-null  object
 5   isAnalysis   19628 non-null  bool  
 6   isMobile     19628 non-null  bool  
 7   location     19628 non-null  object
 8   locationId   19628 non-null  int64 
 9   parameter    19628 non-null  object
 10  sensorType   19628 non-null  object
 11  unit         19628 non-null  object
 12  value        19628 non-null  int64 
dtypes: bool(2), int64(2), object(9)
memory usage: 1.7+ MB


Unnamed: 0,location,city,date,parameter,value,unit
0,ES1348A,Lleida,"{'local': '2020-12-30T16:00:00-01:00', 'utc': ...",o3,65,µg/m³
1,ES1348A,Lleida,"{'local': '2020-12-30T15:00:00-01:00', 'utc': ...",no2,3,µg/m³
2,ES1348A,Lleida,"{'local': '2020-12-30T15:00:00-01:00', 'utc': ...",o3,70,µg/m³
3,ES1348A,Lleida,"{'local': '2020-12-30T14:00:00-01:00', 'utc': ...",no2,4,µg/m³
4,ES1348A,Lleida,"{'local': '2020-12-30T14:00:00-01:00', 'utc': ...",o3,70,µg/m³
...,...,...,...,...,...,...
19623,ES1348A,Lleida,"{'local': '2019-01-01T11:00:00-01:00', 'utc': ...",no2,14,µg/m³
19624,ES1348A,Lleida,"{'local': '2019-01-01T10:00:00-01:00', 'utc': ...",no2,18,µg/m³
19625,ES1348A,Lleida,"{'local': '2019-01-01T10:00:00-01:00', 'utc': ...",o3,38,µg/m³
19626,ES1348A,Lleida,"{'local': '2019-01-01T09:00:00-01:00', 'utc': ...",no2,24,µg/m³


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9853 entries, 0 to 9852
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   city         9853 non-null   object
 1   coordinates  9853 non-null   object
 2   country      9853 non-null   object
 3   date         9853 non-null   object
 4   entity       9853 non-null   object
 5   isAnalysis   9853 non-null   bool  
 6   isMobile     9853 non-null   bool  
 7   location     9853 non-null   object
 8   locationId   9853 non-null   int64 
 9   parameter    9853 non-null   object
 10  sensorType   9853 non-null   object
 11  unit         9853 non-null   object
 12  value        9853 non-null   int64 
dtypes: bool(2), int64(2), object(9)
memory usage: 866.1+ KB


Unnamed: 0,location,city,date,parameter,value,unit
0,ES1588A,Lleida,"{'local': '2020-12-30T18:00:00+01:00', 'utc': ...",o3,43,µg/m³
1,ES1588A,Lleida,"{'local': '2020-12-30T17:00:00+01:00', 'utc': ...",o3,52,µg/m³
2,ES1588A,Lleida,"{'local': '2020-12-30T16:00:00+01:00', 'utc': ...",o3,52,µg/m³
3,ES1588A,Lleida,"{'local': '2020-12-30T15:00:00+01:00', 'utc': ...",o3,55,µg/m³
4,ES1588A,Lleida,"{'local': '2020-12-30T14:00:00+01:00', 'utc': ...",o3,51,µg/m³
...,...,...,...,...,...,...
9848,ES1588A,Lleida,"{'local': '2019-01-01T15:00:00+01:00', 'utc': ...",o3,18,µg/m³
9849,ES1588A,Lleida,"{'local': '2019-01-01T14:00:00+01:00', 'utc': ...",o3,19,µg/m³
9850,ES1588A,Lleida,"{'local': '2019-01-01T13:00:00+01:00', 'utc': ...",o3,20,µg/m³
9851,ES1588A,Lleida,"{'local': '2019-01-01T12:00:00+01:00', 'utc': ...",o3,12,µg/m³


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31163 entries, 0 to 31162
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   city         31163 non-null  object 
 1   coordinates  31163 non-null  object 
 2   country      31163 non-null  object 
 3   date         31163 non-null  object 
 4   entity       31163 non-null  object 
 5   isAnalysis   31163 non-null  bool   
 6   isMobile     31163 non-null  bool   
 7   location     31163 non-null  object 
 8   locationId   31163 non-null  int64  
 9   parameter    31163 non-null  object 
 10  sensorType   31163 non-null  object 
 11  unit         31163 non-null  object 
 12  value        31163 non-null  float64
dtypes: bool(2), float64(1), int64(1), object(9)
memory usage: 2.7+ MB


Unnamed: 0,location,city,date,parameter,value,unit
0,ES0014R,Lleida,"{'local': '2020-12-30T17:00:00+01:00', 'utc': ...",no2,1.70,µg/m³
1,ES0014R,Lleida,"{'local': '2020-12-30T17:00:00+01:00', 'utc': ...",o3,75.34,µg/m³
2,ES0014R,Lleida,"{'local': '2020-12-30T17:00:00+01:00', 'utc': ...",so2,0.34,µg/m³
3,ES0014R,Lleida,"{'local': '2020-12-30T16:00:00+01:00', 'utc': ...",o3,72.62,µg/m³
4,ES0014R,Lleida,"{'local': '2020-12-30T16:00:00+01:00', 'utc': ...",so2,0.37,µg/m³
...,...,...,...,...,...,...
31158,ES0014R,Lleida,"{'local': '2019-01-02T06:00:00+01:00', 'utc': ...",o3,18.37,µg/m³
31159,ES0014R,Lleida,"{'local': '2019-01-02T06:00:00+01:00', 'utc': ...",no2,4.39,µg/m³
31160,ES0014R,Lleida,"{'local': '2019-01-02T05:00:00+01:00', 'utc': ...",so2,0.65,µg/m³
31161,ES0014R,Lleida,"{'local': '2019-01-02T05:00:00+01:00', 'utc': ...",no2,4.93,µg/m³


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38054 entries, 0 to 38053
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   city         38054 non-null  object 
 1   coordinates  38054 non-null  object 
 2   country      38054 non-null  object 
 3   date         38054 non-null  object 
 4   entity       38054 non-null  object 
 5   isAnalysis   38054 non-null  bool   
 6   isMobile     38054 non-null  bool   
 7   location     38054 non-null  object 
 8   locationId   38054 non-null  int64  
 9   parameter    38054 non-null  object 
 10  sensorType   38054 non-null  object 
 11  unit         38054 non-null  object 
 12  value        38054 non-null  float64
dtypes: bool(2), float64(1), int64(1), object(9)
memory usage: 3.3+ MB


Unnamed: 0,location,city,date,parameter,value,unit
0,ES1982A,Lleida,"{'local': '2020-12-30T18:00:00+01:00', 'utc': ...",so2,1.0,µg/m³
1,ES1982A,Lleida,"{'local': '2020-12-30T17:00:00+01:00', 'utc': ...",no2,1.0,µg/m³
2,ES1982A,Lleida,"{'local': '2020-12-30T17:00:00+01:00', 'utc': ...",co,100.0,µg/m³
3,ES1982A,Lleida,"{'local': '2020-12-30T17:00:00+01:00', 'utc': ...",so2,1.0,µg/m³
4,ES1982A,Lleida,"{'local': '2020-12-30T17:00:00+01:00', 'utc': ...",o3,69.0,µg/m³
...,...,...,...,...,...,...
38049,ES1982A,Lleida,"{'local': '2019-01-01T12:00:00+01:00', 'utc': ...",so2,0.1,µg/m³
38050,ES1982A,Lleida,"{'local': '2019-01-01T11:00:00+01:00', 'utc': ...",o3,91.0,µg/m³
38051,ES1982A,Lleida,"{'local': '2019-01-01T11:00:00+01:00', 'utc': ...",so2,0.1,µg/m³
38052,ES1982A,Lleida,"{'local': '2019-01-01T11:00:00+01:00', 'utc': ...",co,100.0,µg/m³


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9807 entries, 0 to 9806
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   city         9807 non-null   object
 1   coordinates  9807 non-null   object
 2   country      9807 non-null   object
 3   date         9807 non-null   object
 4   entity       9807 non-null   object
 5   isAnalysis   9807 non-null   bool  
 6   isMobile     9807 non-null   bool  
 7   location     9807 non-null   object
 8   locationId   9807 non-null   int64 
 9   parameter    9807 non-null   object
 10  sensorType   9807 non-null   object
 11  unit         9807 non-null   object
 12  value        9807 non-null   int64 
dtypes: bool(2), int64(2), object(9)
memory usage: 862.1+ KB


Unnamed: 0,location,city,date,parameter,value,unit
0,ES1248A,Lleida,"{'local': '2020-12-30T15:00:00-01:00', 'utc': ...",o3,58,µg/m³
1,ES1248A,Lleida,"{'local': '2020-12-30T14:00:00-01:00', 'utc': ...",o3,64,µg/m³
2,ES1248A,Lleida,"{'local': '2020-12-30T13:00:00-01:00', 'utc': ...",o3,66,µg/m³
3,ES1248A,Lleida,"{'local': '2020-12-30T12:00:00-01:00', 'utc': ...",o3,64,µg/m³
4,ES1248A,Lleida,"{'local': '2020-12-30T11:00:00-01:00', 'utc': ...",o3,41,µg/m³
...,...,...,...,...,...,...
9802,ES1248A,Lleida,"{'local': '2019-01-02T15:00:00-01:00', 'utc': ...",o3,90,µg/m³
9803,ES1248A,Lleida,"{'local': '2019-01-02T14:00:00-01:00', 'utc': ...",o3,91,µg/m³
9804,ES1248A,Lleida,"{'local': '2019-01-02T13:00:00-01:00', 'utc': ...",o3,90,µg/m³
9805,ES1248A,Lleida,"{'local': '2019-01-02T12:00:00-01:00', 'utc': ...",o3,88,µg/m³


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37883 entries, 0 to 37882
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   city         37883 non-null  object
 1   coordinates  37883 non-null  object
 2   country      37883 non-null  object
 3   date         37883 non-null  object
 4   entity       37883 non-null  object
 5   isAnalysis   37883 non-null  bool  
 6   isMobile     37883 non-null  bool  
 7   location     37883 non-null  object
 8   locationId   37883 non-null  int64 
 9   parameter    37883 non-null  object
 10  sensorType   37883 non-null  object
 11  unit         37883 non-null  object
 12  value        37883 non-null  int64 
dtypes: bool(2), int64(2), object(9)
memory usage: 3.3+ MB


Unnamed: 0,location,city,date,parameter,value,unit
0,ES1225A,Lleida,"{'local': '2020-12-30T16:00:00-01:00', 'utc': ...",no2,8,µg/m³
1,ES1225A,Lleida,"{'local': '2020-12-30T15:00:00-01:00', 'utc': ...",so2,1,µg/m³
2,ES1225A,Lleida,"{'local': '2020-12-30T15:00:00-01:00', 'utc': ...",co,200,µg/m³
3,ES1225A,Lleida,"{'local': '2020-12-30T15:00:00-01:00', 'utc': ...",o3,66,µg/m³
4,ES1225A,Lleida,"{'local': '2020-12-30T15:00:00-01:00', 'utc': ...",no2,5,µg/m³
...,...,...,...,...,...,...
37878,ES1225A,Lleida,"{'local': '2019-01-01T10:00:00-01:00', 'utc': ...",co,200,µg/m³
37879,ES1225A,Lleida,"{'local': '2019-01-01T09:00:00-01:00', 'utc': ...",so2,2,µg/m³
37880,ES1225A,Lleida,"{'local': '2019-01-01T09:00:00-01:00', 'utc': ...",o3,9,µg/m³
37881,ES1225A,Lleida,"{'local': '2019-01-01T09:00:00-01:00', 'utc': ...",no2,14,µg/m³
