# Clean Data
## Introduction
To measure the quality of the air we can use Sulfur Monoxide (SO), Nitrogen Dioxide (NO2), Carbon Dioxide (CO2), Ozone (O3) or Total Suspended Particules. The data obtained by OpenAQ is provided by the different sensors around the world. Nevertheless, we will focus on our country during the last year. We consider using 2020 because it is a year that we have a pandemic situation and we would know what happens during the lockdown etc.

## Get started

>Note: this notebook uses python 3 as kernel

This notebook assumes tha data is already downloaded and stored at ``../data/raw``

if not, execute:

``python ../src/data/get_data.py``

## Read Files

We will be using, at first instance, the following python modules:
- ``os`` python built-in package to deal with directories/files
- ``json`` python built-in package to code/encode JSON data
- ``pandas`` to read and process the data
- ``display`` to visualize the data
- ``pickle`` to keep stored the read and processed data as a byte stream to save loading times

In [1]:
import pandas as pd
import os
import json
from pandas_profiling import ProfileReport
from IPython.display import display
import pickle as pkl

# global vars
rel_path="../data/raw/"
clean_path="../data/clean/"
pkl_path="../data/interim/"
rep_path="../reports/"

pkl_parameters = pkl_path + "parameters.pkl"
pkl_countries = pkl_path + "countries.pkl"
pkl_measurements = pkl_path + "measurements.pkl"

os.makedirs(pkl_path, exist_ok=True)
os.makedirs(rep_path, exist_ok=True)

def read_parameters():
    _json = json.load(open(rel_path + 'parameters.json'))
    data = pd.DataFrame(_json["results"]).set_index('id')
    data.to_pickle(pkl_parameters, compression='infer', protocol=-1)
    return data

parameters = pkl.load(open(pkl_parameters, 'rb')) if os.path.exists(pkl_parameters) else read_parameters()
parameters.info()

# Profile report
profile = ProfileReport(parameters, title="Profiling Parameters")
profile.to_widgets()

profile.to_file(rep_path + 'parameters.html')
profile.to_file(rep_path + 'parameters.json')

# Display
parameters = parameters[['displayName','name','preferredUnit']]
display(parameters)

os.makedirs(clean_path, exist_ok=True)
parameters.to_json(clean_path + "parameters.json", orient='records')

Summarize dataset:   0%|          | 0/20 [00:00<?, ?it/s]<class 'pandas.core.frame.DataFrame'>
Int64Index: 22 entries, 1 to 19843
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   description    22 non-null     object 
 1   displayName    22 non-null     object 
 2   isCore         22 non-null     bool   
 3   maxColorValue  11 non-null     float64
 4   name           22 non-null     object 
 5   preferredUnit  22 non-null     object 
dtypes: bool(1), float64(1), object(4)
memory usage: 1.1+ KB
Summarize dataset: 100%|██████████| 20/20 [00:03<00:00,  6.16it/s, Completed]
Generate report structure: 100%|██████████| 1/1 [00:02<00:00,  2.73s/it]


VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

Render HTML: 100%|██████████| 1/1 [00:00<00:00,  1.97it/s]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 125.02it/s]
Render JSON: 100%|██████████| 1/1 [00:00<00:00, 50.01it/s]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 499.80it/s]


Unnamed: 0_level_0,displayName,name,preferredUnit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,PM10,pm10,µg/m³
2,PM2.5,pm25,µg/m³
3,O₃ mass,o3,µg/m³
4,CO mass,co,µg/m³
5,NO₂ mass,no2,µg/m³
6,SO₂ mass,so2,µg/m³
7,NO₂,no2,ppm
8,CO,co,ppm
9,SO₂,so2,ppm
10,O₃,o3,ppm


In [2]:
def read_countries():
    _json = json.load(open(rel_path + 'countries.json'))
    data = pd.DataFrame(_json['results'])
    data.to_pickle(pkl_countries, compression='infer', protocol=-1)
    return data

countries = pkl.load(open(pkl_countries, 'rb')) if os.path.exists(pkl_countries) else read_countries()
countries.info()

# Profile report
profile = ProfileReport(countries.loc[:, countries.columns != 'count'], title="Profiling Countries") # cannot insert count, already exists
profile.to_widgets()

profile.to_file(rep_path + 'countries.html')
profile.to_file(rep_path + 'countries.json')

# Display
countries = countries[['cities','code','name','parameters']]
display(countries)

spain = countries[countries['name'] == 'Spain']
display(spain)

print(spain['parameters'].values[0])

countries.to_json(clean_path + "countries.json", orient='records')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(
Summarize dataset:   5%|▍         | 1/21 [00:00<00:01, 10.53it/s, Describe variable:code]<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130 entries, 0 to 129
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   cities        130 non-null    int64 
 1   code          130 non-null    object
 2   count         130 non-null    int64 
 3   firstUpdated  130 non-null    object
 4   lastUpdated   130 non-null    object
 5   locations     130 non-null    int64 
 6   name          130 non-null    object
 7   parameters    130 non-null    object
 8   sources       130 non-null    int64 
dtypes: int64(4), object(5)
memory usage: 9.3+ KB
Summarize dataset: 100%|██████████| 21/21 [00:01<00:00, 10.50it/

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

Render HTML: 100%|██████████| 1/1 [00:00<00:00,  2.67it/s]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 250.00it/s]
Render JSON: 100%|██████████| 1/1 [00:00<00:00, 66.66it/s]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 1000.07it/s]


Unnamed: 0,cities,code,name,parameters
0,2,AD,Andorra,"[co, no2, o3, pm10, so2]"
1,3,AE,United Arab Emirates,"[o3, pm1, pm10, pm25, um010, um025, um100]"
2,2,AF,Afghanistan,[pm25]
3,0,AM,Armenia,"[pm1, pm10, pm25, um010, um025, um100]"
4,0,AO,Angola,"[pm1, pm10, pm25, um010, um025, um100]"
...,...,...,...,...
125,1,UZ,Uzbekistan,"[o3, pm25]"
126,1,VM,VM,[pm25]
127,2,VN,Vietnam,"[pm1, pm10, pm25, um010, um025, um100]"
128,8,XK,Kosovo,"[co, no2, o3, pm1, pm10, pm25, so2, um010, um0..."


Unnamed: 0,cities,code,name,parameters
38,59,ES,Spain,"[co, no2, o3, pm1, pm10, pm25, so2, um010, um0..."


['co', 'no2', 'o3', 'pm1', 'pm10', 'pm25', 'so2', 'um010', 'um025', 'um100']


In [3]:
def read_measurements():
    _json = json.load(open(rel_path + 'measurements_day_0.json'))
    data = pd.DataFrame(_json['results'])
    data.to_pickle(pkl_measurements, compression='infer', protocol=-1)
    return data

measurements = pkl.load(open(pkl_measurements, 'rb')) if os.path.exists(pkl_measurements) else read_measurements()
measurements.info()

# Profile report
profile = ProfileReport(measurements, title="Profiling Measurements")
profile.to_widgets()
#measurements.loc['2020-12-31','o3']

profile.to_file(rep_path + 'measurements_day_0.html')
profile.to_file(rep_path + 'measurements_day_0.json')

# Display
measurements = measurements[['average', 'day', 'parameter', 'unit']]
pd.to_datetime(measurements['day'], format='%Y-%m-%d')

measurements.to_json(clean_path + "measurements.json", orient='records')
display(measurements)

Summarize dataset:   9%|▊         | 2/23 [00:00<00:01, 18.51it/s, Describe variable:average]<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2182 entries, 0 to 2181
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   average            2182 non-null   float64
 1   day                2182 non-null   object 
 2   displayName        2182 non-null   object 
 3   id                 2182 non-null   int64  
 4   measurement_count  2182 non-null   int64  
 5   name               2182 non-null   object 
 6   parameter          2182 non-null   object 
 7   parameterId        2182 non-null   int64  
 8   subtitle           2182 non-null   object 
 9   unit               2182 non-null   object 
dtypes: float64(1), int64(3), object(6)
memory usage: 170.6+ KB
Summarize dataset: 100%|██████████| 23/23 [00:02<00:00, 11.19it/s, Completed]
  font.set_text(s, 0.0, flags=LOAD_NO_HINTING)
  font.set_text(s, 0.0, flags=LOAD_N

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

Render HTML: 100%|██████████| 1/1 [00:00<00:00,  1.54it/s]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 124.98it/s]
Render JSON: 100%|██████████| 1/1 [00:00<00:00, 16.39it/s]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 500.22it/s]


Unnamed: 0,average,day,parameter,unit
0,52.0092,2020-12-31,o3,µg/m³
1,11.5678,2020-12-31,pm10,µg/m³
2,14.7604,2020-12-31,no2,µg/m³
3,3.4066,2020-12-31,so2,µg/m³
4,322.8080,2020-12-31,co,µg/m³
...,...,...,...,...
2177,3.1374,2020-01-01,so2,µg/m³
2178,25.6587,2020-01-01,pm10,µg/m³
2179,412.7953,2020-01-01,co,µg/m³
2180,22.7752,2020-01-01,no2,µg/m³
