# Air Quality Data (cleaning)
This Python code is used to clean and prepare the air quality data. The original data is saved in multiple CSV file. Each file containes daily data with different measurements of polluants for each country. Because of the diversity of polluants and the inconsistancy in the number of datapoints per country, only the polluant PM25 (Particule Matter with a diameter of 2.5 micrometers or less, also called PM2.5) is considered. The yearly dose is calculated as the mean over the whole year of daily medians of PM25. The results are saved in a CSV file containing the following columns: 2-letter country code, year (format: YYYY) and yearly mean of PM25. 

## Dependencies

In [1]:
# Dependencies
import pandas as pd
from pathlib import Path

## Load data from CSV files
- Load only the data for PM25
- Calculate the mean of the median per day for the whole year

In [2]:
# Look only at PM 25 as a measure of air quality
aq_metrics = 'pm25'

# Get all the file extensions
exts = ['2015H1',
'2016H1',
'2017H1',
'2018H1',
'2019Q1',
'2019Q2',
'2019Q3',
'2019Q4',
'2020Q1',
'2020Q2',
'2020Q3',
'2020Q4',
'2021Q1',
'2021Q2',
'2021Q3',
'2021Q4']

# Load the first file and prepare the DataFrame
ext = exts[0]

# [1] Load CSV file to DataFrame
csvfile = Path(f"Data_Sources/AirQuality/waqi-covid-{ext}.csv")
data_01 = pd.read_csv(csvfile)

# [2] Filter for Specie = aq_metrics (should be PM25), get columns Date, Coutry, City and median
reduced_df = data_01.loc[(data_01['Specie'] == aq_metrics),['Date', 'Country', 'City', 'median']]

# [3] Group by countries and calculate the mean of the medians
reduced_df = reduced_df.groupby('Country').mean('median')

# [4] Rename column median to 'PM25'
reduced_df = reduced_df.rename(columns={'median': 'PM25'})

# [5] Use the year from the CSV file name to create a column with the year
reduced_df['Year'] = ext[0:4]

# [6] Rearrange the column Year and PM25
reduced_df = reduced_df[['Year', 'PM25']]

# Add a new CSV file to the data frame
# Perform the same operations as above [1-6], in the same order
for ext in exts[1:]:
    # Print current filename extension
    print(f"File: {ext}")

    # Operations [1-6]
    csvfile = Path(f"Data_Sources/AirQuality/waqi-covid-{ext}.csv")
    data_01 = pd.read_csv(csvfile)
    new_df = data_01.loc[(data_01['Specie'] == aq_metrics),['Date', 'Country', 'City', 'median']]
    new_df = new_df.groupby('Country').mean('median')
    new_df = new_df.rename(columns={'median': 'PM25'})
    new_df['Year'] = ext[0:4]
    new_df = new_df[['Year', 'PM25']]

    # [7] Add the DataFrame created from the CSV file to the existing DataFrame
    reduced_df = pd.concat([reduced_df, new_df])

# Once all the CSV files have been looked at, print a completion message and display the DataFrame head
print('Completed. Displaying DataFrame:')
reduced_df.head()

File: 2016H1
File: 2017H1
File: 2018H1
File: 2019Q1
File: 2019Q2
File: 2019Q3
File: 2019Q4
File: 2020Q1
File: 2020Q2
File: 2020Q3
File: 2020Q4
File: 2021Q1
File: 2021Q2
File: 2021Q3
File: 2021Q4
Completed. Displaying DataFrame:


Unnamed: 0_level_0,Year,PM25
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
AE,2015,118.714286
AT,2015,52.816
AU,2015,27.327829
BE,2015,51.058252
BG,2015,72.144737


## Save file to Cleaned Datasets directory
- Group data by country and by year
- Save data to a new CSV file

In [8]:
# Group by country and year
group_by_year = reduced_df.groupby(['Country','Year']).mean()

# Display the grouped DataFrame's head
display(group_by_year.head(10))

# Save DataFrame to CSV
group_by_year.to_csv('Cleaned_Datasets/cleaned_airquality.csv')

Unnamed: 0_level_0,Unnamed: 1_level_0,PM25
Country,Year,Unnamed: 2_level_1
AE,2015,118.714286
AE,2016,92.0
AE,2018,122.683432
AE,2019,111.88194
AE,2020,82.900973
AE,2021,95.897881
AF,2019,187.457627
AF,2020,110.718417
AF,2021,116.374771
AR,2016,44.680473
