# Air Quality Data Cleaning

This code cleans and processes air quality data from multiple CSV files, each containing daily pollutant measurements for various countries. Given the diversity of pollutants and varying data completeness across countries, this analysis focuses solely on PM2.5 (Particulate Matter with a diameter ≤ 2.5 micrometers). 
The yearly exposure for each country is calculated as the mean of daily median PM2.5 levels. The final output is a CSV file with the following columns: 2-letter country code, year (YYYY format), and yearly mean PM2.5.

## Dependencies

In [1]:
# Dependencies
import pandas as pd
from pathlib import Path

## Load Data from the CSV files
Load only PM25 data
Calculate the mean and median per day for each year

In [5]:
# Focus on PM2.5 as the air quality metric
aq_metric = 'pm25'

# File extensions to identify each dataset
exts = [
    '2015H1', '2016H1', '2017H1', '2018H1', 
    '2019Q1', '2019Q2', '2019Q3', '2019Q4', 
    '2020Q1', '2020Q2', '2020Q3', '2020Q4', 
    '2021Q1', '2021Q2', '2021Q3', '2021Q4'
]

# Initialize DataFrame with first file
ext = exts[0]
csvfile = Path(f"Raw_Data/Air_Quality_Raw/waqi-covid-{ext}.csv")
data = pd.read_csv(csvfile)

# Filter for PM2.5 data, select relevant columns, and calculate mean PM2.5 by country
reduced_df = (
    data[data['Specie'] == aq_metric]
    .loc[:, ['Date', 'Country', 'City', 'median']]
    .groupby('Country').mean('median')
    .rename(columns={'median': 'PM25'})
)
reduced_df['Year'] = ext[:4]
reduced_df = reduced_df[['Year', 'PM25']]

# Process each subsequent file and append to the DataFrame
for ext in exts[1:]:
    print(f"Processing file: {ext}")
    csvfile = Path(f"Raw_Data/Air_Quality_Raw/waqi-covid-{ext}.csv")
    data = pd.read_csv(csvfile)
    
    # Filter, aggregate, and prepare new data
    new_df = (
        data[data['Specie'] == aq_metric]
        .loc[:, ['Date', 'Country', 'City', 'median']]
        .groupby('Country').mean('median')
        .rename(columns={'median': 'PM25'})
    )
    new_df['Year'] = ext[:4]
    new_df = new_df[['Year', 'PM25']]
    
    # Append to main DataFrame
    reduced_df = pd.concat([reduced_df, new_df])

# Completion message and preview
print('Data processing complete. Displaying DataFrame head:')
reduced_df.head()

Processing file: 2016H1
Processing file: 2017H1
Processing file: 2018H1
Processing file: 2019Q1
Processing file: 2019Q2
Processing file: 2019Q3
Processing file: 2019Q4
Processing file: 2020Q1
Processing file: 2020Q2
Processing file: 2020Q3
Processing file: 2020Q4
Processing file: 2021Q1
Processing file: 2021Q2
Processing file: 2021Q3
Processing file: 2021Q4
Data processing complete. Displaying DataFrame head:


Unnamed: 0_level_0,Year,PM25
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
AE,2015,118.714286
AT,2015,52.816
AU,2015,27.327829
BE,2015,51.058252
BG,2015,72.144737


## Save to Cleaned_Data Directory


In [8]:
# Group by country and year, calculating mean PM2.5 values
grouped_data = reduced_df.groupby(['Country', 'Year']).mean()

# Preview the first 10 rows of the grouped DataFrame
display(grouped_data.head(10))

# Save the cleaned data to CSV
grouped_data.to_csv('Cleaned_Data/cleaned_airquality.csv')

Unnamed: 0_level_0,Unnamed: 1_level_0,PM25
Country,Year,Unnamed: 2_level_1
AE,2015,118.714286
AE,2016,92.0
AE,2018,122.683432
AE,2019,111.88194
AE,2020,82.900973
AE,2021,95.897881
AF,2019,187.457627
AF,2020,110.718417
AF,2021,116.374771
AR,2016,44.680473
