# Air Quality and COVID-19 Infection Rates
For this project, our goal is to investigate a possible relationship between Air Quality and COVID-19 Infection Rates. To that end, we will merge an Air Quality dataset with a COVID-19 dataset, train a linear regression model on the merged dataset, and perform predictions to test the accuracy of our model.

Some thoughts about the web app:
- Perhaps an app where Air Quality features and COVID-19 cases are the independent variables while COVID-19 infection rate (%) is the dependent variable.
- Sliders and input fields would be used for the independent variables.

**Important:** Because this is an exploratory project, we are not absolutely certain that there exists a significant relationship between Air Quality and COVID-19 datasets. Furthermore, our results, significant or not, can only lend support to hypotheses surrounding Air Quality and COVID-19.

For full project details, please see `CS180_G16_ProjectProposal.pdf`.

## Examining the Datasets
Let us look at the samples, features, and other details regarding the dataset so that we know what we're dealing with.

In [None]:
%reset -f

def newline(): print("---------------------------\n")

import pandas as pd
dataA = pd.read_csv('datasets/A-WHO-air-quality.csv')
dataB = pd.read_csv('datasets/B-WHO-covid-infections-deaths.csv')

print("Dataset A: WHO Air Quality (2023)")
print("Dataset A Contents")
dataA.head()
print("Additional Details")
dataA.describe()
dataA.info()

newline()

print("Dataset B: WHO COVID-19 Cases and Deaths (2023)")
print("Dataset B Contents")
dataB.head()
print("Additional Details")
dataB.describe()
dataB.info()

## First Dropping of Features
In this section, we are going to drop the features that are obviously unrelated to our purposes.

Looking at Dataset A's features (and samples for confirmation), we find that the following features are unnecessary in our context:

- who_region
- iso3
- city (because there's no city in Dataset B)
- version
- type_of_stations
- reference
- web_link
- population (also has to be dropped because this is by [country, city])
- population_source
- latitude (because analysis is on a per country basis)
- longitude (ditto)
- who_ms

Similarly, for Dataset B, the following are unnecessary:

- Country_code
- WHO_region

Note that explicitly printing the labels shows that there's no abnormality with their names. Thanks WHO. But I think we should remove the capitalizations in Dataset B's labels to match Dataset A.

We shall now drop these features:

In [None]:
# Dropping columns from Dataset A
dataA.drop(['who_region', 'iso3', 'city', 'version', 'type_of_stations',
            'reference', 'web_link', 'population', 'population_source',
            'latitude', 'longitude', 'who_ms'], axis=1, inplace=True)
# Dropping columns from Dataset B
dataB.drop(['Country_code', 'WHO_region'], axis=1, inplace=True)

# Converting all of Dataset B's features to lowercase
newLabels = []
for label in dataB.columns:
    newLabels.append(label.lower())
dataB.columns = newLabels

# Also convert country_name label to country
dataA.rename(columns={'country_name':'country'},inplace=True)

# Checking Dataset A and B state
print("Dataset A")
dataA.head()
print("Dataset B")
dataB.head()

## Temporal Restriction
As mentioned in our project proposal, we would limit our project to records from the year 2020-2022 to capture only the records which the COVID-19 pandemic may have impacted.

*Remark:* We're dropping records from beyond 2022 from Dataset B because Dataset A is restricted to until 2022.

Because of the way the date_reported in dataset B is formatted (date object with format YYYY-MM-DD), we cannot use it directly for comparison with the year in Dataset A of format YYYY. We must first extract year from the date_reported and replace that column with year.

In [None]:
# Replaced date_reported with Year
dataB['date_reported'] = pd.to_datetime(dataB['date_reported'])
dataB['year'] = dataB['date_reported'].dt.year
del dataB['date_reported']
cols = dataB.columns.tolist()
cols = cols[-1:] + cols[:-1]
dataB = dataB[cols]
print('Dataset B')
dataB

In [None]:
# Aux: Sort dataset by country name
dataA.sort_values('country', ascending=True, inplace=True)
dataA.reset_index(drop=True, inplace=True)
print('Dataset A')
dataA.head()

In [None]:
# Dropping records from both Dataset A and B that were taken from before 2020 and beyond 2023.
print("dataA records before restriction:", dataA.shape)
dataA = dataA[(dataA['year'] >= 2020) & (dataA['year'] <= 2022)]
print("dataA records after restriction:", dataA.shape)
dataA.reset_index(drop=True, inplace=True)

## Possible Imputation to estimate NA values

In [None]:
import seaborn as sns

# Deciding whether to drop or impute null values, so we check how many null values there are.
# Dataset A impute
print("A: Number of entries with null values:", dataA.isna().any(axis=1).sum())
print("A: Number of entries:", dataA.shape[0])

# Dataset B impute
print("B: Number of entries with null values:", dataB.isna().any(axis=1).sum())
print("B: Number of entries:", dataB.shape[0])

As we can see for Dataset A, we have 6741 entries with null values out of a total of 6872 entries. Hence, dropping those entries is similar to just dropping the entire dataset. Thus, imputation must be performed to preserve Dataset A.

As for Dataset B, we find that there are 0 entries with null values, hence we can work with it as it is.

**Decision:** We apply multivariate imputation to Dataset A, while Dataset B will remain the same.

However, further examination of the dataset reveals that there are countries where almost the entire column is populated with NAs. Hence, for better reliability — that is, not making up values from thin air — we will perform imputation per country, and then all together.

We will now do **Per Country Imputation of Dataset A**.

In [None]:
# These imports are important, imputer relies on them.

import numpy as np
from sklearn import preprocessing

from sklearn.impute import SimpleImputer

from sklearn.experimental import enable_iterative_imputer   # Important!
from sklearn.impute import IterativeImputer     # default imputer is BayesianRidge

from sklearn.linear_model import BayesianRidge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.neighbors import KNeighborsRegressor

print("Before imputation:")
dataA

countries = dataA['country'].unique()

# Initialize imputer
imp = IterativeImputer(max_iter=100, random_state=1)

# Idea: Drop country and year from a copy of Dataset A, then
# make that accumulate imputed values, then restore the dropped columns
# Assumes country and year are the first 2 columns of Dataset A
accumulator = []    # List of rows
for country in countries:
    temp = dataA[dataA['country'] == country].drop(['country', 'year'], axis=1)
    imputed = imp.fit_transform(temp)
    if imputed.shape[1] != 6:
        # Failure of imputation (i.e. not getting all 6 attributes imputed) would mean that the country could be
        # dropped from the dataset, as the failure implies too many NaN values for the records to be sensibly used
        print("{} ({})".format(country, "Dropped"))
        dataA.drop(dataA[dataA.country == country].index, inplace=True)
        continue
    else:
        print(country)
        accumulator.extend(imputed)

print("After imputation:")
cols = dataA[['country', 'year']].copy()
dataA.drop(['country', 'year'], axis=1, inplace=True)   # Dropping for shape compatibility
dataA[:] = accumulator
# Restoring dropped columns
dataA.insert(0, 'country', cols.country.values)
dataA.insert(1, 'year', cols.year.values)
dataA.reset_index(drop=True, inplace=True)
dataA

In [None]:
# Checking if there are still NAN values in dataset A
print("A: Number of entries with null values:", dataA.isna().any(axis=1).sum())
print("A: Number of entries:", dataA.shape[0])

# Merging Dataset A with Dataset B
Now that Dataset A has been successfully preprocessed, we can then merge it with Dataset B.
We will apply Linear Regression on the Result.

In [None]:
# Show again what we're dealing with
