# CIND820 - Exploration Data Analysis  



I aim to predict the efficiency of COVID-19 vaccinates around the world using Data Classification and Clustering to analyze the efficiency of COVID-19 vaccines over the population infected and deaths reported.

First, I will explore the dataset to get insides and better understand patterns, detect error and outliers, and find relationships between variables. Then, identify key factors to determine the efficiency of COVID-19 vaccine in relation to the number of cases and deaths.


# Preparation
Describing the working dataset and any imposed constraints

This dataset is taken from Our World in Data website, officially collected by Our World in Data team. This dataset will be synced daily. For more info:
https://www.kaggle.com/datasets/caesarmario/our-world-in-data-covid19-dataset

Import the following files:
This dataset is taken from Our World in Data website, officially collected by Our World in Data team. This dataset will be synced daily:

https://covid.ourworldindata.org/data/owid-covid-data.csv

In [None]:
# Import libraries
import pandas as pd
from ydata_profiling import ProfileReport
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

from sklearn.feature_selection import VarianceThreshold

import numpy as np


## Load file and explore data

The dataset, provided by Our World in Data, provides COVID-19 vaccination information collected by Our World in Data available to Kaggle community https://www.kaggle.com/datasets/caesarmario/our-world-in-data-covid19-dataset/download?datasetVersionNumber=418. This dataset is updated daily, for the purpose of this study I am analyzing the data with information up to Oct 7th, 2023.

In [None]:
# Load file
covid_data = pd.read_csv('archive.zip', sep=',')  

#Explore data
covid_data.head()

### Check the data type and metadata of the attributes

In [None]:
covid_data.dtypes

In [None]:
# look at meta information about data, such as null values
covid_data.info()

In [None]:
# Let's see meta information about numeric data, we can also see if there any extreme values
covid_data.describe()

### Removing data before COVID vaccinate availability

Multiple vaccinates became available on the second semester of 2020. By December most contries have approved vaccinates for their own country. 

To avoid ..... we will remove data before Jan 1st, 2021 to consider data only with vaccinate availability

In [None]:
covid_data = covid_data.drop(covid_data[covid_data.date < '2021-01-01'].index)

In [None]:
# look at meta information about data, such as null values
covid_data.info()

# Dimensional Reduction (CMTH642 - Module 9)

Due to the size of the dataset with 255173 entries and 67 columns, I am going to apply dimensional reduction to provide better features for statistical learning methods

## 1. Removing data columns with too many NaN values

We can calculate the ratio of missing values using a simple formula. The formula is- the number of missing values in each column divided by the total number of observation. Generally, we can drop variables having a missing value ratio of more than 60% or 70%. For my purpose I am going to use a threashold of 60% missing values and remove those attributes.

In [None]:
# Defining threashold of 60% missing values 
threashold_NaN = 0.60

#Explore data
def describe_nan(df):
    return pd.DataFrame([(i, df[df[i].isna()].shape[0],df[df[i].isna()].shape[0]/df.shape[0]) for i in df.columns], columns=['column', 'nan_counts', 'nan_rate'])

pd.options.display.max_rows = None

#icu=covid_data.icu_patients.value_counts(dropna=False)
#display ("NaN entries for the icu_patients column:", icu[icu.index.isnull()])

describe_nan(covid_data).sort_values(by="nan_rate", ascending=False).query("nan_rate >= %s"%threashold_NaN)

#((covid_data.isnull() | covid_data.isna()).sum() * 100 / covid_data.index.size).round(2)

In [None]:

my_columns = describe_nan(covid_data).sort_values(by="nan_rate", ascending=False).query("nan_rate < %s"%threashold_NaN)[["column"]]
my_columns = my_columns['column'].to_list() 

#dr1 -> Dimensionality Reduction - 1. Removing data columns with too many missing values
dr1_covid_data = covid_data[my_columns]
dr1_covid_data.info()

In [None]:
#covid_data.info()
describe_nan(dr1_covid_data).sort_values(by="nan_rate", ascending=False)
dr1_covid_data.shape

In [None]:
covid_data.size
dr1_covid_data.size

In [None]:
dr1_covid_data.info()

## 2. Low Variance Filter

Another way of measuring how much information a data column has, is to measure its variance. In the limit case where the column cells assume a constant value, the variance would be 0 and the column would be of no help in the discrimination of different groups of data.

The Low Variance Filter node calculates each column variance and removes those columns with a variance value below a given threshold. Notice that the variance can only be calculated for numerical columns, i.e. this dimensionality reduction method applies only to numerical columns. Note, too, that the variance value depends on the column numerical range. Therefore data column ranges need to be normalized to make variance values independent from the column domain range.

First a Normalizer node normalizes all column ranges to [0, 1]; next, a Low Variance Filter node calculates the columns variance and filters out the columns with a variance lower than a set threshold.

In [None]:
# We initialize it just like any other Scikit-learn estimator. The default value for the threshold is always 0. 
# Also, the estimator only works with numeric data obviously and it will raise an error if there are categorical features present in the dataframe. 
# That’s why, for now, we will subset the numeric features into another dataframe:
vt = VarianceThreshold()

#dr2 -> Dimensionality Reduction - 1. Removing low variance filter
dr2_covid_data_num = dr1_covid_data.select_dtypes(include="number")
dr2_covid_data_num.shape


In [None]:
# Before, I need to tak care of missing values encoded as NaN natively by replacing with the mean on reduced dataset "dr2_covid_data_reduced"

for c in dr2_covid_data_num.columns:
    dr2_covid_data_num[c] = dr2_covid_data_num[c].fillna(dr2_covid_data_num[c].mean())

dr2_covid_data_num.info()

In [None]:
# First, we fit the estimator to data and call its get_support() method. It returns a boolean mask with True values for columns which are not dropped. 
# We can then use this mask to subset our DataFrame like so

_ = vt.fit(dr2_covid_data_num)
mask = vt.get_support()

dr2_covid_data_num = dr2_covid_data_num.loc[:, mask]

dr2_covid_data_num.shape

dr2_covid_data_num.info()

In [None]:
# We still have the same number of features. Now, let’s drop features with variances close to 0
vt = VarianceThreshold(threshold=1)

# Fit
_ = vt.fit(dr2_covid_data_num)

# # Get the boolean mask
mask = vt.get_support()

dr2_covid_data_reduced = dr2_covid_data_num.loc[:, mask]

dr2_covid_data_reduced.shape

# With a threshold of 1, 3 attributes were removedthreshold
# From: (255173, 32)
# To: (255173, 29)

In [None]:
dr2_covid_data_reduced.info()

# The attributes that were dropped are:
# - reproduction_rate
# - new_people_vaccinated_smoothed_per_hundred
# - human_development_index

In [None]:
covid_data.size
dr1_covid_data.size
dr2_covid_data_reduced.size

In [None]:
# Method of normalizing all features by dividing them by their mean

normalized_df = dr2_covid_data_num / dr2_covid_data_num.mean()
normalized_df.head()
normalized_df.var()

In [None]:
# Now, we can use the estimator with a lower threshold like 0.005
vt = VarianceThreshold(threshold=0.005)

# Fit
_ = vt.fit(normalized_df)

# # Get the boolean mask
mask = vt.get_support()

dr2_covid_data_final = dr2_covid_data_num.loc[:, mask]

dr2_covid_data_final.shape

# With a threshold of 0.05, zero attributes were removedthreshold
# From: (255173, 32)
# To: (255173, 32)

In [None]:
dr2_covid_data_reduced.info()

In [None]:
# With method of normalizing no attributes were removed; while with variances close to 0, 3 features were removed.

# I will check is it is rigth to removed this 3 attributes. I will test this by training two RandomForestRegressor to predict a total_cases: the first one on the reduced dataset, feature selected dataset
# and the second one on the full, numeric-feature only dataset.

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# Build feature, target arrays
X, y = dr2_covid_data_reduced.iloc[:, [i for i in range(29) if i != 18]], dr2_covid_data_reduced.iloc[:, [18]]

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=1121218)

# Init, fit, score
forest = RandomForestRegressor(random_state=1121218)

_ = forest.fit(X_train, y_train)

# Training Score
print(f"Training Score: {forest.score(X_train, y_train)}")
#Training Score: 0.988528867222243

print(f"Test Score: {forest.score(X_test, y_test)}")
# Test Score: 0.9511616691995844

In [None]:
dr2_covid_data_num.info()

In [None]:
# Both training and test score suggest a really high performance without overfitting. Now, let’s train the same model on the full numeric-only dataset

# Build feature, target arrays
X, y = dr2_covid_data_num.iloc[:, [i for i in range(32) if i != 21]], dr2_covid_data_num.iloc[:, [21]]

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=1121218)

# Init, fit, score
forest = RandomForestRegressor(random_state=1121218)

_ = forest.fit(X_train, y_train)

# Training Score
print(f"Training Score: {forest.score(X_train, y_train)}")
#Training Score: 0.988528867222243

print(f"Test Score: {forest.score(X_test, y_test)}")

#I can confirm that there isn't any impact on the prediction by removing these 3 features

In [None]:
# Genetate profiling report
#profile = ProfileReport(covid_data, title="Profiling Report")
#profile = ProfileReport(covid_data, title="Profiling Report", html={'style':{'fullwith':True}})
#profile

## 3. High correlation with other data columns


* https://www.kaggle.com/code/bbloggsbott/feature-selection-correlation-and-p-value

In [None]:
# 

import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix

corr = dr2_covid_data_final.corr()
corr.head()

sns.heatmap(corr)

In [None]:
columns = np.full((corr.shape[0],), True, dtype=bool)
for i in range(corr.shape[0]):
    for j in range(i+1, corr.shape[0]):
        if corr.iloc[i,j] >= 0.9:
            if columns[j]:
                columns[j] = False

selected_columns = dr2_covid_data_final.columns[columns]
selected_columns
selected_columns.shape

* https://towardsdatascience.com/statistics-in-python-collinearity-and-multicollinearity-4cc4dcd82b3f