# Prediction of COVID-19 around the world

Student: Angela Amador

TMU Student Number: 500259095

Supervisor: Tamer Abdou, PhD


I aim to demonstrate how Machine Learning (ML) models were able to predict the spread of COVID-19 around the world.

First, I will explore the dataset to get insides and better understand patterns, detect error and outliers, and find relationships between variables. 


## Preparation
Describing the working dataset and any imposed constraints

This dataset is taken from Our World in Data website, officially collected by Our World in Data team: https://covid.ourworldindata.org/data/owid-covid-data.csv.

This dataset will be synced daily. For more info: https://www.kaggle.com/datasets/caesarmario/our-world-in-data-covid19-dataset

In [None]:
# Import libraries
import pandas as pd
from ydata_profiling import ProfileReport
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

from sklearn.feature_selection import VarianceThreshold

import numpy as np

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import warnings
warnings.filterwarnings("ignore")
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix



### Load file and explore data

The dataset, provided by Our World in Data, provides COVID-19 information collected by Our World in Data available to Kaggle community https://www.kaggle.com/datasets/caesarmario/our-world-in-data-covid19-dataset/download?datasetVersionNumber=418. This dataset is updated daily, for the purpose of this study I am analyzing the data with information up to Oct 7th, 2023.

In [None]:
# Load file
covid_data = pd.read_csv('archive.zip', sep=',')  

#Explore data
covid_data.head()

### Check the data type and metadata of the attributes

In [None]:
covid_data.dtypes

In [None]:
# look at meta information about data, such as null values
covid_data.info()

In [None]:
# Let's see meta information about numeric data, we can also see if there any extreme values
covid_data.describe()

### Removing data before COVID vaccinate availability

Multiple vaccinates became available on the second semester of 2020. By December most contries have approved vaccinates for their own country. 

To avoid ..xxxxxx... we will remove data before Jan 1st, 2021 to consider data only with vaccinate availability

In [None]:
print("Original dataset:")
print("Total number of observations: ", covid_data.shape[0])
print("Total number of attributes: ", covid_data.shape[1])
print("Size: ", covid_data.size)


covid_data = covid_data.drop(covid_data[covid_data.date < '2021-01-01'].index)

print("\nAfter removing data before vaccinate was available around the world (Jan 1st, 2021):")
print("Total number of observations: ", covid_data.shape[0])
print("Total number of attributes: ", covid_data.shape[1])
print("Size: ", covid_data.size)


## Categorical attributes

Base of the analysis of the attributes iso_code and location, I can tell that one can be derive from the other. For the purpose of this study, I am going to keep location and remove iso_code

In [None]:
# import pickle
# with open("data.pickle", "wb") as output:
#     pickle.dump(covid_data, output, pickle.HIGHEST_PROTOCOL)

# with open("data.pickle", "rb") as input:
#     data = pickle.load(input)

covid_data.groupby(["iso_code"])["iso_code"].count()
covid_data.groupby(["location"])["location"].count()

covid_data = covid_data.drop(['iso_code'], axis=1)

print("\nAfter removing attribute iso_code because it can be delivered from location:")
print("Total number of observations: ", covid_data.shape[0])
print("Total number of attributes: ", covid_data.shape[1])
print("Size: ", covid_data.size)

# Dimensional Reduction (CMTH642 - Module 9)

Due to the size of the dataset with 255,173 entries and 67 columns, I am going to apply dimensional reduction to provide better features for statistical learning methods

## 1. Removing data columns with too many NaN values

We can calculate the ratio of missing values using a simple formula. The formula is- the number of missing values in each column divided by the total number of observation. Generally, we can drop variables having a missing value ratio of more than 60% or 70%. For my purpose I am going to use a threashold of 60% missing values and remove those attributes.

In [None]:
# Defining threashold of 60% missing values 
threashold_NaN = 0.60

#Explore data
def describe_nan(df):
    return pd.DataFrame([(i, df[df[i].isna()].shape[0],df[df[i].isna()].shape[0]/df.shape[0]) for i in df.columns], columns=['column', 'nan_counts', 'nan_rate'])

pd.options.display.max_rows = None

#icu=covid_data.icu_patients.value_counts(dropna=False)
#display ("NaN entries for the icu_patients column:", icu[icu.index.isnull()])

print("Attributes with more than 60 percentage of missing values:")

describe_nan(covid_data).sort_values(by="nan_rate", ascending=False).query("nan_rate >= %s"%threashold_NaN)

#((covid_data.isnull() | covid_data.isna()).sum() * 100 / covid_data.index.size).round(2)

In [None]:

my_columns = describe_nan(covid_data).sort_values(by="nan_rate", ascending=False).query("nan_rate < %s"%threashold_NaN)[["column"]]
my_columns = my_columns['column'].to_list() 

#dr1 -> Dimensionality Reduction - 1. Removing data columns with too many missing values
dr1_covid_data = covid_data[my_columns]

print("After removing columns with more than 60 percentage of missing values:\n")
print("Total number of observations: ", dr1_covid_data.shape[0])
print("Total number of attributes: ", dr1_covid_data.shape[1])
print("Size: ", dr1_covid_data.size)
print("\n")
dr1_covid_data.info()

In [None]:
print("Percentage of NaN values per attribute for the remaining columns:\n")
describe_nan(dr1_covid_data).sort_values(by="nan_rate", ascending=False)

# To manage memory dur to the size of the dataset, I am keeping one version of the dataset and removing any temporary copy
covid_data = dr1_covid_data
del(dr1_covid_data)

## 2. Low Variance Filter

Another way of measuring how much information a data column has, is to measure its variance. In the limit case where the column cells assume a constant value, the variance would be 0 and the column would be of no help in the discrimination of different groups of data.

The Low Variance Filter node calculates each column variance and removes those columns with a variance value below a given threshold. Notice that the variance can only be calculated for numerical columns, i.e. this dimensionality reduction method applies only to numerical columns. Note, too, that the variance value depends on the column numerical range. Therefore data column ranges need to be normalized to make variance values independent from the column domain range.

First a Normalizer node normalizes all column ranges to [0, 1]; next, a Low Variance Filter node calculates the columns variance and filters out the columns with a variance lower than a set threshold.

In [None]:
# Initialization is just like any other Scikit-learn estimator. The default value for the threshold is always 0. 
# Also, the estimator only works with numeric data obviously and it will raise an error if there are categorical features present in the dataframe. 
# That’s why, for now, I will subset the numeric features into another dataframe:

vt = VarianceThreshold()

#dr2 -> Dimensionality Reduction - 2. Removing low variance filter
dr2_covid_data_num = covid_data.select_dtypes(include="number")
#dr2_covid_data_num.shape
#dr2_covid_data_num.info()


In [None]:
# Before, I need to take care of missing values encoded as NaN natively by replacing with the mean on reduced dataset "dr2_covid_data_reduced"

print ("Before replacing NaN values with the mean:\n")
print("Total number of observations: ", dr2_covid_data_num.shape[0])
print("Total number of attributes: ", dr2_covid_data_num.shape[1])
print("Size: ", dr2_covid_data_num.size)
print("\n")
dr2_covid_data_num.info()

for c in dr2_covid_data_num.columns:
    dr2_covid_data_num[c] = dr2_covid_data_num[c].fillna(dr2_covid_data_num[c].mean())

print ("\nAfter replacing NaN values with the mean:\n")
print("Total number of observations: ", dr2_covid_data_num.shape[0])
print("Total number of attributes: ", dr2_covid_data_num.shape[1])
print("Size: ", dr2_covid_data_num.size)
print("\n")
dr2_covid_data_num.info()

In [None]:
# First, we fit the estimator to data and call its get_support() method. It returns a boolean mask with True values for columns which are not dropped. 
# We can then use this mask to subset our DataFrame like so

_ = vt.fit(dr2_covid_data_num)
mask = vt.get_support()

dr2_covid_data_num = dr2_covid_data_num.loc[:, mask]

# dr2_covid_data_num.shape

# dr2_covid_data_num.info()


In [None]:
# We still have the same number of features. Now, let’s drop features with variances close to 0
vt = VarianceThreshold(threshold=1)

# Fit
_ = vt.fit(dr2_covid_data_num)

# # Get the boolean mask
mask = vt.get_support()

dr2_covid_data_reduced = dr2_covid_data_num.loc[:, mask]

print ("\nAfter dropping features with variances close to 0:\n")
print("Total number of observations: ", dr2_covid_data_reduced.shape[0])
print("Total number of attributes: ", dr2_covid_data_reduced.shape[1])
print("Size: ", dr2_covid_data_reduced.size)
print("\n")
dr2_covid_data_reduced.info()

# With a threshold of 1, 3 attributes were removedthreshold
# From: (255173, 32)
# To: (255173, 29)

In [None]:
# The attributes that were dropped are:
# - reproduction_rate
# - new_people_vaccinated_smoothed_per_hundred
# - human_development_index

In [None]:
# Method of normalizing all features by dividing them by their mean

normalized_df = dr2_covid_data_num / dr2_covid_data_num.mean()
normalized_df.head()

print("Variance of the normalized dataset:\n")
normalized_df.var()

In [None]:
# Now, we can use the estimator with a lower threshold like 0.005
vt = VarianceThreshold(threshold=0.005)

# Fit
_ = vt.fit(normalized_df)

# # Get the boolean mask
mask = vt.get_support()

dr2_covid_data_final = dr2_covid_data_num.loc[:, mask]

dr2_covid_data_final.shape

# With a threshold of 0.05, zero attributes were removed threshold
# From: (255173, 32)
# To: (255173, 32)

In [None]:
# dr2_covid_data_reduced.columns.get_loc('total_cases')
# dr2_covid_data_reduced.shape[1]

In [None]:
# With method of normalizing no attributes were removed; while with variances close to 0, 3 features were removed.
# - reproduction_rate
# - new_people_vaccinated_smoothed_per_hundred
# - human_development_index

# I will check if it is rigth to removed these 3 attributes. I will test this by training two RandomForestRegressor to predict a total_cases: the first one on the reduced dataset (dr2_covid_data_reduced), feature selected dataset
# and the second one on the full, numeric-feature only dataset (dr2_covid_data_num).

#from sklearn.ensemble import RandomForestRegressor
#from sklearn.model_selection import train_test_split

# Find out the index of total_cases column
c = dr2_covid_data_reduced.columns.get_loc('total_cases')

# Find out number of columns
d = dr2_covid_data_reduced.shape[1]

# Build feature, target arrays
X, y = dr2_covid_data_reduced.iloc[:, [i for i in range(d) if i != c]], dr2_covid_data_reduced.iloc[:, [c]]

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=1121218)

# Init, fit, score
forest = RandomForestRegressor(random_state=1121218)

_ = forest.fit(X_train, y_train)

# Training Score
print(f"Training Score: {forest.score(X_train, y_train)}")
#Training Score: 0.988528867222243

print(f"Test Score: {forest.score(X_test, y_test)}")
# Test Score: 0.9511616691995844

print("Both training and test score suggest a really high performance without overfitting.")

In [None]:
# dr2_covid_data_num.columns.get_loc('total_cases')
# dr2_covid_data_num.shape[1]

In [None]:
# Now, let’s train the same model on the full numeric-only dataset

# Find out the index of total_cases column
c = dr2_covid_data_num.columns.get_loc('total_cases')

# Find out number of columns
d = dr2_covid_data_num.shape[1]


# Build feature, target arrays
X, y = dr2_covid_data_num.iloc[:, [i for i in range(d) if i != c]], dr2_covid_data_num.iloc[:, [c]]

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=1121218)

# Init, fit, score
forest = RandomForestRegressor(random_state=1121218)

_ = forest.fit(X_train, y_train)

# Training Score
print(f"Training Score: {forest.score(X_train, y_train)}")
#Training Score: 0.988528867222243

print(f"Test Score: {forest.score(X_test, y_test)}")

print("I can confirm that there isn't any impact on the prediction by removing these 3 features")

#Freeing memory
del(X)
del(y)
del(X_train)
del(X_test)
del(y_train)
del(y_test)
del(dr2_covid_data_num)
del(dr2_covid_data_reduced)
del(dr2_covid_data_final)
del(normalized_df)

In [None]:
# Droping the columns identified with variance close to 0
# - reproduction_rate
# - new_people_vaccinated_smoothed_per_hundred
# - human_development_index

covid_data = covid_data.drop(['reproduction_rate'], axis=1)
covid_data = covid_data.drop(['new_people_vaccinated_smoothed_per_hundred'], axis=1)
covid_data = covid_data.drop(['human_development_index'], axis=1)
print("After removing columns identified with variance close to 0:\n")
print("Total number of observations: ", covid_data.shape[0])
print("Total number of attributes: ", covid_data.shape[1])
print("Size: ", covid_data.size)
print("\n")
covid_data.info()

## 3. High correlation with other data columns


* https://www.kaggle.com/code/bbloggsbott/feature-selection-correlation-and-p-value

In [None]:
# 

#import seaborn as sns
#import matplotlib.pyplot as plt
#from sklearn.preprocessing import LabelEncoder, OneHotEncoder
#import warnings
#warnings.filterwarnings("ignore")
#from sklearn.model_selection import train_test_split
#from sklearn.svm import SVC
#from sklearn.metrics import confusion_matrix

# corr = dr2_covid_data_final.corr()
# corr.head()

# sns.heatmap(corr)

In [None]:
# columns = np.full((corr.shape[0],), True, dtype=bool)
# for i in range(corr.shape[0]):
#     for j in range(i+1, corr.shape[0]):
#         if corr.iloc[i,j] >= 0.9:
#             if columns[j]:
#                 columns[j] = False

# selected_columns = dr2_covid_data_final.columns[columns]
# selected_columns
# selected_columns.shape

* https://towardsdatascience.com/statistics-in-python-collinearity-and-multicollinearity-4cc4dcd82b3f

# Generate Profiling Report

In [None]:
# Genetate profiling report
#profile = ProfileReport(covid_data, title="Profiling Report")
#profile = ProfileReport(covid_data, title="Profiling Report", html={'style':{'fullwith':True}})
#profile