# Prediction of COVID-19 Around the World

| Student | __Angela Amador__ |
| ------ | ----- |
| TMU Student Number | __500259095__ |
| Supervisor | __Tamer Abdou, PhD__  |

I aim to demonstrate how Machine Learning (ML) models were able to predict the spread of COVID-19 around the world.

First, I will explore the dataset to get insights and better understand patterns, detect error and outliers, and find relationships between variables.

The scope of this notebook is to run Pandas Data Profiling.

## Preparation

The dataset is taken from Our World in Data website, officially collected by Our World in Data team: https://covid.ourworldindata.org/data/owid-covid-data.csv.

The dataset, provides COVID-19 information collected by [Our World in Data](https://www.kaggle.com/datasets/caesarmario/our-world-in-data-covid19-dataset) and made available by the Kaggle community https://www.kaggle.com/datasets/caesarmario/our-world-in-data-covid19-dataset/download?datasetVersionNumber=418. This dataset is updated daily.

In [1]:
import warnings

# Use jupyter_black to automatically format the code
import black
import jupyter_black
import matplotlib.pyplot as plt
import numpy as np

# Import libraries
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
from IPython.core.interactiveshell import InteractiveShell
from IPython.display import Markdown, display
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectKBest, VarianceThreshold, chi2
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder
from sklearn.svm import SVC
from ydata_profiling import ProfileReport


jupyter_black.load(
    lab=False,
    line_length=79,
    verbosity="INFO",
    target_version=black.TargetVersion.PY310,
)

# Use pycodestyle to enforce coding standards.
%load_ext pycodestyle_magic
%pycodestyle_on

warnings.filterwarnings("ignore")
InteractiveShell.ast_node_interactivity = "all"

<IPython.core.display.Javascript object>

### Load file and run Data Profiling

For the purpose of this study I am analyzing the data with information up to Oct 7th, 2023.

In [2]:
# Load file
raw_data = pd.read_csv("archive.zip", sep=",")

# Explore data
raw_data.head()

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,population,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
0,AFG,Asia,Afghanistan,2020-01-03,,0.0,,,0.0,,...,,37.746,0.5,64.83,0.511,41128772.0,,,,
1,AFG,Asia,Afghanistan,2020-01-04,,0.0,,,0.0,,...,,37.746,0.5,64.83,0.511,41128772.0,,,,
2,AFG,Asia,Afghanistan,2020-01-05,,0.0,,,0.0,,...,,37.746,0.5,64.83,0.511,41128772.0,,,,
3,AFG,Asia,Afghanistan,2020-01-06,,0.0,,,0.0,,...,,37.746,0.5,64.83,0.511,41128772.0,,,,
4,AFG,Asia,Afghanistan,2020-01-07,,0.0,,,0.0,,...,,37.746,0.5,64.83,0.511,41128772.0,,,,


### Check the data type and metadata of the attributes

In [3]:
raw_data.dtypes

iso_code                                    object
continent                                   object
location                                    object
date                                        object
total_cases                                float64
                                            ...   
population                                 float64
excess_mortality_cumulative_absolute       float64
excess_mortality_cumulative                float64
excess_mortality                           float64
excess_mortality_cumulative_per_million    float64
Length: 67, dtype: object

Look at meta information about numeric data, we can also see if there any
extreme values.

In [4]:
raw_data.describe()

Unnamed: 0,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,total_cases_per_million,new_cases_per_million,new_cases_smoothed_per_million,total_deaths_per_million,...,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,population,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
count,308672.0,337028.0,335769.0,287169.0,337072.0,335842.0,308672.0,337028.0,335769.0,287169.0,...,198833.0,131627.0,237221.0,318823.0,260466.0,346567.0,11953.0,11953.0,11953.0,11953.0
mean,6609069.0,9695.906,9732.069,85595.25,86.392889,86.704207,100634.394008,146.569024,147.113196,867.35464,...,32.909864,50.789455,3.097109,73.714185,0.72246,128322500.0,51135.35,9.739424,11.461129,1646.844959
std,40325470.0,110832.4,94954.14,438049.3,616.815791,561.926045,150292.226515,1169.506821,602.840371,1096.750172,...,13.574185,31.956355,2.548353,7.39556,0.148979,660311700.0,144279.6,12.380781,25.354695,1929.159161
min,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,7.7,1.188,0.1,53.28,0.394,47.0,-37726.1,-44.23,-95.92,-2752.9248
25%,7988.75,0.0,0.286,125.0,0.0,0.0,2573.7835,0.0,0.056,59.672,...,22.6,20.859,1.3,69.59,0.602,449002.0,106.6,1.32,-1.62,65.34572
50%,69047.0,2.0,25.857,1313.0,0.0,0.143,27720.494,0.169,6.815,374.322,...,33.1,49.839,2.5,75.05,0.74,5882259.0,5736.601,8.07,5.77,1072.4727
75%,734550.2,273.0,510.714,11818.0,3.0,5.286,131483.602,36.566,84.076,1356.019,...,41.3,82.502,4.2,79.46,0.829,28301700.0,36689.59,15.47,16.52,2704.9338
max,771150500.0,8401961.0,6402036.0,6960770.0,27939.0,14821.857,737554.506,228872.025,37241.781,6511.209,...,78.1,100.0,13.8,86.75,0.957,7975105000.0,1289776.0,76.55,377.63,10292.916


Helper function to generate a markdown table with details of a dataframe.

In [5]:
def df_details(label, df):
    """ Generate a markdown table with details of a dataframe """
    display(
        Markdown(
            rf"""
| {label} | |
| --- | ---: |
| Number of observations | {df.shape[0]} |
| Number of attributes | {df.shape[1]} |
| Size | {df.size} |
"""
        )
    )


df_details("Original dataset", raw_data)


| Original dataset | |
| --- | ---: |
| Number of observations | 346567 |
| Number of attributes | 67 |
| Size | 23219989 |


# Generate Profiling Report

Comment the execution as the size is too big but it is available in the HTML CIND820_EDA_DataProfiling.html

In [6]:
# Generate profiling report
%pycodestyle_off
profile = ProfileReport(
    raw_data,
    title="Profiling Report",
    html={'style':{'fullwith':True}}
)
#profile