## Covid-19 Exploratory Data Analysis

### Goal of the Analysis: 

Addressing key unknown factors to understand the global and regional dynamics of COVID-19, its impact on healthcare systems, the effectiveness of public health measures and potential predictive patterns.

This project aims to answer the following questions: 


* What are the trends of total cases and deaths globally over time?
* Which continents/countries have the highest total cases and deaths?
* How do total cases and deaths per million population vary across different regions?
* How many ICU patients and hospitalised patients are there over time?
* How do ICU patients per million and hospital patients per million vary across different regions?
* How has the reproduction rate (R) of the virus varied over time and across different regions?
* How does excess mortality vary across different countries and continents?
* Can we predict future trends of COVID-19 cases and deaths using time series forecasting techniques?

### Importing Data

In [None]:
# Importing libraries for Data Analysis and Viz.
import pandas as pd                 # Library for data manipulation
import matplotlib.pyplot as plt     # Library for creating static, animated, and interactive visualizations
import numpy as np                  # Fundamental package for scientific computing 

print("All libraries imported successfully!")

### Initial Data Exploration and Data Cleaning

These are the steps that I will follow in this section: 
-  Initial Dataset evaluation 
-  Feature selection
-  General information: (data type, count of total values, number of unique values, etc.)
-  Handling missing values 
-  Data Cleaning

In [None]:
# Creating a data frame & reading data from the .csv file

df = pd.read_csv('Covid_Data.csv')

In [None]:
df.shape

In [None]:
# Retrieve the column names of the dataset and store them in a list
column_names = df.columns.tolist()
print("Column names:")

# Iterate through the list of column names and print each one
for x in column_names:
    print(x)

In [None]:
# View the first 5 rows

df.head()  

#### Initial Dataset Evaluation

In our preliminary evaluation, we encountered a large dataset with 67 columns and 409,839 rows.

We observed that many columns have a significant amount of missing values, which will be addressed during the Data Cleaning stage.

The next step will be to perform **feature selection**. Given the size of the dataset, I will focus on the attributes that best serve the questions I am aiming to answer for this specific project.

#### Feature selection 


In [20]:
# Performing a feature selection that will help answering the initial questions. 

selected_col = [
     'date', 'continent', 'location', 'total_cases', 'total_deaths', 
    'total_cases_per_million', 'total_deaths_per_million', 
    'icu_patients', 'hosp_patients', 'icu_patients_per_million', 
    'hosp_patients_per_million', 'reproduction_rate', 
    'excess_mortality', 'excess_mortality_cumulative', 
    'excess_mortality_cumulative_per_million'
]

# Creating a new Data Frame with the selected columns 

data = df[selected_col]

# Display the new Data Frame
data.head()

Unnamed: 0,date,continent,location,total_cases,total_deaths,total_cases_per_million,total_deaths_per_million,icu_patients,hosp_patients,icu_patients_per_million,hosp_patients_per_million,reproduction_rate,excess_mortality,excess_mortality_cumulative,excess_mortality_cumulative_per_million
0,2020-01-05,Asia,Afghanistan,,,,,,,,,,,,
1,2020-01-06,Asia,Afghanistan,,,,,,,,,,,,
2,2020-01-07,Asia,Afghanistan,,,,,,,,,,,,
3,2020-01-08,Asia,Afghanistan,,,,,,,,,,,,
4,2020-01-09,Asia,Afghanistan,,,,,,,,,,,,



#### General information: (data type, count of total values, number of unique values, etc.)

Here we are going to get a deeper understanding of the data we are dealing with. 


In [None]:
data.shape

In [None]:
# Checking the data types of all the columns in the Data Frame
print(data.dtypes)

In [None]:
# Counting total non-null values of each column
print(data.count())

In [None]:
# Counting the number of unique values in each column
print(data.nunique())

#### Handle Missing Values (Initial Handling)

Before I can continue with the analysis, I need to evaluate and handle the missing values. 

In [None]:
# Total missing values in each column
print("Total missing values in each column:")
print(data.isnull().sum())
print()  # Print a blank line for better readability

# Calculate the proportion of missing values in each column
missing_data = data.isnull().mean() * 100
print("Proportion of missing values in each column (%):")
print(missing_data)
print()  # Print a blank line for better readability

# Create a DataFrame to display both total missing values and proportion
total_missing = data.isnull().sum()
proportion_missing = data.isnull().mean() * 100

missing_data_table = pd.DataFrame({
    'Total Missing Values': total_missing,
    'Proportion of Missing Values (%)': proportion_missing
})

print("Missing Data Table:")
print(missing_data_table)

# Visualize missing values
# Calculate the number of missing values for each column
missing_values = data.isnull().sum()

# Plot a histogram of the missing values
missing_values.plot(kind='bar')
plt.title('Missing Values Count per Column')
plt.xlabel('Columns')
plt.ylabel('Count of Missing Values')
plt.show()

As we can see above we have some things that we can work in regards missing values: 
1. For the 'continent' column we can categorized the Nan as 'unknown'.
2. For the column 'total_cases' and 'total_death' we can handled missing values using an imputation method like the mean or mode. 
3. total_cases_per_million and total_death_per_million we can also use an imputation method. (TBC)
4. We have identified that 7 out of 15 column where we have more than 90% of missing values. 

##### Data cleaning

### Exploratory Data Analysis 

#### Explore Range and central tendency: (min, max, mean, median, mode, quartiles)

In [None]:
# Set pandas display option to show numbers in standard notation
pd.options.display.float_format = '{:.2f}'.format

# Checking summary statistics for each of the columns of the Data frame.
data.describe()

In [None]:
# box plots & histograms. 
# scatter plots
# barplots 
# world map

#### Check Normality and spread: (standard deviation, skewness, kurtosis)

In [None]:
# Impute numeric columns with mean
numeric_cols = data.select_dtypes(include=['float64', 'int64']).columns
data.loc[:, numeric_cols] = data[numeric_cols].fillna(data[numeric_cols].mean())

# Impute categorical columns with mode
categorical_cols = data.select_dtypes(include=['object']).columns
data.loc[:, categorical_cols] = data[categorical_cols].apply(lambda x: x.fillna(x.mode()[0]))

from scipy.stats import skew, kurtosis

skewness = data[numeric_cols].apply(lambda x: skew(x.dropna()))
kurt = data[numeric_cols].apply(lambda x: kurtosis(x.dropna()))

print("Skewness:\n", skewness)
print()
print("Kurtosis:\n", kurt)



#### Further Handling of Missing Values:

### Futher review - Leave this for a different section

In [None]:
# Get unique values in the 'location, continent" column
unique_locations = df['continent'].nunique()
unique_locations2 = df['location'].nunique()

# Print the number of unique continents and locations
print("Number of unique locations (continents):", unique_locations)
print("Number of unique locations (countries):", unique_locations2)

# Get unique values in the 'location' column
unique_locations3 = df['continent'].unique()

# Print or inspect the unique locations
print(unique_locations3)

In [None]:
# Count NaN values in the 'column_name' column
nan_count = data['continent'].isna().sum()

# Print the count of NaN values
print("Number of NaN values in column 'continent':", nan_count)

In [None]:
# Filter rows where 'column_name' contains NaN values
nan_rows = df[data['continent'].isna()]

# Print or inspect the rows with NaN values
print("Rows with NaN values in column 'continents':")
print(nan_rows)