# Exploratory Data Analysis Project
___

You will be working with the `covid19-can.csv` file located in the `Data` folder.

This dataset is obtained from the [Government of Canada Public Health Infobase](https://open.canada.ca/data/en/dataset/261c32ab-4cfd-4f81-9dea-7b64065690dc) and contains information on daily reported COVID-19 cases as well as total COVID-19 deaths in all provinces of Canada.

**Analyze the above dataset to answer the following questions:**

1. What is the total number of COVID-19 cases reported in each province?
2. Which province has the highest average rate of COVID-19 per capita?
3. What is the average rate of COVID-19 deaths per capita?
4. What is the overall mortality rate of COVID-19 in Canada?
5. What is the mortality rate per province?
6. What are the total reported cases per year?
7. For each year in the dataset, find the month with the highest total number of cases.
8. For each year, find the month with the lowest total number of cases.
9. Which year had the highest mortality rate?
10. Which year had the lowest total number of cases?
11. Which year had the highest total number of cases?
12. In 2020, on which day did Quebec have the highest number of COVID-19 deaths?

## Step 1: Imports

In [1]:
import pandas as pd
import numpy as np

## Step 2: Reading Data

In [2]:
# Load csv file into df
df = pd.read_csv("../Desktop/M1-P4-main/Data/covid19-can.csv")
df.head()

Unnamed: 0,prname,date,reporting_week,totalcases,ratecases_total,numdeaths,ratedeaths
0,British Columbia,2020-02-01,5,1,0.02,0,0.0
1,Alberta,2020-02-01,5,0,0.0,0,0.0
2,Saskatchewan,2020-02-01,5,0,0.0,0,0.0
3,Manitoba,2020-02-01,5,0,0.0,0,0.0
4,Ontario,2020-02-01,5,3,0.02,0,0.0


## Step 3: Data Exploration

Explore the dataset to better understand its characteristics, structure, content and data types.

In [3]:
# Exploration (information)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3180 entries, 0 to 3179
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   prname           3180 non-null   object 
 1   date             3180 non-null   object 
 2   reporting_week   3180 non-null   int64  
 3   totalcases       3180 non-null   int64  
 4   ratecases_total  2968 non-null   float64
 5   numdeaths        3180 non-null   int64  
 6   ratedeaths       2968 non-null   float64
dtypes: float64(2), int64(3), object(2)
memory usage: 174.0+ KB


In [4]:
# Exploration (shape)
df.shape

(3180, 7)

In [5]:
# Exploration (types)
df.dtypes

prname              object
date                object
reporting_week       int64
totalcases           int64
ratecases_total    float64
numdeaths            int64
ratedeaths         float64
dtype: object

In [6]:
# Exploration (descriptive statistics summary for the DataFrame -Numeric columns-)
df.describe()

Unnamed: 0,reporting_week,totalcases,ratecases_total,numdeaths,ratedeaths
count,3180.0,3180.0,2968.0,3180.0,2968.0
mean,26.334906,351889.0,7405.393693,4425.555975,59.537823
std,15.176101,859664.5,7583.690076,10030.916779,56.302812
min,1.0,0.0,0.0,0.0,0.0
25%,13.0,795.5,289.2375,7.0,6.3675
50%,26.0,42589.0,6962.71,217.0,48.24
75%,39.25,161333.2,11386.88,3002.5,98.69
max,53.0,4933311.0,34224.43,58475.0,227.44


In [7]:
# Exploration (descriptive statistics summary for the DataFrame -Object columns-)
df.describe(include = ["object"])

Unnamed: 0,prname,date
count,3180,3180
unique,15,212
top,British Columbia,2020-02-01
freq,212,15


## Step 4: Data Preparation

In [8]:
# Check for missing values
print("Number of missing values for each column :")
print(df.isnull().sum())

Number of missing values for each column :
prname               0
date                 0
reporting_week       0
totalcases           0
ratecases_total    212
numdeaths            0
ratedeaths         212
dtype: int64


In [9]:
# Remove all missing values
df.dropna(inplace = True)

In [10]:
# Check for missing values (verification)
print("Number of missing values for each column :")
print(df.isnull().sum())

Number of missing values for each column :
prname             0
date               0
reporting_week     0
totalcases         0
ratecases_total    0
numdeaths          0
ratedeaths         0
dtype: int64


In [11]:
# Find all unique values in the 'prname' column

# Print the number of unique values in the 'prname' column

print("Unique values in the 'prname' column are :", df["prname"].nunique())

# Iterate through each unique value in the 'prname' column and print it

for i in df["prname"].unique() :
    print(i)

Unique values in the 'prname' column are : 14
British Columbia
Alberta
Saskatchewan
Manitoba
Ontario
Quebec
Newfoundland and Labrador
New Brunswick
Nova Scotia
Prince Edward Island
Yukon
Northwest Territories
Nunavut
Canada


In [12]:
# Delete all records with Canada in 'prname'

df.drop(df[df['prname'] == "Canada"].index, inplace = True)

In [13]:
# Check

df.info(verbose = False)

<class 'pandas.core.frame.DataFrame'>
Index: 2756 entries, 0 to 3177
Columns: 7 entries, prname to ratedeaths
dtypes: float64(2), int64(3), object(2)
memory usage: 172.2+ KB


In [14]:
# Check

df['prname'].unique()

array(['British Columbia', 'Alberta', 'Saskatchewan', 'Manitoba',
       'Ontario', 'Quebec', 'Newfoundland and Labrador', 'New Brunswick',
       'Nova Scotia', 'Prince Edward Island', 'Yukon',
       'Northwest Territories', 'Nunavut'], dtype=object)

In [15]:
# Check (There is no longer a 'prname' Canada)

df[df['prname']=="Canada"]

Unnamed: 0,prname,date,reporting_week,totalcases,ratecases_total,numdeaths,ratedeaths


In [16]:
# Convert the 'date' column to datetime format
df["date"] = pd.to_datetime(df["date"])

In [17]:
# Create a 'year' column
df["year"] = df["date"].dt.year

In [18]:
# Create month column
df['month'] = df["date"].dt.month

In [19]:
df.head()

Unnamed: 0,prname,date,reporting_week,totalcases,ratecases_total,numdeaths,ratedeaths,year,month
0,British Columbia,2020-02-01,5,1,0.02,0,0.0,2020,2
1,Alberta,2020-02-01,5,0,0.0,0,0.0,2020,2
2,Saskatchewan,2020-02-01,5,0,0.0,0,0.0,2020,2
3,Manitoba,2020-02-01,5,0,0.0,0,0.0,2020,2
4,Ontario,2020-02-01,5,3,0.02,0,0.0,2020,2


## Step 5: Data Analysis

In [20]:
# 1. Total number of COVID-19 cases reported in each province

# Group the data in the DataFrame by the "prname" column, which represents the names of the provinces
province_cases = df.groupby("prname")

# Calculate the sum of numeric values for each group (province)
df_cases = province_cases.sum(numeric_only = True)

print('Total number of COVID-19 cases reported in each province :')
print(df_cases['totalcases'])

Total number of COVID-19 cases reported in each province :
prname
Alberta                       80504411
British Columbia              51249692
Manitoba                      19485185
New Brunswick                  8912651
Newfoundland and Labrador      5629703
Northwest Territories          1256158
Nova Scotia                   13941547
Nunavut                         407810
Ontario                      193613879
Prince Edward Island           5385603
Quebec                       159348011
Saskatchewan                  19218685
Yukon                           547530
Name: totalcases, dtype: int64


In [21]:
# 2 Province with the highest average rate of COVID-19 cases per capita

# Calculate the mean of numeric values for each group (province)
average_rate_cases = province_cases['ratecases_total'].mean()

print("The province with the highest average rate of COVID-19 cases per capita is",average_rate_cases.idxmax())

The province with the highest average rate of COVID-19 cases per capita is Prince Edward Island


In [22]:
# 3. Average rate of COVID-19 deaths per capita in each province
print('Average rate of COVID-19 deaths per capita in each province :')
average_rate_cases

Average rate of COVID-19 deaths per capita in each province :


prname
Alberta                       8358.540613
British Columbia              4544.634387
Manitoba                      6522.122736
New Brunswick                 5177.050330
Newfoundland and Labrador     5048.786226
Northwest Territories        12992.594623
Nova Scotia                   6448.994434
Nunavut                       4746.663632
Ontario                       6044.396557
Prince Edward Island         14883.171745
Quebec                        8643.871038
Saskatchewan                  7587.374151
Yukon                         5898.030189
Name: ratecases_total, dtype: float64

In [23]:
# 4. Overall mortality rate of COVID-19 in Canada

# Sum the total number of cases and deaths across all provinces

total_cases = df['totalcases'].sum()
total_deaths = df['numdeaths'].sum()

# Calculate the overall mortality rate

mortality_rate = (total_deaths / total_cases) * 100

print("Overall mortality rate of COVID-19 in Canada is :",np.round(mortality_rate,2))

Overall mortality rate of COVID-19 in Canada is : 1.26


In [24]:
# 5. Mortality rate per province : \n

# Calculate the mortality rate per province as a percentage

province_rate = province_cases.sum(numeric_only = True)
province_rate['mortality rates province'] = (province_rate['numdeaths']/province_rate['totalcases'])*100

# Print the mortality rates per province from the 'cases' DataFrame

print('Mortality rate per province : \n')
print(province_rate['mortality rates province'])

Mortality rate per province : 

prname
Alberta                      0.895691
British Columbia             1.251715
Manitoba                     1.599862
New Brunswick                0.898285
Newfoundland and Labrador    0.549656
Northwest Territories        0.201487
Nova Scotia                  0.573315
Nunavut                      0.228783
Ontario                      1.196737
Prince Edward Island         0.156696
Quebec                       1.641893
Saskatchewan                 1.157972
Yukon                        0.652202
Name: mortality rates province, dtype: float64


In [25]:
# 6. Total reported cases per year

# Group the data by year and calculate the sum of numeric columns
cases_year = df.groupby('year').sum(numeric_only=True)

print('Total reported cases per year :')
cases_year['totalcases']

Total reported cases per year :


year
2020      6505264
2021     69105463
2022    204977700
2023    244473088
2024     34439350
Name: totalcases, dtype: int64

In [26]:
# 7. Month with highest total cases for each year

# Group the data by year and month, and calculate the sum of numeric columns

cases_month = df.groupby(['year','month']).sum(numeric_only = True)

# Retrieve dataframes for each specific year

df0 = cases_month.loc[2020]
df1 = cases_month.loc[2021]
df2 = cases_month.loc[2022]
df3 = cases_month.loc[2023]
df4 = cases_month.loc[2024]

# Find the month with the highest total cases for each year

highest_month_2000 = df0['totalcases'].idxmax()
highest_month_2001 = df1['totalcases'].idxmax()
highest_month_2002 = df2['totalcases'].idxmax()
highest_month_2003 = df3['totalcases'].idxmax()
highest_month_2004 = df4['totalcases'].idxmax()

# List of the months with the highest total cases for each year

highest_month = [highest_month_2000,highest_month_2001,highest_month_2002,highest_month_2003,highest_month_2004]

# Iterate through the list and print the results

j = 2000
for i in (highest_month) :
    print("The month with highest total cases for",j,"is :",i)
    j+=1

The month with highest total cases for 2000 is : 12
The month with highest total cases for 2001 is : 10
The month with highest total cases for 2002 is : 12
The month with highest total cases for 2003 is : 12
The month with highest total cases for 2004 is : 1


In [27]:
# 8. Month with lowest total cases for each year

# Find the month with the lowest total cases for each year

lowest_month_2000 = df0['totalcases'].idxmin()
lowest_month_2001 = df1['totalcases'].idxmin()
lowest_month_2002 = df2['totalcases'].idxmin()
lowest_month_2003 = df3['totalcases'].idxmin()
lowest_month_2004 = df4['totalcases'].idxmin()

# List of the months with the highest total cases for each year

lowest_month = [lowest_month_2000,lowest_month_2001,lowest_month_2002,lowest_month_2003,lowest_month_2004]

# Iterate through the list and print the results

j = 2000
for i in (lowest_month) :
    print("The month with lowest total cases for",j,"is :",i)
    j+=1

The month with lowest total cases for 2000 is : 2
The month with lowest total cases for 2001 is : 2
The month with lowest total cases for 2002 is : 2
The month with lowest total cases for 2003 is : 1
The month with lowest total cases for 2004 is : 2


In [28]:
# 9. Yearly mortality rate

# Calculate the yearly mortality rate as the number of deaths divided by the total number of cases, multiplied by 100

cases_year['Yearly_mortality_rate'] = (cases_year['numdeaths']/cases_year['totalcases'])*100

# Print the yearly mortality rate for each year

print("Yearly mortality rate :")
print(cases_year['Yearly_mortality_rate'])

Yearly mortality rate :
year
2020    5.118040
2021    1.875300
2022    1.074239
2023    1.144759
2024    1.182307
Name: Yearly_mortality_rate, dtype: float64


In [29]:
# 10/11: Year with lowest total cases
print("The year with highest total cases is :",cases_year['totalcases'].idxmin())

The year with highest total cases is : 2020


In [30]:
# 11. Year with highest total cases
print("The year with highest total cases is :",cases_year['totalcases'].idxmax())

The year with highest total cases is : 2023


In [31]:
# 12. Day with highest number of COVID-19 deaths for Quebec in 2020

df['day'] = df['date'].dt.day

In [32]:
# 12. Day with highest number of COVID-19 deaths for Quebec in 2020

# Group the data by province, year, month, and day, and calculate the sum of numeric columns

df_prov_year = df.groupby(["prname","year","month","day"]).sum(numeric_only=True)

# Select data for Quebec in 2020

df_quebec = df_prov_year.loc['Quebec'].loc[2020]

# Find the maximum number of deaths and index in Quebec in 2020

max_death = df_quebec['numdeaths'].max()
ind_max = df_quebec['numdeaths'].idxmax()

# Print the day with the highest number of COVID-19 deaths for Quebec in 2020

print("The day with highest number of COVID-19 deaths for Quebec in 2020 is :",ind_max[1],"/",ind_max[0],"with",max_death,"death")

The day with highest number of COVID-19 deaths for Quebec in 2020 is : 26 / 12 with 7662 death


The End!