<a href="https://colab.research.google.com/github/dayananikol/CCDATSCL_EXERCISES_COM221ML/blob/main/Exercise4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 4

This exercise focuses on data visualization and interpretation using a real-world COVID-19 dataset. The dataset contains daily records of confirmed cases, deaths, recoveries, and active cases across countries and regions, along with temporal and geographic information.
The goal of this exercise is not only to create charts, but to choose appropriate visualizations, apply correct data aggregation, and draw meaningful insights from the data. You will work with time-based, categorical, numerical, and geographic variables, and you are expected to think critically about how design choices affect interpretation.

Your visualizations should follow good practices:
- Use clear titles, axis labels, and legends
- Choose chart types appropriate to the data and question
- Avoid misleading scales or cluttered designs
- Clearly explain patterns, trends, or anomalies you observe

Unless stated otherwise, you may filter, aggregate, or group the data as needed.

<img src="https://d3i6fh83elv35t.cloudfront.net/static/2020/03/Screen-Shot-2020-03-05-at-6.29.29-PM-1024x574.png"/>

In [None]:
import kagglehub
import os
import pandas as pd

# Download latest version
path = kagglehub.dataset_download("imdevskp/corona-virus-report")

print("Path to dataset files:", path)

Using Colab cache for faster access to the 'corona-virus-report' dataset.
Path to dataset files: /kaggle/input/corona-virus-report


In [None]:
if os.path.isdir(path):
  print(True)

contents = os.listdir(path)
contents

mydataset = path + "/" + contents[0]
mydataset


df = pd.read_csv(mydataset)

True


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49068 entries, 0 to 49067
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Province/State  14664 non-null  object 
 1   Country/Region  49068 non-null  object 
 2   Lat             49068 non-null  float64
 3   Long            49068 non-null  float64
 4   Date            49068 non-null  object 
 5   Confirmed       49068 non-null  int64  
 6   Deaths          49068 non-null  int64  
 7   Recovered       49068 non-null  int64  
 8   Active          49068 non-null  int64  
 9   WHO Region      49068 non-null  object 
dtypes: float64(2), int64(4), object(4)
memory usage: 3.7+ MB


In [None]:
df.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered,Active,WHO Region
0,,Afghanistan,33.93911,67.709953,2020-01-22,0,0,0,0,Eastern Mediterranean
1,,Albania,41.1533,20.1683,2020-01-22,0,0,0,0,Europe
2,,Algeria,28.0339,1.6596,2020-01-22,0,0,0,0,Africa
3,,Andorra,42.5063,1.5218,2020-01-22,0,0,0,0,Europe
4,,Angola,-11.2027,17.8739,2020-01-22,0,0,0,0,Africa


In [None]:
df.query("`Country/Region` == 'Philippines'")

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered,Active,WHO Region
180,,Philippines,12.879721,121.774017,2020-01-22,0,0,0,0,Western Pacific
441,,Philippines,12.879721,121.774017,2020-01-23,0,0,0,0,Western Pacific
702,,Philippines,12.879721,121.774017,2020-01-24,0,0,0,0,Western Pacific
963,,Philippines,12.879721,121.774017,2020-01-25,0,0,0,0,Western Pacific
1224,,Philippines,12.879721,121.774017,2020-01-26,0,0,0,0,Western Pacific
...,...,...,...,...,...,...,...,...,...,...
47943,,Philippines,12.879721,121.774017,2020-07-23,74390,1871,24383,48136,Western Pacific
48204,,Philippines,12.879721,121.774017,2020-07-24,76444,1879,24502,50063,Western Pacific
48465,,Philippines,12.879721,121.774017,2020-07-25,78412,1897,25752,50763,Western Pacific
48726,,Philippines,12.879721,121.774017,2020-07-26,80448,1932,26110,52406,Western Pacific


## A. Time-Based Visualizations

1. Global Trend `(5 pts)`

Aggregate the data by Date and create a line chart showing the global number of confirmed COVID-19 cases over time.

In [None]:
import plotly.express as px

# Ensure 'Date' is in datetime format
df['Date'] = pd.to_datetime(df['Date'])

# Aggregate data by Date for global confirmed cases
global_cases = df.groupby('Date')['Confirmed'].sum().reset_index()

# Create a line chart
fig = px.line(
    global_cases,
    x='Date',
    y='Confirmed',
    title='Global Confirmed COVID-19 Cases Over Time'
)

# Update layout for better readability
fig.update_layout(
    xaxis_title='Date',
    yaxis_title='Total Confirmed Cases',
    hovermode='x unified'
)

fig.show()

2. Country-Level Trends `(5 pts)`

Select three countries and visualize their confirmed case counts over time on the same plot.

In [None]:
countries_to_compare = ['US', 'India', 'Brazil']

# Filter the DataFrame for the selected countries
filtered_df = df[df['Country/Region'].isin(countries_to_compare)]

# Aggregate data by Date and Country/Region for confirmed cases
country_cases = filtered_df.groupby(['Date', 'Country/Region'])['Confirmed'].sum().reset_index()

# Create a line chart
fig = px.line(
    country_cases,
    x='Date',
    y='Confirmed',
    color='Country/Region',
    title='Confirmed COVID-19 Cases Over Time for Selected Countries',
    labels={'Confirmed': 'Total Confirmed Cases'}
)

# Update layout for better readability
fig.update_layout(
    xaxis_title='Date',
    yaxis_title='Total Confirmed Cases',
    hovermode='x unified'
)

fig.show()

3. Active vs Recovered `(5 pts)`

For a selected country, create a line chart showing Active and Recovered cases over time.

In [None]:
selected_country = 'India'

# Filter data for the selected country
country_data = df[df['Country/Region'] == selected_country].copy()

# Ensure 'Date' is in datetime format and sort by date
country_data['Date'] = pd.to_datetime(country_data['Date'])
country_data = country_data.sort_values(by='Date')

# Select relevant columns and melt for easier plotting of multiple lines
cases_melted = country_data.melt(
    id_vars=['Date'],
    value_vars=['Active', 'Recovered'],
    var_name='Case Type',
    value_name='Count'
)

# Create a line chart
fig = px.line(
    cases_melted,
    x='Date',
    y='Count',
    color='Case Type',
    title=f'Active vs. Recovered COVID-19 Cases in {selected_country} Over Time',
    labels={'Count': 'Number of Cases'}
)

# Update layout for better readability
fig.update_layout(
    xaxis_title='Date',
    yaxis_title='Number of Cases',
    hovermode='x unified'
)

fig.show()

## B: Comparative Visualizations

4. Country Comparison `(5 pts)`

Using data from a single date, create a bar chart showing the top 10 countries by confirmed cases.

In [None]:
# Select the latest date in the dataset
latest_date = df['Date'].max()

# Filter data for the latest date
latest_date_df = df[df['Date'] == latest_date]

# Aggregate confirmed cases by country for the latest date
country_confirmed_cases = latest_date_df.groupby('Country/Region')['Confirmed'].sum().reset_index()

# Get the top 10 countries by confirmed cases
top_10_countries = country_confirmed_cases.nlargest(10, 'Confirmed')

# Create a bar chart
fig = px.bar(
    top_10_countries,
    x='Country/Region',
    y='Confirmed',
    title=f'Top 10 Countries by Confirmed COVID-19 Cases on {latest_date.strftime('%Y-%m-%d')}',
    labels={'Confirmed': 'Total Confirmed Cases'}
)

# Update layout for better readability
fig.update_layout(
    xaxis_title='Country/Region',
    yaxis_title='Total Confirmed Cases',
    xaxis={'categoryorder':'total descending'}
)

fig.show()

5. WHO Region Comparison `(5 pts)`

Aggregate confirmed cases by WHO Region and visualize the result using a bar chart.

In [None]:
# Aggregate confirmed cases by WHO Region
who_region_cases = df.groupby('WHO Region')['Confirmed'].sum().reset_index()

# Sort the regions by confirmed cases in descending order for better visualization
who_region_cases = who_region_cases.sort_values(by='Confirmed', ascending=False)

# Create a bar chart
fig = px.bar(
    who_region_cases,
    x='WHO Region',
    y='Confirmed',
    title='Confirmed COVID-19 Cases by WHO Region',
    labels={'Confirmed': 'Total Confirmed Cases'}
)

# Update layout for better readability
fig.update_layout(
    xaxis_title='WHO Region',
    yaxis_title='Total Confirmed Cases',
    xaxis={'categoryorder':'total descending'}
)

fig.show()

## C. Geographic Visualization

6. Geographic Spread `(10 pts)`

Using Latitude and Longitude, create a map-based visualization showing confirmed cases for a selected date.

In [None]:
# Select the latest date in the dataset for consistency
selected_date = df['Date'].max()

# Filter data for the selected date
map_df = df[df['Date'] == selected_date].copy()

# Aggregate data by Country/Region to sum up cases from different provinces/states
# This also averages Lat/Long for the country, which is acceptable for a broad map
map_data = map_df.groupby(['Country/Region', 'Lat', 'Long'])['Confirmed'].sum().reset_index()

# Create a geographic scatter plot
fig = px.scatter_geo(
    map_data,
    lat='Lat',
    lon='Long',
    size='Confirmed', # size of markers based on confirmed cases
    hover_name='Country/Region',
    hover_data={'Confirmed': True, 'Lat': False, 'Long': False},
    projection='natural earth', # A good projection for global maps
    title=f'Confirmed COVID-19 Cases Worldwide on {selected_date.strftime('%Y-%m-%d')}',
    color_continuous_scale=px.colors.sequential.Plasma, # A color scale for intensity
    labels={'Confirmed': 'Total Confirmed Cases'}
)

fig.update_geos(
    showcountries=True,
    countrycolor='Black',
    showcoastlines=True,
    coastlinecolor='RebeccaPurple'
)

fig.show()






7. Regional Clustering `(15 pts)`

Create a visualization that shows how confirmed cases are distributed geographically within a single WHO Region.

In [None]:
# Select a WHO Region and the latest date
selected_who_region = 'Americas'
latest_date = df['Date'].max()

# Filter data for the selected WHO Region and latest date
regional_df = df[
    (df['WHO Region'] == selected_who_region) &
    (df['Date'] == latest_date)
].copy()

# Aggregate confirmed cases by Country/Region, Latitude, and Longitude
# This is important to get a single point per country for mapping
regional_map_data = regional_df.groupby(['Country/Region', 'Lat', 'Long'])['Confirmed'].sum().reset_index()

# Create a geographic scatter plot for the region
fig = px.scatter_geo(
    regional_map_data,
    lat='Lat',
    lon='Long',
    size='Confirmed', # Size of markers based on confirmed cases
    hover_name='Country/Region',
    hover_data={'Confirmed': True, 'Lat': False, 'Long': False},
    projection='natural earth', # Good for regional or global maps
    title=f'Confirmed COVID-19 Cases in {selected_who_region} on {latest_date.strftime('%Y-%m-%d')}',
    color_continuous_scale=px.colors.sequential.Plasma, # Color scale for intensity
    labels={'Confirmed': 'Total Confirmed Cases'}
)

fig.update_geos(
    showcountries=True,
    countrycolor='Black',
    showcoastlines=True,
    coastlinecolor='RebeccaPurple',
    # You can set the scope to the region for a more zoomed-in view, e.g., 'north america' or 'south america'
    # For 'Americas', 'world' or 'usa' (for a specific sub-region) can be used, 'natural earth' handles this well.
    # Alternatively, you could try setting a specific center and zoom if needed for a more precise regional view.
    scope='world'
)

fig.show()