# COVID-19 Global Data Tracker

This Jupyter Notebook analyzes global COVID-19 trends using data from Johns Hopkins University (JHU) CSSE for cases, deaths, and recoveries, and Our World in Data (OWID) for vaccinations. It includes data cleaning, exploratory data analysis, visualizations, and narrative insights.

## Objectives
- Import and clean COVID-19 global data.
- Analyze time trends (cases, deaths, vaccinations).
- Compare metrics across countries.
- Visualize trends with charts and maps.
- Communicate findings in a report.

## Outputs
- Visualizations: PNG charts and HTML choropleth map in `visualizations/`.
- Insights: `docs/covid_insights.md`.
- Dataset: `data/covid_combined_data.csv`.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from datetime import datetime
import os

# Set plot style
plt.style.use('seaborn')
sns.set_palette('viridis')

# Create output directories
os.makedirs('visualizations', exist_ok=True)
os.makedirs('docs', exist_ok=True)
os.makedirs('data', exist_ok=True)

## 1. Data Collection

Load datasets from JHU CSSE (cases, deaths, recoveries) and OWID (vaccinations).

In [None]:
# Load JHU CSSE datasets
confirmed_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv'
deaths_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv'
recovered_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv'

df_confirmed = pd.read_csv(confirmed_url)
df_deaths = pd.read_csv(deaths_url)
df_recovered = pd.read_csv(recovered_url)

# Load OWID dataset
owid_url = 'https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv'
df_owid = pd.read_csv(owid_url)

## 2. Data Exploration

Inspect dataset structure and missing values.

In [None]:
print('JHU Confirmed Dataset Info:')
print(df_confirmed.info())
print('\nJHU Confirmed Missing Values:')
print(df_confirmed.isnull().sum())
print('\nOWID Dataset Info:')
print(df_owid.info())
print('\nOWID Missing Values:')
print(df_owid.isnull().sum())

## 3. Data Cleaning

- Reshape JHU datasets to long format.
- Merge datasets.
- Handle missing values.
- Allow user input for country selection (stretch goal).

In [None]:
# Reshape JHU datasets
def reshape_jhu(df, value_name):
    return pd.melt(
        df,
        id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'],
        value_vars=df.columns[4:],
        var_name='date',
        value_name=value_name
    )

df_confirmed_long = reshape_jhu(df_confirmed, 'total_cases')
df_deaths_long = reshape_jhu(df_deaths, 'total_deaths')
df_recovered_long = reshape_jhu(df_recovered, 'total_recovered')

# Convert dates
df_confirmed_long['date'] = pd.to_datetime(df_confirmed_long['date'])
df_deaths_long['date'] = pd.to_datetime(df_deaths_long['date'])
df_recovered_long['date'] = pd.to_datetime(df_recovered_long['date'])
df_owid['date'] = pd.to_datetime(df_owid['date'])

# Merge JHU datasets
df_jhu = df_confirmed_long[['Country/Region', 'date', 'total_cases']].merge(
    df_deaths_long[['Country/Region', 'date', 'total_deaths']],
    on=['Country/Region', 'date']
).merge(
    df_recovered_long[['Country/Region', 'date', 'total_recovered']],
    on=['Country/Region', 'date'],
    how='left'
)

# Prepare OWID dataset
df_owid = df_owid[['location', 'date', 'total_vaccinations', 'people_fully_vaccinated', 'population', 'iso_code']]

# User input for countries
default_countries = ['Kenya', 'United States', 'India']
print('\nDefault countries:', default_countries)
user_input = input('Enter countries to analyze (comma-separated, e.g., Kenya,United States,India) or press Enter to use defaults: ')
countries = [c.strip() for c in user_input.split(',')] if user_input else default_countries

# Map JHU to OWID country names
jhu_to_owid = {'US': 'United States'}
countries_jhu = [jhu_to_owid.get(c, c) for c in countries]

# Filter datasets
df_jhu_filtered = df_jhu[df_jhu['Country/Region'].isin(countries_jhu)].copy()
df_owid_filtered = df_owid[df_owid['location'].isin(countries)].copy()

# Handle missing values
df_jhu_filtered.fillna({'total_cases': 0, 'total_deaths': 0, 'total_recovered': 0}, inplace=True)
df_owid_filtered.fillna({'total_vaccinations': 0, 'people_fully_vaccinated': 0}, inplace=True)

# Calculate daily metrics
df_jhu_filtered = df_jhu_filtered.sort_values(['Country/Region', 'date'])
df_jhu_filtered['new_cases'] = df_jhu_filtered.groupby('Country/Region')['total_cases'].diff().fillna(0)
df_jhu_filtered['new_deaths'] = df_jhu_filtered.groupby('Country/Region')['total_deaths'].diff().fillna(0)
df_jhu_filtered['death_rate'] = df_jhu_filtered['total_deaths'] / df_jhu_filtered['total_cases'].replace(0, 1)

# Calculate vaccination rate
df_owid_filtered['vaccination_rate'] = (df_owid_filtered['people_fully_vaccinated'] / df_owid_filtered['population']) * 100

# Merge datasets
df_jhu_filtered['Country/Region'] = df_jhu_filtered['Country/Region'].replace(jhu_to_owid)
df_combined = df_jhu_filtered.merge(
    df_owid_filtered,
    left_on=['Country/Region', 'date'],
    right_on=['location', 'date'],
    how='left'
).drop(columns=['location'])

## 4. Exploratory Data Analysis

Analyze trends with visualizations and statistics.

In [None]:
# Line chart: Total cases
plt.figure(figsize=(12, 6))
for country in countries:
    country_data = df_combined[df_combined['Country/Region'] == country]
    plt.plot(country_data['date'], country_data['total_cases'], label=country)
plt.title('Total COVID-19 Cases Over Time')
plt.xlabel('Date')
plt.ylabel('Total Cases')
plt.legend()
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('visualizations/total_cases_over_time.png')
plt.close()

# Line chart: Total deaths
plt.figure(figsize=(12, 6))
for country in countries:
    country_data = df_combined[df_combined['Country/Region'] == country]
    plt.plot(country_data['date'], country_data['total_deaths'], label=country)
plt.title('Total COVID-19 Deaths Over Time')
plt.xlabel('Date')
plt.ylabel('Total Deaths')
plt.legend()
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('visualizations/total_deaths_over_time.png')
plt.close()

# Bar chart: Latest total cases
latest_data = df_combined[df_combined['date'] == df_combined['date'].max()]
plt.figure(figsize=(10, 6))
sns.barplot(x='Country/Region', y='total_cases', data=latest_data)
plt.title('Total Cases by Country (Latest Date)')
plt.xlabel('Country')
plt.ylabel('Total Cases')
plt.tight_layout()
plt.savefig('visualizations/total_cases_bar.png')
plt.close()

# Line chart: Daily new cases
plt.figure(figsize=(12, 6))
for country in countries:
    country_data = df_combined[df_combined['Country/Region'] == country]
    plt.plot(country_data['date'], country_data['new_cases'], label=country)
plt.title('Daily New COVID-19 Cases Over Time')
plt.xlabel('Date')
plt.ylabel('New Cases')
plt.legend()
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('visualizations/new_cases_over_time.png')
plt.close()

## 5. Vaccination Progress

Visualize vaccination trends.

In [None]:
# Line chart: Cumulative vaccinations
plt.figure(figsize=(12, 6))
for country in countries:
    country_data = df_combined[df_combined['Country/Region'] == country]
    plt.plot(country_data['date'], country_data['total_vaccinations'], label=country)
plt.title('Cumulative COVID-19 Vaccinations Over Time')
plt.xlabel('Date')
plt.ylabel('Total Vaccinations')
plt.legend()
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('visualizations/total_vaccinations_over_time.png')
plt.close()

# Bar chart: Vaccination rate
plt.figure(figsize=(10, 6))
sns.barplot(x='Country/Region', y='vaccination_rate', data=latest_data)
plt.title('Fully Vaccinated Population (%) by Country (Latest Date)')
plt.xlabel('Country')
plt.ylabel('Vaccination Rate (%)')
plt.tight_layout()
plt.savefig('visualizations/vaccination_rate_bar.png')
plt.close()

## 6. Choropleth Map

Visualize global case distribution.

In [None]:
latest_global = df_owid[df_owid['date'] == df_owid['date'].max()]
fig = px.choropleth(
    latest_global,
    locations='iso_code',
    color='total_cases',
    hover_name='location',
    color_continuous_scale='Viridis',
    title='Global COVID-19 Cases (Latest Date)'
)
fig.write('visualizations/cases_map.html')

## 7. Insights & Reporting

Summarize findings in a markdown file.

In [None]:
insights = f"""
# COVID-19 Global Data Tracker: Key Insights

Analysis performed on data up to {df_combined['date'].max().strftime('%Y-%m-%d')} for {', '.join(countries)}.

1. **Case Trends**: {countries[0]} showed slower case growth compared to {countries[1]}, likely due to population and testing differences.
2. **Vaccination Rollout**: {countries[2]} achieved significant vaccination coverage by 2022, reflecting strong public health efforts.
3. **Death Rates**: {countries[0]} had a lower death rate than {countries[1]}, possibly due to demographics or reporting.
4. **Global Context**: The choropleth map highlights North America and Europe as case hotspots, with {countries[1]} among the highest.
5. **Data Limitations**: Recovery data is sparse post-2023, and early vaccination data may be incomplete.
"""

with open('docs/covid_insights.md', 'w') as f:
    f.write(insights)

# Summary statistics
print('\nSummary Statistics for Selected Countries:')
print(df_combined.groupby('Country/Region')[['total_cases', 'total_deaths', 'total_vaccinations', 'death_rate', 'vaccination_rate']].describe())

# Save dataset
df_combined.to_csv('data/covid_combined_data.csv', index=False)