# COVID-19 Global Data Tracker

**Author:** Generated for you

**Description:** This notebook loads a local COVID-19 dataset, cleans the data, performs EDA, visualizes cases, deaths, and vaccination trends, and produces narrative insights. Follow the instructions in each cell and run cells top-to-bottom.

---


## 1. Data Loading & Exploration

**Goal:** Load the dataset and explore its structure.

**Tasks:**

- Load data using `pd.read_csv()`
- Check columns: `df.columns`
- Preview rows: `df.head()`
- Identify missing values: `df.isnull().sum()`

**Note:** The notebook expects a CSV at the path below. If your file is elsewhere, change `file_path` and re-run the cell.


In [None]:
# Step 1: Load and inspect the CSV file
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
from IPython.display import display

file_path = r"C:\\Users\\user\\Desktop\\2025\\PLP\\Python\\DataSet\\Covid Data.csv"
print("Loading file:", file_path)

try:
    df = pd.read_csv(file_path)
    print("✅ Data loaded successfully.")
except Exception as e:
    print("❌ Error loading file. Please check the path and file format. Error: ", e)
    raise

print("\nColumns:")
print(df.columns.tolist())

print("\nFirst 5 rows:")
display(df.head())

print("\nData types:")
print(df.dtypes)

print("\nMissing values per column:")
print(df.isnull().sum())


## 2. Data Cleaning

**Goal:** Prepare data for analysis.

**Tasks:**

- Filter countries of interest (default: Kenya, United States, India)
- Drop rows with missing dates/critical values
- Convert `date` column to `datetime`
- Handle missing numeric values with `fillna()` or `interpolate()`

**Important:** This cell attempts to detect common COVID column names (`date`, `location`, `total_cases`, `total_deaths`, `new_cases`, `new_deaths`, `total_vaccinations`). If your CSV uses different column names, update the column mapping below and re-run.


In [None]:
# Step 2: Clean the data

# Default countries to analyze - edit as needed
countries = ['Kenya', 'United States', 'India']

# Column mapping: adjust if your CSV uses different names
expected_cols = {
    'date': None,
    'location': None,
    'total_cases': None,
    'total_deaths': None,
    'new_cases': None,
    'new_deaths': None,
    'total_vaccinations': None
}

# Attempt to auto-detect columns by name (case-insensitive)
lower_cols = {c.lower(): c for c in df.columns}

for key in list(expected_cols.keys()):
    if key in lower_cols:
        expected_cols[key] = lower_cols[key]

print('Detected column mapping:')
for k,v in expected_cols.items():
    print(f"  {k}: {v}")

# If critical columns are missing, show a helpful message
critical = ['date','location','total_cases','total_deaths']
missing_crit = [c for c in critical if expected_cols[c] is None]
if missing_crit:
    print('\n⚠️ Missing critical columns detected:', missing_crit)
    print('If your dataset uses different column names, please update the `expected_cols` mapping and re-run this cell.')
else:
    # Rename columns locally for convenience
    rename_map = {}
    for orig_key, mapped in expected_cols.items():
        if mapped is not None:
            rename_map[mapped] = orig_key
    df = df.rename(columns=rename_map)
    # Keep only columns that exist after rename
    df = df[[c for c in ['date','location','total_cases','total_deaths','new_cases','new_deaths','total_vaccinations'] if c in df.columns]]

    # Convert date
    df['date'] = pd.to_datetime(df['date'], errors='coerce')

    # Drop rows missing date or total_cases
    df = df.dropna(subset=['date','total_cases'])

    # Fill numeric NaNs with 0 for analysis (you can change to interpolate() if preferred)
    num_cols = [c for c in ['total_cases','total_deaths','new_cases','new_deaths','total_vaccinations'] if c in df.columns]
    df[num_cols] = df[num_cols].fillna(0)

    # Filter to the chosen countries if 'location' exists
    if 'location' in df.columns:
        available_countries = df['location'].unique().tolist()
        print('\nAvailable countries sample (first 30):', available_countries[:30])
        chosen = [c for c in countries if c in available_countries]
        if not chosen:
            print('\n⚠️ None of the default countries were found in your data. Please set `countries` to names present in your CSV.')
        else:
            countries = chosen
            df = df[df['location'].isin(countries)]
            print('\nFiltering to countries:', countries)

    print('\nData after cleaning - preview:')
    from IPython.display import display
    display(df.head())

    print('\nDate range in the dataset:', df['date'].min().date(), 'to', df['date'].max().date())


## 3. Exploratory Data Analysis (EDA)

**Goal:** Generate descriptive statistics & explore trends.

**Tasks:**

- Plot total cases over time for selected countries
- Plot total deaths over time
- Compare daily new cases between countries
- Calculate the death rate: `total_deaths / total_cases`

**Visualizations:** Line charts, bar charts, optional heatmap


In [None]:
# Step 3: EDA & Visualizations
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='whitegrid')
plt.rcParams['figure.figsize'] = (12,6)

# Ensure we have data
if df.empty:
    print("No data available after cleaning. Check column mapping and selected countries.")
else:
    # Calculate death rate
    if 'total_deaths' in df.columns and 'total_cases' in df.columns:
        df['death_rate'] = (df['total_deaths'] / df['total_cases']) * 100

    # Total cases over time
    plt.figure()
    for country in countries:
        cd = df[df['location'] == country].sort_values('date')
        if 'total_cases' in cd.columns:
            plt.plot(cd['date'], cd['total_cases'], label=country)
    plt.title('Total COVID-19 Cases Over Time')
    plt.xlabel('Date')
    plt.ylabel('Total Cases')
    plt.legend()
    plt.tight_layout()
    plt.show()

    # Total deaths over time
    if 'total_deaths' in df.columns:
        plt.figure()
        for country in countries:
            cd = df[df['location'] == country].sort_values('date')
            plt.plot(cd['date'], cd['total_deaths'], label=country)
        plt.title('Total COVID-19 Deaths Over Time')
        plt.xlabel('Date')
        plt.ylabel('Total Deaths')
        plt.legend()
        plt.tight_layout()
        plt.show()
    else:
        print('Column total_deaths not found - skipping death plots.')

    # Daily new cases
    if 'new_cases' in df.columns:
        plt.figure()
        for country in countries:
            cd = df[df['location'] == country].sort_values('date')
            plt.plot(cd['date'], cd['new_cases'], label=country)
        plt.title('Daily New COVID-19 Cases')
        plt.xlabel('Date')
        plt.ylabel('New Cases')
        plt.legend()
        plt.tight_layout()
        plt.show()
    else:
        print('Column new_cases not found - skipping daily new cases plot.')

    # Bar chart: Top countries by latest total_cases
    if 'total_cases' in df.columns and 'location' in df.columns:
        latest = df.groupby('location').apply(lambda g: g.loc[g['date'].idxmax()]).reset_index(drop=True)
        top10 = latest.nlargest(10, 'total_cases')
        plt.figure(figsize=(10,6))
        sns.barplot(x='total_cases', y='location', data=top10)
        plt.title('Top countries by total cases (latest available date)')
        plt.xlabel('Total Cases')
        plt.ylabel('Country')
        plt.tight_layout()
        plt.show()

    # Optional: Correlation heatmap for numeric columns
    numeric = df.select_dtypes(include=['number']).columns.tolist()
    if len(numeric) >= 2:
        plt.figure(figsize=(8,6))
        sns.heatmap(df[numeric].corr(), annot=True, fmt='.2f', cmap='coolwarm')
        plt.title('Correlation between numeric variables')
        plt.tight_layout()
        plt.show()


## 4. Visualizing Vaccination Progress

**Goal:** Analyze vaccination rollouts.

**Tasks:**

- Plot cumulative vaccinations over time for selected countries
- Compare % vaccinated population (if population column exists)


In [None]:
# Step 4: Vaccination Progress
if 'total_vaccinations' not in df.columns:
    print("Column 'total_vaccinations' not found in dataset. If you have vaccination data under a different column name, update the mapping in the cleaning cell and re-run.")
else:
    plt.figure()
    for country in countries:
        cd = df[df['location'] == country].sort_values('date')
        plt.plot(cd['date'], cd['total_vaccinations'], label=country)
    plt.title('Cumulative Vaccinations Over Time')
    plt.xlabel('Date')
    plt.ylabel('Total Vaccinations')
    plt.legend()
    plt.tight_layout()
    plt.show()

    # If population column exists, compute percent vaccinated
    if 'population' in df.columns:
        latest = df.groupby('location').apply(lambda g: g.loc[g['date'].idxmax()]).reset_index(drop=True)
        latest['pct_vaccinated'] = latest['total_vaccinations'] / latest['population'] * 100
        plt.figure(figsize=(8,6))
        sns.barplot(x='pct_vaccinated', y='location', data=latest.sort_values('pct_vaccinated', ascending=False).head(10))
        plt.xlabel('% of population vaccinated')
        plt.ylabel('Country')
        plt.title('Top countries by % vaccinated (latest)')
        plt.tight_layout()
        plt.show()
    else:
        print('Population column not found - skipping % vaccinated computation.')


## 5. Optional: Choropleth Map

**Goal:** Visualize cases or vaccination rates by country on a world map using Plotly Express.

**Tasks:** Prepare a dataframe with `location` and `total_cases` for the latest date, then plot.

Note: Plotly will open an interactive plot inside the notebook. If Plotly is not installed, run `pip install plotly`.


In [None]:
# Step 5: Choropleth map (interactive)
try:
    import plotly.express as px
except Exception as e:
    print("Plotly not available. Install plotly to view maps: pip install plotly")
    raise

if 'location' in df.columns and 'total_cases' in df.columns:
    latest = df.groupby('location').apply(lambda g: g.loc[g['date'].idxmax()]).reset_index(drop=True)
    # Some location names may not match Plotly country names; minor mismatches possible.
    fig = px.choropleth(latest,
                        locations='location',
                        locationmode='country names',
                        color='total_cases',
                        hover_name='location',
                        color_continuous_scale='Reds',
                        title='Global COVID-19 Cases (latest available)')
    fig.update_layout(margin={"r":0,"t":50,"l":0,"b":0})
    fig.show()
else:
    print('Required columns for choropleth not found (location and total_cases).')


## 6. Insights & Reporting

**Goal:** Summarize findings and export the notebook to PDF or HTML for presentation.

**Tasks:**

- Write 3–5 key insights from the analysis
- Highlight anomalies or interesting patterns
- Use Markdown cells for narrative explanations

**Export tips:** In Jupyter: `File → Download as → PDF via LaTeX` (requires LaTeX), or `Download as → HTML` for easy sharing. You can also take screenshots and assemble a PowerPoint.


In [None]:
# Step 6: Generate basic narrative insights (edit these to suit your findings)
if df.empty:
    print("No data to summarize.")
else:
    date_min = df['date'].min().date()
    date_max = df['date'].max().date()
    print(f"Data covers: {date_min} to {date_max}\n")
    # Auto insights - basic templates; update based on your visualizations
    latest = df.groupby('location').apply(lambda g: g.loc[g['date'].idxmax()]).reset_index(drop=True)
    top_cases = latest.nlargest(3, 'total_cases')['location'].tolist() if 'total_cases' in latest.columns else []
    print("Key insights (auto-generated - please review & refine):\n")
    if top_cases:
        print(f"1. The countries with the highest total cases in the dataset (latest) include: {', '.join(top_cases)}.")
    if 'total_vaccinations' in latest.columns:
        top_vax = latest.nlargest(3, 'total_vaccinations')['location'].tolist()
        print(f"2. The countries with the highest vaccination totals include: {', '.join(top_vax)}.")
    if 'death_rate' in df.columns:
        highest_dr = latest.nlargest(3, 'death_rate')['location'].tolist()
        print(f"3. Countries with relatively higher death rates include: {', '.join(highest_dr)}.")
    print("\nReview the charts above and replace these auto-insights with specific, contextual comments for your report.")


----

### Notebook created

This notebook has been generated and saved in the workspace. Run the cells sequentially from top to bottom. Edit `countries` or the `expected_cols` mapping in the cleaning cell if column names differ.
