# ETL Lab Exercise: Global COVID-19 Time Series Data

## Problem Statement

In this lab, you will build a complete ETL pipeline in Google Colab with the publicly available global COVID-19 confirmed cases dataset.

---

### Tasks

1. **Extract:**
   - Download the global confirmed COVID-19 time series CSV file directly from the Johns Hopkins University GitHub repository:
     https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv
   - Load the CSV file into a pandas DataFrame.

2. **Transform:**
   - Clean the data by handling missing values or anomalous entries.
   - Reshape the data from wide format (many date columns) to long format: columns `[Country/Region, Date, Confirmed Cases]`.
   - Convert the date column to datetime format and aggregate confirmed cases monthly by country.
   - Create new features such as monthly case increase and growth rates.

3. **Load:**
   - Store the transformed data into a local SQLite database within Colab.
   - Write and run SQL queries to:
     - Retrieve the top 5 countries by confirmed cases for any selected month.
     - Compare monthly growth rates for specified countries.
     - Identify countries with zero reported cases for given time periods.

---

### Dataset Details

- Dataset: Global daily confirmed COVID-19 cases  
- Source URL:  
  https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv

---

### Learning Outcomes

- Practice ETL with real-world pandemic data.
- Experience reshaping and aggregating time series data in pandas.
- Learn to store and query complex datasets in SQLite locally.
- Understand temporal patterns through analytic queries.

---


In [6]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sqlite3

# Step 1: Extract
url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
df = pd.read_csv(url)

# Step 2: Transform
# Drop unnecessary columns
df = df.drop(columns=["Lat", "Long", "Province/State"], errors='ignore')

# Group by Country/Region
df = df.groupby("Country/Region").sum()

# Transpose so dates are rows
df = df.T
df.index = pd.to_datetime(df.index)
df = df.reset_index().rename(columns={"index": "Date"})

# Melt into long format
df_melt = df.melt(id_vars=["Date"], var_name="Country", value_name="Confirmed")

# Create Month column
df_melt["Month"] = df_melt["Date"].dt.to_period("M").dt.to_timestamp()

# Group by Country and Month
monthly = (
    df_melt.groupby(["Country", "Month"])["Confirmed"]
    .sum()
    .reset_index()
)

# Feature Engineering
monthly["Monthly_Increase"] = (
    monthly.groupby("Country")["Confirmed"]
    .diff()
    .fillna(0)
)
monthly["Growth_Rate"] = (
    monthly.groupby("Country")["Confirmed"]
    .pct_change()
    .replace([np.inf, -np.inf], np.nan)
    .fillna(0)
)

# Step 3: Load to SQLite
conn = sqlite3.connect("covid_data.db")
monthly.to_sql("covid_monthly", conn, if_exists="replace", index=False)

# Step 4: SQL Queries

# Top 5 countries by confirmed cases in March 2021
query1 = """
SELECT Country, Confirmed
FROM covid_monthly
WHERE Month = '2021-03-01'
ORDER BY Confirmed DESC
LIMIT 5;
"""
print("Top 5 countries by confirmed cases in March 2021:")
print(pd.read_sql(query1, conn))

# Monthly growth rate for India
query2 = """
SELECT Month, Confirmed, Monthly_Increase, ROUND(Growth_Rate * 100, 2) AS Growth_Percent
FROM covid_monthly
WHERE Country = 'India'
ORDER BY Month;
"""
print("\nMonthly growth rate for India:")
print(pd.read_sql(query2, conn))

# Countries with 0 reported cases in March 2020
query3 = """
SELECT Country
FROM covid_monthly
WHERE Month = '2020-03-01' AND Confirmed = 0;
"""
print("\nCountries with 0 reported cases in March 2020:")
print(pd.read_sql(query3, conn))


Top 5 countries by confirmed cases in March 2021:
Empty DataFrame
Columns: [Country, Confirmed]
Index: []

Monthly growth rate for India:
                  Month   Confirmed  Monthly_Increase  Growth_Percent
0   2020-01-01 00:00:00           2               0.0            0.00
1   2020-02-01 00:00:00          84              82.0         4100.00
2   2020-03-01 00:00:00       10252           10168.0        12104.76
3   2020-04-01 00:00:00      447607          437355.0         4266.05
4   2020-05-01 00:00:00     3088494         2640887.0          590.00
5   2020-06-01 00:00:00    10951713         7863219.0          254.60
6   2020-07-01 00:00:00    32829678        21877965.0          199.77
7   2020-08-01 00:00:00    82734792        49905114.0          152.01
8   2020-09-01 00:00:00   151735176        69000384.0           83.40
9   2020-10-01 00:00:00   228641810        76906634.0           50.68
10  2020-11-01 00:00:00   265835139        37193329.0           16.27
11  2020-12-01 00:00:0

  df.index = pd.to_datetime(df.index)
