# ETL Lab Exercise: Global COVID-19 Time Series Data

## Problem Statement

In this lab, you will build a complete ETL pipeline in Google Colab with the publicly available global COVID-19 confirmed cases dataset.

---

### Tasks

1. **Extract:**
   - Download the global confirmed COVID-19 time series CSV file directly from the Johns Hopkins University GitHub repository:
     https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv
   - Load the CSV file into a pandas DataFrame.

2. **Transform:**
   - Clean the data by handling missing values or anomalous entries.
   - Reshape the data from wide format (many date columns) to long format: columns `[Country/Region, Date, Confirmed Cases]`.
   - Convert the date column to datetime format and aggregate confirmed cases monthly by country.
   - Create new features such as monthly case increase and growth rates.

3. **Load:**
   - Store the transformed data into a local SQLite database within Colab.
   - Write and run SQL queries to:
     - Retrieve the top 5 countries by confirmed cases for any selected month.
     - Compare monthly growth rates for specified countries.
     - Identify countries with zero reported cases for given time periods.

---

### Dataset Details

- Dataset: Global daily confirmed COVID-19 cases  
- Source URL:  
  https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv

---

### Learning Outcomes

- Practice ETL with real-world pandemic data.
- Experience reshaping and aggregating time series data in pandas.
- Learn to store and query complex datasets in SQLite locally.
- Understand temporal patterns through analytic queries.

---


In [1]:
import pandas as pd

In [2]:
url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
df_raw = pd.read_csv(url)

In [3]:
print(df_raw.head())

  Province/State Country/Region       Lat       Long  1/22/20  1/23/20  \
0            NaN    Afghanistan  33.93911  67.709953        0        0   
1            NaN        Albania  41.15330  20.168300        0        0   
2            NaN        Algeria  28.03390   1.659600        0        0   
3            NaN        Andorra  42.50630   1.521800        0        0   
4            NaN         Angola -11.20270  17.873900        0        0   

   1/24/20  1/25/20  1/26/20  1/27/20  ...  2/28/23  3/1/23  3/2/23  3/3/23  \
0        0        0        0        0  ...   209322  209340  209358  209362   
1        0        0        0        0  ...   334391  334408  334408  334427   
2        0        0        0        0  ...   271441  271448  271463  271469   
3        0        0        0        0  ...    47866   47875   47875   47875   
4        0        0        0        0  ...   105255  105277  105277  105277   

   3/4/23  3/5/23  3/6/23  3/7/23  3/8/23  3/9/23  
0  209369  209390  209406  2

In [4]:
# Drop irrelevant columns
df = df_raw.drop(columns=["Lat", "Long"])

# Fill missing Country/Region values if any
df["Country/Region"].fillna("Unknown", inplace=True)

# Check nulls
print(df.isnull().sum())

Province/State    198
Country/Region      0
1/22/20             0
1/23/20             0
1/24/20             0
                 ... 
3/5/23              0
3/6/23              0
3/7/23              0
3/8/23              0
3/9/23              0
Length: 1145, dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Country/Region"].fillna("Unknown", inplace=True)


In [5]:
df_long = df.melt(id_vars=["Province/State", "Country/Region"],
                  var_name="Date",
                  value_name="Confirmed")

# Convert Date to datetime
df_long["Date"] = pd.to_datetime(df_long["Date"])

# Group by Country and Date
df_grouped = df_long.groupby(["Country/Region", "Date"], as_index=False)["Confirmed"].sum()
df_grouped.rename(columns={"Country/Region": "Country"}, inplace=True)

  df_long["Date"] = pd.to_datetime(df_long["Date"])


In [6]:
# Extract month
df_grouped["YearMonth"] = df_grouped["Date"].dt.to_period("M")

# Monthly confirmed (last day of month value)
df_monthly = df_grouped.groupby(["Country", "YearMonth"]).agg({
    "Confirmed": "max"
}).reset_index()

# Convert YearMonth back to datetime (last day of month)
df_monthly["YearMonth"] = df_monthly["YearMonth"].dt.to_timestamp("M")

In [7]:
df_monthly["Monthly_Increase"] = df_monthly.groupby("Country")["Confirmed"].diff().fillna(0)
df_monthly["Growth_Rate_%"] = df_monthly.groupby("Country")["Confirmed"].pct_change().fillna(0) * 100

In [8]:
import sqlite3

# Create connection and load
conn = sqlite3.connect("covid19_etl.db")
df_monthly.to_sql("covid_monthly", conn, if_exists="replace", index=False)

7839

In [14]:
df.columns

Index(['Province/State', 'Country/Region', '1/22/20', '1/23/20', '1/24/20',
       '1/25/20', '1/26/20', '1/27/20', '1/28/20', '1/29/20',
       ...
       '2/28/23', '3/1/23', '3/2/23', '3/3/23', '3/4/23', '3/5/23', '3/6/23',
       '3/7/23', '3/8/23', '3/9/23'],
      dtype='object', length=1145)

Top 5 Countries by Cases in a Given Month

In [19]:
month = '2021-05'
query = f"""
SELECT Country, Confirmed
FROM covid_monthly
WHERE YearMonth LIKE '{month}%'
ORDER BY Confirmed DESC
LIMIT 5;
"""
print(pd.read_sql(query, conn))


  Country  Confirmed
0      US   33386447
1   India   28175044
2  Brazil   16557888
3  France    5978761
4  Turkey    5249404


Monthly Growth Rate for Specific Countries

In [17]:
countries = ("India", "Brazil", "United States")
query = f"""
SELECT Country, YearMonth, "Growth_Rate_%"
FROM covid_monthly
WHERE Country IN {countries}
ORDER BY Country, YearMonth;
"""
print(pd.read_sql(query, conn))


   Country            YearMonth  Growth_Rate_%
0   Brazil  2020-01-31 00:00:00   0.000000e+00
1   Brazil  2020-02-29 00:00:00            inf
2   Brazil  2020-03-31 00:00:00   2.857500e+05
3   Brazil  2020-04-30 00:00:00   1.425048e+03
4   Brazil  2020-05-31 00:00:00   4.919885e+02
..     ...                  ...            ...
73   India  2022-11-30 00:00:00   4.305264e-02
74   India  2022-12-31 00:00:00   1.345306e-02
75   India  2023-01-31 00:00:00   9.505399e-03
76   India  2023-02-28 00:00:00   8.318391e-03
77   India  2023-03-31 00:00:00   6.491699e-03

[78 rows x 3 columns]


Countries with Zero Cases in a Given Month

In [23]:
month = '2020-03-31 00:00:00'
query = f"""
SELECT Country, Confirmed
FROM covid_monthly
WHERE YearMonth = '{month}' AND Confirmed = 0;
"""
print(pd.read_sql(query, conn))

                  Country  Confirmed
0              Antarctica          0
1                 Comoros          0
2                Kiribati          0
3            Korea, North          0
4                 Lesotho          0
5                  Malawi          0
6        Marshall Islands          0
7              Micronesia          0
8                   Nauru          0
9                   Palau          0
10                  Samoa          0
11  Sao Tome and Principe          0
12        Solomon Islands          0
13            South Sudan          0
14   Summer Olympics 2020          0
15             Tajikistan          0
16                  Tonga          0
17                 Tuvalu          0
18                Vanuatu          0
19   Winter Olympics 2022          0
20                  Yemen          0
