# ETL Lab Exercise: Global COVID-19 Time Series Data

## Problem Statement

In this lab, you will build a complete ETL pipeline in Google Colab with the publicly available global COVID-19 confirmed cases dataset.

---

### Tasks

1. **Extract:**
   - Download the global confirmed COVID-19 time series CSV file directly from the Johns Hopkins University GitHub repository:
     https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv
   - Load the CSV file into a pandas DataFrame.

2. **Transform:**
   - Clean the data by handling missing values or anomalous entries.
   - Reshape the data from wide format (many date columns) to long format: columns `[Country/Region, Date, Confirmed Cases]`.
   - Convert the date column to datetime format and aggregate confirmed cases monthly by country.
   - Create new features such as monthly case increase and growth rates.

3. **Load:**
   - Store the transformed data into a local SQLite database within Colab.
   - Write and run SQL queries to:
     - Retrieve the top 5 countries by confirmed cases for any selected month.
     - Compare monthly growth rates for specified countries.
     - Identify countries with zero reported cases for given time periods.

---

### Dataset Details

- Dataset: Global daily confirmed COVID-19 cases  
- Source URL:  
  https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv

---

### Learning Outcomes

- Practice ETL with real-world pandemic data.
- Experience reshaping and aggregating time series data in pandas.
- Learn to store and query complex datasets in SQLite locally.
- Understand temporal patterns through analytic queries.

---

Would you like me to provide a sample Colab notebook with code to guide you through this ETL workflow?
