# ETL Lab Exercise: Global Earthquake Data Analysis

## Problem Statement

In this lab, you will build a complete ETL (Extract-Transform-Load) pipeline in Google Colab using publicly available global earthquake data from the US Geological Survey (USGS).

---

## Tasks

### 1. Extract
- **Download:**  
  Download the latest global earthquake data directly from the USGS public API or their CSV archive.  
  - Example URL (last 30 days earthquake data):  
    https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_month.csv
- **Load:**  
  Load the CSV file into a pandas DataFrame.

### 2. Transform
- **Clean:**  
  - Handle missing or inconsistent data entries.
  - Rename columns for clarity if necessary.
- **Feature Engineering:**  
  - Create a new column `DepthCategory` classifying earthquake depth as "Shallow" (<70 km), "Intermediate" (70-300 km), or "Deep" (>300 km).
  - Calculate time-based features such as extracting the month and weekday from the timestamp.
  - Filter earthquakes by magnitude threshold (e.g., magnitude ≥ 4.0).

### 3. Load
- **Local Storage:**  
  Store the transformed DataFrame into a local SQLite database inside Colab.
- **Query & Analyze:**  
  - Write SQL queries to:
    - Find the number of earthquakes per depth category.
    - List the top 5 strongest earthquakes in the last 30 days.
    - Find average magnitude by weekday.

---

## Constraints

- Perform the entire ETL process only with pandas, SQLite, and standard Python within Google Colab.
- Do not use any external databases or proprietary cloud services.

---

## Dataset Details

- **Dataset:** Global earthquakes, last 30 days  
- **Source URL:** [USGS Earthquake Data - All Month CSV](https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_month.csv)

---

## Example Challenge Questions

- Which depth category experiences the most earthquakes?
- On which weekday do the strongest earthquakes most frequently occur?
- What is the distribution of earthquake magnitudes in your filtered data?

---

**Expected Outcome:**  
You will gain practical ETL experience on a real-world geoscience dataset, including downloading live data, cleaning and feature engineering, and storing/querying the results locally—all within a Colab notebook.


In [1]:
# Load CSV file
import pandas as pd

url = "https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_month.csv"

df = pd.read_csv(url)
print(df.head())
print(df.columns)


                       time   latitude   longitude  depth   mag magType   nst  \
0  2025-08-06T19:20:57.320Z  34.285500 -116.851997   1.47  1.19      ml  25.0   
1  2025-08-06T19:08:53.900Z  34.854000 -118.788333   0.33  1.43      ml  24.0   
2  2025-08-06T19:00:23.688Z  60.374000 -150.665800  32.90  1.60      ml   NaN   
3  2025-08-06T18:53:27.360Z  33.271333 -116.417333   4.47  0.77      ml  23.0   
4  2025-08-06T18:48:50.940Z  17.929167  -66.912667  11.61  1.71      md   7.0   

     gap      dmin   rms  ...                   updated  \
0   49.0  0.061460  0.18  ...  2025-08-06T19:22:59.839Z   
1   83.0  0.077850  0.17  ...  2025-08-06T19:16:41.531Z   
2    NaN       NaN  0.66  ...  2025-08-06T19:03:34.329Z   
3   75.0  0.003128  0.18  ...  2025-08-06T18:56:48.391Z   
4  244.0  0.042970  0.09  ...  2025-08-06T19:11:56.960Z   

                              place        type horizontalError depthError  \
0     3 km NNW of Big Bear City, CA  earthquake            0.29       0.42   
1 

In [2]:
# Detect Missing Values and Data Types
print(df.info())
print("\n")
print(df.isnull().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12199 entries, 0 to 12198
Data columns (total 22 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   time             12199 non-null  object 
 1   latitude         12199 non-null  float64
 2   longitude        12199 non-null  float64
 3   depth            12199 non-null  float64
 4   mag              12198 non-null  float64
 5   magType          12198 non-null  object 
 6   nst              10379 non-null  float64
 7   gap              10378 non-null  float64
 8   dmin             10372 non-null  float64
 9   rms              12197 non-null  float64
 10  net              12199 non-null  object 
 11  id               12199 non-null  object 
 12  updated          12199 non-null  object 
 13  place            12199 non-null  object 
 14  type             12199 non-null  object 
 15  horizontalError  9816 non-null   float64
 16  depthError       12197 non-null  float64
 17  magError    

In [4]:
# Drop columns that will be required.
df.drop(columns=['magType', 'nst','gap', 'dmin', 'rms', 'net','updated', 'horizontalError', 'depthError', 'magError', 'magNst', 'status', 'locationSource', 'magSource'], inplace=True)



KeyError: "['magType', 'nst', 'gap', 'dmin', 'rms', 'net', 'updated', 'horizontalError', 'depthError', 'magError', 'magNst', 'status', 'locationSource', 'magSource'] not found in axis"

In [5]:
# Fill in blank values with median.
df["mag"].fillna(df["mag"].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["mag"].fillna(df["mag"].median(), inplace=True)


In [None]:
# Data Profiling with ydata_profiling - but this can be skipped.
!pip install ydata-profiling
from ydata_profiling import ProfileReport

# Generate profile report
profile = ProfileReport(df, title='Titanic Data Profiling', explorative=True)
profile.to_notebook_iframe()  # For Jupyter/Colab interactive view

In [6]:
def depth_category(depth):
    if depth < 70:
        return "Shallow"
    elif 70 <= depth <= 300:
        return "Intermediate"
    else:
        return "Deep"

df["DepthCategory"] = df["depth"].apply(depth_category)
