# ETL Extract raw notebook

## Contents

1. Import Required Libraries  
2. Extract Dataset from KaggleHub  
3. Load and Inspect Dataset  
4. Save Raw Dataset to Repository  
5. Summary

## 1. Import Required Libraries
Load essential packages for data access, manipulation, and file handling.

In [None]:
# Import required libraries for data extraction and manipulation
import kagglehub
import pandas as pd
import os

## 2. Extract Dataset from KaggleHub
Use `kagglehub` to download the latest cached version of the earthquake-tsunami dataset from Kaggle.

In [None]:
# Download dataset using kagglehub (fetches latest cached version)
path = kagglehub.dataset_download("ahmeduzaki/global-earthquake-tsunami-risk-assessment-dataset")
print("Path to dataset files:", path)

Path to dataset files: C:\Users\Daniel\.cache\kagglehub\datasets\ahmeduzaki\global-earthquake-tsunami-risk-assessment-dataset\versions\1


## 3. Load and Inspect Dataset
Read the CSV file into a DataFrame and preview its structure.

In [None]:
# Load the earthquake-tsunami dataset into a pandas DataFrame
df = pd.read_csv(os.path.join(path, "earthquake_data_tsunami.csv"))

In [None]:
# Display dataset information (column types, non-null counts)
df.info()

In [None]:
# Display dataset overview
print(f"Dataset shape: {df.shape}")
print(f"\nFirst few rows:")
df.head()

## 4. Save Raw Dataset to Repository
Store the extracted dataset in the `data/raw/` folder for downstream ETL stages.

In [None]:
# Save raw dataset to repository for downstream ETL stages
save_path = "../data/raw/earthquake_data_tsunami.csv"

# Create directory if it doesn't exist
os.makedirs(os.path.dirname(save_path), exist_ok=True)

# Save to CSV
df.to_csv(save_path, index=False)
print(f"✓ Saved raw dataset to: {save_path}")
print(f"✓ Total records saved: {len(df)}")

Saved raw dataset to: ../data/raw/earthquake_data_tsunami.csv


---
## Summary

This notebook successfully completed the **Extract** phase of the ETL pipeline:
- ✓ Downloaded the latest earthquake-tsunami dataset from Kaggle
- ✓ Loaded and inspected the raw data
- ✓ Saved the raw dataset to `data/raw/` for transformation

**Next Steps:** Proceed to the Transform notebook (`02_01_etl_transform.ipynb`) for data cleaning and preprocessing.