# Notebook 1: Data Collection

This notebook extracts the earthquake-tsunami dataset from Kaggle, performs initial inspection, and saves the raw data for downstream processing.

## 1.1 Import Required Libraries

Load essential packages for data access, manipulation, and file handling.

## Contents

- [1.1 Import Required Libraries](#11-import-required-libraries)
- [1.2 Extract Dataset from KaggleHub](#12-extract-dataset-from-kagglehub)
- [1.3 Load and Inspect Dataset](#13-load-and-inspect-dataset)
- [1.4 Save Raw Dataset to Repository](#14-save-raw-dataset-to-repository)
- [1.5 Statistical Foundations](#15-statistical-foundations)
- [1.6 Summary](#16-summary)

In [None]:
# Import required libraries for data extraction and manipulation
import kagglehub
import pandas as pd
import os

## 1.2 Extract Dataset from KaggleHub

Use `kagglehub` to download the latest cached version of the earthquake-tsunami dataset from Kaggle.

[↑ Back to Contents](#contents)

In [None]:
# Download dataset using kagglehub (fetches latest cached version)
path = kagglehub.dataset_download("ahmeduzaki/global-earthquake-tsunami-risk-assessment-dataset")
print("Path to dataset files:", path)

Path to dataset files: C:\Users\Daniel\.cache\kagglehub\datasets\ahmeduzaki\global-earthquake-tsunami-risk-assessment-dataset\versions\1


## 1.3 Load and Inspect Dataset

Read the CSV file into a DataFrame and preview its structure.

[↑ Back to Contents](#contents)

In [None]:
# Load the earthquake-tsunami dataset into a pandas DataFrame
df = pd.read_csv(os.path.join(path, "earthquake_data_tsunami.csv"))

In [None]:
# Display dataset information (column types, non-null counts)
df.info()

In [None]:
# Display dataset overview
print(f"Dataset shape: {df.shape}")
print(f"\nFirst few rows:")
df.head()

## 1.4 Save Raw Dataset to Repository

Store the extracted dataset in the `data/raw/` folder for downstream ETL stages.

[↑ Back to Contents](#contents)

In [None]:
# Save raw dataset to repository for downstream ETL stages
save_path = "../data/raw/earthquake_data_tsunami.csv"

# Create directory if it doesn't exist
os.makedirs(os.path.dirname(save_path), exist_ok=True)

# Save to CSV
df.to_csv(save_path, index=False)
print(f"✓ Saved raw dataset to: {save_path}")
print(f"✓ Total records saved: {len(df)}")

Saved raw dataset to: ../data/raw/earthquake_data_tsunami.csv


## 1.6 Summary

This notebook successfully completed the **Extract** phase of the ETL pipeline:
- ✓ Downloaded the latest earthquake-tsunami dataset from Kaggle
- ✓ Loaded and inspected the raw data
- ✓ Saved the raw dataset to `data/raw/` for transformation

**Next Steps:** Proceed to the Transform notebook (`02_etl_transform.ipynb`) for data cleaning and preprocessing.

[↑ Back to Contents](#contents)

## 1.5 Statistical Foundations

This section documents core statistical and probability concepts applied in this project.

### Descriptive Statistics
We summarise numeric columns to understand central tendency and dispersion:
- Mean (average): sensitive to extreme values.
- Median (50th percentile): robust to outliers.
- Standard Deviation (std): typical spread around the mean.
- Min/Max & Quartiles: range and distribution shape.

In [None]:
import pandas as pd, numpy as np
from pathlib import Path

RAW_PATH = Path('../data/raw/earthquake_data_tsunami.csv').resolve()

# Load raw dataset
raw_df = pd.read_csv(RAW_PATH)
print('Loaded raw shape:', raw_df.shape)

# Identify numeric columns (heuristic)
numeric_cols = [c for c in raw_df.columns if pd.api.types.is_numeric_dtype(raw_df[c])]
print('Numeric columns:', numeric_cols[:10], '...')

# Basic descriptive statistics
summary = raw_df[numeric_cols].describe().T
summary['variance'] = raw_df[numeric_cols].var().values
summary[['mean','50%','std','variance','min','max']].head()


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Choose a numeric column to visualise (prefer magnitude-like pattern if present)
col_candidates = [c for c in numeric_cols if c.lower() in ['mag','magnitude']]
col = col_candidates[0] if col_candidates else (numeric_cols[0] if numeric_cols else None)
if col is None:
    print('No numeric columns available for histogram.')
else:
    fig, ax = plt.subplots(figsize=(6,4))
    sns.histplot(raw_df[col].dropna(), bins=30, kde=True, stat='density', ax=ax, color='#4e79a7')
    mu, sigma = np.nanmean(raw_df[col]), np.nanstd(raw_df[col])
    xs = np.linspace(raw_df[col].min(), raw_df[col].max(), 200)
    normal_pdf = 1.0/(sigma*np.sqrt(2*np.pi)) * np.exp(-0.5*((xs-mu)/sigma)**2)
    ax.plot(xs, normal_pdf, color='#e15759', lw=2, label=f'Normal fit (μ={mu:.2f}, σ={sigma:.2f})')
    ax.set_title(f'Distribution of {col} with normal overlay')
    ax.legend()
    plt.show()

In [None]:
# One-sample t-test (manual) for illustrative hypothesis: mean magnitude == 5.0
mu0 = 5.0
if col is None:
    print('Skipping t-test; no suitable numeric column.')
else:
    x = raw_df[col].dropna().values
    n = len(x)
    x_bar = x.mean()
    s = x.std(ddof=1)
    t_stat = (x_bar - mu0) / (s / np.sqrt(n))
    # Approximate two-tailed p-value using normal approximation (|t| large, n large)
    # p ≈ 2 * (1 - Φ(|t|)) where Φ is CDF of standard normal
    # Φ(z) ≈ 0.5 * (1 + erf(z / sqrt(2))) ; we use numpy.erf
    from math import erf, sqrt
    def phi(z):
        return 0.5 * (1 + erf(z / sqrt(2)))
    p_approx = 2 * (1 - phi(abs(t_stat)))
    print(f"One-sample t-test vs μ0={mu0:.2f} on column '{col}':")
    print(f"n={n}, mean={x_bar:.3f}, std={s:.3f}, t_stat={t_stat:.3f}, approx p-value={p_approx:.4f}")
    if p_approx < 0.05:
        print('Reject H0 at 5% level (illustrative).')
    else:
        print('Fail to reject H0 at 5% level (illustrative).')

[↑ Back to Contents](#contents)