# Phase 1: Exploratory Data Analysis & Preprocessing
In this notebook we will:
1. Load the raw CSVs.  
2. Inspect schema and summary statistics.  
3. Visualize missing values.  
4. Explore distributions & outliers (boxplots, histograms, KDEs).  
5. Generate correlation heatmap.  
6. Plot time-series trends (price, market cap, sentiment).  
7. Save cleaned/interim datasets for Phase 2.

### 1. Setup file paths
Define the base directories for raw input data (`DATA_RAW`) and cleaned interim output (`DATA_INTERIM`), so every file‐IO call uses these constants.

In [None]:
# Parameters cell for Papermill
DATA_RAW     = "data/raw"
DATA_INTERIM = "data/interim"

### 2. Imports & Plot Configuration
Load standard data libs (`pandas`, `numpy`), visualization tools (`matplotlib`, `seaborn`, `missingno`), and set inline plotting style.

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

%matplotlib inline
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (10,6)

### 3. Load & Shape‐Check Raw CSVs
Read each raw CSV into a DataFrame, print its name and shape for quick validation.

In [None]:
# 1) Load raw CSVs
import pandas as pd

coins      = pd.read_csv(f"{DATA_RAW}/coins.csv")
historical = pd.read_csv(f"{DATA_RAW}/historical.csv")
eth_df     = pd.read_csv(f"{DATA_RAW}/eth_df.csv")
ada_df     = pd.read_csv(f"{DATA_RAW}/ada_df.csv")
bnb_df     = pd.read_csv(f"{DATA_RAW}/bnb_df.csv")
btc_df     = pd.read_csv(f"{DATA_RAW}/btc_df.csv")

# Quick sanity check
for name, df in [("coins", coins), ("historical", historical),
                 ("ETH", eth_df), ("ADA", ada_df),
                 ("BNB", bnb_df), ("BTC", btc_df)]:
    print(f"{name}: {df.shape[0]} rows x {df.shape[1]} cols")


### 4. Show DataFrame info & head for each loaded dataset
We iterate over the actual globals `coins`, `historical`, and each sentiment DataFrame (`eth_df`, `ada_df`, etc.), printing their `.info()` and first rows.

In [None]:
# 4) Display info() and head() for each DataFrame
for name, df in [
    ("coins",      coins),
    ("historical", historical),
    ("ETH",        eth_df),
    ("ADA",        ada_df),
    ("BNB",        bnb_df),
    ("BTC",        btc_df)
]:
    print(f"\n### {name} INFO")
    display(df.info())
    display(df.head())

### 5. Statistical Summary
Show descriptive statistics (mean, std, min/max, quartiles) for numeric columns in `historical`.

In [None]:
print("### HISTORICAL DESCRIBE")
display(historical.describe().T)

### 6. Missingness Visualization
Use `missingno` to plot:
1. A matrix showing where data is missing in `historical`.
2. A bar chart of missing counts in `coins`.

In [None]:
msno.matrix(historical)
plt.title("historical.csv Missingness")
plt.show()

msno.bar(coins)
plt.title("coins.csv Missingness")
plt.show()

### 7. Distribution & KDE Plots
For each numeric column in `historical`:
1. Drop NAs.
2. Subsample to ≤100 000 points for performance.
3. Plot a fixed‐bin histogram + KDE.

In [None]:
num_cols = historical.select_dtypes(include="number").columns
for col in num_cols:
    data = historical[col].dropna()
    if len(data) > 100_000:
        data = data.sample(100_000, random_state=42)
    fig, (ax1, ax2) = plt.subplots(1,2,figsize=(12,4))
    sns.histplot(data, bins=50, kde=True, ax=ax1).set_title(f"{col} Distribution")
    sns.kdeplot(data, bw_adjust=1, ax=ax2).set_title(f"{col} KDE")
    plt.tight_layout()
    plt.show()

### 8. Outlier Detection via Boxplots
Draw a boxplot for each numeric column to visualize outliers.

In [None]:
for col in num_cols:
    plt.figure(figsize=(6,2))
    sns.boxplot(x=historical[col])
    plt.title(f"{col} Boxplot")
    plt.show()

### 9. Correlation Heatmap
Compute Pearson correlations among numeric features in `historical` and plot a heatmap.

In [None]:
corr = historical[num_cols].corr()
plt.figure(figsize=(10,8))
sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm")
plt.title("historical.csv Correlation Matrix")
plt.show()

### 10. Time Series of Price & Market Cap
Convert `date` to datetime, then plot `price` and `market_cap` over time in two stacked subplots.

In [None]:
historical['date'] = pd.to_datetime(historical['date'])
fig, (ax1, ax2) = plt.subplots(2,1,figsize=(12,8), sharex=True)
historical.set_index('date')['price'].plot(ax=ax1, title="Price over Time")
historical.set_index('date')['market_cap'].plot(ax=ax2, title="Market Cap over Time")
plt.tight_layout()
plt.show()

### 11. Plot per‐coin sentiment time series

For each of our sentiment DataFrames (`eth_df`, `ada_df`, `bnb_df`, `btc_df`), convert `date` to datetime, then plot count and normalized sentiment on two stacked axes.

In [None]:
for name, df in [
    ("ETH", eth_df),
    ("ADA", ada_df),
    ("BNB", bnb_df),
    ("BTC", btc_df)
]:
    df = df.copy()
    df['date'] = pd.to_datetime(df['date'])
    fig, (ax1, ax2) = plt.subplots(2,1,figsize=(12,6), sharex=True)
    df.set_index('date')['count']\
      .plot(ax=ax1, title=f"{name} Sentiment Count")
    df.set_index('date')['normalized']\
      .plot(ax=ax2, title=f"{name} Sentiment Normalized")
    plt.tight_layout()
    plt.show()

### 12. Save Cleaned Interim Data
Write out cleaned copies of `historical` and `coins` to the interim folder for downstream notebooks.

In [None]:
# (any cleaning steps would go here; currently pass-through)
historical_clean = historical.copy()
coins_clean      = coins.copy()

os.makedirs(DATA_INTERIM, exist_ok=True)
historical_clean.to_csv(f"{DATA_INTERIM}/historical_clean.csv", index=False)
coins_clean.to_csv(f"{DATA_INTERIM}/coins_clean.csv", index=False)
print("Saved historical_clean.csv & coins_clean.csv to interim/")