# Data Exploration Template
Use this notebook to perform initial exploratory data analysis (EDA) on new datasets. Fill in each section with project-specific details.

## Notebook Goals
- Document data sources and assumptions
- Inspect schema, data quality, and summary statistics
- Visualize key distributions and relationships
- Capture actionable follow-ups for feature engineering

In [None]:
import os
from pathlib import Path

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme(style="whitegrid")
pd.set_option("display.max_columns", 100)

DATA_DIR = Path(os.environ.get("DATA_DIR", "data"))
print(f"Using data directory: {DATA_DIR.resolve()}")

## Load Raw Dataset
Update the path, file format, and loader logic to match the dataset you want to explore.
Capture data source details (owner, refresh cadence, quirks) in a markdown cell once confirmed.

In [None]:
raw_path = DATA_DIR / "comp_data.csv"  # TODO: replace with the correct file
if not raw_path.exists():
    raise FileNotFoundError(f"Update raw_path; {raw_path} not found.")

df = pd.read_csv(raw_path)
df.head()

## Quick Schema & Sample Rows

In [None]:
df.head()

In [None]:
df.sample(5, random_state=42)

In [None]:
df.info()

In [None]:
df.describe(include='all').transpose()

## Data Quality Checks
Track columns with high missingness or unusual values. Add commentary below.

In [None]:
missing = df.isna().mean().sort_values(ascending=False)
missing[missing > 0].to_frame('missing_rate')

In [None]:
cardinality = df.nunique().sort_values(ascending=False)
cardinality.to_frame('unique_values')

## Numeric Feature Distributions
Visualize key numeric features. Replace the selection logic as needed.

In [None]:
numeric_cols = df.select_dtypes(include='number').columns.tolist()
selected_numeric = numeric_cols[:6]  # TODO: curate this list

fig, axes = plt.subplots(nrows=len(selected_numeric), ncols=1, figsize=(10, 4 * len(selected_numeric)))
if not isinstance(axes, np.ndarray):
    axes = np.array([axes])
for ax, col in zip(axes, selected_numeric):
    sns.histplot(df[col].dropna(), ax=ax, kde=True)
    ax.set_title(col)
plt.tight_layout()

## Correlations & Relationships
Use pairplots, heatmaps, or custom charts to highlight important relationships.

In [None]:
corr = df[numeric_cols].corr()
plt.figure(figsize=(12, 8))
sns.heatmap(corr, cmap='coolwarm', center=0)
plt.title('Numeric Feature Correlation Heatmap')

## Categorical Feature Summary
Inspect top categories and their frequencies for categorical columns.

In [None]:
categorical_cols = df.select_dtypes(include='object').columns.tolist()
summary = {}
for col in categorical_cols[:5]:  # TODO: curate list
    summary[col] = df[col].value_counts(dropna=False).head(10)
summary

## Time-Based Checks (Optional)
If the dataset has a temporal dimension, convert to datetime and inspect recency, seasonality, and missing spans.

In [None]:
# Example: df['timestamp'] = pd.to_datetime(df['timestamp'])
# df.set_index('timestamp').resample('1D').size().plot(figsize=(12, 4))
# plt.title('Daily Record Volume')

## Findings & Next Actions
Summarize key takeaways and follow-ups for modeling or data engineering.
- **Data quality:** TODO
- **Feature ideas:** TODO
- **Risks/Gaps:** TODO