# P3 · Forecasting / Predictive Baseline

> **Case study notebook** — replace placeholders with your dataset and analysis.

**Checklist**
- [ ] Link to dataset and license
- [ ] Problem statement & business context
- [ ] Data dictionary & assumptions
- [ ] Methodology overview (steps & scope)
- [ ] Results (KPIs, visuals, tables)
- [ ] Executive summary (non-technical)


In [None]:
# Environment & imports
import os, sys, json, math, pathlib, sqlite3, itertools, statistics
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Local utils
sys.path.append(str(pathlib.Path('../src').resolve()))
from utils import ensure_dirs, load_csv, save_df, memory_report

# Paths
RAW = pathlib.Path('../data/raw')
PROC = pathlib.Path('../data/processed')
REPORTS = pathlib.Path('../reports')
ensure_dirs(RAW, PROC, REPORTS)

pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 120)

## 1) Data Ingestion & Provenance

- Describe the data source (URL, owner, update frequency).
- Record the date/time pulled and version.
- Save the raw snapshot under `data/raw/`.


In [None]:
# Example: load a CSV placed manually in data/raw
# df_raw = load_csv(RAW / 'your_raw_file.csv')
# display(df_raw.head())

## 2) Data Quality & Cleaning

Assess and fix:
- Missingness, duplicates, types, outliers
- Standardize categories, units, and date parsing
- Create a data quality report (shape, null %, duplicates)


In [None]:
# if 'df_raw' in globals():
#     # Basic profile
#     print('Shape:', df_raw.shape)
#     nulls = df_raw.isna().mean().sort_values(ascending=False).to_frame('null_pct')
#     display(nulls.head(20))
#     dup_count = df_raw.duplicated().sum()
#     print('Duplicates:', dup_count)
#     display(df_raw.describe(include='all').T.head(20))
#     memory_report(df_raw)
# else:
#     print('Place a dataset in data/raw and uncomment.')

## 3) Feature Engineering / Transformations

- Create canonical keys, date parts, normalized metrics
- Save cleaned dataset to `data/processed/`


In [None]:
# Example transformation
# df = df_raw.copy()
# # df['date'] = pd.to_datetime(df['date'])
# # df['month'] = df['date'].dt.to_period('M')
# # Save processed
# save_df(df, PROC / 'clean.csv', index=False)

## 4) EDA & KPIs

- Univariate/bivariate analysis
- Trend lines, distributions, categorical vs numeric
- Define and compute KPIs relevant to the problem


In [None]:
# if 'df' in globals():
#     # Example: histogram of a numeric column
#     # df['some_numeric'].plot(kind='hist')
#     # plt.show()
#     pass

## 5) (Optional) SQL Layer (for P2)

- Create a local SQLite DB from the processed data
- Write and document analytical queries/views


In [None]:
# if 'df' in globals():
#     conn = sqlite3.connect(PROC / 'project.db')
#     df.to_sql('fact', conn, if_exists='replace', index=False)
#     # Example analytical question:
#     # q = '''
#     # SELECT category, COUNT(*) as n, AVG(metric) as avg_metric
#     # FROM fact GROUP BY 1 ORDER BY 2 DESC;
#     # '''
#     # print(pd.read_sql(q, conn).head())
#     # conn.close()

## 6) (P3) Baseline Modeling

- Choose a simple model (e.g., LinearRegression or LogisticRegression)
- Split data, evaluate baseline metrics, analyze errors
- Emphasize interpretability: coefficients/feature importance


In [None]:
# Example baseline (uncomment and adapt)
# from sklearn.model_selection import train_test_split
# from sklearn.linear_model import LinearRegression, LogisticRegression
# from sklearn.metrics import mean_absolute_error, r2_score, accuracy_score, roc_auc_score

# # y = df['target']; X = df.drop(columns=['target'])
# # X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# # model = LinearRegression()
# # model.fit(X_train, y_train)
# # pred = model.predict(X_test)
# # print('MAE:', mean_absolute_error(y_test, pred))
# # print('R2:', r2_score(y_test, pred))

## 7) (P4) Visualization Storytelling

- What story are we telling? Who is the audience?
- Build a few key visuals that track directly to the KPIs
- Export images to `reports/` for the executive summary


In [None]:
# Example export
# fig = plt.figure()
# df['some_numeric'].plot(kind='hist')
# fig.savefig(REPORTS / 'figure_01.png', bbox_inches='tight')

## 8) Executive Summary (Non-Technical)

- Context & objective (2–3 lines)
- What we did (bulleted steps)
- 3–5 key findings with impact
- Next actions (clear, prioritized)


## Appendix

- Data dictionary
- Assumptions & limitations
- Ethical & privacy considerations
