# Python + ML Basics for Architects

A fast, friendly introduction to Python, data handling, and simple machine learning — using fake but architecture-flavored data.

Goals:
- Learn Python + pandas basics
- Load and explore a dataset about windows/daylight/energy
- Split into train/test, fit simple models, and evaluate
- Understand cross-validation and overfitting with visuals
- Get a small taste of neural nets via an `MLP` ("deep learning vibe")


## Setup
If you are on a fresh environment, you may need to install packages (run one by one if needed):

```bash
# !pip install pandas numpy matplotlib seaborn scikit-learn scipy wordcloud
```

This notebook expects the repository structure with a `data/` and `scripts/` folder.

In [None]:
from pathlib import Path
import sys, warnings

# Add repo root and scripts to import path when running from notebooks/
if '__file__' in globals():
    ROOT = Path(__file__).resolve().parents[1]
else:
    candidate = Path.cwd()
    if (candidate / 'scripts').exists():
        ROOT = candidate
    elif candidate.name == 'notebooks' and (candidate.parent / 'scripts').exists():
        ROOT = candidate.parent
    else:
        ROOT = candidate
if str(ROOT / 'scripts') not in sys.path:
    sys.path.append(str(ROOT / 'scripts'))

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.preprocessing import OneHotEncoder, PolynomialFeatures
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import r2_score, mean_absolute_error
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier

from plotting_utils import set_style, plot_cv_folds, confusion_matrix_plot, learning_curve_plot, validation_curve_plot
import data_simulation as sim

set_style()
np.random.seed(42)


## Python crashlet
Python stores numbers, strings, lists, and dictionaries. You'll mostly use lists/dicts and then pandas DataFrames.

In [None]:
# Lists and dicts
sizes = [10, 20, 30]
room = {"name": "Studio A", "area_m2": 120, "seats": 40}
sizes.append(40)
{"sizes": sizes, "room": room}

In [None]:
# Numpy and pandas basics
import numpy as np, pandas as pd
a = np.array([1,2,3])
df = pd.DataFrame({"x": [1,2,3], "y": [3,2,1]})
df.describe()

## Load (or generate) architecture-flavored dataset
We use fake data that mimics window-to-wall ratio (wwr), shading depth, orientation, and energy/daylight outcomes.

In [None]:
data_path = ROOT / 'data' / 'building_performance_fake.csv'
if not data_path.exists():
    # Generate defaults if missing
    sim.save_default_fake_data(ROOT)

df = pd.read_csv(data_path)
df.head()

In [None]:
df.describe(include='all')

In [None]:
sns.pairplot(df[['wwr','shading_depth_m','daylit_area','glare_probability','cooling_kwh_m2']])
plt.suptitle('Quick relationships', y=1.02);

## Regression: predict cooling energy from design inputs
We fit a simple linear model and evaluate on a held-out test set.

In [None]:
features = ['wwr','shading_depth_m','orientation','glazing_u_w_m2k']
target = 'cooling_kwh_m2'
X = df[features]
y = df[target]

cat = ['orientation']
num = ['wwr','shading_depth_m','glazing_u_w_m2k']
pre = ColumnTransformer([
    ("cat", OneHotEncoder(drop='first'), cat),
    ("num", 'passthrough', num)
])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
reg = Pipeline([('pre', pre), ('model', LinearRegression())])
reg.fit(X_train, y_train)
pred = reg.predict(X_test)
r2 = r2_score(y_test, pred)
mae = mean_absolute_error(y_test, pred)
r2, mae

In [None]:
plt.scatter(y_test, pred, alpha=0.8)
plt.xlabel('Actual cooling kWh/m²')
plt.ylabel('Predicted')
plt.title(f'Linear Regression: R²={r2:.2f}, MAE={mae:.1f}')
m = np.array([y_test.min(), y_test.max()])
plt.plot(m, m, 'k--', alpha=0.5)
plt.tight_layout()
plt.show()

## Classification: comfortable daylight?
Define a simple label: comfortable if `glare_probability < 0.35` and `daylit_area > 0.5`.

In [None]:
df['comfortable'] = ((df['glare_probability'] < 0.35) & (df['daylit_area'] > 0.5)).astype(int)
Xc = df[['wwr','shading_depth_m','orientation']]
yc = df['comfortable']
pre_c = ColumnTransformer([
    ("cat", OneHotEncoder(drop='first'), ['orientation']),
    ("num", 'passthrough', ['wwr','shading_depth_m'])
])
Xtr, Xte, ytr, yte = train_test_split(Xc, yc, test_size=0.25, random_state=42, stratify=yc)
clf = Pipeline([('pre', pre_c), ('model', LogisticRegression(max_iter=1000))])
clf.fit(Xtr, ytr)
predc = clf.predict(Xte)
acc = accuracy_score(yte, predc)
cm = confusion_matrix(yte, predc)
acc, cm

In [None]:
confusion_matrix_plot(cm, class_names=("Not Comfortable","Comfortable"), title=f'Logistic Regression (Accuracy={acc:.2f})');


## Cross-validation and overfitting
We vary model complexity (polynomial degree) and show CV scores.

In [None]:
deg_scores = []
cv = KFold(n_splits=5, shuffle=True, random_state=42)
for deg in range(1,7):
    poly = PolynomialFeatures(degree=deg, include_bias=False)
    pipe = Pipeline([('pre', pre), ('poly', poly), ('ridge', Ridge(alpha=1.0))])
    scores = cross_val_score(pipe, X, y, cv=cv, scoring='r2')
    deg_scores.append((deg, scores.mean(), scores.std()))

ds = pd.DataFrame(deg_scores, columns=['degree','cv_mean','cv_std'])
plt.errorbar(ds['degree'], ds['cv_mean'], yerr=ds['cv_std'], fmt='o-')
plt.xlabel('Polynomial degree (complexity)')
plt.ylabel('CV R²')
plt.title('Overfitting demo: watch CV score peak then drop')
plt.tight_layout(); plt.show()
ds

### Visualizing CV folds ("crazily simple")
Each row is a fold; green cells are train, red cells are test.

In [None]:
cv2 = KFold(n_splits=5, shuffle=True, random_state=7)
_ = plot_cv_folds(n_samples=len(X), cv_splits=cv2.split(np.arange(len(X))), title='5-fold CV: train vs test membership')
plt.show()

## A tiny taste of "deep learning": MLP
A small neural network classifier. This is just to build intuition — it’s not magic, and simple models are often enough.

In [None]:
mlp = Pipeline([('pre', pre_c), ('mlp', MLPClassifier(hidden_layer_sizes=(16,8), max_iter=1000, random_state=42))])
mlp.fit(Xtr, ytr)
acc_mlp = accuracy_score(yte, mlp.predict(Xte))
acc, acc_mlp

### Learning curve (optional)
How performance improves with more data.

In [None]:
_ = learning_curve_plot(clf, Xc, yc, cv=5, scoring='accuracy')
plt.show()

## Key ideas
- Always hold out test data or use cross-validation
- Start simple; only add complexity if it improves validated performance
- Document random seeds and settings for reproducibility
- "Deep" models need more data and care; don’t overfit

Next: a richer notebook focused on visualizing pilot results and making figures for stakeholders.