# 01 — Multi Linear Regression (Fill-in-the-Blanks)

## Objective

Learn how to train and evaluate a multi-linear regression model with multiple input features.

## Install libraries

Use `pip install -r requirements.txt` from the repo root if needed. This notebook does not run pip commands.

## Imports

In [None]:
# Concept: import the core data + model libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

## Load / Create dataset

In [None]:
# Concept: create a small housing dataset (based on multi_reg1.ipynb)
mydata = {
    "Area": [1500, 1800, 2400, 3000, 3500, 1200, 2100, 2800, 3200, 1700],
    "Bedroom": [3, 4, 3, 4, 5, 2, 3, 4, 5, 3],
    "Age": [10, 5, 8, 3, 2, 20, 12, 7, 4, 15],
    "Price": [300000, 360000, 400000, 500000, 580000, 250000, 390000, 470000, 540000, 330000],
}

df = pd.DataFrame(mydata)
df.head()

## Separate features and target

In [None]:
# TODO: create X and y
X = df[["Area", "Bedroom", "Age"]]
y = df["Price"]

X.head(), y.head()

## Check yourself

In [None]:
print("X shape:", X.shape)
print("y shape:", y.shape)
print("X type:", type(X))
print("y type:", type(y))
print("NaNs in X:", int(X.isna().sum().sum()))
print("NaNs in y:", int(y.isna().sum()))

## Train-test split

In [None]:
# TODO: split train/test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

## Create model

In [None]:
# TODO: instantiate model
model = LinearRegression()
model

## Train model

In [None]:
# TODO: fit model
model.fit(X_train, y_train)
print("Intercept:", round(float(model.intercept_), 2))
print("Coefficients:", dict(zip(X.columns, model.coef_)))

## Make predictions

In [None]:
# TODO: generate predictions on the test set
y_pred = model.predict(X_test)
pd.DataFrame({"Actual": y_test.values, "Predicted": y_pred.round(2)})

## Evaluate model

In [None]:
# TODO: compute regression metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"MSE: {mse:.2f}")
print(f"R^2: {r2:.4f}")

## Predict new data

In [None]:
# TODO: predict a new house price
new_house = pd.DataFrame([{"Area": 2200, "Bedroom": 3, "Age": 4}])
predicted_price = model.predict(new_house)[0]
print(f"Predicted price: ${predicted_price:,.2f}")

## Visualization (if applicable)

In [None]:
# Concept: compare actual vs predicted prices
plt.figure(figsize=(6, 4))
plt.scatter(y_test, y_pred, color="royalblue")
min_v = min(y_test.min(), y_pred.min())
max_v = max(y_test.max(), y_pred.max())
plt.plot([min_v, max_v], [min_v, max_v], "r--")
plt.xlabel("Actual Price")
plt.ylabel("Predicted Price")
plt.title("Actual vs Predicted Prices")
plt.tight_layout()
plt.show()

## Core Concepts

- **Multiple features**: the model uses Area, Bedroom count, and Age together.
- **Coefficient interpretation**: each coefficient estimates change in price for a one-unit feature change (holding others fixed).
- **Train/test split**: helps estimate generalization on unseen samples.
- **MSE and R²**: MSE measures average squared error; R² measures explained variance.

## Common Pitfalls

- Feature-name mismatch: use a DataFrame with matching column names for `.predict()`.
- Shape mismatch: regression expects 2D features `(n_samples, n_features)`.
- Plotting gotcha: scatter inputs should be 1D arrays of equal length.

## Smoke Tests (must pass)

In [None]:
assert X.shape[1] == 3
assert len(y_pred) == len(y_test)
assert mse >= 0
assert -1.0 <= r2 <= 1.0
print("✅ Smoke tests passed")

## Further Reading

- [LinearRegression (scikit-learn)](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
- [KNeighborsClassifier (scikit-learn)](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
- [DecisionTreeClassifier (scikit-learn)](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)
- [RandomForestClassifier (scikit-learn)](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
- [train_test_split (scikit-learn)](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
- [accuracy_score (scikit-learn)](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)
- [confusion_matrix (scikit-learn)](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)
- [classification_report (scikit-learn)](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)
- [mean_squared_error (scikit-learn)](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html)
- [r2_score (scikit-learn)](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html)
- [Pandas indexing guide](https://pandas.pydata.org/docs/user_guide/indexing.html)
- [Matplotlib scatter](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html)
