# California Housing — Linear Regression Walkthrough
This notebook demonstrates a complete regression workflow using the **California Housing dataset**, available from scikit-learn. We'll walk through data exploration, model training, evaluation, and assumptions.

**Steps covered:**
1. Load dataset & inspect structure
2. Exploratory Data Analysis (EDA)
3. Train/test split
4. Fit Linear Regression model
5. Evaluate with metrics (MAE, MSE, RMSE, R²)
6. Check regression assumptions with residual plots
7. (Optional) Improve with Ridge & Lasso


## 1. Setup & Imports

In [None]:
import sys, numpy as np, pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Reproducibility
np.random.seed(42)

print("Python:", sys.version.split()[0])
print("NumPy:", np.__version__)
print("Pandas:", pd.__version__)

## 2. Load Dataset
We use scikit-learn's California Housing dataset, which contains information like median income, house age, and population.

In [None]:
housing = fetch_california_housing(as_frame=True)
df = housing.frame
df.head()

## 3. Exploratory Data Analysis (EDA)
Check shape, summary stats, missing values, and correlations.

In [None]:
print("Shape:", df.shape)
print("Columns:", df.columns.tolist())
print("\nMissing values per column:\n", df.isnull().sum())

df.describe()

In [None]:
# Correlation heatmap
plt.figure(figsize=(10,6))
sns.heatmap(df.corr(), annot=False, cmap="coolwarm", center=0)
plt.title("Correlation Heatmap of Features")
plt.show()

## 4. Split Data
Separate features (X) and target (y), then split into training and testing sets.

In [None]:
X = df.drop(columns=["MedHouseVal"])  # features
y = df["MedHouseVal"]  # target (median house value)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape

## 5. Fit Linear Regression Model

In [None]:
model = LinearRegression()
model.fit(X_train, y_train)

print("Intercept:", model.intercept_)
print("Coefficients:")
for name, coef in zip(X.columns, model.coef_):
    print(f"  {name}: {coef:.4f}")

## 6. Predictions & Evaluation
We use standard regression metrics: MAE, MSE, RMSE, and R².

In [None]:
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

def regression_report(y_true, y_pred, prefix=""):
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_true, y_pred)
    print(f"{prefix}MAE : {mae:.4f}")
    print(f"{prefix}MSE : {mse:.4f}")
    print(f"{prefix}RMSE: {rmse:.4f}")
    print(f"{prefix}R²  : {r2:.4f}")

print("Train set:")
regression_report(y_train, y_train_pred, prefix="  ")
print("\nTest set:")
regression_report(y_test, y_test_pred, prefix="  ")

## 7. Residual Analysis
Check assumptions of linear regression: linearity, homoscedasticity, normality.

In [None]:
train_residuals = y_train - y_train_pred

# Residuals vs Fitted
plt.figure(figsize=(6,4))
plt.scatter(y_train_pred, train_residuals, alpha=0.5)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel("Fitted values")
plt.ylabel("Residuals")
plt.title("Residuals vs Fitted (Train)")
plt.show()

# Histogram of residuals
plt.figure(figsize=(6,4))
plt.hist(train_residuals, bins=30, edgecolor='k')
plt.title("Histogram of Residuals (Train)")
plt.xlabel("Residual")
plt.ylabel("Frequency")
plt.show()

## 8. Optional: Ridge & Lasso Regression
Compare performance of regularized linear models.

In [None]:
ridge = Ridge(alpha=1.0)
lasso = Lasso(alpha=0.01, max_iter=10000)

ridge.fit(X_train, y_train)
lasso.fit(X_train, y_train)

print("Ridge (train R², test R²):", ridge.score(X_train, y_train), ridge.score(X_test, y_test))
print("Lasso (train R², test R²):", lasso.score(X_train, y_train), lasso.score(X_test, y_test))