## Problem Statement
Load and explore the dataset 'California Housing Dataset' to train and test the linear regression model to see the model performance.

Objective:
- train the model using multiple features
- cleanup the data for better modeling




In [None]:
#Load dataset (from sklean dataset samples)
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
print(type(housing))

<class 'sklearn.utils._bunch.Bunch'>


In [None]:
#Convert bunch into dataframe
import pandas as pd
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df['Price'] = housing.target
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,Price
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


## Preprocessing

In [None]:
# FINDING MISSING VALUES
print(df.isnull().sum())

MedInc        0
HouseAge      0
AveRooms      0
AveBedrms     0
Population    0
AveOccup      0
Latitude      0
Longitude     0
Price         0
dtype: int64


In [None]:
# REMOVE OR REPLACE NULL VALUES
### No action needed since there is no null values

In [None]:
# FIX OUTLIERS
## Here, we have assumed that it is all good.

In [None]:
# SEGREGRATE INDEPENDENT and DEPENDENT VARIABLES

## Except Price, all other fields are input INDEPENDENT features
X = df.drop('Price', axis=1)
X.head()

## Price is the DEPENDENT feature
Y = df['Price']
Y.head()

Unnamed: 0,Price
0,4.526
1,3.585
2,3.521
3,3.413
4,3.422


In [None]:
# NORMALIZATION or STANDARDIZATION

## Here, I am standardizing using SCALING or Z-SCORE
### Note: only input feature is scaled, not the output. Hence, y is not scaled.
###      Hence, we separated Independent from Dependent variables above
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) #calculates the z-score
X_scaled


array([[ 2.34476576,  0.98214266,  0.62855945, ..., -0.04959654,
         1.05254828, -1.32783522],
       [ 2.33223796, -0.60701891,  0.32704136, ..., -0.09251223,
         1.04318455, -1.32284391],
       [ 1.7826994 ,  1.85618152,  1.15562047, ..., -0.02584253,
         1.03850269, -1.33282653],
       ...,
       [-1.14259331, -0.92485123, -0.09031802, ..., -0.0717345 ,
         1.77823747, -0.8237132 ],
       [-1.05458292, -0.84539315, -0.04021111, ..., -0.09122515,
         1.77823747, -0.87362627],
       [-0.78012947, -1.00430931, -0.07044252, ..., -0.04368215,
         1.75014627, -0.83369581]])

### Check Multi-Linearity

#### Assumptions in Linear Regression

1. Linear relation between Independent and Dependent variables

2. Multi-colinearity: There should be <font color=red>NO or MINIMAL co-linearity or relation between the indepdent variables (x1, x2, x3, etc...)</font>

$$
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_p x_p + \epsilon
$$

**VIF (Variance Inflation Factor)** is a metric that helps detect multicollinearity in regression models — when two or more independent variables are highly correlated.

$$
\text{VIF}_i = \frac{1}{1 - R_i^2}
$$

Where:
- $\text{VIF}_i$ is the Variance Inflation Factor for the $i^{th}$ feature
- $R_i^2$ is the coefficient of determination from regressing the $i^{th}$ feature on all other independent features

**Interpretation:**

| VIF Value | Meaning                                |
|-----------|-----------------------------------------|
| 1         | No multicollinearity                    |
| 1–5       | Moderate multicollinearity (acceptable) |
| > 5       | High multicollinearity (check closely)  |
| > 10      | Very high multicollinearity (problematic) |

If the VIF of a feature is more than 10, remove the feature from the feature list.

In [None]:
# CHECK MULTI-LINEARITY
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame()
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X_scaled, i) for i in range(X_scaled.shape[1])]
vif

Unnamed: 0,Features,VIF
0,MedInc,2.501295
1,HouseAge,1.241254
2,AveRooms,8.342786
3,AveBedrms,6.994995
4,Population,1.138125
5,AveOccup,1.008324
6,Latitude,9.297624
7,Longitude,8.962263


### Split the Dataset

Split the dataset into 2 dataset.
- One is the training data that is used for training the algorithm to create the model.
- Another is the test data that is used to evaluate the model.

In [None]:
# SPLIT DATA into TRAINING and TESTING
## Here, I have split into 80:20 ratio
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X_scaled, Y, test_size=0.2, random_state=42) # test_size=0.2 tells to use 20% data for testing dataset.
print("Train Shape: ", X_train.shape[0] * 100 / X_scaled.shape[0])
print("Test Shape", X_test.shape[0] * 100 / X_scaled.shape[0])

Train Shape:  80.0
Test Shape 20.0


## Model Training and Evaluation

In [None]:
# TRAIN the MODEL (using a specific algorithm)
## here, we have selected LinearRegression Algorithm
from sklearn.linear_model import LinearRegression
regressor_model = LinearRegression()
regressor_model.fit(X_train, Y_train)

In [None]:
# PREDICT using TEST DATA
Y_pred = regressor_model.predict(X_test)
Y_pred

array([0.71912284, 1.76401657, 2.70965883, ..., 4.46877017, 1.18751119,
       2.00940251])

In [None]:
# EVALUATE MODEL using TEST DATA
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import numpy as np
print("Mean of target: ", Y.mean())
print("Std of target: ", Y.std())

print("MAE: ", mean_absolute_error(Y_test, Y_pred))
print("MSE: ", mean_squared_error(Y_test, Y_pred))
print("RMSE: ", np.sqrt(mean_squared_error(Y_test, Y_pred)))
print("R2 Score: ", r2_score(Y_test, Y_pred)) #model looks ~60% good.

print("RMSE is less than Std which means, the model is good.")


Mean of target:  2.068558169089147
Std of target:  1.1539561587441483
MAE:  0.5332001304956565
MSE:  0.555891598695244
RMSE:  0.7455813830127761
R2 Score:  0.5757877060324511
RMSE is less than Std which means, the model is good.


# 📊 Model Performance Summary Report

## 🎯 Target Overview
- **Mean of target**: `2.07`
- **Standard deviation**: `1.15`
- The typical range of your target values is approximately between **0.9 and 3.2**.

---

## 📈 Error Metrics

- **Mean Absolute Error (MAE)**: `0.53`
  - On average, predictions are off by ~0.53 units.
  - Indicates consistent, relatively low prediction error.

- **Root Mean Squared Error (RMSE)**: `0.75`
  - Slightly higher than MAE due to larger penalties on outliers.
  - Reflects some occasional larger errors.

- **Mean Squared Error (MSE)**: `0.56`
  - The average of squared errors; less interpretable directly, but used during optimization.

---

## 🧠 Explained Variance

- **R² Score**: `0.576`
  - The model explains **57.6% of the variance** in the target variable.
  - Indicates a **moderate** level of model fit.
  - Significantly better than baseline (e.g., mean predictor), but with room for improvement.

---

## ✅ Interpretation

- The model performs **moderately well overall**:
  - It predicts close to actual values, with **low average error**.
  - Captures a **meaningful amount of variance** in the target.
- However, **~42% of the variance remains unexplained**, indicating:
  - Potential missing features
  - Model complexity limits
  - Or noisy data

---

## 🔍 Recommendations for Improvement

- Enhance **feature engineering** to expose more signal in the data.
- Try **more complex models** (e.g., ensemble methods, neural networks).
- Perform **residual analysis** to detect systematic under/over-predictions.
- Apply **cross-validation** to confirm the model generalizes well.

---

