#**California Housing Price Prediction – Regression Assignment**

**Objective**: Apply and evaluate five regression algorithms on the California Housing dataset using supervised learning.

**Dataset Source**:  sklearn

- Preprocessing and scaling
- Implementing 5 regression models
- Evaluating using MSE, MAE, and R²
- Identifying the best and worst models


## 🔹 1. Load & Preprocess the Dataset

In [1]:
from sklearn.datasets import fetch_california_housing
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Load the dataset
california = fetch_california_housing()
df = pd.DataFrame(california.data, columns=california.feature_names)
df["Target"] = california.target

# Check for missing values
print("Missing values:\n", df.isnull().sum())

# Feature scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df.drop("Target", axis=1))
y = df["Target"]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

print("✅ Preprocessing complete.")


Missing values:
 MedInc        0
HouseAge      0
AveRooms      0
AveBedrms     0
Population    0
AveOccup      0
Latitude      0
Longitude     0
Target        0
dtype: int64
✅ Preprocessing complete.


## 2. Train Regression Models

In [2]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

models = {
    "Linear Regression": LinearRegression(),
    "Decision Tree": DecisionTreeRegressor(random_state=42),
    "Random Forest": RandomForestRegressor(random_state=42),
    "Gradient Boosting": GradientBoostingRegressor(random_state=42),
    "SVR": SVR()
}

results = []

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    results.append([name, mse, mae, r2])
    print(f"{name} - Done")

results_df = pd.DataFrame(results, columns=["Model", "MSE", "MAE", "R²"]).sort_values("R²", ascending=False)
results_df


Linear Regression - Done
Decision Tree - Done
Random Forest - Done
Gradient Boosting - Done
SVR - Done


Unnamed: 0,Model,MSE,MAE,R²
2,Random Forest,0.255498,0.327613,0.805024
3,Gradient Boosting,0.293999,0.37165,0.775643
4,SVR,0.355198,0.397763,0.728941
1,Decision Tree,0.494272,0.453784,0.622811
0,Linear Regression,0.555892,0.5332,0.575788


##  3. Model Evaluation & Comparison

### Metrics Used:
- **MSE (Mean Squared Error)**: Penalizes larger errors
- **MAE (Mean Absolute Error)**: Average magnitude of errors
- **R² Score**: Proportion of variance explained by the model

### Observations:
- **Best performing model**: Usually Random Forest or Gradient Boosting (high R², low MSE/MAE)
- **Worst performing model**: Typically Linear Regression or SVR on complex data


##  4. Conclusion

This notebook demonstrates regression modeling on the California Housing dataset using:
- Linear and non-linear models
- Ensemble methods
- Support Vector Regressor

The ensemble methods (Random Forest & Gradient Boosting) perform the best due to their ability to handle non-linearities and overfitting.
