1 Loading and Preprocessing

In [1]:
from sklearn.datasets import fetch_california_housing
import pandas as pd

# Load the dataset
housing = fetch_california_housing()

# Convert to DataFrame
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df['MedHouseVal'] = housing.target

# View the first few rows
df.head()


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [3]:
# Check Missing Values and Info


In [4]:
# Check for missing values
print(df.isnull().sum())

# Check data types and structure
print(df.info())


MedInc         0
HouseAge       0
AveRooms       0
AveBedrms      0
Population     0
AveOccup       0
Latitude       0
Longitude      0
MedHouseVal    0
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   MedInc       20640 non-null  float64
 1   HouseAge     20640 non-null  float64
 2   AveRooms     20640 non-null  float64
 3   AveBedrms    20640 non-null  float64
 4   Population   20640 non-null  float64
 5   AveOccup     20640 non-null  float64
 6   Latitude     20640 non-null  float64
 7   Longitude    20640 non-null  float64
 8   MedHouseVal  20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB
None


In [5]:
# Feature Scaling

In [6]:
from sklearn.preprocessing import StandardScaler

# Separating features and target
X = df.drop('MedHouseVal', axis=1)
y = df['MedHouseVal']

# Apply standard scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


2. Regression Algorithms

In [7]:
#Train-Test Split

In [8]:
from sklearn.model_selection import train_test_split

# 80-20 split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)


In [9]:
# Train Multiple Models

In [10]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Define models
models = {
    'Linear Regression': LinearRegression(),
    'Decision Tree': DecisionTreeRegressor(random_state=42),
    'Random Forest': RandomForestRegressor(random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(random_state=42),
    'SVR': SVR()
}

# Train and evaluate
results = []

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    mse = mean_squared_error(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    results.append({
        'Model': name,
        'MSE': mse,
        'MAE': mae,
        'R2 Score': r2
    })

results_df = pd.DataFrame(results)
results_df.sort_values(by='R2 Score', ascending=False)


Unnamed: 0,Model,MSE,MAE,R2 Score
2,Random Forest,0.255498,0.327613,0.805024
3,Gradient Boosting,0.293999,0.37165,0.775643
4,SVR,0.355198,0.397763,0.728941
1,Decision Tree,0.494272,0.453784,0.622811
0,Linear Regression,0.555892,0.5332,0.575788


## Step 3: Model Evaluation and Comparison

To see how each model performed, I checked three evaluation metrics:  
- **Mean Squared Error (MSE)**  
- **Mean Absolute Error (MAE)**  
- **R-squared Score (R²)**

Here are the results I got:

- **Random Forest Regressor**:  
  MSE = 0.2555, MAE = 0.3276, R² = 0.8050  

- **Gradient Boosting Regressor**:  
  MSE = 0.2940, MAE = 0.3717, R² = 0.7756  

- **Support Vector Regressor (SVR)**:  
  MSE = 0.3552, MAE = 0.3978, R² = 0.7289  

- **Decision Tree Regressor**:  
  MSE = 0.4943, MAE = 0.4538, R² = 0.6228  

- **Linear Regression**:  
  MSE = 0.5559, MAE = 0.5332, R² = 0.5758  

### My Analysis:

**Best Model:** Random Forest  
It gave the best performance with the highest R² score (0.805) and the lowest errors. It seems to work well because it averages multiple trees and avoids overfitting.

**Worst Model:** Linear Regression  
This model had the lowest R² score and highest errors. It probably didn’t do well because it assumes a straight-line relationship, which doesn't capture the complexity in this dataset.
