<a href="https://colab.research.google.com/github/drishyatv/regression/blob/main/Copy_of_ASSIGNMNTTregression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1. Loading and Preprocessing


In [None]:
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Load dataset
housing = fetch_california_housing()
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df['MedHouseVal'] = housing.target

# Check for missing values
print("Missing values:\n", df.isnull().sum())

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df.drop('MedHouseVal', axis=1))
y = df['MedHouseVal']

# Split into train-test sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)


#Missing Values: Checked for missing values; none found.

#Standardization: Used StandardScaler to standardize features for algorithms like SVR and Gradient Boosting which are sensitive to feature scales.

Missing values:
 MedInc         0
HouseAge       0
AveRooms       0
AveBedrms      0
Population     0
AveOccup       0
Latitude       0
Longitude      0
MedHouseVal    0
dtype: int64


In [None]:
from sklearn.datasets import fetch_california_housing
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Load dataset
california = fetch_california_housing()
df = pd.DataFrame(california.data, columns=california.feature_names)
df['MedHouseVal'] = california.target

# Standardization
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df.drop('MedHouseVal', axis=1))
X = pd.DataFrame(scaled_features, columns=california.feature_names)
y = df['MedHouseVal']



2. Regression Algorithms Implementation

In [None]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()      #linear regression
lr.fit(X, y)

#Explanation: Linear Regression assumes a linear relationship between features and the target. It’s fast and interpretable.

In [None]:
from sklearn.tree import DecisionTreeRegressor

dt = DecisionTreeRegressor(random_state=42)   #decision tree regressor
dt.fit(X, y)


#Explanation: Decision Trees split data into regions and fit constant values. It captures non-linear patterns but may overfit.

In [None]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(random_state=42)    #random forest regressor
rf.fit(X, y)


#Explanation: An ensemble of Decision Trees that improves accuracy and reduces overfitting via averaging


In [None]:
from sklearn.ensemble import GradientBoostingRegressor

gbr = GradientBoostingRegressor(random_state=42)    #gradient boosting regressor
gbr.fit(X, y)


#Explanation: Builds trees sequentially to minimize error, great for complex datasets but sensitive to hyperparameters.



In [None]:
from sklearn.svm import SVR

svr = SVR()             #support vector regressor
svr.fit(X, y)


Explanation: SVR uses kernel functions to model non-linear patterns. Sensitive to feature scales.


3. Model Evaluation and Comparison


In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

models = {
    'Linear Regression': y_pred_lr,
    'Decision Tree': y_pred_dt,
    'Random Forest': y_pred_rf,
    'Gradient Boosting': y_pred_gb,
    'SVR': y_pred_svr
}

results = {}
for name, preds in models.items():
    results[name] = {
        'MSE': mean_squared_error(y_test, preds),
        'MAE': mean_absolute_error(y_test, preds),
        'R2': r2_score(y_test, preds)
    }

results_df = pd.DataFrame(results).T.sort_values(by='R2', ascending=False)
print("\nModel Performance Comparison:\n")
print(results_df)

best_model = results_df['R2'].idxmax()
worst_model = results_df['R2'].idxmin()

print(f"\n✅ Best Performing Model: {best_model} with R2 = {results_df.loc[best_model, 'R2']:.4f}")
print(f"❌ Worst Performing Model: {worst_model} with R2 = {results_df.loc[worst_model, 'R2']:.4f}")




Model Performance Comparison:

                        MSE       MAE        R2
Random Forest      0.255498  0.327613  0.805024
Gradient Boosting  0.293999  0.371650  0.775643
SVR                0.355198  0.397763  0.728941
Decision Tree      0.494272  0.453784  0.622811
Linear Regression  0.555892  0.533200  0.575788

✅ Best Performing Model: Random Forest with R2 = 0.8050
❌ Worst Performing Model: Linear Regression with R2 = 0.5758


1.Best Performing Model: Likely Random Forest or Gradient Boosting — they model complex patterns and reduce overfitting.

2.Worst Performing Model: SVR or Linear Regression — SVR is sensitive to scaling and not ideal for large datasets; Linear Regression assumes linearity.