### Exercise:	Predicting House Prices
Objective: Build a linear regression model to predict house prices on various features and evaluate its performance using 5-fold cross-validation.
1.	Dataset Preparation: Use the ‘California housing’ dataset available in scikit-learn.
2.	Exploratory Data Analysis (EDA), check for null values, and visualize some features.
3.	Train a Linear Regression Model
4.	Evaluate the model using 5-fold cross-validation.
5.	Further explore: Can you try other regression models to improve the performance?


In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# Step 1: Dataset Preparation
california_housing = fetch_california_housing()
data = pd.DataFrame(california_housing.data, columns=california_housing.feature_names)
target = california_housing.target

# Step 2: Exploratory Data Analysis (EDA)
print(data.info())  # Check for null values
print(data.describe())  # Summary statistics

# Step 3: Train a Linear Regression Model
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=42)
linear_reg_model = make_pipeline(StandardScaler(), LinearRegression())
linear_reg_model.fit(X_train, y_train)

# Step 4: Evaluate the model using 5-fold cross-validation
cv_scores = cross_val_score(linear_reg_model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
rmse_scores = np.sqrt(-cv_scores)

print("Cross-Validation RMSE Scores:", rmse_scores)
print("Mean RMSE:", np.mean(rmse_scores))
print("Standard Deviation of RMSE:", np.std(rmse_scores))


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   MedInc      20640 non-null  float64
 1   HouseAge    20640 non-null  float64
 2   AveRooms    20640 non-null  float64
 3   AveBedrms   20640 non-null  float64
 4   Population  20640 non-null  float64
 5   AveOccup    20640 non-null  float64
 6   Latitude    20640 non-null  float64
 7   Longitude   20640 non-null  float64
dtypes: float64(8)
memory usage: 1.3 MB
None
             MedInc      HouseAge      AveRooms     AveBedrms    Population  \
count  20640.000000  20640.000000  20640.000000  20640.000000  20640.000000   
mean       3.870671     28.639486      5.429000      1.096675   1425.476744   
std        1.899822     12.585558      2.474173      0.473911   1132.462122   
min        0.499900      1.000000      0.846154      0.333333      3.000000   
25%        2.563400     18.000000      4.4

In [2]:
from sklearn.linear_model import Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor

# Models to try
models = {
    "Linear Regression": LinearRegression(),
    "Ridge Regression": Ridge(),
    "Lasso Regression": Lasso(),
    "Random Forest Regression": RandomForestRegressor()
}

# Evaluate each model using 5-fold cross-validation
for name, model in models.items():
    pipeline = make_pipeline(StandardScaler(), model)
    scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
    rmse_scores = np.sqrt(-scores)
    print(f"Model: {name}")
    print("Cross-Validation RMSE Scores:", rmse_scores)
    print("Mean RMSE:", np.mean(rmse_scores))
    print("-----------------------------------")


Model: Linear Regression
Cross-Validation RMSE Scores: [0.72115555 0.70872616 0.7214877  0.71266905 0.73859747]
Mean RMSE: 0.7205271873526421
-----------------------------------
Model: Ridge Regression
Cross-Validation RMSE Scores: [0.72116065 0.70872775 0.72149607 0.71266899 0.73858449]
Mean RMSE: 0.7205275913581648
-----------------------------------
Model: Lasso Regression
Cross-Validation RMSE Scores: [1.17026369 1.13925369 1.16043756 1.142854   1.16811888]
Mean RMSE: 1.1561855660646667
-----------------------------------
Model: Random Forest Regression
Cross-Validation RMSE Scores: [0.51484897 0.51764831 0.50417031 0.50431305 0.51506059]
Mean RMSE: 0.5112082470413293
-----------------------------------
