# Python Implementation of K- Fold
Let's create a Python code using the scikit-learn library to perform K-fold crossvalidation on a house price dataset.\
We'll use the California housing dataset,
which is available in scikit-learn\
**Objective**: The objective of this code is to evaluate the performance of different
regression models (Linear Regression, Multiple Linear Regression, Random Forest
Regression, and Decision Tree Regression) on the California housing dataset
using K-fold cross-validation.

## Step 1: Importing the necessary libraries

In [5]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
import numpy as np
import pandas as pd

## Step 2: Load the housing dataset

In [3]:
df = pd.read_csv('housing.csv')

In [6]:
california_housing = fetch_california_housing()
X = california_housing.data
y = california_housing.target

In [14]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [18]:
df.isna().sum()

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

In [19]:
df.dropna(inplace = True)

In [20]:
X = df.iloc[:,2:-2].values
y = df.iloc[:,-2].values

## Step 3: Initialize KFold with 5 folds

In [21]:
k = 5
kf = KFold(n_splits=k, shuffle=True, random_state=42)
# Initialize lists to store MSE for each fold
mse_scores_lr = []
mse_scores_mlr = []
mse_scores_rf = []
mse_scores_dt = []

## Step 4: Model Training and Evaluation

### Iterate over each fold

In [22]:
for train_index, test_index in kf.split(X):
 # Split data into train and test sets
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    # Linear Regression
    model_lr = LinearRegression()
    model_lr.fit(X_train, y_train)
    y_pred_lr = model_lr.predict(X_test)
    mse_lr = mean_squared_error(y_test, y_pred_lr)
    mse_scores_lr.append(mse_lr)

    # Multiple Linear Regression
    model_mlr = LinearRegression()
    model_mlr.fit(X_train, y_train)
    y_pred_mlr = model_mlr.predict(X_test)
    mse_mlr = mean_squared_error(y_test, y_pred_mlr)
    mse_scores_mlr.append(mse_mlr)
    
    # Random Forest Regression
    model_rf = RandomForestRegressor(n_estimators=100,
    random_state=42) # You can adjust n_estimators and other hyperparameters
    model_rf.fit(X_train, y_train)
    y_pred_rf = model_rf.predict(X_test)
    mse_rf = mean_squared_error(y_test, y_pred_rf)
    mse_scores_rf.append(mse_rf)
    
    # Decision Tree Regression
    model_dt = DecisionTreeRegressor(random_state=42) # You can adjust other hyperparameters
    model_dt.fit(X_train, y_train)
    y_pred_dt = model_dt.predict(X_test)
    mse_dt = mean_squared_error(y_test, y_pred_dt)
    mse_scores_dt.append(mse_dt)
    

## Calculate the average MSE across all folds for each model

In [23]:
average_mse_lr = np.mean(mse_scores_lr)
average_mse_mlr = np.mean(mse_scores_mlr)
average_mse_rf = np.mean(mse_scores_rf)
average_mse_dt = np.mean(mse_scores_dt)
print("Linear Regression - Mean Squared Error (MSE) for each fold:", mse_scores_lr)
print("Linear Regression - Average MSE:", average_mse_lr)
print("Multiple Linear Regression - Mean Squared Error (MSE) for each fold:", mse_scores_mlr)
print("Multiple Linear Regression - Average MSE:", average_mse_mlr)
print("Random Forest Regression - Mean Squared Error (MSE) for each fold:", mse_scores_rf)
print("Random Forest Regression - Average MSE:", average_mse_rf)
print("Decision Tree Regression - Mean Squared Error (MSE) for each fold:", mse_scores_dt)
print("Decision Tree Regression - Average MSE:", average_mse_dt)

Linear Regression - Mean Squared Error (MSE) for each fold: [5865619646.959269, 5240470425.3381195, 5877061000.424654, 6192954321.708551, 5794433167.31188]
Linear Regression - Average MSE: 5794107712.3484955
Multiple Linear Regression - Mean Squared Error (MSE) for each fold: [5865619646.959269, 5240470425.3381195, 5877061000.424654, 6192954321.708551, 5794433167.31188]
Multiple Linear Regression - Average MSE: 5794107712.3484955
Random Forest Regression - Mean Squared Error (MSE) for each fold: [4484119550.753096, 4260088426.88528, 4569829795.922909, 4470597016.153349, 4523660660.54193]
Random Forest Regression - Average MSE: 4461659090.051313
Decision Tree Regression - Mean Squared Error (MSE) for each fold: [8466798573.477367, 8542955554.78297, 8900706471.199902, 8511012157.081987, 8561037635.880078]
Decision Tree Regression - Average MSE: 8596502078.484463


**Conclusion**\
Therefore, based on these results, we can conclude that Random Forest
Regression is the most suitable model for predicting housing prices in the
California housing dataset, as it achieves the lowest average MSE among all the
models tested.
