# 📌 Cross-Validation Methods & Evaluation Metrics in Regression

## 🔹 Cross-Validation Methods
Cross-validation is used to evaluate regression model performance by splitting the dataset into multiple parts (folds).  
The model is trained on some folds and tested on others, ensuring more reliable performance estimation.

### 1. **K-Fold Cross-Validation**
- Splits the dataset into `k` equal folds.  
- The model is trained on `k-1` folds and tested on the remaining fold.  
- Repeats this process `k` times, each time with a different test fold.  
- **Pros:** Simple, works well for regression problems.  
- **Cons:** Performance may vary depending on the dataset size and noise.  

### 2. **Leave-One-Out Cross-Validation (LOOCV)**
- Each observation is used once as a test set, while the rest form the training set.  
- Suitable for very small datasets.  
- **Pros:** Makes full use of the data for training.  
- **Cons:** Computationally expensive for large datasets and may have high variance.  

### 3. **Group K-Fold Cross-Validation**
- Ensures that all samples from the same group (e.g., neighborhoods, patients) are kept in either the training or test set.  
- **Pros:** Prevents data leakage when groups contain correlated samples.  
- **Cons:** Requires a meaningful grouping variable.  

---

## 🔹 Evaluation Metrics in Regression

Unlike classification, regression does not use accuracy or confusion matrices. Instead, it focuses on how close the predicted values are to the actual values.

### 1. **Mean Squared Error (MSE)**
\[
MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
\]  
- Penalizes larger errors more heavily.  
- Lower values indicate better model performance.  

### 2. **Root Mean Squared Error (RMSE)**
\[
RMSE = \sqrt{MSE}
\]  
- Square root of MSE, making it interpretable in the same units as the target variable.  
- Useful for comparing models when target values are in real-world units (e.g., prices, temperatures).  

### 3. **Mean Absolute Error (MAE)**
\[
MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|
\]  
- Average of absolute errors.  
- Less sensitive to outliers than MSE/RMSE.  

### 4. **R² Score (Coefficient of Determination)**
\[
R^2 = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2}
\]  
- Measures the proportion of variance in the target variable explained by the model.  
- `R² = 1`: Perfect fit.  
- `R² = 0`: Model performs no better than predicting the mean.  
- Can be negative if the model performs worse than a baseline.  

### 5. **Adjusted R²**
- Adjusts R² for the number of predictors used.  
- Helps prevent overfitting when adding more features.  
- Especially useful in multiple regression models.  




In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split #hold-out cross val
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold, cross_val_score, StratifiedKFold, LeaveOneOut, GroupKFold
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Load your dataset
data = pd.read_csv('house_price_regression_dataset.csv')
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Square_Footage        1000 non-null   int64  
 1   Num_Bedrooms          1000 non-null   int64  
 2   Num_Bathrooms         1000 non-null   int64  
 3   Year_Built            1000 non-null   int64  
 4   Lot_Size              1000 non-null   float64
 5   Garage_Size           1000 non-null   int64  
 6   Neighborhood_Quality  1000 non-null   int64  
 7   House_Price           1000 non-null   float64
dtypes: float64(2), int64(6)
memory usage: 62.6 KB
None


In [2]:
print(data.describe())

       Square_Footage  Num_Bedrooms  Num_Bathrooms   Year_Built     Lot_Size  \
count     1000.000000   1000.000000    1000.000000  1000.000000  1000.000000   
mean      2815.422000      2.990000       1.973000  1986.550000     2.778087   
std       1255.514921      1.427564       0.820332    20.632916     1.297903   
min        503.000000      1.000000       1.000000  1950.000000     0.506058   
25%       1749.500000      2.000000       1.000000  1969.000000     1.665946   
50%       2862.500000      3.000000       2.000000  1986.000000     2.809740   
75%       3849.500000      4.000000       3.000000  2004.250000     3.923317   
max       4999.000000      5.000000       3.000000  2022.000000     4.989303   

       Garage_Size  Neighborhood_Quality   House_Price  
count  1000.000000           1000.000000  1.000000e+03  
mean      1.022000              5.615000  6.188610e+05  
std       0.814973              2.887059  2.535681e+05  
min       0.000000              1.000000  1.116269e

In [None]:
# Define features and target variable
X = data.drop('House_Price', axis=1)
y = data['House_Price']

In [None]:
# Initialize the model
model = LinearRegression()

## Holdout Validation

# 📌 Hold-Out Cross-Validation (Train/Test Split)

This snippet uses the **hold-out method** to evaluate a model:

- `train_test_split(X, y, test_size=0.25, random_state=100)`  
  - Splits the dataset into training and testing sets.  
  - `test_size=0.25`: 25% of the data is reserved for testing, and 75% is used for training.  
  - `random_state=100`: Ensures reproducibility of the split.  

### 🔹 How it works
- The model is trained on **X_train, y_train**.  
- The model is evaluated on **X_test, y_test**.  

### 🔹 Pros & Cons
- **Pros:** Simple, fast, and widely used.  
- **Cons:** Model performance depends heavily on the particular split; less reliable than k-fold cross-validation for small datasets.


In [None]:
# Hold-out cross-validation
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=100
)

In [None]:
# Train the model
model.fit(X_train, y_train)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [None]:
# Make predictions on your test set
y_pred = model.predict(X_test)

In [None]:
# Evaluate the model using Mean Squared Error (MSE)
mse_score = mean_squared_error(y_test, y_pred)

In [None]:
# Print the MSE score
print(f"MSE: {mse_score}")

MSE: 94878275.76188166


In [10]:
#For RMSE
rmse_score = np.sqrt(mse_score)

In [11]:
rmse_score
print(f"RMSE: {rmse_score}")

RMSE: 9740.548021640347


In [12]:
MAE = mean_absolute_error(y_test, y_pred)
print(f"MAE: {MAE}")

MAE: 7631.283403104009


## K-Fold Cross Validation

# 📌 K-Fold Cross-Validation with MSE Scoring

This snippet applies **K-Fold Cross-Validation** to evaluate a regression model:

- `KFold(n_splits=5, shuffle=True, random_state=42)`  
  - Splits the dataset into 5 folds.  
  - Shuffles the data before splitting.  
  - Uses a fixed random seed for reproducibility.  

- `cross_val_score(model, X, y, cv=kf, scoring='neg_mean_squared_error')`  
  - Trains and tests the model across the 5 folds.  
  - Uses **Mean Squared Error (MSE)** as the evaluation metric.  
  - Since scikit-learn minimizes loss functions, it returns **negative MSE values** (`neg_mean_squared_error`).  
  - To interpret results, take the absolute value or convert back to positive MSE.  

This approach provides a more reliable estimate of the model’s error compared to a single train/test split.


In [24]:
# K-Fold Cross-Validation using MSE as the evaluation metric
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model,X,  y, cv=kf, scoring='neg_mean_squared_error')

In [None]:
# Since scores are negative MSE, take the negative to get positive MSE and RMSE values
scores.mean()
print(f"Cross-validated MSE: {-scores.mean()}")
print (f"Cross-validated RMSE: {np.sqrt(-scores.mean())}")

Cross-validated MSE: 96389323.55483347
Cross-validated RMSE: 9817.806453319065


## Leave-One-out Validation

# 📌 Leave-One-Out Cross-Validation (LOOCV)

This snippet applies **Leave-One-Out Cross-Validation (LOOCV)** to evaluate a regression model:

- `LeaveOneOut()`  
  - Splits the dataset into as many folds as there are samples.  
  - In each iteration, **1 sample is used as the test set** and the remaining samples are used for training.  

- `cross_val_score(model, X, y, cv=loov, scoring='neg_mean_squared_error')`  
  - Trains and evaluates the model once per sample.  
  - Uses **Mean Squared Error (MSE)** as the evaluation metric.  
  - Returns **negative MSE values** (`neg_mean_squared_error`) because scikit-learn minimizes loss functions.  
  - Take the absolute value or convert to positive for interpretation.  

### 🔹 Pros & Cons
- **Pros:**  
  - Makes the most of limited data (training on `n-1` samples each time).  
  - Provides an almost unbiased estimate of generalization error.  
- **Cons:**  
  - Computationally expensive for large datasets.  
  - High variance, since each test set contains only one sample.


In [26]:
# Leave-One-Out Cross-Validation (LOOCV) using MSE as the evaluation metric
loov = LeaveOneOut()
scores_loo = cross_val_score(model, X, y, cv=loov, scoring='neg_mean_squared_error')

In [17]:
scores_loo

array([-1.90813344e+07, -6.98783138e+07, -1.71111519e+08, -1.09221529e+08,
       -1.14674213e+08, -6.01889586e+06, -4.66251791e+07, -4.62927712e+07,
       -1.29209742e+06, -2.61313089e+08, -1.77660007e+07, -2.63162094e+07,
       -6.69060015e+07, -4.92125237e+05, -1.70675160e+07, -7.26370217e+07,
       -2.80061974e+07, -2.01922144e+07, -6.64149684e+07, -3.32835746e+08,
       -7.00202568e+07, -1.31710779e+08, -4.68577878e+08, -1.29526691e+08,
       -9.55989540e+07, -1.76970986e+08, -2.50241101e+08, -1.46370069e+07,
       -5.50468473e+06, -1.35388570e+07, -5.06642132e+07, -9.85151763e+06,
       -5.83308605e+07, -7.45473794e+04, -1.17997613e+08, -1.10751043e+08,
       -3.56961648e+07, -8.10533610e+07, -2.67521104e+06, -8.83593348e+07,
       -1.70288038e+07, -1.71923270e+07, -1.26514451e+07, -9.05138194e+07,
       -2.45444722e+07, -1.32671164e+08, -2.28496437e+07, -4.78198390e+08,
       -1.41482010e+08, -1.40344552e+08, -2.01128547e+08, -4.83758085e+07,
       -1.57248611e+04, -

In [27]:
#Get the mean of the LOOCV scores
scores_loo.mean()

np.float64(-96790986.17648485)

## GroupKFold Validation

# 📌 Group K-Fold Cross-Validation with MSE Scoring

This snippet applies **Group K-Fold Cross-Validation** to evaluate a regression model:

- `GroupKFold(n_splits=5)`  
  - Splits the dataset into 5 folds **based on group membership**.  
  - Ensures that the same group is **not split across training and test sets**.  
  - Useful when samples within the same group are correlated (e.g., patients, schools, neighborhoods).  
  - Unlike `KFold`, it does not allow shuffling or random state (groups define the split).  

- `cross_val_score(model, X, y, cv=gkf, groups=data['Neighborhood_Quality'], scoring='neg_mean_squared_error')`  
  - Uses the column `'Neighborhood_Quality'` to define groups.  
  - Evaluates the model with **Mean Squared Error (MSE)** as the performance metric.  
  - Returns **negative MSE values** because scikit-learn minimizes loss functions.  
  - Convert them back to positive values (or RMSE) for interpretation.  

### 🔹 Pros & Cons
- **Pros:** Prevents data leakage when observations within a group are not independent.  
- **Cons:** Requires a meaningful grouping variable and may reduce the effective number of folds if some groups are large.


In [28]:
# Group K-Fold Cross-Validation using MSE as the evaluation metric
# Assuming 'Neighborhood_Quality' is a column in your dataset that defines the groups
gkf = GroupKFold(n_splits=5, shuffle=True, random_state=42)
scores_gkf = cross_val_score(model, X, y, cv=gkf, groups=data['Neighborhood_Quality'], scoring='neg_mean_squared_error')

In [21]:
scores_gkf

array([-1.06244722e+08, -1.03998344e+08, -9.16591308e+07, -1.00277266e+08,
       -8.48172686e+07])

In [22]:
scores_gkf.mean()

np.float64(-97399346.3575216)