# TODAY
* Regression
* Evaluation metrics for regression


# Regression with ScikitLearn
Scikit-learn: A widely used machine learning library that provides simple and efficient tools for data analysis and modeling, including various regression algorithms.
* Key features for regression: Linear regression, Ridge regression, Lasso regression, Polynomial regression, and more.

**Example**: *We are willing to predict the prices of houses based on various features such as the number of bedrooms, size in square feet, and location.*

## Evaluation Metrics for Regression

When evaluating regression models, several metrics can help you understand how well the model is performing. Here are some commonly used metrics for regression evaluation:

### 1. Mean Absolute Error (MAE)
- **Formula**:
  $
  \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|
  $
- **Description**: MAE measures the average absolute difference between the actual and predicted values. It is easy to interpret and provides a clear view of the average error.

### 2. Mean Squared Error (MSE)
- **Formula**:
  $
  \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
  $
- **Description**: MSE calculates the average of the squares of the errors, giving higher weight to larger errors. This metric is useful when you want to penalize larger errors more severely.

### 3. Root Mean Squared Error (RMSE)
- **Formula**:
  $
  \text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}
  $
- **Description**: RMSE is the square root of MSE and provides a measure of error in the same units as the target variable. It is sensitive to outliers and can be useful for understanding model performance.



In [1]:
# Upgrade scikit-learn to the latest version
!pip install --upgrade scikit-learn



In [2]:
import sklearn
from sklearn import datasets  # Import a built-in dataset from scikit-learn
import numpy as np

# Load the California housing dataset
housing_dataset = datasets.fetch_california_housing()
print(housing_dataset.keys())  # Print the keys of a "housing_dataset" object


dict_keys(['data', 'target', 'frame', 'target_names', 'feature_names', 'DESCR'])


In [None]:
print(housing_dataset["DESCR"]) #Print the description of the dataset

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

:Number of Instances: 20640

:Number of Attributes: 8 numeric, predictive attributes and the target

:Attribute Information:
    - MedInc        median income in block group
    - HouseAge      median house age in block group
    - AveRooms      average number of rooms per household
    - AveBedrms     average number of bedrooms per household
    - Population    block group population
    - AveOccup      average number of household members
    - Latitude      block group latitude
    - Longitude     block group longitude

:Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per ce

In [4]:
# Access the data, targets, and other information in the dataset.
X = housing_dataset.data
y = housing_dataset.target
feature_names = housing_dataset.feature_names

In [5]:
print(X.shape) # Print the shape of the array X

(20640, 8)


The output in this case might be "(20640, 8)," which indicates that there are 20640 rows and 8 columns in the data represented by "X." 20640: is the number of samples in the housing dataset. 8: is the number of features in a single sample.

In [6]:
# Very important!!!!! Split the entire dataset into (a) training set, and (b) test set.
from sklearn.model_selection import train_test_split

# In this case, with a proportion of 25% of the data assigned to the test set.
# The "random_state" argument is used to control the reproducibility of the random split.
# The value "24" in this case is just an arbitrary number and could be replaced with any other integer.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=24)

In [7]:
# Very important!!!!! Split the training set into (a) training set, and (b) validation set.
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=32)

In [8]:
# Import the "LinearRegression" class from the "linear_model" module of the scikit-learn library.
# Linear Regression is a machine learning algorithm used for regression problems.
# This model predicts a continuous target variable by finding the linear relationship between the input features and the target.

from sklearn.linear_model import LinearRegression


In [9]:
# Specify the desired parameters for Linear Regression
model = LinearRegression()  # The model is initialized for regression without any specific parameters.

In [10]:
# Fit model to data: the model will be trained using the training data X_train and the target values y_train.
model.fit(X_train, y_train)  # The Linear Regression model is trained on the provided training dataset.

In [None]:
# Validation
# "model" represents the Linear Regression model that we trained earlier.
# predict() is a method to make predictions using the trained model.
# The array pred_val will contain the model's predictions for the validation data (X_val).

pred_val = model.predict(X_val)  # The model predicts the target values for the validation dataset.


In [12]:
pred_val

array([2.31117235, 1.82298517, 1.81575386, ..., 2.94968042, 0.40382384,
       2.55711476])

In [13]:
# Ground Truth Comparison
# Here, we will compare the predicted values from the Linear Regression model
# with the actual target values (ground truth) from the validation set.

# Assume y_val contains the actual target values for the validation data.
# Let's print both the predicted values and the actual ground truth values.

# Print the predicted values
print("Predicted Values:")
print(pred_val)

# Print the actual ground truth values
print("\nGround Truth Values:")
print(y_val)

# Optionally, we can also list both in a more structured way
import pandas as pd

# Create a DataFrame for better visualization
comparison_df = pd.DataFrame({
    'Predicted': pred_val,
    'Ground Truth': y_val
})

# Display the comparison DataFrame
print("\nComparison of Predicted vs Ground Truth:")
print(comparison_df.head(20))  # Display the first few rows of the comparison (20 in this case)


Predicted Values:
[2.31117235 1.82298517 1.81575386 ... 2.94968042 0.40382384 2.55711476]

Ground Truth Values:
[2.75  2.265 1.639 ... 2.172 0.922 4.714]

Comparison of Predicted vs Ground Truth:
    Predicted  Ground Truth
0    2.311172       2.75000
1    1.822985       2.26500
2    1.815754       1.63900
3    1.472728       0.87100
4    2.206530       2.86300
5    2.128606       2.28500
6    1.429184       0.47500
7    6.220995       5.00001
8    1.152889       0.90500
9    3.141569       2.02300
10   1.020152       1.00000
11   2.454536       1.82300
12   2.837636       2.39700
13   2.159156       2.31400
14   2.497055       5.00001
15   0.474425       0.73400
16   2.155573       2.83500
17   1.220352       0.90800
18   1.256383       0.83900
19   2.735125       3.55300


In [14]:
# Import necessary metrics from sklearn
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, explained_variance_score
from sklearn.metrics import mean_squared_error as mse_func  # Importing for clarity

# Assuming y_val contains the ground truth labels and pred_val contains the predictions
# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_val, pred_val)
print(f"Mean Absolute Error (MAE): {mae:.4f}")

# Calculate Mean Squared Error (MSE)
mse = mse_func(y_val, pred_val)  # Use the alias for clarity
print(f"Mean Squared Error (MSE): {mse:.4f}")

# Calculate Root Mean Squared Error (RMSE) using the new function
from sklearn.metrics import mean_squared_error
rmse = mean_squared_error(y_val, pred_val, squared=False)  # squared=False returns RMSE
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")

# Calculate R-squared (R2) Score
r2 = r2_score(y_val, pred_val)
print(f"R-squared (R2) Score: {r2:.4f}")

# Calculate Explained Variance Score
explained_variance = explained_variance_score(y_val, pred_val)
print(f"Explained Variance Score: {explained_variance:.4f}")


Mean Absolute Error (MAE): 0.5316
Mean Squared Error (MSE): 0.5201
Root Mean Squared Error (RMSE): 0.7212
R-squared (R2) Score: 0.6091
Explained Variance Score: 0.6092




## Exercise: Comparing Linear Regression, Polynomial Regression, and Ridge Regression

### Objective:
In this exercise, you will compare the performance of three different regression models on the housing dataset: Linear Regression, Polynomial Regression, and Ridge Regression. You will evaluate each model using several regression metrics and analyze the results.

### Instructions:

1. **Load the Housing Dataset:**
   - Import the necessary libraries and load the housing dataset using `sklearn.datasets`.

2. **Data Splitting:**
   - Split the dataset into training and testing sets using a 50%-50% ratio. You can use `train_test_split` from `sklearn.model_selection`.
   - Ensure that the random state is fixed for reproducibility.

3. **Model Training and Evaluation:**
   - Repeat the following steps 5 times:
     - **Train the Models:**
       - Train a Linear Regression model on the training set.
       - Train a Polynomial Regression model (you may need to use `PolynomialFeatures` from `sklearn.preprocessing`).
       - Train a Ridge Regression model (use `Ridge` from `sklearn.linear_model`).
     - **Make Predictions:**
       - Use the trained models to make predictions on the test set.
     - **Evaluate the Models:**
       - Calculate the Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R-squared (R²) Score, and Explained Variance Score for each model.
       - Store the results in a structured format (e.g., a list or a DataFrame).

4. **Results Comparison:**
   - After completing the evaluations for all 5 iterations, calculate the average values of the evaluation metrics for each model.
   - Create a summary table comparing the average performance of Linear Regression, Polynomial Regression, and Ridge Regression.

5. **Analysis:**
   - Analyze and discuss the results:
     - Which model performed the best based on the evaluation metrics?
     - How does polynomial regression compare to linear regression?
     - What effect does Ridge regression have on the model performance?
