<a href="https://colab.research.google.com/github/cedamusk/AI-N-ML/blob/main/Ridge_and_Lasso_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Install Python Libraries
1. **Pandas**: A library for data manipulation and analysis. It provides data structures like `DataFrame` and functions to clean, aggregrate, and process data.
2. **numpy**: A library for numerical computations. it supports large, multi-dimensional arrays and matrices, along wit a collection of mathematical functions.
3. **scikit-learn**: A machine learning library that provides simple and efficient tools for data mining, analysis and modelling, including classification, regression, clustering and dimensionality reduction.
4. **matplotlib**: A plotting library for creating static, animated, and interactive visualizations in Python.
5. **seaborn**: A statistical data visualization library built on top of Matplotlib, providing a high-level interface for creating attractive and informative graphics.

In [None]:
!pip install pandas numpy scikit-learn matplotlib seaborn

## Import Libraries
This cell imports various libraries and modules that are essential for data manipulation, machine learning, preprocessing, evaluation and visualiation.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns



## Prepare Data
This function is designed to preprocess a dataset for machine learning tasks. It performs feature-target separation, data splitting, and feature scaling.

**Step-by-step Breakdown**:
1. **Input**: `df` A pandas Dataframe containing the dataset.
2. **Feature-target separation**: `X=df.drop(['Y', 'Year'], axis=1)` Drops the columns `Y`(the target variable) and `Year` (likely irrelevant or categorical data) to create the feature matrix

  `X`. `y=df['Y]`: Selects the column `Y` as the target variable.

3. **Data-Splitting**: `train_test_split`: Splits the data into training and testing sets: `X_train` and `y_train` for training. `X_test` and `y_test` for testing. `test_size=0.2` allocates 20% of the data for tesing. `random_state=42` ensures reproducibility of the split.

4. **Feature Scaling**: `scaler=StandardScaler()`: Creates an instance of the `StandardScaler` for standardizing features by removing the mean and scaling to unit variance.

5. **Output**: Returns: `X_train_scaled`: Scaled training feature set. `X_test_scaled`: Scaled testing feature set. `y_train`: Training target values. `y_test`: Testing target values. `scaler` the fitted scaler instance (useful for scaling new data later).

In [None]:
def prepare_data(df):
  X=df.drop(['Y', 'Year'], axis=1)
  y=df['Y']

  X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.2, random_state=42)

  scaler=StandardScaler()
  X_train_scaled=scaler.fit_transform(X_train)
  X_test_scaled=scaler.transform(X_test)

  return X_train_scaled, X_test_scaled, y_train, y_test, scaler

## Train and Evaluate models
This code defines a Python function, `train_and_evaluate_models` to train and evaluate three regression models. **Linear Regression**, **Ridge Regression**, and **Lasso Regression**.

**Input Parameters**:
`X_train`: training features
`X_test`: Testing features
`y_train`: Training target values
`y_test`: Testing target values.

**Steps in the Function**:
1. **Model Initialization**: A dictionary `models` is created containing three regression models: Linear Regression, Ridge Regression (with `alpha=1.0` as the regulaization parameter) and Lassos Regression (with `alpha=1.0` as the regularization parameter).

2. **Evaluation results storage**: An empty dictionary `results` is initialized t store the performance metrics for each model.

3. **Model Training and Evaluation**: The function iterates through each model in the `models` dictionary. Each model is trained using `model.fit(X_train, y_train)`. Predictions are made for both training (`y_train_pred`) and testing data (`y_test_pred`). Performance metrics are computed **Mean Squared Error (MSE)** for training and testing. **Root Mean Squared Error(RMSE)** for training (square root of MSE). **R^2 Score** for training and testing, **Cross-Validation** scores using 5-fold cross-validation. Mean and Standard deviation of R^2 scores across folds. The results for the model are stored in the `results` dictionary, keyed by the model's name.

4. **Return value**: The function returns the `results` dictionary containing; the trained model, predictions for training and testing datasets, RMSE, R^2 scores, and cross-validation statistics (mean and standard deviation of scores)


In [None]:
def train_and_evaluate_models(X_train, X_test, y_train, y_test):
  models={
      'Linear Regression': LinearRegression(),
      'Ridge Regression': Ridge (alpha=1.0),
      "Lasso Regression": Lasso(alpha=1.0)
  }

  results={}

  for name, model in models.items():
    model.fit(X_train, y_train)

    y_train_pred=model.predict(X_train)
    y_test_pred=model.predict(X_test)

    train_mse=mean_squared_error(y_train, y_train_pred)
    test_mse=mean_squared_error(y_test, y_test_pred)
    train_r2=r2_score(y_train, y_train_pred)
    test_r2=r2_score(y_test, y_test_pred)

    cv_scores=cross_val_score(model, X_train, y_train, cv=5, scoring='r2')

    results[name]={
        'model':model,
        'train_predictions': y_train_pred,
        'test_predictions': y_test_pred,
        'train_rmse': np.sqrt(train_mse),
        'train_r2': train_r2,
        'test_r2': test_r2,
        'cv_scores_mean':cv_scores.mean(),
        'cv_scores_std': cv_scores.std()
    }

  return results

## Feature Importance
The `plot_feature_importance` function is used to visualize the importance of features in a regression model.

###Functionality
1. **Input Parameters**

  `model`: A trained regression model witha `coef_` attribute (e.g., Linear Regression, Ridge or Lasso)

  `feature_names`: A list or array containing the names of the features corresponding to the model's coefficient.

2. **feature importance DataFrame**:
  The function create a Pandas DataFrame named `importance` with:
    
    `feature`: Names of the features.
    `coefficient`: Absolute values of the coefficients (`np.abs(model.coef_)`), representing the magnitude of the feature's contribution to the prediction.

3. **Sorting by Importance**: The features are sorted by the absolute value of their coefficients in descending order (`ascending=False`).

4. **Bar Plot**: A bar plot is created using `matplotlib` to visualize the feature importance:

    X-axis: feature names
    Y-axis: Absolute coefficient values.
    the plot is formatted with labels, a rotated X-axis for readability and a title.

5. **Display**: The `plt.show()` function is used to display the plot.

In [None]:
def plot_feature_importance(model, feature_names):
  importance=pd.DataFrame({
      'feature': feature_names,
      'coefficient': np.abs(model.coef_)
  })

  importance=importance.sort_values('coefficient', ascending=False)

  plt.figure(figsize=(12, 6))
  plt.bar(importance['feature'], importance['coefficient'])
  plt.xticks(rotation=45)
  plt.title('Feature Importance')
  plt.xlabel('Features')
  plt.ylabel("Absolute Coefficient Value")
  plt.tight_layout()
  plt.show()

In [None]:
def plot_predictions_vs_actual(results, y_train, y_test):
  fig, axes=plt.subplots(len(results), 2, figsize=(15, 5*len(results)))
  fig.suptitle('Predicted Vs Actual Values for All Models', y=1.02, fontsize=16)

  for i, (name, metrics) in enumerate(results.items()):
    axes[i, 0].scatter(y_train, metrics['train_predictions'], alpha=0.5)
    axes[i, 0].plot([y_train.min(), y_train.max()], [y_train.min(), y_train.max()], 'r--', lw=2)
    axes[i, 0].set_title(f'{name}-Training Set')
    axes[i, 0].set_xlabel("Actual Values")
    axes[i, 0].set_ylabel('Predicted Values')

    axes[i,1].scatter(y_test, metrics['test_predictions'], alpha=0.5)
    axes[i,1].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
    axes[i,1].set_title(f'{name}-Test Set')
    axes[i,1].set_xlabel("Actual Values")
    axes[i, 1].set_ylabel("Predicted Values")

  plt.tight_layout()
  plt.show()


## Predictions Vs Actual
The `plot_predictions_vs_actual` function generates visualizations to compare actual vs. predicted values for both the training and testing datasets across multiple regression models.

###Functionality
1. **Input parameters**:

    `results`: A dictionary containig model evaluation results. Each entry corresponds to a model and includes `train_predictions` and `test_predictions`.

    `y_train`: the actual target values for training dataset.

    `y_test`: the actual target values for the testing dataset.

2. **Figure and Subplots**:

    A `matplotlib` figure with sublots is created. Rows correspond to the number of models in `results`. Each model has two columns Column1: Training set (actual vs. predicted values). Column2: testing set (actual vs. predicted values).

3. **Scatter plots**: for each model, two scatter plots are created:

  **Training set**:`y_train` (actual) vs. `metrics['train_predictions']` (predicted).

  **Testing set**: `y_test` (actual) vs. `metrics['test_predictions']` (predicted).

  A reference line (`r--`, dashed red) is plotted on both scatter plots to represent the ideal case where predicted values perfectly match actual values.

4. **Title, labels and layout**: Each plot is titled with the model's name and dataset type (Training/Test). Axes are labelled for clarity. The layout is adjusted using `plt.tight_layout()` to ensure proper spacing.

5. **Display**: The plots are displayed using `plt.show()`.

In [None]:
def plot_residuals(results, y_train, y_test):
  fig, axes=plt.subplots(len(results), 2, figsize=(15, 5*len(results)))
  fig.suptitle('residual Plots for All models', y=1.02, fontsize=16)

  for i, (name, metrics) in enumerate(results.items()):

    train_residuals=y_train-metrics['train_predictions']
    axes[i, 0].scatter(metrics['train_predictions'], train_residuals, alpha=0.5)
    axes[i, 0].axhline(y=0, color='r', linestyle='--')
    axes[i, 0].set_title(f'{name}-Trainig Set residuals')
    axes[i, 0].set_xlabel('Predicted Values')
    axes[i, 0].set_ylabel("Residuals")

    test_residuals=y_test-metrics['test_predictions']
    axes[i, 1].scatter(metrics['test_predictions'], test_residuals, alpha=0.5)
    axes[i, 1].axhline(y=0, color='r', linestyle='--')
    axes[i, 1].set_title(f'{name}-Test set Residuals')
    axes[i, 1].set_xlabel('Predicted Values')
    axes[i, 1].set_ylabel('Residuals')

  plt.tight_layout()
  plt.show()

## Error Distribution
The `plot_error_distribution` function generates histograms of predictions errors for both the training and testing datasets across multiple regression models. This visualization provides insights into how well the models are predicting and whether there are systematic biases or outliers in the pedictions.

###Functionality
1. **Input Parameters**:

    `results`: A dictionary containing evaluation results for multiple models. Each entry includes `train_predictions` and `test_predictions`.

    `y_train`: The actual target values for the training dataset.

    `y_test`: The actual target values for the testing dataset.

2. **Figure and Subplots**:

  A `matplotlib` figure with subplots is created:
    
      Rows correspond to the numner of models in `results`. Each model has two columns:
        Column 1: Training set error distribution
        Column 2: Testing set error distribution

3. **Error Calculation**:

  For each model:

      Training errors: Error=(y_train)-(train_predictions)

      Testing Errors: Error= (y_test)-(test_predictions).


4. **Error distribution plots**:

Histograms are plotted for both training and testing errors using `sns.histplot`(from the Seaborn Library).

Kernel Density Estimation (KDE) is enabled (`kde=True`) to visualize the underlying distribution smoothly.

5. **Titles, labels and layout**:
  Each plot is filled with the model's name and dataset type (Training/Test). X-axis is labelled as the "Prediction Error". Layout is adjusted using `plt.tight_layout()` for better spacing.

6. **Display**:
The plots are displayed using `plt.show()`.

In [None]:
# @title
def plot_error_distribution(results, y_train, y_test):
  fig, axes=plt.subplots(len(results), 2, figsize=(15, 5*len(results)))
  fig.suptitle('Error Distribution for All models', y=1.02, fontsize=16)

  for i, (name, metrics) in enumerate(results.items()):
    train_errors=y_train-metrics['train_predictions']
    sns.histplot(train_errors, kde=True, ax=axes[i, 0])
    axes[i, 0].set_title(f'{name}- Training Error distribution')
    axes[i, 0].set_xlabel('Prediction Error')

    test_errors=y_test-metrics['test_predictions']
    sns.histplot(test_errors, kde=True, ax=axes[i, 1])
    axes[i, 1].set_title(f'{name}- Test Error Distribution')
    axes[i, 1].set_xlabel('Prediction Error')

  plt.tight_layout()
  plt.show()

## Ridge and Lasso Regression
This script combines the previous functions to perform end-to-end data preparation, model training, evaluation, and visualization for a regression task.

In [None]:
df=pd.read_csv('synthetic_ridge_lasso_data.csv')
X_train, X_test, y_train, y_test, scaler=prepare_data(df)

results=train_and_evaluate_models(X_train, X_test, y_train, y_test)

for name, metrics in results.items():
  print(f"\n{name} Results:")
  print(f"Training RMSE: {metrics['train_rmse']:.4f}")
  print(f"Training R2 Score: {metrics['train_r2']:.4f}")
  print(f"Test R2 Score: {metrics['test_r2']:.4f}")
  print(f"Cross-validation R2(mean +/- std): {metrics['cv_scores_mean']:.4f}+/-{metrics['cv_scores_std']:.4f}")

  plot_feature_importance(results['Linear Regression']['model'], df.drop(["Y", "Year"], axis=1).columns)
  plot_predictions_vs_actual(results, y_train, y_test)
  plot_residuals(results, y_train, y_test)
  plot_error_distribution(results, y_train, y_test)