# Notebook Description: Model Comparison for Regression Techniques

## Project Overview

This notebook compares different regression techniques for predicting student performance based on demographic and academic factors. The goal is to evaluate the performance of various regression models and identify the most suitable one for the task.

## Context

The notebook begins by importing necessary libraries and modules for data manipulation, visualization, and model building. It includes modules for implementing different regression techniques such as linear regression, regularized linear regression, and Bayesian linear regression.

## Data Preprocessing

The dataset is loaded using pandas from a CSV file ('train.csv') containing student performance data. Preprocessing steps include partitioning the dataset into training and testing sets, as well as normalization using mean and standard deviation.

## Model Comparison

Four different regression techniques are compared:
1. **Linear Regression (1b)**: Basic linear regression model without regularization.
2. **Regularized Linear Regression (1c)**: Linear regression model with regularization (e.g., Lasso or Ridge) to prevent overfitting.
3. **Regularized Biased Linear Regression (1d)**: Linear regression model with regularization and a bias term to account for bias in the data.
4. **Bayesian Linear Regression (1e)**: Probabilistic approach to linear regression using Bayesian inference.

For each technique, the notebook computes the Root Mean Squared Error (RMSE) and generates predictions (`yhat`) for the testing set. These predictions are then plotted against the ground truth values to visually compare the performance of each model.

## Results Visualization

A comparison plot (`plot_1f`) is generated to visualize the predictions of all models alongside the ground truth values. This plot provides insights into the performance of each regression technique and helps identify the best-performing model based on RMSE.

## Coded by

Gaddisa Olani (gaddisaolex@gmail.com)


In [4]:
import pandas as pd
from matplotlib import pyplot
import numpy as np
import LinearRegression_1b as one_b
import RegularizedLinearRegression_1c as one_c
import RegularizedBiasedLinearRegression_1d as one_d
import BayesianLinearRegression_1e as one_e
#do the normalization using mean and standard deviation 

from preprocessing import *
#plot the comparison of all models  from 1b,1c,1d,1e
def plot_1f(yhat_b,yhat_c,yhat_d,yhat_e):
    x=np.arange(0,200)
    pyplot.figure(figsize=(12,9))
    pyplot.style.use('fivethirtyeight')
    #pyplot.ylim(-50,50)
    pyplot.plot(x, test_set_y,label='Ground Truth')
    pyplot.plot(x, yhat_b, label='('+str(rmse_b.round(decimals=2))+') Linear Regression')
    pyplot.plot(x, yhat_c,label='('+str(rmse_c.round(decimals=2))+') Linear Regression (with reg)')
    pyplot.plot(x, yhat_d, label='('+str(rmse_d.round(decimals=2))+') Linear Regression (r/b)')
    pyplot.plot(x, yhat_e, label='('+str(rmse_e.round(decimals=2))+') Bayesian Linear Regression (r/b)')

    pyplot.xlabel('Sample Index')
    pyplot.ylabel('Values')
    pyplot.title('Comparison of Linear Regression Model answer to 1f')
    pyplot.legend(loc="best")


In [None]:
if __name__ == '__main__':
    #partition it to 80% trainingset and 20%testset#total parameter=24
    datasets = pd.read_csv('train.csv')
    # Set seed so we get same random allocation on each run of code
    #np.random.seed(7)
    preprocessed=preprocessing(datasets)
    train_set_x,train_set_y,test_set_x,test_set_y=train_test_split_preprocessed(preprocessed)
    
    #solution1: Normal Linear Regression (1b)
    rmse_b,yhat_b=one_b.linear_regression_1b(train_set_x,train_set_y,test_set_x,test_set_y)

    #solution2: Regularized Linear Regression (1c)
    rmse_c,yhat_c=one_c.Regularized_LinearRegression_1c(train_set_x,train_set_y,test_set_x,test_set_y)
    
    #solution3: Regularized Linear Regression with bias term (1d)
    rmse_d,yhat_d=one_d.Regularized_Biased_LinearRegression_1d(train_set_x,train_set_y,test_set_x,test_set_y)
    
    #solution4: Regularized Linear Regression with bias term (1e)
    rmse_e,yhat_e=one_e.Bayesian_LinearRegression_1e(train_set_x,train_set_y,test_set_x,test_set_y)
    #print(rmse_e)
    
    #plot the comparison 1f
    plot_1f(yhat_b,yhat_c,yhat_d,yhat_e)