In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
# Import clean data 
path = 'https://raw.githubusercontent.com/andresmorenoviteri/ML-models/main/CarPrice_Assignment.csv'
df = pd.read_csv(path)
df.head()

First, let's only use numeric data:

In [None]:
df=df._get_numeric_data()
df.head()

## Part 1: Training and Testing

The first step into training and testing a model is to split the data into a training and testing data set. Since our target is to predict the 'price', we will named this **y_data** and the dependent parameters will be named **x_data**

since we want all the other parameters in df except 'price', we can drop it from the dataframe using:

    df.drop('parameter', axis=1)


In [None]:
y_data = ____

x_data = ____

Now, we randomly split our data into training and testing data using the function **train_test_split**. 

In [None]:
from sklearn.model_selection import train_test_split


x_train, x_test, y_train, y_test = ____

print("training samples:",x_train.shape[0])
print("test samples:", x_test.shape[0])

The **test_size** parameter sets the percentage of data that is split for testing.

Let's import **LinearRegression** from the module **linear_model**.


In [None]:
from sklearn.linear_model import LinearRegression

We want to start with a **simple linear regression**, therefore we check which of our features has the highest correlation with the 'price'

In [None]:
df.____

we can see that the parameter 'enginesize' has the highest correlation with the 'price'

We create a Linear Regression object:

In [None]:
lre=____

We fit the model using the feature "enginesize":

In [None]:
lre.____

Let's calculate the R^2 on the test data:

In [None]:
lre.____

Let's calculate the R^2 on the train data:

In [None]:
lre.____

R^2 for the training data is bigger than for the test data.

## Cross-Validation Score

Let's import **cross_val_score** from the module **model_selection**.


In [None]:
from sklearn.model_selection import cross_val_score

We input the object, the feature 'enginesize', and the target data 'y_data'. The parameter 'cv' determines the number of folds. let's test 4. 

In [None]:
Rcross = ____

The default scoring is R^2. Each element in the array has the average R^2 value for the fold:

In [None]:
Rcross

 We can calculate the average and standard deviation of our estimate:

In [None]:
print(f"The mean of the folds are {round(Rcross.mean(), 2)} and the standard deviation is {round(Rcross.std(), 2)}")

We can use negative squared error as a score by setting the parameter  'scoring' metric to 'neg_mean_squared_error'. 

In [None]:
-1 * cross_val_score(lre,x_data[['enginesize']], y_data,cv=4,scoring='neg_mean_squared_error')

You can also use the function 'cross_val_predict' to predict the output. The function splits up the data into the specified number of folds, with one fold for testing and the other folds are used for training. First, import the function:

In [None]:
from sklearn.model_selection import cross_val_predict

We input the object, the feature **"enginesize"**, and the target data **y_data**. The parameter 'cv' determines the number of folds. In this case, it is 4. We can produce an output:

In [None]:
yhat = ____
yhat[0:5]

In [None]:
df.corr()['price']

## Part 2: Overfitting, Underfitting and Model Selection

The test data, often called **out-of-sample data** gives a more accurate picture of how your model will perform on real-world data. This is because it helps reveal issues like overfitting, where a model fits the training data too closely and fails to generalize.

We’ll look at some examples to illustrate this. The effects of **overfitting** are especially noticeable in **Multiple Linear Regression** and **Polynomial Regression**, so we’ll focus on those cases.

Let's create Multiple Linear Regression objects and train the model using **'curbweight'**, **'enginesize'** and **'horsepower'** as features.

In [None]:
lr = ____
lr.____

Prediction using training data:

In [None]:
yhat_train = ____
yhat_train[0:5]

R^2 on the test set:

In [None]:
lr.____

Let's examine the distribution of the predicted values of the training data.

In [None]:
def distributionPlot(ytarget, ypred, title, ytargetLabel):
    sns.kdeplot(ytarget, label=ytargetLabel)
    sns.kdeplot(ypred, label='predicted data')
    plt.title(title)
    plt.ylabel('Proportion of cars')
    plt.legend()
    plt.show()


In [None]:
distributionPlot(ytarget=y_train, ypred=yhat_train, title='Distribution Plot of Predicted Value Using Training Data and its Target Value', ytargetLabel='training data')

The model seems to be doing well in learning from the training dataset. But we are interested in seeing how the model performs with never before seen data, therefore we do predictions on the test data and compare with its actual values.

Prediction using test data: 

In [None]:
yhat_test = lr.predict(x_test[['curbweight', 'enginesize', 'horsepower']])
yhat_test[0:5]

In [None]:
distributionPlot(ytarget=y_test, ypred=yhat_test, title='Distribution Plot of Predicted Value Using Test Data and its Target Value', ytargetLabel='test data')

Comparing the Training plot and and the Test plot, it is evident that the distribution of the training data  is much better at fitting the data. This difference in the Test plot is apparent in the range of 5000 to 15,000. This is where the shape of the distribution is extremely different. Let's see if polynomial regression also exhibits a drop in the prediction accuracy when analysing the test dataset.

In [None]:
from sklearn.preprocessing import PolynomialFeatures

#### Overfitting
Overfitting occurs when the model fits the noise, but not the underlying process. Therefore, when testing your model using the test set, your model does not perform as well since it is modelling noise, not the underlying process that generated the relationship. Let's create a degree 2 polynomial model.

Let's use 55 percent of the data for training and the rest for testing:

In [None]:
x_train, x_test, y_train, y_test = ____

We will perform a degree 2 polynomial transformation on the feature 'enginesize'. 

In [None]:
pr = ____
x_train_pr = ____
x_test_pr = _____
pr

Now, let's create a Linear Regression model "poly" and train it.

In [None]:
poly = LinearRegression()
poly.____

We can see the output of our model using the method "predict." We assign the values to "yhat".

In [None]:
yhat = poly.____

Let's take the first five predicted values and compare it to the actual targets. 

In [None]:
print("Predicted values:", yhat[0:5])
print("True values:", y_test[0:5].values)

We define a function "polyplot" to display the training data, testing data, and the predicted function.

In [None]:
def polyplot(xtrain, ytrain, xtest, ytest, poly_feat, poly_reg):
    x = xtrain
    x_range = np.linspace(x.min(), x.max(), 200).reshape(-1, 1)
    x_poly_range = pr.fit_transform(x_range)
    y_range_pred = poly.predict(x_poly_range)
    sns.scatterplot(x=xtrain, y=ytrain, label='Train data')
    sns.scatterplot(x=xtest, y=ytest, label='Test data')
    sns.lineplot(x=x_range.flatten(), y=y_range_pred, color='r', label='Prediction Function')
    plt.legend()
    plt.show()

In [None]:
polyplot(xtrain=x_train['enginesize'], ytrain=y_train, xtest=x_test['enginesize'], ytest=y_test, poly_feat=pr, poly_reg=poly)

A polynomial regression model where blue dots represent training data, orange dots represent test data, and the red line represents the model prediction. 

We see that the estimated function appears to track the data but around an enginesize of 270, the function begins to diverge from the data points. 

R^2 of the training data:


In [None]:
poly.____

 R^2 of the test data:


In [None]:
poly.____

We see the R^2 for the training data is 0.775 while the R^2 on the test data was 0.734.  The lower the R^2, the worse the model. A negative R^2 is a sign of overfitting.


Let's see how the R^2 changes on the test data for different order polynomials and then plot the results:

In [None]:
Rsqu_test = []

order = [1, 2, 3, 4, 5]
for n in order:
    pr = PolynomialFeatures(degree=n)
    
    x_train_pr = pr.fit_transform(x_train[['enginesize']])
    
    x_test_pr = pr.fit_transform(x_test[['enginesize']])    
    
    lr.fit(x_train_pr, y_train)
    
    Rsqu_test.append(lr.score(x_test_pr, y_test))

plt.plot(order, Rsqu_test)
plt.xlabel('order')
plt.ylabel('R^2')
plt.title('R^2 Using Test Data')
plt.show()

We see the R^2 gradually increases until an order three polynomial is used. Then, the R^2 continously decreases.

We can perform polynomial transformations with more than one feature. Create a **PolynomialFeatures** object **pf** of degree three using
`carwidth`, `curbweight`, `enginesize`, `horsepower`

In [None]:
# Write your code below and press Shift+Enter to execute 
pf = ____
x_train_pf = ____
x_test_pf = ____

How many dimensions does the new feature have? Hint: use the attribute "shape".

In [None]:
# Write your code below and press Shift+Enter to execute 
x_train_pf.shape

Create a linear regression model "polyreg1". Train the object using the method "fit" using the polynomial features.

In [None]:
# Write your code below and press Shift+Enter to execute 
polyreg1 = ____
polyreg1.____

Use the method  "predict" to predict an output on the polynomial features, then use the function "distributionPlot" to display the distribution of the predicted test output vs. the actual test data.

In [None]:
# Write your code below and press Shift+Enter to execute 
yhat_test = polyreg1.____

distributionPlot(ytarget=y_test, ypred=yhat_test, title='Distribution Plot of Predicted Value Using Test Data vs Test Data Value of polyreg 2', ytargetLabel='test data')

Using the distribution plot above, describe (in words) the two regions where the predicted prices are less accurate than the actual prices.

In [None]:
# Write your code below and press Shift+Enter to execute 


## Part 3: Ridge Regression

 In this section, we will review Ridge Regression and see how the parameter alpha changes the model. Just a note, here our test data will be used as validation data.

Let's perform a degree two polynomial transformation on the parameters:
`curbweight`, `enginesize`, `boreratio`, `horsepower`, `highwaympg`

In [None]:
pf = ____
x_train_pr = pf.____
x_test_pr = pf.____

 Let's import  **Ridge**  from the module **linear models**.

In [None]:
from sklearn.linear_model import Ridge

Let's create a Ridge regression object, setting the regularization parameter (alpha) to 1 


In [None]:
ridgeModel=____

Like regular regression, you can fit the model using the method fit.

In [None]:
ridgeModel.____

Similarly, you can obtain a prediction: 

In [None]:
yhat = _____

Let's compare the first four predicted samples to our test set: 

In [None]:
print('predicted:', yhat[0:4])
print('test set :', y_test[0:4].values)

We select the value of alpha that minimizes the test error. To do so, we can use a for loop. We have also created a progress bar to see how many iterations we have completed so far.

In [None]:
from tqdm import tqdm

Rsqu_test = []
Rsqu_train = []
dummy1 = []
Alpha = np.array(range(0,1000,1))
pbar = tqdm(Alpha)

for alpha in pbar:
    ridgeModel = Ridge(alpha=alpha) 
    ridgeModel.fit(x_train_pr, y_train)
    test_score, train_score = ridgeModel.score(x_test_pr, y_test), ridgeModel.score(x_train_pr, y_train)
    
    pbar.set_postfix({"Test Score": test_score, "Train Score": train_score})

    Rsqu_test.append(test_score)
    Rsqu_train.append(train_score)

We can plot out the value of R^2 for different alphas:

In [None]:
width = 6
height = 5
plt.figure(figsize=(width, height))

plt.plot(Alpha,Rsqu_test, label='validation data  ')
plt.plot(Alpha,Rsqu_train, 'r', label='training Data ')
plt.xlabel('alpha')
plt.ylabel('R^2')
plt.legend()



In [None]:
max_rsq = 0
max_id = 0
for idx, val in enumerate(Rsqu_test):
    if val > max_rsq:
        max_rsq = val
        max_id = Alpha[idx]

print(f"max alpha: {max_id}")

**Figure 4**: The blue line represents the R^2 of the validation data, and the red line represents the R^2 of the training data. The x-axis represents the different values of Alpha. 

Here the model is built and tested on the same data, so the training and test data are the same.

The red line in Figure 4 represents the R^2 of the training data. As alpha increases the R^2 decreases. Therefore, as alpha increases, the model performs worse on the training data

The blue line represents the R^2 on the validation data. As the value for alpha increases, the R^2 increases and converges at a point.


Perform Ridge regression. Calculate the R^2 using the polynomial features, use the training data to train the model and use the test data to test the model. The parameter alpha should be set to max_alpha.

In [None]:
# Write your code below and press Shift+Enter to execute 
ridgeModel = ____
ridgeModel.____
ridgeModel.____

## Part 4: Grid Search

The term alpha is a hyperparameter. Sklearn has the class **GridSearchCV** to make the process of finding the best hyperparameter simpler.


Let's import **GridSearchCV** from  the module **model_selection**.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

We create a a pipeline for the hyperparameters:

In [None]:
# Define the pipeline
pipeline = Pipeline([
    ('poly', PolynomialFeatures()),
    ('ridge', Ridge())
])


Define the parameters of the pipeline. `poly__degree` and `ridge__alpha`:

In [None]:
# Define the parameter grid
param_grid = ____

# Set up GridSearchCV
grid_search = GridSearchCV(____)
grid_search

In [None]:
# Fit the grid search to the data
grid_search.____

In [None]:
# Best parameters and score
print("Best parameters:", grid_search.best_params_)
print("Best CV R² score:", grid_search.best_score_)

In [None]:
# Use the best model to predict
best_model = grid_search.best_estimator_
best_model.score(x_test[['enginesize', 'horsepower']], y_test)
#y_pred = best_model.predict(x_test)

In [None]:
7