**Before Starting**: First, fill out the below code cell with your first name, last name, and student ID.

**Before Submission**: Make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).


**During Lab Tips**:
1. DO NOT write your written responses in the same markdown cell as the question. If you do this, your written response will be lost!


2. If possible, please try to use your local Jupyter Notebook to complete the lab. Online notebook editors like Collab can edit notebook source code and cause our auto-grader to break, making grading your lab more difficult for us!

**<font color='red'>WARNING: Some TODOs have `todo_check()` functions which will give you a rough estimate of whether you will recieve points or not. <u>These checks are there simply to make sure you are on the right track and they DO NOT determine your final grade for the lab</u>. They are only here to provide you with real-time feedback.</font>**

In [None]:
FIRST_NAME = "Claude"
LAST_NAME = "Kouakou"
STUDENT_ID = "801438848"



# Linear Regression Lab


In [None]:
# Extra imports for this lab that are beyond the scope of discussion
import os
import gc
import traceback
import warnings
from pdb import set_trace

# Set this to True if you DO NOT want to run the
# garbage_collect() functions throughout the notebook
turn_off_garbage_collect = False

def garbage_collect(vars_):
    if not turn_off_garbage_collect:
        for v in vars_:
            if v in globals():
                del globals()[v]
        collected = gc.collect()


class TodoCheckFailed(Exception):
    pass

def todo_check(asserts):
    failed_err = "You passed {}/{} and FAILED the following code checks:\n{}"
    failed = ""
    n_failed = 0
    for check, (condi, err) in enumerate(asserts):
        exc_failed = False
        if isinstance(condi, str):
            try:
                passed = eval(condi)
            except Exception:
                exc_failed = True
                n_failed += 1
                failed += f"\nCheck [{check+1}]: Failed to execute check [{check+1}] due to the following error...\n{traceback.format_exc()}"
        elif isinstance(condi, bool):
            passed = condi
        else:
            raise ValueError("asserts must be a list of strings or bools")

        if not exc_failed and not passed:
            n_failed += 1
            failed += f"\nCheck [{check+1}]: Failed\n\tTip: {err}\n"

    if len(failed) != 0:
        passed = len(asserts) - n_failed
        err = failed_err.format(passed, len(asserts), failed)
        raise TodoCheckFailed(err.format(failed))
    print("Your code PASSED the code check!")

# Goal

The goal of this activity is to introduce the popular ML tool, Scikit.Learn and practice linear regression models with it. You will apply ordinary least squares (LS), regularized linear regression models (i.e., Ridge, Lasso, and Elastic net), and online regression model (stochastic gradient descent) to real data. We will prepare data as we did in last week's practice and then apply these linear models. Follow the TODO titles and comments to finish the activity!

# Agenda

* Scikit.Learn Basics
* Data Preparation
  * Data Preprocessing
  * Data Visualization
  * Data Partitioning
* Regression with  
  * Least Squares
  * Ridge Regression
  * Lasso Regression
  * Elastic Net
  * Stochastic Gradient Descent
  

# Tables of TODO's


1. [TODO1 (5 points)](#TODO1)
2. [TODO2 (5 points)](#TODO2)
3. [TODO3 (5 points)](#TODO3)
4. [TODO4 (10 points)](#TODO4)  
5. [TODO5 (5 points)](#TODO5)
6. [TODO6 (5 points)](#TODO6)
7. [TODO7 (8 points)](#TODO7)
8. [TODO8 (10 points)](#TODO8)
9. [TODO9 (5 points)](#TODO9)
10. [TODO10 (5 points)](#TODO10)
11. [TODO11 (20 points)](#TODO11)
12. [TODO12 (5 points)](#TODO12)
13. [TODO13 (10 points)](#TODO13)
14. [Feedback (2 points)](#feedback)


* Total: 100 Points

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from copy import deepcopy as copy

%matplotlib inline

# Scikit.Learn

[Scikit-learn](https://scikit-learn.org/stable/index.html) is one of the most popular machine learning tools. It is well managed by a lot of contributors and a good organizer group. The design of softeware is well developed, so it is easy to learn and apply to many different data anaylsis applications.

By running the code cell below, if you have an error to import sklearn or the version is outdated, please follow [the instruction](https://scikit-learn.org/stable/install.html) to install/upgrade scikit-learn.

In [None]:
import sklearn
sklearn.__version__

Using scikit-learn is very simple. You can just follow five steps to use it in general:  <br/>
&nbsp;&nbsp;1) import, <br/>
&nbsp;&nbsp;2) prepare data, <br/>
&nbsp;&nbsp;3) initialize (create an object),  <br/>
&nbsp;&nbsp;4) train (fit), <br/>
&nbsp;&nbsp;5) predict (or/and evaluate).  <br/>
  
Although you have already seen examples in the slides, we can give another one here. For now, you don't know need to know what KNN is.

In [None]:
from sklearn import datasets, neighbors           # 1) import

X, t = datasets.load_digits(return_X_y=True)      # 2) prepare data (from library embeded one)
X = X / X.max()                                   #.     - rescale the X to betweeen 0 and 1

# number of data samples
N = X.shape[0]

# prepare train and test data
i_split = int(0.8 * N)
X_train = X[:i_split]
t_train = t[:i_split]
X_test = X[i_split:]
t_test = t[i_split:]

knn = neighbors.KNeighborsClassifier()            # 3) initialize - create an KNN classifier using default setting

knn.fit(X_train, t_train)                         # 4) train with training data (input, X_train and target, t_train)

knn.score(X_test, t_test)                         # 5) evaluate the model on the entire test data and compute the accuracy


The test accuracy seems good with 96 percent of accuracy. Well, we do not even know what the data is for, what the task is, and what ML model we used here. Let us see by actually see the data and predictions the trained model made to learn about them.

Here, the data is famous hand-written digit recognition data. Therefore, the task is classify the image (8x8, 64 values after flattened) for right number (from 0 to 9). Here are the example codes to predict and plot the results.

In [None]:
k = 8  # the number of data samples to check

y = knn.predict(X_test[0:k])                     # 5) predict - for the first k data samples in X_test

# plot the digits along with "label (prediction)"
_, axes = plt.subplots(1, k, figsize=(15,3))
for ax, image, label, pred in zip(axes, X_test[:k], t_test[:k], y):

    ax.set_axis_off()
    ax.imshow(image.reshape((8,8)), cmap=plt.cm.gray_r, interpolation='nearest')
    ax.set_title('Digit {}({})'.format(label, pred))


# Data Preparation

Well... It is a bit awkward to introduce classification example for the regression module. Let us come back to the regression problem.

This week, we will play with the [1985 Automobile Data Set](https://archive.ics.uci.edu/dataset/10/automobile) in [UCI Data Repository](https://archive.ics.uci.edu/ml/index.html).
You can get the csv file directly from [data.csv](https://archive.ics.uci.edu/static/public/10/data.csv).
You do not need to download the names file but you can read it to get informed about the data.

### Goal

What we want to do with this data is developing a linear regression model that predicts the symboling, a variable used by actuarians to determine the risk of buying an automobile, given other input variables.

Let us first import all the libraries that we need.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn import linear_model

%matplotlib inline

<div id='TODO1'></div>

## Quick checking the content of the file


Before start this TODO, make sure to download data file in the working directory.

### TODO 1-1 (1 point)

You can use `%pfile` (all) or `!head` (Linux or Mac) or `!type` (Windows) to quickly check the content of the file. See how the file is formatted so you can get an idea of how to read.

In [None]:
# TODO 1-1: add pfile/head/type command here
!powershell Get-Content -Head 30 imports-85.data

## Reading automobile data


### TODO 1-2 (1 point)

We are repeating what we did last week, loading data using pandas.

1. Load the data file into variable `df` using pandas library.
2. Print out the dataframe `df`.

In [None]:
# TODO 1-2:
DATA_URL = (
    "https://archive.ics.uci.edu/static/public/10/data.csv"
)

df = pd.read_csv(DATA_URL, nrows=1000)
df

<div id='TODO1-2'></div>

### TODO 1-3 (1 point)

1. Let us look at the summary of data using `describe` again.
2. Check the min, max, mean and standard deviation to get some idea of the value distributions.

In [None]:
# TODO 1-3:
df_describe = df.describe()
df_describe

### TODO 1-4 (2 points)
1. Check for any null data in the dataset.
2. Print the rows with null data.


Hint: Refer to last week's lab exercise.

In [None]:
# TODO 1-4.1: 
null_df = df.isnull()
null_df

In [None]:
# TODO 1-4.2:
rows_with_null = df[df.isnull().any(axis=1)]
rows_with_null

<div id='TODO2'></div>

### TODO 2 (5 points)

1. Using `SimpleImputer` in Scikit-Learn, replace the missing values (NaN) with the most frequent values in the data. Store the cleaned data into `df_freq`.

Hint: Refer to last week's lab exercise where you used the same to preprocess the garment workers productivity dataset.

In [None]:
from sklearn.impute import SimpleImputer
# TODO 2:
imputer = SimpleImputer(strategy='most_frequent')
df_freq = pd.DataFrame(imputer.fit_transform(df))

df_freq.head()



Notice that the names of each column has been changed to an integer, instead of a string. Let's get them back using the below code, and store the final database in the variable `df`.

In [None]:
df_freq.columns = df.columns
df_freq.head()

In [None]:
df = df_freq
df.head()


Let's check the type of data in each column.

In [None]:
df.dtypes


Notice that the type of data in each column is changed to object. Let's revert back them to numbers, which makes our work a lot easier.

In [None]:
numeric_columns = ['normalized-losses', 'num-of-doors', 'wheel-base', 'length', 'width', 'height',
                   'curb-weight', 'num-of-cylinders', 'engine-size', 'bore', 'stroke', 'compression-ratio',
                   'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price', 'symboling']

df[numeric_columns] = df[numeric_columns].apply(pd.to_numeric, errors='coerce')

df.dtypes

The list of column names in the variable `numeric_columns` are the ones which have numbers stored in them in the original data. The data in these column types is simply changed from type `object` to `int/float` by using the pandas function [to_numeric()](https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html).

But, what about the remaining data that is still present in the object form, to be specific, in the form of strings?

Remember, in the last week's lab, we've converted each of them by manually assigning an integer to each string, and then modified the dataset using the pandas `apply()` and `lambda` functions. This task can be easily achieved by using `labelEncoder` function in the scikit-learn library.

<div id='TODO3'></div>

### TODO 3 (5 points)

1. Convert the data present in the form of strings in `df` to integer format using `labelEncoder` function. The list of columns with string data is given to you in the form of a list stored in `strings_list` variable.
2. Hint: [Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html),  [Reference](https://www.geeksforgeeks.org/ml-label-encoding-of-datasets-in-python/)

In [None]:
strings_list = ['make', 'fuel-type', 'aspiration', 'body-style',
                'drive-wheels', 'engine-location', 'engine-type', 'fuel-system']

from sklearn import preprocessing
# TODO 3:
label_encoder = preprocessing.LabelEncoder()
for col in strings_list:
    df[col] = label_encoder.fit_transform(df[col])




In [None]:
df.dtypes


Notice that our whole data has now been changed into numerical format, either in the form of int or float. This makes our work a lot easier when we go to advanced stages like training and testing a model. The preprocessed dataset is shown in the below cell.

In [None]:
df


## Visualize the data

Working for a linear regression, knowing the how the target value varies depending on input variables. So, by using scatter plot, let us see if there is positive or negative correlation between any input feature and the target value.

<div id='TODO4'></div>

### TODO 4-1 (3 points)

1. In the last week's lab, we created 3 by 4 scatter plots. Similarly, create 5 by 5 scatter plots, each plotting the target value (symboling) against individual input variable.

In [None]:
# TODO 4-1:
fig = plt.figure(figsize=(15,15))
plt.clf() # Clear previous plt figure

Target = df.loc[:, 'symboling']
X_f = df.iloc[:, :-1]

for i in range(25):
    plt.subplot(5, 5, i+1) # Selects which subplot to plot to
    plt.scatter(x = X_f.iloc[:, i], y=Target, marker= '.') # Plots a given column
    plt.xlabel(X_f.columns.values[i]) # Sets x label
    plt.ylabel("symboling") # Sets Y label
fig.tight_layout()

Often, rather than just looking at how each feature interacts with the target, observing how each feature interacts with other features can be useful as well. One of the reasons for this is that we can begin to gain a glimpse at the dependency between features. If two highly correlated features exist, we can easily ignore one of them as using both would be redundant. Let say you have two input variable $洧논1$ and $洧논2$ and $洧논2=2칑洧논1$. Then, your linear model $洧녽=洧녻1洧논1+洧녻2洧논2+洧녪$ can be easily converted to $洧녽=3洧녻洧논1+洧녪$.

### TODO 4-2 (2 points)

1. Now, create a scatter plot which plots all the features against one another by using the Pandas `scatter_matrix()` function with our DataFrame `df`.

In [None]:
# TODO 4-2:
from pandas.plotting import scatter_matrix
# Create a scatter matrix
scatter_matrix(df, figsize=(10, 10), diagonal='kde', alpha=0.8, marker = ".")
fig.tight_layout()


From above figures, we can see the target 'symboling' is a categorical value ranging from -2 to 3. Let us verify this using the following code.

In [None]:
df['symboling'].unique()

Yes, it is one of the six values, which  means we can use classification algorithms. We will talk about classification later but we can see how they are grouped together using Andrew's curve. Andrew's curve maps the high dimensional data into frequency plot using finite Fourier series. Therefore, you are expected to observe similar frequncy patterns in the graph for the similar data.  You can check David Andrew's paper ([Plots of High-Dimensional Data](https://www.jstor.org/stable/2528964), 1972).

### TODO 4-3 (3 points)

1. Using Pandas `andrews_curve()` function in conjunction with our DataFrame `df` to produce the Andrew's curve. Take some time to observe the plot.
    1. Try using different colours for each target value.

In [None]:
# TODO 4-3:
from pandas.plotting import andrews_curves
# Andrews curve plot
plt.figure(figsize=(6, 6))
andrews_curves(df, 'symboling', colormap='viridis')
fig.tight_layout()

What can you see from the plot?

Well, there are large data samples overlap, so between -2 to 0, it is hard to see anything particular. Some samples in the symboling 1 and 3, however, sticks out a little, which will be interesting how this will impact on ML model performance.

Now, let us look at the target values.

### TODO 4-4 (2 points)

1. Create the histogram using our target represented by the 'symboling' column in our `df`. Take some time to observe the plot.
    1. Hint: You can use Pandas or Matplotlib to  generate this plot.

In [None]:
# TODO 4-4:
# Create the histogram
df.loc[:, 'symboling'].plot.hist()
fig.tight_layout()

You can see the figure now shows the majority of automobiles are with symboling 0 and 1. Let us see how this sample imbalance will impact the model performance.

## Splitting data into input features and targets

Let us have two separate variables for input features and output targets.
`X` will be used for the input features, `T` for target labels, and `N` for the total number of samples in our data.


<div id='TODO5'></div>

### TODO 5-1 (3 points)

1. Store the target column 'symboling' into `T`.
2. Store all the input input features into `X`, excluding the target column `symboling`.
3. Store the total number of samples in our dataset (e.g., rows) into `N`.

In [None]:
# TODO 5-1:

T = df.loc[:, 'symboling'].copy() # copy the target into T
X = df.iloc[:, :-1].copy()         # copy the feature into X
N = df.shape[0]


In [None]:
todo_check([
    ('"X" in globals()', 'X is not defined'),
    ('np.all(np.isclose(X.iloc[10, 1:5].values, np.array([2.0, 1.0, 0.0, 2.0])))', 'X has incorrect values'),
    ('"T" in globals()', 'T is not defined'),
    ('np.all(np.isclose(T.iloc[5:10].values, np.array([2, 1, 1, 1, 0])))', 'T has incorrect values'),
    ('"N" in globals()', 'N is not defined'),
    ("N == 205", "N has the wrong value")
])

### TODO 5-2 (2 points)

1. Split the data `X` and targets `T` using Sklearn's `train_test_split()` function. Store the values into `X_train`, `X_test`, `t_train`, and `t_test`. Be sure to pass the arguments that correspond to the following descriptions:
    1. Split the data/targets using a 80/20 split (80% for training and 20% for testing).
    1. Use a seed of 0 for the `random_state` argument.

In [None]:
# TODO 5-2:
from sklearn.model_selection import train_test_split
# Split the data into training and testing sets (80/20 split)
X_train, X_test, t_train, t_test = train_test_split(X, T, test_size=0.2, random_state=0)
print(X.shape, X_train.shape, X_test.shape, t_train.shape, t_test.shape)


In [None]:
todo_check([
    ("X_train.shape == (164, 25)", "X_train has the wrong shape"),
    ("X_test.shape == (41, 25)", "X_test has the wrong shape"),
    ("t_train.shape == (164,)", "t_train has the wrong shape"),
    ("t_test.shape == (41,)", "t_test has the wrong shape"),
])

# Applying Least Squares
Now it's time to apply least squares to our data in an attempt to develop a model for predicting our target variable `symboling`! Recall that the least squares formula is used to generate weights $w$ which can then be used for making predictions given new data. The least squares formula is given below where the symbol $\cdot$ corresponds to the dot product.

$$
w = (X^T \cdot X)^{-1} \cdot X^T \cdot T
$$

**References**

If you want to gain a better understanding of what least squares is doing check out the following references.
* [ Khan Academy](https://www.youtube.com/watch?v=MC7l96tW8V8)
* [Geometric view](https://medium.com/@andrew.chamberlain/the-linear-algebra-view-of-least-squares-regression-f67044b7f39b)

<div id='TODO6'></div>

### TODO 6 (5 points)

Well, the model is implemented in `sklearn.linear_model`.

1. Create an instance using the proper Sklearn class for conducting least squares. Store the output into `model`.
    1. Hint: Make sure to import and use the right class. You can find the correct class to import and use by referring to the slides.
2. Train the least squares model using the `X_train` and `t_train`.
3. Evaluate the model by computing the test scores using the `score()` method with the `X_test` data and `t_test` targets. Store the output into `test_score`.


In [None]:
from sklearn.linear_model import LinearRegression

np.random.seed(0)

# TODO 6:

# 1) initialize
# create a linear regression object
model = LinearRegression()

# 2) train the model
model.fit(X_train, t_train)

# 3) evaluate

test_score = model.score(X_test, t_test)

test_score

In [None]:
todo_check([
    ('np.isclose(test_score,  0.387996, rtol=.1)', '`score` potentially has the wrong value.')
])

The `score` function calculated the coeffient $R^2$:
$$
 R^2 = 1 - \frac{(t - y)^2}{(t - \bar{t})^2}
$$
where $t$ is target label, $y$ is predicted values, and $\bar{t}$ is mean of target labels.
When the model is perfect fit, $R^2 =1$. Knowing that, the model seems to be a bit weak. Let us check this with the following plots.

<div id='TODO7'></div>

First, let us see how close the prediction is to the actual label by comparing the value against the diagonal line after plotting a scatter plot between them (See the figure below). If the predictions are accurate, all the points should lie on the diagonal line.

### TODO 7-1 (3 points)
1. call `predict()` method to make predictions for `X_test` input and store the results to `y`.  
2. Plot `t_test` (x-axis) against `y` (y-axis).
3. Based on your plot and results, briefly state what you think/observe.

In [None]:
# TODO 7-1:

## TODO 1. make a prediction
y = model.predict(X_test)


print(y.shape)
print("First 5 predicted values:", y[1:5])
print("Expected values:", np.array([-0.51629084, 1.938767, 1.35099362, 0.77757691]))
print("Test score: ", test_score)

## TODO 2. plot t_text vs y
plt.figure(figsize=(6, 6))
plt.scatter(t_test, y, alpha=0.7, marker = '.')

# dashed diagonal line
plt.plot([-3,3], [-3,3], 'r--')
# x and y labels
plt.xlabel("target")
plt.ylabel("predicted")


In [None]:
todo_check([
    ("y.shape == (41,)", "y shape is incorrect"),
    ("np.all(np.isclose(y[1:5], np.array([-0.51629084, 1.938767, 1.35099362, 0.77757691])))", "y values are incorrect")
])

Here the red-diagonal line represents where the blue dots should fall if our classifier achieved 100% accuracy.

### TODO 7-2 (5 points)

Write your thoughts about the results and plots in the below cell.

**DO NOT WRITE YOUR ANSWER IN THIS CELL!**

`ANSWER: we could observe that the plot of predicted vs Actual Symbolism/target is along the diagonal. But the predicted shows great distances to the red diagonal. Some of the predicted are distant from the red line  `

If you want to look at all the data samples plotted *with* their corresponding predictions, you can plot all the samples as in the code below.

In [None]:
plt.plot(t_test.to_numpy(), '.')
plt.plot(y, 'x')
plt.xlabel("samples")
plt.ylabel("symboling")

Well, this above plot is a bit messy when we look at the results here. Let us plot each data sample in its own plot based on the  target value.

<div id='TODO8'></div>

### TODO 8 (10 points)

1. Create a sub-plot where each plot, plots the actual symboling target `t_test` (blue dot) against its predicted target `y` (orange x).
  1. Hint: You will need to use sub-plots similar to TODO 4-1.
  1. Hint: Notice, there are 5 unique values in `t_test`, thus you will need 5 sub-plots, one for each unique value in `t_test`. For example, let's say we have 2 data samples with the symboling '-1' in `t_test`. The plot for symboling '-1' should then plot the actual targets against the predicted targets (a total of 4 points should be plotted, 2 blue dots, 2 orange x's).
2. Write your thoughts about what you observe from the figures.

Recall, `t_test` is a Pandas Series (1D DataFrame) and `y` is a numpy array!

In [None]:
t_test.unique()

In [None]:
print(f"t_test type {type(t_test)}")
print(f"y type {type(y)}")

Further, recall that the unique values inside `t_test` are as follows! Notice the symboling '-1' appears 2 times while the quality '2' appears 10 times.

In [None]:
uniques, counts = np.unique(t_test, return_counts=True)

print(f"Unique values in t_test: {uniques}")
print(f"Number of times each unqiue value appears in t_test: {counts}")

Plotting the actual vs predicted for target '-1' might look as follows:

In [None]:
plt.plot(t_test.values[t_test==-1], '.', label="Actual")
plt.plot(y[t_test==-1], 'x', label="Predicted")
plt.ylabel("Symboling")
plt.xlabel("Sample Number")
plt.legend()

Finish the code for TODO 8 in the cell below.

In [None]:
# TODO 8:
unique_values = sorted(dict.fromkeys(t_test.values)) 
fig , axes = plt.subplots(2, 3, figsize=(10,6)) 
axes = axes.ravel()  # Flatten the axes for easier iteration

for i, val in enumerate(unique_values):
    axes[i].plot(t_test.values[t_test==val], '.', label="Actual")
    axes[i].plot(y[t_test==val], 'x', label="Predicted")
    axes[i].set_ylabel(f"Actual Symboling = {val}")
    axes[i].grid(alpha=0.3)
# Hide any unused subplots
for j in range(len(unique_values), len(axes)):
    axes[j].axis("off")
fig.tight_layout()

Write your thoughts about the results and plots in the below cell.

**DO NOT WRITE YOUR ANSWER IN THIS CELL!**

`ANSWER: I see that the actual symboling is distant from the predicted. I can see that when t_test values are lower -1 to 0, the prediction is way above the actual and from 2 to 3 the prediction is way below. I can deduct that my model is not very accurate `


As we'll be using other linear models throughout this notebook it will be nice to condense all the plots for examining our targets and predictions into one easy to call function. So, let's make a function that does this. We'll call it `evaluate`.

<div id='TODO9'></div>

### TODO 9 (5 points)

Finish coding the `evaluate()`function given below.

1. Plot `t_test` (x-axis) against `y` (y-axis) by calling the function we finished in TODO 7.1.
2. Plot all the data samples `t_test` *with* their corresponding predictions `y` using the function we defined for you a few code cells earlier.
3.  Create a sub-plot where each plot, plots the actual target `t_test` (blue dot) against its predicted target `y` (orange x) as we did in TODO 8.
4. Maintaining the subplot structure as it is, fill in the blank to finish the function.

In [None]:
def evaluate(y, t):
    plt.figure(figsize=(10,10))

    # TODO 9:
    # t vs y plot
    plt.subplot(3,3, 1)
    # TODO: add the first plot
    plt.scatter(t_test, y, alpha=0.7, marker = '.')
    # dashed diagonal line
    plt.plot([-3,3], [-3,3], 'r--')
    plt.xlabel("target")
    plt.ylabel("Predicted")
    plt.title("Plot-1")

    # all value comparison
    plt.subplot(3,2, 2)
    # TODO: add the second one
    plt.plot(t_test.to_numpy(), '.')
    plt.plot(y, 'x')
    plt.xlabel("samples")
    plt.ylabel("symboling")
    plt.title("Plot-2")


    # subplots of individual quality comparision
    # TODO: add the third subplots
    unique_values = sorted(dict.fromkeys(t_test.values)) 
    fig , axes = plt.subplots(2, 3, figsize=(10,6)) 
    axes = axes.ravel()  # Flatten the axes for easier iteration
    
    for i, val in enumerate(unique_values):
        axes[i].plot(t_test.values[t_test==val], '.', label="Actual")
        axes[i].plot(y[t_test==val], 'x', label="Predicted")
        axes[i].set_ylabel(f"Actual Symboling = {val}")
        axes[i].grid(alpha=0.3)
    # Hide any unused subplots
    for j in range(len(unique_values), len(axes)):
        axes[j].axis("off")
    
    plt.tight_layout()

When finishing the function, you can call the function as below:

In [None]:
evaluate(y, t_test.to_numpy())

Can you take a guess to what these values we are accessing from our `model` correspond to? these values are the weights and the intercept

In [None]:
model.coef_

In [None]:
model.intercept_

## Weight Observation

The previous cells print out some parameters from our model, specifically the weights and y-intercept (bias value) for our linear model. That's cool, but what these parameters tell us?

Often, the weights contain meaningful information to understand the data and machine learning model. For instance, as we'll see in the following figure as well, the weights can inform us about how much each feature/variable contributes to the models predictions.

<div id='TODO10'></div>

### TODO 10 (5 points)
Let's make a bar plot that plots the values of our weights and bias. To make presentation as informative as possible, we'll also add the values on top of the bar chart.

1. Read the code below and find the TODO comment. Add one line of code to create a bar plot using matplotlib `plt.bar()` function to plot the values of the weights/bias stored in `w`. Store the output into `rects`.

2. Call the `autolabel()` function passing the correct arguments to plot the values of the weights/bias above the bars in the bar plot.

In [None]:

# print the value text over the bar
# https://matplotlib.org/3.1.1/gallery/lines_bars_and_markers/barchart.html
def autolabel(ax, rects):
    """Attach a text label above each bar in *rects*, displaying its height."""
    for rect in rects:
        height = rect.get_height()
        ax.annotate('{:0.3f}'.format(height),
                    xy=(rect.get_x() + rect.get_width() / 2, height),
                    xytext=(0, 3),  # 3 points vertical offset
                    textcoords="offset points",
                    ha='center', va='bottom', color='blue')

def show_weights(model, names):

    # combine both the coefficients and intercept to present
    w = np.append(model.coef_, model.intercept_)

    names = list(names) + ['bias/intercept']

    plt.figure(figsize=(12,3))

    # TODO: create bar chart to present the weights
    rects = plt.bar(range(len(w)), w, color='skyblue')

    ax = plt.gca()
    ax.set_xticks(range(len(w)))
    ax.set_xticklabels(names, rotation = 90)

    # TODO: call the autolabel function
    autolabel(ax, rects)
    plt.title("Model Weights and Bias")
    plt.ylabel("Value")
    plt.tight_layout()

In [None]:
show_weights(model, df.columns.values[:-1]) # We dropped the last column in TODO 5, so here we drop the name of the last column.

It looks `fuel-type` must have highest conntribution with the large bias. But here let us hold this interpretation. Remember that the scale of input vairables are not similar, which means the current interpretation must be **misleading**.

For now, let us observe/analyze without considering input scaling.

# Regularized Linear Models

## Ridge Regression

Now let's repeat everything we just did but now let's look at using linear regression with regularization. To start off, let's try using ridge regression and see if the result differ at all!

<div id='TODO11'></div>

### TODO 11-1 (5 points)

1. Create an instance using the proper Sklearn class for conducting ridge regression. Store the output into `model`.
    1. Hint: Make sure to import and use the right class. You can find the correct class to import and use by referring to the slides.
2. Train the ridge regression model using the `X_train` and `t_train`.
3. Evaluate the model by computing the test scores using the `score()` method with the `X_test` data and `t_test` targets. Store this value into `test_score`.
4. Using `model`, make a prediction using `predict()` method taking `X_test` as input. Store the output inside `y`
5. Call the `evaluate()` function and observe the results.
6. Call the `show_weights()` function and analyze the weights.

In [None]:
from sklearn.linear_model import Ridge

# 1) initialize
model = Ridge(alpha=0.5) 

# 2) train the model
model.fit(X_train, t_train)

# 3) evaluate
test_score = model.score(X_test, t_test)

print("Test score: ", test_score)

In [None]:
todo_check([
    ('test_score > 0.30', '`test_score` is below .30, try adusting the `alpha` value.')
])

In [None]:
# 4) Predict the values
y = model.predict(X_test)

In [None]:
# 5) Plot actual vs predicted graphs
evaluate(y, t_test.to_numpy())

In [None]:
# 6) Plot the weights
show_weights(model, df.columns.values[:-1])

## Lasso Regression

Now let's try lasso regression!


### TODO 11-2 (5 points)

1. Combine all the process and the modify the code to use Lasso. You can simply copy the four above code cells and modify it to repeat the same process but with Lasso this time.

    1. Create an instance using the proper Sklearn class for conducting lasso regression. Store the output into `model`.
        1. Hint: Make sure to import and use the right class. You can find the correct class to import and use by referring to the slides.
    2. Train the lasso regression model using the `X_train` and `t_train`.
    3. Evaluate the model by computing the test scores using the `score()` method with the `X_test` data and `t_test` targets. Store this value into `test_score`.
    4. Using `model`, make a prediction using `predict()` method taking `X_test` as input. Store the output inside `y`
    5. Call the `evaluate()` function and observe the results.
    6. Call the `show_weights()` function and analyze the weights.

In [None]:
from sklearn.linear_model import Lasso

# 1) initialize
model = Lasso(alpha=0.1)
# 2) train the model
model.fit(X_train, t_train)
# 3) evaluate
test_score = model.score(X_test, t_test)
print("Test score: ", test_score)

# 4) Predict the values
y = model.predict(X_test)
# 5) Plot actual vs predicted graphs
evaluate(y, t_test.to_numpy())
# 6) Plot the weights
show_weights(model, df.columns.values[:-1])

In [None]:
todo_check([
    ('test_score > 0.40', '`test_score` is below .40, try adusting the `alpha` value.')
])

## Elastic Net

### TODO 11-3 (5 points)

1. Combine all the process and the modify the code to use Elastic Net. You can simply copy the above cell and modify it to repeat the same process but with Elastic Net this time.

    1. Create an instance using the proper Sklearn class for elastic net. Store the output into `model`.
        1. Hint: Make sure to import and use the right class. You can find the correct class to import and use by referring to the slides.
    2. Train the elastic net model using the `X_train` and `t_train`.
    3. Evaluate the model by computing the test scores using the `score()` method with the `X_test` data and `t_test` targets. Store this value into `test_score`.
    4. Using `model`, make a prediction using `predict()` method taking `X_test` as input. Store the output inside `y`
    5. Call the `evaluate()` function and observe the results.
    6. Call the `show_weights()` function and analyze the weights.

In [None]:
from sklearn.linear_model import ElasticNet

# 1) initialize
model = ElasticNet(alpha=0.1, l1_ratio=0.5)
# 2) train the model
model.fit(X_train, t_train)
# 3) evaluate
test_score = model.score(X_test, t_test)
print("Test score: ", test_score)

# 4) Predict the values
y = model.predict(X_test)
# 5) Plot actual vs predicted graphs
evaluate(y, t_test.to_numpy())
# 6) Plot the weights
show_weights(model, df.columns.values[:-1])

In [None]:
todo_check([
    ('test_score > 0.40', '`test_score` is below .40, try adusting the `alpha` value.')
])

## Stochastic Gradient Descent

### TODO 11-4 (5 points)

1. Combine all the process and the modify the code to use SGD. You can simply copy the above cell and modify it to repeat the same process but with SGD this time.

    1. Create an instance using the proper Sklearn class for conducting SGD. Store the output into `model`.
        1. Hint: Make sure to import and use the right class. You can find the correct class to import and use by referring to the slides.
    2. Train the SGD model using the `X_train` and `t_train`.
    3. Evaluate the model by computing the test scores using the `score()` method with the `X_test` data and `t_test` targets. Store this value into `test_score`.
    4. Using `model`, make a prediction using `predict()` method taking `X_test` as input. Store the output inside `y`
    5. Call the `evaluate()` function and observe the results.
    6. Call the `show_weights()` function and analyze the weights.

In [None]:
from sklearn.linear_model import SGDRegressor

# 1) initialize
model = SGDRegressor(max_iter=1000, tol=1e-3, penalty='l2', alpha=0.03, random_state=42)

# 2) train the model
model.fit(X_train, t_train)
# 3) evaluate
test_score = model.score(X_test, t_test)
print("Test score: ", test_score)

# 4) Predict the values
y = model.predict(X_test)
# 5) Plot actual vs predicted graphs
evaluate(y, t_test.to_numpy())
# 6) Plot the weights
show_weights(model, df.columns.values[:-1])

In [None]:
todo_check([
    ('test_score < 0', '`test_score` should be an extreme negative vaue')
])

<div id="TODO12"></div>

### TODO 12 (5 points)

Write your thoughts about the plots above. Pay attention to the values of the weights, the predictions, and the test score we got.

**DO NOT WRITE YOUR ANSWER IN THIS CELL!**

`ANSWER: General Observations
For all the models (linear Regression, Ridge Regression, Lasso Regression, Elastic Net) we could observe that the plot of predicted vs Actual Symbolism/target is along the diagonal.
-	The Test Scores are poor all below 40%
For, and Stochastic Gradient Descent the red line is horizontal
-	The Test Score is negative
Test Scores & Model Comparison
Linear Regression: test score = 0.3879966145620023
Ridge: test score = 0.3888197852975577
Lasso Regression: test score = 0.4277618447604027
Elastic Net: test score = 0.45726770065887723
Stochastic Gradient Descent: test score = -6.261884166999523e+34
Linear Regression and Ridge seem to be performing similarly. Lasso and Elastic Net have close test scores.
SGD has a negative test score.
Model Weight and Bias: in the Linear regression fuel-type seems to be significant among all the features.
The bias/intercept: the bias/intercept seems to be higher for the Lasso Regression.
`

# Streamlit

<div id='TODO13'> </div>

### TODO 13 (10 points)

Let us streamlit the models for comparison by using `evaluate` and `show_weights`.
Your final interface does not need to be exactly same but to secure an credit, you need to
1. Include all the models we practiced
2. The parameters are controllable so you can observe the effects of choosing different values interactively.
3. You should use both `evaluate` and `show_weights` at minimum.

In [None]:
%%writefile automobile_linearReg.py

import streamlit as st
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn import preprocessing
# TODO: Import all the linear regression models that you used here
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import SGDRegressor


# SETTING PAGE CONFIG TO WIDE MODE
st.set_page_config(layout="wide")

# LOADING DATA
DATA_URL = (
    "https://archive.ics.uci.edu/static/public/10/data.csv"
)

"""
# 1985 Auto Imports Database

Abstract: This data set consists of three types of entities:
        (a) the specification of an auto in terms of various characteristics,
        (b) its assigned insurance risk rating,
        (c) its normalized losses in use as compared to other cars. The second rating
        corresponds to the degree to which the auto is more risky than its price indicates.
        Cars are initially assigned a risk factor symbol associated with its
        price.   Then, if it is more risky (or less), this symbol is
        adjusted by moving it up (or down) the scale.  Actuarians call this
        process "symboling".  A value of +3 indicates that the auto is
        risky, -3 that it is probably pretty safe.
"""

@st.cache_data
def load_data(nrows):
    # TODO: Import the data, preprocess it, and seperate it into training and testing data
    df = pd.read_csv(DATA_URL, nrows=nrows)
    #is there missing values
    null_df = df.isnull()
    null_df

    rows_with_null = df[df.isnull().any(axis=1)]
    rows_with_null
    #Using SimpleImputer in Scikit-Learn, replace the missing values (NaN) with the most 
    #frequent values in the data. Store the cleaned data into df_freq.
    imputer = SimpleImputer(strategy='most_frequent')
    df_freq = pd.DataFrame(imputer.fit_transform(df))
    df_freq
    df_freq.head()
    df_freq.columns = df.columns
    df_freq.head()

    df = df_freq
    "Dataframe", df.head()
    
    #Let's check the type of data in each column
    df.dtypes
    numeric_columns = ['normalized-losses', 'num-of-doors', 'wheel-base', 'length', 'width', 'height',
                    'curb-weight', 'num-of-cylinders', 'engine-size', 'bore', 'stroke', 'compression-ratio',
                    'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price', 'symboling']

    df[numeric_columns] = df[numeric_columns].apply(pd.to_numeric, errors='coerce')

    df.dtypes
    #Convert the data present in the form of strings in df to integer format using labelEncoder function. 
    #The list of columns with string data is given to you in the form of a list stored in strings_list
    strings_list = ['make', 'fuel-type', 'aspiration', 'body-style',
                'drive-wheels', 'engine-location', 'engine-type', 'fuel-system']

    label_encoder = preprocessing.LabelEncoder()
    for col in strings_list:
        df[col] = label_encoder.fit_transform(df[col])

    df.dtypes
    # let's lookk at df
    "DataFrame df now: ", df


    T = df.loc[:, 'symboling'].copy() # copy the target into T
    X = df.iloc[:, :-1].copy()         # copy the feature into X
    N = df.shape[0]
    "X.shape", X.shape

    # Split the data into training and testing sets (80/20 split)
    X_train, X_test, t_train, t_test = train_test_split(X, T, test_size=0.2, random_state=0)
    X.shape, X_train.shape, X_test.shape, t_train.shape, t_test.shape

    return df, X_train, X_test, t_train, t_test
    
df, X_train, X_test, t_train, t_test = load_data(100000)


"## Summary"
st.dataframe(df.describe())


#################### functions

def evaluate(y, t):
    fig = plt.figure(figsize=(10,10))

    # Paste the corresponding part of your evaluate() function
    # t vs y plot
    plt.subplot(3,3, 1)
    # TODO: add the first plot
    plt.scatter(t_test, y, alpha=0.7, marker = '.')
    # dashed diagonal line
    plt.plot([-3,3], [-3,3], 'r--')
    plt.xlabel("target")
    plt.ylabel("Predicted")
    plt.title("Plot-1")

    # all value comparison
    plt.subplot(3,2, 2)
    # TODO: add the second one
    plt.plot(t_test.to_numpy(), '.')
    plt.plot(y, 'x')
    plt.xlabel("samples")
    plt.ylabel("symboling")
    plt.title("Plot-2")
    st.pyplot(fig)  

    # subplots of individual quality comparision
    # TODO: add the third subplots
    unique_values = sorted(dict.fromkeys(t_test.values)) 
    fig , axes = plt.subplots(2, 3, figsize=(10,6)) 
    axes = axes.ravel()  # Flatten the axes for easier iteration

    for i, val in enumerate(unique_values):
        axes[i].plot(t_test.values[t_test==val], '.', label="Actual")
        axes[i].plot(y[t_test==val], 'x', label="Predicted")
        axes[i].set_ylabel(f"Actual Symboling = {val}")
        axes[i].grid(alpha=0.3)
    # Hide any unused subplots
    for j in range(len(unique_values), len(axes)):
        axes[j].axis("off")
    fig.tight_layout()
    st.pyplot(fig)

# print the value text over the bar
# https://matplotlib.org/3.1.1/gallery/lines_bars_and_markers/barchart.html
def autolabel(ax, rects):
    # Paste the corresponding part of your autolabel() function
    for rect in rects:
        height = rect.get_height()
        ax.annotate('{:0.3f}'.format(height),
                    xy=(rect.get_x() + rect.get_width() / 2, height),
                    xytext=(0, 3),  # 3 points vertical offset
                    textcoords="offset points",
                    ha='center', va='bottom', color='blue')
            

def show_weights(model, names):

    # combine both the coefficients and intercept to present
    w = np.append(model.coef_, model.intercept_)

    fig = plt.figure(figsize=(12,3))

    # Paste the corresponding part of your show_weights() function
    names = list(names) + ['bias/intercept']

    # create bar chart to present the weights
    rects = plt.bar(range(len(w)), w, color='skyblue')

    ax = plt.gca()
    ax.set_xticks(range(len(w)))
    ax.set_xticklabels(names, rotation = 90)

    # TODO: call the autolabel function
    autolabel(ax, rects)
    plt.title("Model Weights and Bias")
    plt.ylabel("Value")
    #plt.tight_layout()

    st.pyplot(fig)
####################


st.divider()
# TODO: Add your code to observe different models
'''
# Linear Regression
'''
np.random.seed(0)

# 1) initialize
# create a linear regression object
model = LinearRegression()
# 2) train the model
model.fit(X_train, t_train)
# 3) evaluate
test_score = model.score(X_test, t_test)
test_score
# 4) Predict the values
y = model.predict(X_test)
# 5) Plot actual vs predicted graphs
evaluate(y, t_test.to_numpy())
# 6) Plot the weights
show_weights(model, df.columns.values[:-1])

st.divider()

'''
# Ridge Regression
Now let's repeat everything we just did but now let's look at using linear regression with regularization. 
To start off, let's try using ridge regression and see if the result differ at all!
'''
# 1) initialize
model = Ridge(alpha=1.0) 

# 2) train the model
model.fit(X_train, t_train)

# 3) evaluate
test_score = model.score(X_test, t_test)
"Test score: ", test_score

# 4) Predict the values
y = model.predict(X_test)
# 5) Plot actual vs predicted graphs
evaluate(y, t_test.to_numpy())
# 6) Plot the weights
show_weights(model, df.columns.values[:-1])

st.divider()

'''
# Lasso Regression
'''
# 1) initialize
model = Lasso(alpha=0.1)
# 2) train the model
model.fit(X_train, t_train)
# 3) evaluate
test_score = model.score(X_test, t_test)
"Test score: ", test_score

# 4) Predict the values
y = model.predict(X_test)
# 5) Plot actual vs predicted graphs
evaluate(y, t_test.to_numpy())
# 6) Plot the weights
show_weights(model, df.columns.values[:-1])

st.divider()

'''
# Elastic Net
'''
# 1) initialize
model = ElasticNet(alpha=0.1, l1_ratio=0.5)
# 2) train the model
model.fit(X_train, t_train)
# 3) evaluate
test_score = model.score(X_test, t_test)
"Test score: ", test_score

# 4) Predict the values
y = model.predict(X_test)
# 5) Plot actual vs predicted graphs
evaluate(y, t_test.to_numpy())
# 6) Plot the weights
show_weights(model, df.columns.values[:-1])
st.divider()

'''
# Stochastic Gradient Descent
'''
# 1) initialize
model = SGDRegressor(max_iter=1000, tol=1e-3, penalty='l2', alpha=0.03, random_state=42)

# 2) train the model
model.fit(X_train, t_train)
# 3) evaluate
test_score = model.score(X_test, t_test)
"Test score: ", test_score

# 4) Predict the values
y = model.predict(X_test)
# 5) Plot actual vs predicted graphs
evaluate(y, t_test.to_numpy())
# 6) Plot the weights
show_weights(model, df.columns.values[:-1])

st.divider()



### Expected Output:


![image.png](https://webpages.charlotte.edu/mlee173/teach/ml/images/class/lab3-streamlit.png)

<div id="feedback"></div>

## Feedback (2 points)

Did you enjoy the lab?

Please take time to answer the following feedback qustions to help us further improve these labs! Your feedback is crucial to making these labs more useful!
    


* How do you rate the overall experience in this lab? (5 likert scale. i.e., 1 - poor ... 5 - amazing)  
Why do you think so? What was most/least useful?



`ANSWER: on a Likert scale of 1 to 5, I think this lab is a 5 - amazing because it instructs step by step. it built on the previous lab on processing the data, then it takes each linear regression model and shows how they are used in a practical way using sklearn.`

* What did you find difficult about the lab? Were there any TODOs that were unclear? If so, what specfically did not make sense about it?



`ANSWER: the loading of the data was misleading; I first downloaded the data by clicking the link that was given, and I realized that the format of the data that I had was not the same as the results presented in the lab. Until I got to the stream where I saw I could open it from id URL.`

* Which concepts, if any, within the lab do you feel could use more explanation?

`ANSWER: The Stochastic Gradient Descent was a little unclear to me. I also have issues interpreting the plottings.`