**Name:** Fahim Shahriar Eram

**ID:** 2022523

#### **Polynomial Regression**

In this assignment, you will implement polynomial regression and apply it to the [Assignment 4 Dataset](https://minhaskamal.github.io/DownGit/#/home?url=https://github.com/mirsazzathossain/CSE317-Lab-Numerical-Methods/blob/main/datasets/data.csv).

The dataset contains two columns, the first column is the feature and the second column is the label. The goal is find the best fit line for the data.

You will need to perform the following regression tasks and find the best one for the dataset.

1.    **Linear Regression:**

     The equation we are trying to fit is:
     $$y = \theta_0 + \theta_1 x$$
     where $x$ is the feature and $y$ is the label.

     We can rewrite the equation in vector form as:
$$Y = X\theta$$ where $X$ is a matrix with two columns, the first column is all 1s and the second column is the feature, and $Y$ is a vector with the labels. $\theta$ is a vector with two elements, $\theta_0$ and $\theta_1$. The $X$ matrix will look like this:
$$X = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix}$$
2. **Quadratic Regression:**

     The equation we are trying to fit is:
     $$y = \theta_0 + \theta_1 x + \theta_2 x^2$$
     where $x$ is the feature and $y$ is the label.

     We can rewrite the equation in vector form as:
$$Y = X\theta$$where $X$ is a matrix with three columns, the first column is all 1s, the second column is the feature, and the third column is the feature squared, and $Y$ is a vector with the labels. $\theta$ is a vector with three elements, $\theta_0$, $\theta_1$, and $\theta_2$. The $X$ matrix will look like this:

$$X = \begin{bmatrix} 1 & x_1 & x_1^2 \\ 1 & x_2 & x_2^2 \\ \vdots & \vdots & \vdots \\ 1 & x_n & x_n^2 \end{bmatrix}$$
3. **Cubic Regression:**

     The equation we are trying to fit is:
$$y = \theta_0 + \theta_1 x + \theta_2 x^2 + \theta_3 x^3$$
     where $x$ is the feature and $y$ is the label.

     We can rewrite the equation in vector form as:
$$Y = X\theta$$where $X$ is a matrix with four columns, the first column is all 1s, the second column is the feature, the third column is the feature squared, and the fourth column is the feature cubed, and $Y$ is a vector with the labels. $\theta$ is a vector with four elements, $\theta_0$, $\theta_1$, $\theta_2$, and $\theta_3$. The $X$ matrix will look like this:
$$X = \begin{bmatrix} 1 & x_1 & x_1^2 & x_1^3 \\ 1 & x_2 & x_2^2 & x_2^3 \\ \vdots & \vdots & \vdots & \vdots \\ 1 & x_n & x_n^2 & x_n^3 \end{bmatrix}$$

Take 15 data points from the dataset and use them as the training set. Use the remaining data points as the test set. For each regression task, find the best $\theta$ vector using the training set. Then, calculate the mean squared error (MSE) on the test set. Plot the training set, the test set (in a different color), and the best fit line for each regression task. Which regression task gives the best fit line? Which regression task gives the lowest MSE on the test set? Report your answers in a Markdown cell.

**Note:** Do not use any built-in functions like `np.polyfit` or `sklearn.linear_model.LinearRegression` or any other built-in functions that perform polynomial regression. You must implement the regression tasks yourself.

**Importing necessary packages**

In [None]:
import os
import numpy as np
import pandas as pd
from io import StringIO
from google.colab import drive
from PIL import Image
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.neighbors import KNeighborsClassifier

*Uploading Dataset*

In [None]:
!unzip /content/data.csv.zip

Archive:  /content/data.csv.zip
replace data.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n


*Loading Dataset*

In [None]:
myData = pd.read_csv("data.csv", header=None)
myData.columns = ["Feature", "Label"]

*Seperating Data*

In [None]:
features = myData["Feature"]
label = myData["Label"]

oneMatrix = np.ones((20, 1))
X = np.vstack(features.to_numpy())
X = np.append(oneMatrix, X, axis=1)
Y = np.vstack(label.to_numpy())

*Splitting Data*

In [None]:
train_X, test_X = np.split(X, [int(0.75 * len(X))], axis = 0)
train_Y, test_Y = np.split(Y, [int(0.75 * len(Y))], axis = 0)

*Accuracy of Linear Regression*

In [None]:
R = np.matmul(np.transpose(train_X), train_X)
R_INV = np.linalg.inv(R)
Q = train_X
Q_T = np.transpose(Q)
beta1 = np.matmul(R_INV, Q_T).dot(train_Y)

R = np.matmul(np.transpose(test_X), test_X)
R_INV = np.linalg.inv(R)
Q = test_X
Q_T = np.transpose(Q)
beta2 = np.matmul(R_INV, Q_T).dot(test_Y)

MSE_Linear = round(np.square(np.subtract(beta1,beta2)).mean(), 3)

print("{0}%".format(MSE_Linear))

38.287%


*Plotting Linear Regression*

In [None]:
# x_axis1 = train_X[:,1]
# x_axis2 = test_X[:,1]

# plt.scatter(x_axis1, train_Y, color = "purple")
# plt.scatter(x_axis2, test_Y, color = "green")

# plt.show()

*Accuracy of Quadratic Regression*

In [None]:
quadratic_train = np.append(train_X, np.vstack(train_X[:,1]**2), axis=1)
quadratic_test =  np.append(test_X, np.vstack(test_X[:,1]**2), axis=1)

R = np.matmul(np.transpose(quadratic_train), quadratic_train)
R_INV = np.linalg.inv(R)
Q = quadratic_train
Q_T = np.transpose(Q)
beta3 = np.matmul(R_INV, Q_T).dot(train_Y)

R = np.matmul(np.transpose(quadratic_test), quadratic_test)
R_INV = np.linalg.inv(R)
Q = quadratic_test
Q_T = np.transpose(Q)
beta4 = np.matmul(R_INV, Q_T).dot(test_Y)

MSE_Quadratic = round(np.square(np.subtract(beta3,beta4)).mean(), 3)

print("{0}%".format(MSE_Quadratic))

11.206%


*Plotting of Quadratic Regression*

*Accuracy of Cubic Regression*

In [None]:
cubic_train = np.append(quadratic_train, np.vstack(train_X[:,1]**3), axis=1)
cubic_test =  np.append(quadratic_test, np.vstack(test_X[:,1]**3), axis=1)

R = np.matmul(np.transpose(cubic_train), cubic_train)
R_INV = np.linalg.inv(R)
Q = cubic_train
Q_T = np.transpose(Q)
beta5 = np.matmul(R_INV, Q_T).dot(train_Y)

R = np.matmul(np.transpose(cubic_test), cubic_test)
R_INV = np.linalg.inv(R)
Q = cubic_test
Q_T = np.transpose(Q)
beta6 = np.matmul(R_INV, Q_T).dot(test_Y)

MSE_Cubic = round(np.square(np.subtract(beta5,beta6)).mean(), 3)

print("{0}%".format(MSE_Cubic))

4.206%


*Plotting of Cubic Regression*

**Conclusion:**

-

-The cubic regression task gives the lowest MSE (4.206%) on the test set, which is far less than the MSE of quadratic regresion task (11.206%) and the MSE of linear regresstion task (38.287%).