## Linear regression for missing values

### Overview

In this Notebook, the main objective is to use a linear regression model to make predictions that you can use to fill in missing values in a dataset. The procedure is the same, however, you are using one of the features as the "target" instead of what you may normally think of as the target for that particular dataset. By the end of this lab you should have:

- gained experience manipulating dataframes with Pandas
- an initial understanding of how missing data is represented
- applied a linear regression model to fill in missing data   

### Instructions

Please execute the following steps using a mixture of base python, NumPy, sklearn, Pandas, and/or matplotlib:

1. Find a dataset
    - I would suggest looking [here](https://archive.ics.uci.edu/ml/datasets.php) for **regression** datasets
    - The dataset for this lab does not have to be complicated, but it should meet the following criteria:
        - have at least 100 samples/rows
        - have at least 4 numeric features
    - if necessary, categorical features can simply be dropped from the dataset
2. Import the data as a Pandas dataframe
    - Depending on the data format, you may need to consult this [page](https://pandas.pydata.org/pandas-docs/stable/reference/io.html)
3. Verify that your data has no missing values
    - If it does have missing values, drop them from the dataset but be sure that your dataset still meets the criteria of *Step 1* above
4. Choose a single, numeric feature (not the target)
    - Replace approximately 15% of the values of this feature with `nan`, which means "not a number" and is one way to represent missing data
5. Split your dataset into 2 dataframes
    - Dataframe 1: has all `nan` values for the feature chosen in *Step 4*
    - Dataframe 2: has no `nan` values for the feature chosen in *Step 4*
6. Use *Dataframe 2* to create a linear regression model to predict the feature chosen in *Step 4* (not the usual target)
    - Split the data
    - Scale the data
    - Create the model
    - evaluate the model on the train and test sets
7. Use the model you created in *Step 6* to predict the missing values in *Dataframe 1*
    - At the end of this step, *Dataframe 1* will have the `nan` values replaced with the predictions from the model you created in *Step 6*
8. Create a final dataframe by combining *Dataframe 1* and *Dataframe 2*
    - This dataframe should have no missing values
9. Create a k nearest neighbours regressor (`k = 3`) for the dataframe you created in *Step 8*
    - Follow the usual procedures
10. Create a k nearest neighbours regressor (`k = 3`) for the original dataframe (from *Step 2* and maybe *Step 3*)
    - Follow the usual procedures
11. Is there any significant performance difference between *Step 9* and *Step 10*?


## Step 1
#### <span style="color:#003399"> Standard packages import</span>
- <span style="font-size: 100%;color:#003399"> Numpy packages are used to work on array.</span>
- <span style="font-size: 100%;color:#003399"> Pandas packages are used for working on dataframes while fetching the data from csv and while creating features and target dataframe for Machine learning models.</span>
- <span style="font-size: 100%;color:#003399"> Matplotlib packages are used for plotting/visualing the data.</span>

In [1]:
#Import standard packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Step 2
####  <span style="color:#003399">Reading the csv file and storing it into dataframe (admsn_data)</span>
-  <span style="font-size: 100%;color:#003399"> using read_csv() function of pandas to fetch the data from our csv file into a dataframe. </span>

In [2]:
#the dataset for this notebook is taken from below link
#https://www.kaggle.com/mohansacharya/graduate-admissions/data
#importing the dataset as dataframe
admsn_data = pd.read_csv("Admission_Predict.csv")

#Top 5 records of dataset
admsn_data.head()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


## Step 3
#### <span style="color:#003399"> Verifying that dataset has no Missing values </span>
- <span style="font-size: 100%;color:#003399"> .describe() method gives the summary of the dataframe </span>

In [3]:
#Descibe the summary of the dataset
print(admsn_data.describe())

       Serial No.   GRE Score  TOEFL Score  University Rating         SOP  \
count  400.000000  400.000000   400.000000         400.000000  400.000000   
mean   200.500000  316.807500   107.410000           3.087500    3.400000   
std    115.614301   11.473646     6.069514           1.143728    1.006869   
min      1.000000  290.000000    92.000000           1.000000    1.000000   
25%    100.750000  308.000000   103.000000           2.000000    2.500000   
50%    200.500000  317.000000   107.000000           3.000000    3.500000   
75%    300.250000  325.000000   112.000000           4.000000    4.000000   
max    400.000000  340.000000   120.000000           5.000000    5.000000   

             LOR         CGPA    Research  Chance of Admit   
count  400.000000  400.000000  400.000000        400.000000  
mean     3.452500    8.598925    0.547500          0.724350  
std      0.898478    0.596317    0.498362          0.142609  
min      1.000000    6.800000    0.000000          0.34000

- <span style="font-size: 100%;color:#003399"> .isnull() method is used to check if there is null values in dataframe and it will return boolean value (True or False) </span>
-<span style="font-size: 100%;color:#003399"> .any() method is checking where the condition is satisfying for any column, here in our case it checks if any columns has null values or not </span>

In [4]:
#Checking if data has any null values
print(admsn_data.isnull().any())


Serial No.           False
GRE Score            False
TOEFL Score          False
University Rating    False
SOP                  False
LOR                  False
CGPA                 False
Research             False
Chance of Admit      False
dtype: bool


- <span style="font-size: 100%;color:#003399"> .isna().any() method is used to check if there is any NA values in dataframe and it will return boolean value (True or False) based on the columns in dataframe</span>

In [5]:
#checking if data has any NA values for each colums
print(admsn_data.isna().any())

Serial No.           False
GRE Score            False
TOEFL Score          False
University Rating    False
SOP                  False
LOR                  False
CGPA                 False
Research             False
Chance of Admit      False
dtype: bool


## Step 4

##### <span style="color:#003399"> Assigning 15% of the values of feature "TOEFL Score" as NAN </span>
- <span style="font-size: 100%;color:#003399"> .isna().sum() method is used to count the number of NA in the column</span>

In [6]:
#creating a copy of dataset
data = admsn_data.copy()

data['TOEFL Score'] = data['TOEFL Score'].sample(frac=0.85)

In [7]:
count_null = data['TOEFL Score'].isna().sum()
#Printing the count of Null records 
print(count_null)
total = len(data['TOEFL Score'])

#Percentage of data as Null
print("{:.2f}".format(round(count_null/total,2)*100)+"%")

60
15.00%


In [8]:
#count number of rows with NA
data['TOEFL Score'].isna().sum()


60

### Step 5
#### Splitting the data into two data frame

##### <span style="color:#003399"> data_wo_null with not null values in features ("TOEFL Score") </span>

In [9]:
data_wo_null = data[data['TOEFL Score'].notnull()]
print(data_wo_null.shape)

#fetchin top records in the dataframe with not null feature
data_wo_null.head()

(340, 9)


Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118.0,4,4.5,4.5,9.65,1,0.92
1,2,324,107.0,4,4.0,4.5,8.87,1,0.76
2,3,316,104.0,3,3.0,3.5,8.0,1,0.72
3,4,322,110.0,3,3.5,2.5,8.67,1,0.8
4,5,314,103.0,2,2.0,3.0,8.21,0,0.65


##### <span style="color:#003399"> data_w_null with null values in feature ("TOEFL Score") </span>

In [10]:
data_w_null =  data[data['TOEFL Score'].isnull()]
print(data_w_null.shape)

#fetching top records in the dataframe with null feature
data_w_null.head()

(60, 9)


Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
9,10,323,,3,3.5,3.0,8.6,0,0.45
12,13,328,,4,4.0,4.5,9.1,1,0.78
27,28,298,,2,1.5,2.5,7.5,1,0.44
37,38,300,,1,1.0,2.0,7.8,0,0.58
41,42,316,,2,2.5,2.5,8.2,1,0.49


<br>

#### <span style="color:#003399"> Creating features and target for not null dataset </span>

In [11]:
#Creating feature data from not nul dataset
X_df =  data_wo_null[['GRE Score','University Rating','SOP','LOR ','CGPA','Research']]
#Shape of feature data frame
print(X_df.shape)

#fetching top records of features
X_df.head()

#Creating target data from not nul dataset
y = data_wo_null[['TOEFL Score']]

#fetching top records of target
y.head()

(340, 6)


Unnamed: 0,TOEFL Score
0,118.0
1,107.0
2,104.0
3,110.0
4,103.0


## Step 6
#### <span style="color:#003399"> Splitting the not null data into Train & Test </span>
- <span style="font-size: 100%;color:#003399"> train_test_split() method is used from sklearn package, model selection module which splits the features and target data into training and testing dataset  </span>


In [12]:
from sklearn.model_selection import train_test_split

X_train,X_test, y_train, y_test = train_test_split(X_df,y,random_state = 456)

#### <span style="color:#003399">Train, test accuracies and shape </span>

In [13]:
#Train accuracy percentage
print("Training data percentage is {:.2f}".format(len(X_train)/len(data_wo_null)*100)+" %")
#Test accuracy percentage
print("Test data percentage is {:.2f}".format((len(X_test)/len(data_wo_null))*100)+" %")

print("\n")

#Tain and test feature dataset shapes
print("Features train shape"+str(X_train.shape))
print("Features test shape"+str(X_test.shape))

print("\n")

#Tain and test target dataset shapes
print("Target train shape"+str(y_train.shape))
print("Target test shape"+str(y_test.shape))

Training data percentage is 75.00 %
Test data percentage is 25.00 %


Features train shape(255, 6)
Features test shape(85, 6)


Target train shape(255, 1)
Target test shape(85, 1)


<br>

### <span style="color:#003399">Building the model </span>
- <span style="font-size: 100%;color:#003399"> Linear Regression model belongs to the sklearn.linear_model module which is used for creating the linear models for independent and dependent variables </span>
- <span style="font-size: 100%;color:#003399"> we will create a object(lr) belonging to class LinearRegression </span>
- <span style="font-size: 100%;color:#003399"> .fit() is used to fit the training data into the model </span>

In [14]:
#Importing linear regression libraries from sklearn linear_model module
from sklearn.linear_model import LinearRegression

#Creating object of linear regression class
lr = LinearRegression()

#fitting the train data to dataset
lr.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

<br>

#### <span style="color:#003399"> Finding the co-efficients (weights) for features and the intercept </span>
- <span style="font-size: 100%;color:#003399"> .coef_ method is used to find the coefficients/weights of the independent varilables/features </span>
    - <span style="font-size: 100%;color:#003399"> The regression co-efficents in  multiple regression is the slope of relationship between our dependent and independent variables (i.e. between between the features and target). This acts as weights for each features, So model will learn and adjust the weights, so that if a feature is important and affecting the target will have a higher weight compare to the feature will is less important, so its weight would be less </span>
- <span style="font-size: 100%;color:#003399"> .intercept_ is used to find the intercept of the equation, based on which our model is built.</span>
    - <span style="font-size: 100%;color:#003399">Intercept means that even if the features(X) is Zero(0) the Target(y) will still have a value equal to intercept(W0). </span>

In [15]:
#finding the co-efficient of linear model
coeff = lr.coef_
print("Coefficients of model are "+str(coeff))
print(type(coeff))

#finding the intercept (W0 Weight) of linear model
intercpt = lr.intercept_
print("Intercept of the linear model is "+str(intercpt))

Coefficients of model are [[ 0.26437415  0.66344017  0.48855059 -0.30648806  3.41696786 -0.24953508]]
<class 'numpy.ndarray'>
Intercept of the linear model is [-8.25293833]


<br>

#### <span style="color:#003399">  As we are using 5 features X1, X2, X3, X4, X5 and 1 feature as target equation of the line will be </span>

<span style="color:blue"> **y = w0 + w1*x1 + w2 * x2 + w3 * X3 + w4 * x4 + w5 * x5** </span>

##### 'TOEFL Score' = Intercept + (coeff[0] * 'GRE Score') + (coeff[1] * 'University Rating') + (coeff[2] * SOP) + (coeff[3] * LOR) + (coeff[4] * CGPA) + (coeff[5] * Research)

<br>

#### <span style="color:#003399">  Predicting the value of the target based on the model </span>
- <span style="font-size: 100%;color:#003399"> predicting the TOEFL Score </span>
- <span style="font-size: 100%;color:#003399"> .predict() method is used to predict the target bassed on the feature dataset </span>

In [16]:
# prediciting the target based on the test
y_pred = lr.predict(X_test)

#printing first 5 predicted values
print(y_pred[0:5])

[[102.55645253]
 [106.19368093]
 [109.53277399]
 [ 96.5152292 ]
 [100.1046061 ]]


<br>

#### <span style="color:#003399"> Finding the difference between the actual and the predicted values </span>

In [17]:
#finding the difference between the actual values and the predicted values
residuals =  y_pred - y_test
residuals.head()

Unnamed: 0,TOEFL Score
300,-3.443547
17,0.193681
153,4.532774
272,1.515229
8,-1.895394


#### <span style="color:#003399"> Evaluation of the model based on the metrics </span>
- <span style="font-size: 100%;color:#003399"> r2_score() gives us information about how close the data are to the fitted regression line, or how much our predicted values are varying from our test dataset</span>
- <span style="font-size: 100%;color:#003399"> We can evaluate model using many metrics, we will focus on R2_Score for our model </span>

In [18]:
#importing the metrics packages from sklearn.metrics
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
import math 

print("R^2 : \t\t\t\t {:.2f} ".format(r2_score(y_test,y_pred)))
print("Mean Squared value :\t\t {:.2f}  ".format(mean_squared_error(y_test,y_pred)))
print("Root mean squared value : \t {:.2f} ".format(math.sqrt(mean_squared_error(y_test,y_pred))))
print("Abssolute mean error : \t\t {:.2f} ".format(mean_absolute_error(y_test,y_pred)))

R^2 : 				 0.72 
Mean Squared value :		 8.25  
Root mean squared value : 	 2.87 
Abssolute mean error : 		 2.19 


<br>

## Step 7

#### <span style="color:#003399"> Predicting feature ('TOEFL Score') for null dataset based on model built </span>

#### <span style="color:#003399">  Creating feature and the target data for the dataset having NAN values for feature ("TOEFL Score") </span>

In [19]:
#Creating feature dataframe
X_null_df = data_w_null[['GRE Score','University Rating','SOP','LOR ','CGPA','Research']]
X_null_df.head()

Unnamed: 0,GRE Score,University Rating,SOP,LOR,CGPA,Research
9,323,3,3.5,3.0,8.6,0
12,328,4,4.0,4.5,9.1,1
27,298,2,1.5,2.5,7.5,1
37,300,1,1.0,2.0,7.8,0
41,316,2,2.5,2.5,8.2,1


In [20]:
#Creating target dataframe
y_null = data_w_null[['TOEFL Score']]
y_null.head()


Unnamed: 0,TOEFL Score
9,
12,
27,
37,
41,


<span style="font-size: 100%;color:#003399"> Predicting the y 'TOEFL Score' using two methods </span>

#### <span style="color:#003399">  Predicting the y i.e. our null Feature 'TOEFL Score' using model </span>

In [21]:
#Predicting using the model
y_pred_null = lr.predict(X_null_df)
y_pred_null[0:5]

array([[109.30661802],
       [112.53542097],
       [ 97.20176732],
       [ 98.25066962],
       [104.84093005]])

<br>

#### <span style="color:#003399">  Finding the values of the Null feature using the model equation </span>

In [22]:
#Calculating NAN feature ("TOEFL Score")

y_null = intercpt + (coeff.flat[0] * X_null_df['GRE Score']) + (coeff.flat[1] * X_null_df['University Rating']) + (coeff.flat[2] * X_null_df['SOP']) + (coeff.flat[3] * X_null_df['LOR ']) + (coeff.flat[4] * X_null_df['CGPA']) + (coeff.flat[5] * X_null_df['Research'])

#Top 5 predicted values from our model for features we changed to NAN
y_null[0:5]

9     109.306618
12    112.535421
27     97.201767
37     98.250670
41    104.840930
dtype: float64

<br>

##### We can see that both using the .predict() method and using the model , we get the same predicted values for feature 'TOEFL Score'

<br> <br>
####  <span style="color:#003399">  Finding the difference between the predicted value and the values from the dataset </span>

In [23]:
#find the difference between the actual values in dataset and the predicted values for NaN feature
residuals_for_null = []
for i in data_w_null['TOEFL Score'].index:
    
    residuals_for_null.append(admsn_data['TOEFL Score'].loc[i] - y_null[i])
    
#Fetching first 5 records for difference values
residuals_for_null[0:5]

[-1.3066180197880897,
 -0.5354209744994165,
 0.7982326818940209,
 6.74933038239044,
 0.15906995282495018]

<br>

## Step 8 

#### <span style="color:#003399">  Creating the dataframe after getting data from model </span>

In [24]:
#replacing the NAN with the predicted feature ['TOEFL score'] value
for i in data_w_null['TOEFL Score'].index:
    #print(i,y_null[i])
    #a[i]
    data_w_null['TOEFL Score'].loc[i] = y_null[i]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


In [25]:
#Verifying the null dataset to check if the predicted values are updated 
print(data_w_null.shape)
data_w_null.head()

(60, 9)


Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
9,10,323,109.306618,3,3.5,3.0,8.6,0,0.45
12,13,328,112.535421,4,4.0,4.5,9.1,1,0.78
27,28,298,97.201767,2,1.5,2.5,7.5,1,0.44
37,38,300,98.25067,1,1.0,2.0,7.8,0,0.58
41,42,316,104.84093,2,2.5,2.5,8.2,1,0.49


- <span style="font-size: 115%;color:#003399"> .append() method is used to append one data frame to other (vertically) </span>

In [26]:
#appending the null with predicted values data frame with not null data frame

data_generated = data_wo_null.append(data_w_null)
#shape of newly generated data frame
print(data_generated.shape)

#Importing the date time
from datetime import datetime
current_date = datetime.now()

#creating the file name
file_name = "Admission_"+str(current_date.strftime('%m_%d_%Y'))+".csv"
print(file_name)

#creating the file path
file_path = "./"+file_name
print(file_path)

#storing the data generated into a csv
data_generated.to_csv (file_path, index = False, header = True)

(400, 9)
Admission_06_28_2020.csv
./Admission_06_28_2020.csv


## Step 9

### <span style="color:#003399"> KNN regression on the newly created data frame (not null data set + NAN data set with predicted values) </span>

- <span style="font-size: 100%;color:#003399"> KNeighbourRegressor is machine learning model which we use to predict our taget  value based on the feature data, It belongs to sklearn.model_selection module  </span>

- <span style="font-size: 100%;color:#003399">Features : 'GRE Score','TOEFL Score','University Rating','SOP','LOR ','CGPA','Research' </span>
- <span style="font-size: 100%;color:#003399"> Target : 'Chance of Admit' </span>


In [27]:
#Creating the feature and target data frame
feature_x = data_generated[['GRE Score','TOEFL Score','University Rating','SOP','LOR ','CGPA','Research']]
target_y = data_generated[['Chance of Admit ']]

#Splitting the data
from sklearn.model_selection import train_test_split

Xk_train, Xk_test, yk_train, yk_test = train_test_split(feature_x, target_y)

#Building the model without scaling
#step 1
from sklearn.neighbors import KNeighborsRegressor

# Step 2
rg = KNeighborsRegressor(n_neighbors=3)

# Step 3 fitting data to model
rg.fit(Xk_train, yk_train)

# Step 4 calculating the accuracy
r2_Score_train = rg.score(Xk_train, yk_train)
r2_Score_test = rg.score(Xk_test, yk_test)

#printing the train and test accuracies
print("Training set R2_score: {:.2f}".format(r2_Score_train*100)+"%")  
print("Test set R2_Score: {:.2f}".format(r2_Score_test*100)+"")

Training set R2_score: 80.97%
Test set R2_Score: 67.47


#### Scaling the data using the StandardScaler

In [28]:
#importing the scaler to normalize the data
from sklearn.preprocessing import StandardScaler

#Creating the scaler object
scaler_X = StandardScaler()

#Fitting the data into scaler
scaler_X.fit(Xk_train)

#Transforming the train and test features
Xk_scale_train = scaler_X.transform(Xk_train)
Xk_scale_test = scaler_X.transform(Xk_test)


In [29]:
# Step 3 fitting scaled data to model
rg.fit(Xk_scale_train, yk_train)

# Step 4 calculating r2_Score
scaled_r2_score_train = rg.score(Xk_scale_train, yk_train)
scaled_r2_score_test = rg.score(Xk_scale_test, yk_test)

#printing the train and test accuracies
print("Training set R2_score: {:.2f}".format(scaled_r2_score_train*100)+"%")  
print("Test set R2_score: {:.2f}".format(scaled_r2_score_test*100)+"%")

Training set R2_score: 85.98%
Test set R2_score: 74.49%


## Step 10 

### <span style="color:#003399">  KNN regression on the actual dataset </span>
- <span style="font-size: 100%;color:#003399">Features : 'GRE Score','TOEFL Score','University Rating','SOP','LOR ','CGPA','Research' </span>
- <span style="font-size: 100%;color:#003399"> Target : 'Chance of Admit' </span>

In [30]:
#Creating the feature and target data frame from the actual dataset
feature_xd = admsn_data[['GRE Score','TOEFL Score','University Rating','SOP','LOR ','CGPA','Research']]
target_yd = admsn_data[['Chance of Admit ']]

#splitting the dataset
from sklearn.model_selection import train_test_split

Xd_train, Xd_test, yd_train, yd_test = train_test_split(feature_xd, target_yd)

#Building the model without sclaing
#step 1
from sklearn.neighbors import KNeighborsRegressor

# Step 2
rg = KNeighborsRegressor(n_neighbors=3)

# Step 3
rg.fit(Xd_train, yd_train)

#finding the train and test accuracies
# Step 4
r2_score_train_data = rg.score(Xd_train, yd_train)
r2_score_test_data = rg.score(Xd_test, yd_test)

print("Training set R2_score: {:.2f}".format(r2_score_train_data*100)+"%")  
print("Test set R2_score: {:.2f}".format(r2_score_test_data*100)+"%")

Training set R2_score: 84.80%
Test set R2_score: 60.95%


In [31]:
#importing the scaler to normalize the data
from sklearn.preprocessing import StandardScaler

#Creating the scaler object
scaler_Xd = StandardScaler()

#Fitting the data into scaler
scaler_Xd.fit(Xd_train)

#Transforming the train and test features
Xd_scale_train = scaler_X.transform(Xd_train)
Xd_scale_test = scaler_X.transform(Xd_test)

In [33]:
# Step 3 fitting scaled data to model with scaled values
rg.fit(Xd_scale_train, yd_train)

# Step 4
scaled_r2_score_train_d = rg.score(Xd_scale_train, yd_train)
scaled_r2_Score_test_d = rg.score(Xd_scale_test, yd_test)

#printing the train and test accuracies
print("Training set r2_Score: {:.2f}".format(scaled_r2_score_train_d*100)+"%")  
print("Test set r2_score: {:.2f}".format(scaled_r2_Score_test_d*100)+"%")

Training set r2_Score: 85.54%
Test set r2_score: 77.42%


## Step 11

### Results
### <span style="color:#003399"> The difference between the accuracies of datasets. </span>



<table {margin-left: 0}>
  <thead>
    <tr>
      <th>Dataset</th>  
      <th>Train Accuracies</th>
      <th>Test Accuracies</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Actual without scaling</td>
      <td>84.80%</td>
      <td>60.95%</td>
    </tr>
    <tr>
      <td>Generated without scaling</td>
      <td>80.97%</td>
      <td>67.47%</td>
    </tr>
    <tr>
      <td>Actual with scaling</td>
      <td>85.54%</td>
      <td>77.42%</td>
    </tr>
    <tr>
      <td>Generated with scaling</td>
      <td>85.98%</td>
      <td>74.79%</td>
    </tr>
  </tbody>
</table>



<span style="font-size:100%;color:#003399"> From the above table we can compare the acuracies between the actual and generated dataset for with and without scaling operations</span>
    
<span style="font-size:100%;color:#003399">We can observe that there is a small difference between the accuracies. </span>

- Note: The notebook questions are framed by St. Clair College and this was a part of our Lab for Machine learning for understanding the ML regression concepts and applying it to a real worl scenario.