#**Supervised Machine Learning by Simple Linear Regression**

Linear Regression is .. 



*   a basic and commonly used type of predictive analysis
*   an attempt to model the relationship between two variables by **fitting a linear equation to observed data**



#**Step 1: Data Gathering**


---


Loading the dataset

In [3]:
import pandas as pd
from google.colab import files
upload = files.upload()


Saving Grade_Set_1.csv to Grade_Set_1.csv


In [4]:
!ls

Grade_Set_1.csv  sample_data


In [7]:
data = pd.read_csv('Grade_Set_1.csv')
data

Unnamed: 0,Hours_Studied,Test_Grade,Status,Result
0,2,57,fail,D
1,3,66,fail,D
2,4,73,pass,C
3,5,76,pass,C
4,6,79,pass,C
5,7,81,pass,B
6,8,90,pass,B
7,9,96,pass,A
8,10,100,pass,A


##**Performing Basic Data Exploration**


---


Trying to understand the data

**What are the columns? How many columns are there?**

In [8]:
data.columns

# So, we are dealing with 4 columns,
# hours_studied, test_grade, status, result

Index(['Hours_Studied', 'Test_Grade', 'Status', 'Result'], dtype='object')

**What is the dimension of this dataset?**

In [9]:
data.shape

# We are dealing with 9 number of records and 4 columns

(9, 4)

Some other basic explarotary include,

.summary()

info()

and many more, but these are enough for now

#**Step 2 : Data Preparation**


---


We cannot pass our dataset directly into a ML Algorithm.

In this stage, we are:

1.   Checking missing values --> **removing inconsistencies in our data**
2.   Converting categorical values into numerical features --> **in this data, we are converting the values of (status=fail/pass), and (result=a/b/c/d)** since they are in non-numerical features (text).

Sometimes, non-numerical features can be important data!
3. Normalization --> normalize the data
4. Select dependent and independent variables:






**Understanding dependent and independent variables**


---


Dependent variable --> is a variable that we need to predict, 

in this data test_grade is our dependent variable, since **test_grade** is to be predicted

**test_grade** is dependent on **hours_studied** because it correlates to more study = higher grade

Independent variable --> variable that is free to be chosen/set/up to the user

E.g **hours_studied** is up to the person, the person may study 2,6, 1 hour or dont study at all

##**Checking for missing values**


---


Usually .isnull().sum() is used because assuming we are dealing with lots of data, it's much more easier to monitor/see the amount of missing values

In [11]:
data.isnull().sum()

# False means there are no missing values
# If true is shown, then there are missing values!
# 0 missing values, good news

Hours_Studied    0
Test_Grade       0
Status           0
Result           0
dtype: int64

##**Converting non-numerical features**


---


Python cannot compute mathematics with texts, so we need to convert it

Categorical values are fail/pass because only 2 outcomes

using label binarizer class

[usable commands for label binarizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html)



In [13]:
from sklearn.preprocessing import LabelBinarizer

# Then, create an object for LabelBinarizer class

lb = LabelBinarizer()
data.Status = lb.fit_transform(data.Status) 
#transform will convert negative values = 0, pos values = 1



In [14]:
data

Unnamed: 0,Hours_Studied,Test_Grade,Status,Result
0,2,57,0,D
1,3,66,0,D
2,4,73,1,C
3,5,76,1,C
4,6,79,1,C
5,7,81,1,B
6,8,90,1,B
7,9,96,1,A
8,10,100,1,A


##**Normalization**

---

Changing the scale of multiple variable into a same range of [0-1]

**IV Question : Why normalization is necessary for ML?**

To bring data of different ranges into a common scale, because some features in a dataset might be in different ranges and this will be a problem

But, **for ML, every dataset does not require normalization. It is only required when features have different range**.

(Like this data, normalisation is not needed!)

[Learn more about Normalization](https://medium.com/@urvashilluniya/why-data-normalization-is-necessary-for-machine-learning-models-681b65a05029)





**MIN-MAX scaling for normalization**

Xnorm = (X - XMIN) / (XMAX - XMIN)

1. Assume X = 57, **Normalising the 57**

XMIN = 57, XMAX = 100

Xnorm = (57-57)/(100-57) 

Xnorm = 0

2. Assume X = 100, **Normalising the 100**

Xnorm = (100-57)/ (100-57)

Xnorm = 1



---





##**Selecting Dependent and Independent Variable**

Linear regression is  y = mx + c, where

y= y-coordinate (dependent variable)

x = x-coordinate (independent variable)

m = slope

c = y-intercept

Focus on columns that can be those kind of variables, 

in this case its **hours_studied** and **test_grade**

In [17]:
import numpy as np

X = data.Hours_Studied.values # x is independent variable
X = X.reshape(9,1)

#The equation requires ind. variable in 2D format, but we have a 1D
# X.shape --> (9,) is a 1d array initially.
# So, we use numpy reshape function to change it into a 2d array



In [18]:
X.shape
X

array([[ 2],
       [ 3],
       [ 4],
       [ 5],
       [ 6],
       [ 7],
       [ 8],
       [ 9],
       [10]])

In [19]:
Y = data.Test_Grade.values
Y # can be 1d array, so let it be

array([ 57,  66,  73,  76,  79,  81,  90,  96, 100])



---



#**Step 3 : Training the Dataset**

Choose a model to train the dataset

--> Model chosen for this data is : **Simple Linear Regression**

Y' = mX + C


*   Y' --> predicted value
*   m --> slope/coefficient (Change in the value of Y with respect to X) // How the dependent variable is going to change the value of Y, can be determine using m


*   X --> independent variable
*   c --> intercept (value of Y when X is zero)

use **class LinearRegression** to use method **fit**, so python will apply the equation for us





In [23]:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()

**using fit**

In [24]:
lin_reg.fit(X,Y)

LinearRegression()

##**Printing the predicted values**

In [25]:
predicted_values = lin_reg.predict(X)
predicted_values

array([59.71111111, 64.72777778, 69.74444444, 74.76111111, 79.77777778,
       84.79444444, 89.81111111, 94.82777778, 99.84444444])

##**Checking the accuracy of our model**

In [26]:
# Comparing the predicted data with the original data

data['predicted_values'] = predicted_values
data[['predicted_values','Test_Grade']]

Unnamed: 0,predicted_values,Test_Grade
0,59.711111,57
1,64.727778,66
2,69.744444,73
3,74.761111,76
4,79.777778,79
5,84.794444,81
6,89.811111,90
7,94.827778,96
8,99.844444,100


#**Evaluating model performance**

**In terms of %, how accurate is our model?**

Accuracy > 80%, good model

r2_score(OriginalValue,PredictedValue)

In [27]:
from sklearn.metrics import r2_score
accuracy = r2_score(Y,predicted_values) 
# R2 score will compared with predicted values
#and in return it will give the score

print("Accuracy :",accuracy) 

#Accuracy is 97%, model is 97% accurate making the prediction.

Accuracy : 0.9757431074095347


#**Making final predictions**

Remember, we are designing an application that is going to **give us marks based on hours of our study**

Ideally we want the codes to be simple like this, but if you run it **there will be an error**

Why?

Because python is expecting a **2d array** input, but we are passing a **scalar/1d array** input

In [None]:
#hrs = int(input('How many hours have you studied?: '))
#marks = lin_reg.predict(hrs)
#print('So you studied for,',hrs,'I think you are going to score',marks,'marks.')

#Uncomment to see the error in action!

In simple linear regression, your input has to be in **2d array**

lin_reg.predict**([[hrs]])** solves the problem, it automatically converts the shape of the array into (1,1) / 2d

In [31]:
hrs = int(input('How many hours have you studied?: '))
marks = lin_reg.predict([[hrs]])
print('So you studied for,',hrs,'hours, I think you are going to score',marks,'marks.')

How many hours have you studied?: 6
So you studied for, 6 hours, I think you are going to score [79.77777778] marks.


To calculate for float/decimal values, the process is similar,

change **int(input..) to float(input)**

In [32]:
hrs = float(input('How many hours have you studied?: '))
marks = lin_reg.predict([[hrs]])
print('So you studied for,',hrs,'hours, I think you are going to score',marks,'marks.')

How many hours have you studied?: 6.5
So you studied for, 6.5 hours, I think you are going to score [82.28611111] marks.
