# Implementing Simple Linear Regression from Scratch

### Importing the libraries

We will be using NumPy for setting up the data, model_selection for splitting the dataset into training and testing datasets, and padas for creating a dataset.

In [13]:
import numpy as np
from sklearn import model_selection
import pandas as pd

### Loading the data
We will be using the loadtxt() function to load the data. It consists of 2 columns with the first row taken as 'X' and the second row as 'Y'.

In [14]:
data = np.loadtxt("data.csv", delimiter=",")
X = data[:, 0]
Y = data[:, 1]

### Dividing the data set for training and testing
We divide the data set into training and testing data using the model_selection() library<br>
test_size() indicates the data set size with 0.7 as training data and 0.3 as testing data in the below code.<br>

In [15]:
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size = 0.3)
X_train.shape

(70,)

### Fit() function:
This will return suitable values of m and c by calculating with the values x and y.<br>
fit() mostly takes training data as input. You can learn more here.<br>

In [16]:
def fit(x_train, y_train):
    num = (x_train * y_train).mean() - x_train.mean() * y_train.mean()
    den = (x_train ** 2).mean() - x_train.mean() ** 2
    m = num / den
    c = y_train.mean() - m * x_train.mean()
    return m, c    

### Predict () function:
This function will return suitable values for y now that m and c are calculated using the fit() function.<br>
The input will be training x and y of both train and testing data.<br>
# y = m * x + c

In [17]:
# This function predicts the value of 'y' corresponding to each 'x'
def predict(x, m, c):
    return m * x + c

### Coefficient of determination:
To get to know if our model is good or bad we can plot graphs but it won't be effective if the algorithm is big and has many features. <br>
The coefficient of determination R2 is used to analyse how different R2 in one variable can be explained by a difference in the second variable.<br>
"0" is the Worst possible value you can get.<br>
The input of the function will be Y_test,y_test_pred.<br>

In [18]:
# This function returns the score using the Y(actual) and Y(predited), coefficient of determination.
def score(y_truth, y_pred): 
    u = ((y_truth - y_pred)**2).sum()
    v = ((y_truth - y_truth.mean())**2).sum()
    return 1 - u/v

### Cost() function:
The cost function is used to measure just how wrong the model is in finding a relationship between the input and output.<br>
It tells how badly a model is behaving. <br>
The input of this function will be X_train, Y_train,m and c. <br>

In [19]:
def cost (x, y, m , c):
    return ((y - m * x - c)**2).mean()

### Implementation of the algorithm

In [20]:
m, c = fit(X_train, Y_train)
# Test data
y_test_pred = predict(X_test, m, c)
print("Test Score: ",score(Y_test, y_test_pred))

# Train data
y_train_pred = predict(X_train, m, c)
print("Train Score:", score(Y_train, y_train_pred))
print("M:", m)
print("C:", c)
print("Cost on training data:", cost(X_train,Y_train, m, c ))


Test Score:  0.6134351655250116
Train Score: 0.5924432957527764
M: 1.3744751003971205
C: 5.855174700938079
Cost on training data: 125.36091637594052
