# Simple Linear Regression

In this notebook, we'll build a linear regression model to predict the `percentage of marks` that a student is expected to score based upon the number of hours they studied.

### Author: **Akash Negi**

## Step 1: Reading and Understanding the Data

Let's start with the following steps:

1. Importing data 
2. Understanding the structure of the data

In [7]:
# Supress Warnings
import warnings
warnings.filterwarnings('ignore')

In [10]:
#importing all the neccessory libraries
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
import numpy as np

In [11]:
# Reading data from remote url
url='https://raw.githubusercontent.com/AdiPersonalWorks/Random/master/student_scores%20-%20student_scores.csv'
data=pd.read_csv(url)
data.head()

Unnamed: 0,Hours,Scores
0,2.5,21
1,5.1,47
2,3.2,27
3,8.5,75
4,3.5,30


In [12]:
#looking at the shape of the data
data.shape

(25, 2)

## Step 2: Plotting the distribution of scores
Let's plot our data points on 2-D graph to eyeball our dataset and see if we can manually find any relationship between the data. We can create the plot with the following script:

In [None]:
plt.scatter(data=data,x='Hours',y='Scores')  
plt.title('Hours vs Percentage')  
plt.xlabel('Hours Studied')  
plt.ylabel('Percentage Score')  
plt.show()

## Step 3: Performing Simple Linear Regression

### Simple Linear Regression using `statsmodels`

We first assign the feature variable, `Hours`, in this case, to the variable `X` and the response variable, `Scores`, to the variable `y`.

In [None]:
X=data.iloc[:,0]
y=data.iloc[:,1]

#### Train-Test Split

You now need to split our variable into training and testing sets. You'll perform this by importing `train_test_split` from the `sklearn.model_selection` library. It is usually a good practice to keep 70% of the data in your train dataset and the rest 30% in your test dataset

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.30,random_state=100)

In [None]:
# Let's now take a look at the train dataset
print(X_train)


In [None]:
print(y_train)

#### Building a Linear Model

You first need to import the `statsmodel.api` library using which you'll perform the linear regression.

In [None]:
import statsmodels.api as sm

By default, the `statsmodels` library fits a line on the dataset which passes through the origin. But in order to have an intercept, you need to manually use the `add_constant` attribute of `statsmodels`. And once you've added the constant to your `X_train` dataset, you can go ahead and fit a regression line using the `OLS` (Ordinary Least Squares) attribute of `statsmodels` as shown below

In [None]:
# Add a constant to get an intercept and fits the regression line using OLS
X_train_sm=sm.add_constant(X_train)
lr=sm.OLS(y_train,X_train_sm).fit()

In [None]:
# Print the parameters, i.e. the intercept and the slope of the regression line fitted
lr.params

In [None]:
# Performing a summary operation lists out all the different parameters of the regression line fitted
lr.summary()

In [None]:
#plotting the regression line on the training data
plt.scatter(X_train,y_train)
plt.plot(X_train,1.8709+(X_train*9.8542),'r')

## Step 4: Predictions on the Test Set

Now that you have fitted a regression line on your train dataset, it's time to make some predictions on the test data. For this, you first need to add a constant to the `X_test` data like you did for `X_train` and then you can simply go on and predict the y values corresponding to `X_test` using the `predict` attribute of the fitted regression line.

In [None]:
# Add a constant to X_test
X_test_sm=sm.add_constant(X_test)
# Predict the y values corresponding to X_test_sm
y_pred=lr.predict(X_test_sm)

In [None]:
# Comparing Actual vs Predicted
df=pd.DataFrame({'Actual score':y_test,'Predicted score':y_pred,'difference':(y_test-y_pred)})
df

In [None]:
# You can also test with given data
hours = 9.25
own_pred = lr.predict([[1,hours]])
print(f'No of Hours = {hours}')
print(f"Predicted Score = {own_pred[0]}")

#### So predicted score is 92.80850057353507

## Step 5: Evaluation of the model
The final step is to evaluate the performance of algorithm. This step is particularly important to compare how well different algorithms perform on a particular dataset. For simplicity here, we have chosen the mean square error. There are many such metrics.

In [None]:
#importing metrics for the evaluation of the model
from sklearn.metrics import mean_squared_error , r2_score

In [None]:
#printing our the evaluated metrics
print('Root Mean Squared Error =' ,np.sqrt(mean_squared_error(y_test,y_pred)))
print('R2 Score = ',r2_score(y_pred,y_test))

Thank You