# Student's percentage prediction

In this regression task we will predict the percentage of marks that a student is expected to score based upon the number of hours they studied.
This is a simple linear regression task as it involves just two variables.

What will be predicted score if a student study for 8 hrs in a day?

**Data set** : http://bit.ly/w-data

# Importing the required libraries

In [None]:
# importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns
%matplotlib inline

**Reading the data**

In [None]:
# Reading data from remote link
url = "http://bit.ly/w-data"
data = pd.read_csv(url)
print("Data imported successfully")
data

**Cleaning the data**
   
   Here, we are scouring thorough the given dataset to check out for null values. If there are none, data cleaning is not required.

In [None]:
#checking for null values
data.isnull().sum()

**Plotting the score distribution**

   Now, we plot the given dataset using a 2-D graph to eyeball our datasent and see if we can find any kind of relationship between Hours and Scores.

In [None]:
# Plotting the distribution of scores
data.plot(x='Hours', y='Scores', style='o')  
plt.title('Study Hours vs Percentage Scores')  
plt.xlabel('Hours Studied')  
plt.ylabel('Percentage Score')  
plt.show()

In [None]:
#plotting regressor plot to determine the relationship between feature and target
sns.regplot(x=data['Hours'],y=data['Scores'],data=data)
plt.title('Study Hours vs Percentage Scores')
plt.xlabel('Study Hours')
plt.ylabel('Percentage')
plt.show()

*From the graph, it is clearly proven that there is a positive linear relation between the number of hours studied and the percentage of scores.*

**Preparing the data**

   Now, we should define our "attributes" (input) variable and "labels" (output)

In [None]:
X = data.iloc[:, :-1].values  #Attribute
y = data.iloc[:, 1].values    #Labels

Now that we have the attributes and labels defined, the next step is to split this data into training and test sets.

This is done using *Scikit-Learn's built-in train-test-split method*.

In [None]:
# Using Scikit-Learn's built-in train_test_split() method:

from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

Now, the training and testing sets are ready for training the model.

# Training the Algorithm

Now,the linear regression algorithm is made from scratch and will be compared with built-in Scikit-learn's linear regression function.

**Making the Linear Regression Algorithm**

In [None]:
y_train_new = y_train.reshape(-1,1)  
ones = np.ones([X_train.shape[0], 1]) # create a array containing only ones 
X_train_new = np.concatenate([ones, X_train],1) # concatenate the ones to X matrix

In [None]:
# creating the theta matrix
# notice small alpha value
alpha = 0.01
iters = 5000

theta = np.array([[1.0, 1.0]])
print(theta)

In [None]:
# Cost Function
def findCost(X, y, theta):
    inner = np.power(((X @ theta.T) - y), 2)
    return np.sum(inner) / (2 * len(X))

In [None]:
findCost(X_train_new, y_train_new, theta)

*This value is the initial value. The aim is to minimise this value as small as possible.*

In [None]:
# Gradient Descent
def gradientDescent(X, y, theta, alpha, iters):
    m = len(X)
    for i in range(iters):
        theta = theta - (alpha/m) * np.sum(((X @ theta.T) - y) * X, axis=0)
        cost = findCost(X, y, theta)
        #if i % 10 == 0:
            #print(cost)
    return (theta, cost)

In [None]:
g, cost = gradientDescent(X_train_new, y_train_new, theta, alpha, iters)  
print("Intercept -", g[0][0])
print("Coefficient- ", g[0][1])
print("The final cost obtained after optimisation - ", cost)

*We can clearly see that the cost before and after optimization is hugely decreased.*

**Plotting the result**

In [None]:
# Plotting scatter points
plt.scatter(X, y, label='Scatter Plot')
axes = plt.gca()

# Plotting the Line
x_vals = np.array(axes.get_xlim()) 
y_vals = g[0][0] + g[0][1]* x_vals #the line equation

plt.plot(x_vals, y_vals, color='red', label='Regression Line')
plt.legend()
plt.show()

So, the above method was building the regression algorithm from scratch and implementing it on the data-set. However, as you can see this is in crude form, a more elegant would be to make a class of it.

The hyper-parameters such "alpha" also known as the learning rate is optimised by hit and trial method, which should not be the practice.

So, instead of writing this long code, python has an-inbuilt library for the same. This will be implemented below.

**Using Scikit-Learn Library**

In [None]:
from sklearn.linear_model import LinearRegression  
regressor = LinearRegression()  
regressor.fit(X_train, y_train) 

print("Training complete.")

In [None]:
print ("Coefficient -", regressor.coef_)
print ("Intercept - ", regressor.intercept_)

In [None]:
# Plotting the regression line
line = regressor.coef_*X + regressor.intercept_

# Plotting for the test data
plt.scatter(X, y)
plt.plot(X, line,color='red', label='Regression Line')
plt.legend()
plt.show()

It is observed that both the graphs, the intercepts and coefficient of the line are identical.This proves that my linear regression algorithm is correct.

But as you can see, it's a lot more easier to use the built-in function.

# Making the Predictions

   Now that we have trained our algorithm, the next step is to make some predictions.

In [None]:
print(X_test) # Testing data - In Hours
y_pred = regressor.predict(X_test) # Predicting the scores

In [None]:
# Comparing Actual vs Predicted
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})  
df

In [None]:
#Estimating training and test score
print("Training Score:",regressor.score(X_train,y_train))
print("Test Score:",regressor.score(X_test,y_test))

In [None]:
#plotting the grid to depict the actual and predicted value
df.plot(kind='bar',figsize=(7,7))
plt.grid(which='major', linewidth='0.5', color='green')
plt.grid(which='minor', linewidth='0.5', color='black')
plt.show()

In [None]:
# Testing the model with new data
hours = 8
test = np.array([hours])
test = test.reshape(-1, 1)
own_pred = regressor.predict(test)
print("No of Hours = {}".format(hours))
print("Predicted Score = {}".format(own_pred[0]))

# Evaluating the model

The final step is to evaluate the performance of the model. This step is important to compare how well both the algorithms perform on a specified dataset.

In [None]:
from sklearn import metrics  
print('Mean Absolute Error:',metrics.mean_absolute_error(y_test, y_pred)) 
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print('R-2:', metrics.r2_score(y_test, y_pred))

*We can clearly see that the **R^2 value also known as model's accuracy is 96.78%** which proves that the algorithms are working well for the given dataset.*