# Prediction using Supervised ML

## -By Kowsik Nandagopan D, Data Science and Business Analyst Intern at TSF

In this section we will see how the Python Scikit-Learn library for machine learning can be used to implement regression functions. We will start with simple linear regression involving two variables. 

**Simple Linear Regression**
In this regression task we will predict the percentage of marks that a student is expected to score based upon the number of hours they studied. This is a simple linear regression task as it involves just two variables. <br><br>
`y = mX + c`<br>
where, y is the predicted value (target), X is the feature, m is slope, c is y intercept

## Data Preparation

In [None]:
# Import Required library for data preparation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Import Dataset
url = "http://bit.ly/w-data"
df = pd.read_csv(url)
print("Loaded")

In [None]:
# First 5 values in dataset
df.head()

In [None]:
# Shape of dataset
print("Total data-points: ", df.shape[0])
print("Number of columns: ", df.shape[1])

In [None]:
# Check data types of colums
df.dtypes

In [None]:
# Checking for duplicates dataset
dup = df[df.duplicated()]
print("Number of duplicated data points", dup.shape)

In [None]:
# Check for NaN or Null Data-points
df.isnull().sum()

### Visualizing Dataset

In [None]:
plt.figure(figsize=(5, 5))
plt.scatter(df.Hours, df.Scores)
plt.xlabel("Hours")
plt.ylabel("Scores")
plt.title("Hour vs. Score")
plt.show()

In [None]:
# To plot the correlation matrix
import seaborn as sns
plt.figure(figsize=(8, 5))
sns.heatmap(df.corr(), annot=True)
plt.show()

## Traing Model
### Import requied `sklearn` libraries

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics  

### Split the dataset to features and target variables

In [None]:
X = df[["Hours"]]
y = df[["Scores"]]

### Split dataset to train-test
Here we take `20%` of total dataset for testing

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

### Initialize Model

In [None]:
model = LinearRegression(normalize=False)

#### Fit dataset

In [None]:
model.fit(X_train, y_train)

In [None]:
print("So our formula for prediction is y = {}X + {}".format(model.coef_[0][0], model.intercept_[0]))

## Evaluate Accuracy of model
Using the `20%` data we evaluate the accuracy of model

In [None]:
print("Model Score:", model.score(X_test, y_test))
y_test_pred = model.predict(X_test)
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_test_pred)) 

## Visualize Prediction

In [None]:
pred = model.predict(df[["Hours"]])

In [None]:
plt.figure(figsize=(5, 5))
plt.scatter(df.Hours, df.Scores, label="Actual")
plt.plot(df.Hours, pred, color="green", label="Predicted")
plt.xlabel("Hours")
plt.ylabel("Scores")
plt.legend()
plt.title("Hour vs. Score")
plt.show()

## Predicting Score 

In [None]:
x = 9.25
prd = model.predict(np.array(x).reshape(-1, 1))
print("If student studies for", x, "hrs/day, then predicted score is", prd[0][0])