<h1 align = "center">Prediction of a student's mark.</h1>

## 1. Data Preparation

First, we will import the dataset and prepare it for the *Machine Learning* task ahead.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
stud = pd.read_csv('../source-files/student-math.csv', sep = ';')
stud.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


### Creating feature variable <mark>X</mark> and response variable <mark>y</mark>
First we have to create the response variable *final_grade* and separate the column *G3* from the feature variables.

In [3]:
stud['final_grade'] = np.sum(stud[['G1', 'G2', 'G3']], axis = 1)

Dataset <mark>X</mark> contains all the feature variables except *G3* and <mark>y</mark> which contains the *final_grade* column of the **stud** dataframe.

In [4]:
X_non_encoded = stud[stud.columns[: - 2]]
y = np.array(stud['final_grade'])

### Encoding all text values in the feature variable <mark>X</mark>
- Since all the *nominal columns* do not represent any sense of ordering, we can use `sklearn.preprocessing.OneHotEncoder()` library to encode the *nominal columns* into the *One Hot Format* where one categorical variable is represented by 1 and other by 0.


- We will also make a **Pipeline** with the help of `sklearn.compose.ColumnTransformer` class which will apply the desired transformation method on selected columns only.

The list of all the column names which are to be encoded.

In [5]:
nominal_col = [col for col in stud.columns if stud[col].dtype == 'O']

In [6]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

In [7]:
pipeline = ColumnTransformer([
    ('one_hot_enc', OneHotEncoder(), nominal_col)
], remainder = 'passthrough')

<div class = "alert alert-info">
    The above <strong>Pipeline</strong> transforms all the given list of text columns and pasess all the other features unchanged.
</div>

In [8]:
X = pipeline.fit_transform(X_non_encoded)

### Creating Train and Test sets of <mark>X</mark> and <mark>y</mark>.

We have to import `sklearn.model_selection.train_test_split()` function to create randomly shuffled tain and test sets.

In [9]:
from sklearn.model_selection import train_test_split

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

## 2. Modelling and Evaluation

To create a **linear regression** model we have to import `sklearn.linear_model.LinearRegression` class.

In [11]:
from sklearn.linear_model import LinearRegression

In [12]:
lin_reg = LinearRegression()

Now we have to fit *X_train* and *y_train* values to the **Linear Regression**.

In [26]:
for i in range(1000):
    model = lin_reg.fit(X_train, y_train)

In [27]:
print('The R-score value of the model with test data is {:.4f}'.format(model.score(X_test, y_test)))

The R-score value of the model with test data is 0.9621
