# Supervised Learning - Building a Student Performance Prediction System


# Classification vs. Regression
The aim of this project is to predict how likely a student is to pass. Which type of supervised learning problem is this, classification or regression? Why?
Answer:
This project is a classification supervised learning problem because the variable to predict, i.e. if a student graduates or fails to graduate, is categorical. On this case this a dichotomous categorical variable where the only two possible values are "pass" or "fail".

### Overview:

1.Read the problem statement.

2.Get the dataset.

3.Explore the dataset.

4.Pre-processing of dataset.

5.Transform the dataset for building machine learning model.

6.Split data into train, test set.

7.Build Model.

8.Apply the model.

9.Evaluate the model.

10.Provide insights.

## Problem Statement 

Using Logistic Regression **predict the performance of student**. The classification goal is to predict whether the student will pass or fail.

## Dataset 

This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in Mathematics.

**Source:** https://archive.ics.uci.edu/ml/datasets/Student+Performance

# Question 1 - Exploring the Data (0.5 points)
*Read the dataset file using pandas. Take care about the delimiter.*

#### Answer:

In [3]:
import numpy as np
import pandas as pd
stdf = pd.read_csv('students-data.csv',sep = ";")
stdf.head(10)

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10
5,GP,M,16,U,LE3,T,4,3,services,other,...,5,4,2,1,2,5,10,15,15,15
6,GP,M,16,U,LE3,T,2,2,other,other,...,4,4,4,1,1,3,0,12,12,11
7,GP,F,17,U,GT3,A,4,4,other,teacher,...,4,1,4,1,1,1,6,6,5,6
8,GP,M,15,U,LE3,A,3,2,services,other,...,4,2,2,1,1,1,0,16,18,19
9,GP,M,15,U,GT3,T,3,4,other,other,...,5,5,1,1,1,5,0,14,15,15


# Question 2 - drop missing values (0.5 points)
*Set the index name of the dataframe to **"number"**. Check sample of data to drop if any missing values are there.*
*Use .dropna() function to drop the NAs*

#### Answer:

In [5]:
stdf.index.name = 'number' #setting the name of the index column of the dataset
stdf.head(10)

Unnamed: 0_level_0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10
5,GP,M,16,U,LE3,T,4,3,services,other,...,5,4,2,1,2,5,10,15,15,15
6,GP,M,16,U,LE3,T,2,2,other,other,...,4,4,4,1,1,3,0,12,12,11
7,GP,F,17,U,GT3,A,4,4,other,teacher,...,4,1,4,1,1,1,6,6,5,6
8,GP,M,15,U,LE3,A,3,2,services,other,...,4,2,2,1,1,1,0,16,18,19
9,GP,M,15,U,GT3,T,3,4,other,other,...,5,5,1,1,1,5,0,14,15,15


In [6]:
stdf.isnull().values.any() #to check if any values are NaN

False

# Transform Data

## Question 3 (0.5 points)

*Print all the attribute names which are not numerical.*

**Hint:** check **select_dtypes()** and its **include** and **exclude** parameters.**

#### Answer:

In [9]:
#stdf.dtypes
obj_stdf = stdf.select_dtypes(include=['object']).copy()
obj_stdf.head(10)

Unnamed: 0_level_0,school,sex,address,famsize,Pstatus,Mjob,Fjob,reason,guardian,schoolsup,famsup,paid,activities,nursery,higher,internet,romantic
number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
0,GP,F,U,GT3,A,at_home,teacher,course,mother,yes,no,no,no,yes,yes,no,no
1,GP,F,U,GT3,T,at_home,other,course,father,no,yes,no,no,no,yes,yes,no
2,GP,F,U,LE3,T,at_home,other,other,mother,yes,no,yes,no,yes,yes,yes,no
3,GP,F,U,GT3,T,health,services,home,mother,no,yes,yes,yes,yes,yes,yes,yes
4,GP,F,U,GT3,T,other,other,home,father,no,yes,yes,no,yes,yes,no,no
5,GP,M,U,LE3,T,services,other,reputation,mother,no,yes,yes,yes,yes,yes,yes,no
6,GP,M,U,LE3,T,other,other,home,mother,no,no,no,no,yes,yes,yes,no
7,GP,F,U,GT3,A,other,teacher,home,mother,yes,yes,no,no,yes,yes,no,no
8,GP,M,U,LE3,A,services,other,home,mother,no,yes,yes,no,yes,yes,yes,no
9,GP,M,U,GT3,T,other,other,home,mother,no,yes,yes,yes,yes,yes,yes,no


In [11]:
int_stdf = stdf.select_dtypes(include=['int64']).copy()
int_stdf.head(10)

Unnamed: 0_level_0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
0,18,4,4,2,2,0,4,3,4,1,1,3,6,5,6,6
1,17,1,1,1,2,0,5,3,3,1,1,3,4,5,5,6
2,15,1,1,1,2,3,4,3,2,2,3,3,10,7,8,10
3,15,4,2,1,3,0,3,2,2,1,1,5,2,15,14,15
4,16,3,3,1,2,0,4,3,2,1,2,5,4,6,10,10
5,16,4,3,1,2,0,5,4,2,1,2,5,10,15,15,15
6,16,2,2,1,2,0,4,4,4,1,1,3,0,12,12,11
7,17,4,4,2,2,0,4,1,4,1,1,1,6,6,5,6
8,15,3,2,1,2,0,4,2,2,1,1,1,0,16,18,19
9,15,3,4,1,2,0,5,5,1,1,1,5,0,14,15,15


# Question 4 - Drop variables with less variance (0.5 points)

*Find the variance of each numerical independent variable and drop whose variance is less than 1. Use .var function to check the variance*

In [13]:
(stdf.var(ddof=0) < 1.0) # columns that I need to drop with variance < 1

age           False
Medu          False
Fedu          False
traveltime     True
studytime      True
failures       True
famrel         True
freetime       True
goout         False
Dalc           True
Walc          False
health        False
absences      False
G1            False
G2            False
G3            False
dtype: bool

In [16]:
int_stdf.drop(['traveltime','studytime','famrel','failures','freetime','Dalc'],axis=1, inplace=True)
int_stdf.head(10)

Unnamed: 0_level_0,age,Medu,Fedu,goout,Walc,health,absences,G1,G2,G3
number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,18,4,4,4,1,3,6,5,6,6
1,17,1,1,3,1,3,4,5,5,6
2,15,1,1,2,3,3,10,7,8,10
3,15,4,2,2,1,5,2,15,14,15
4,16,3,3,2,2,5,4,6,10,10
5,16,4,3,2,2,5,10,15,15,15
6,16,2,2,4,1,3,0,12,12,11
7,17,4,4,4,1,1,6,6,5,6
8,15,3,2,2,1,1,0,16,18,19
9,15,3,4,1,1,5,0,14,15,15


In [17]:
i1 = int_stdf[['G1','G2','G3']]
i1.head(10)

Unnamed: 0_level_0,G1,G2,G3
number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,5,6,6
1,5,5,6
2,7,8,10
3,15,14,15
4,6,10,10
5,15,15,15
6,12,12,11
7,6,5,6
8,16,18,19
9,14,15,15


In [33]:
#int_stdf.drop(['G1','G2','G3'],axis=1, inplace=True)
int_stdf.head(10)

Unnamed: 0_level_0,age,Medu,Fedu,goout,Walc,health,absences
number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,18,4,4,4,1,3,6
1,17,1,1,3,1,3,4
2,15,1,1,2,3,3,10
3,15,4,2,2,1,5,2
4,16,3,3,2,2,5,4
5,16,4,3,2,2,5,10
6,16,2,2,4,1,3,0
7,17,4,4,4,1,1,6
8,15,3,2,2,1,1,0
9,15,3,4,1,1,5,0


#### Variables with less variance are almost same for all the records. Hence, they do not contribute much for classification.

# Question 6 - Encode all categorical variables to numerical (0.5 points)

Take the list of categorical attributes(from the above result) and convert them into neumerical variables. After that, print the head of dataframe and check the values.

**Hint:** check **sklearn LabelEncoder()**

#### Answer:

In [19]:
from sklearn.preprocessing import LabelEncoder
o4 = obj_stdf.apply(LabelEncoder().fit_transform)
o4.head(10)

Unnamed: 0_level_0,school,sex,address,famsize,Pstatus,Mjob,Fjob,reason,guardian,schoolsup,famsup,paid,activities,nursery,higher,internet,romantic
number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
0,0,0,1,0,0,0,4,0,1,1,0,0,0,1,1,0,0
1,0,0,1,0,1,0,2,0,0,0,1,0,0,0,1,1,0
2,0,0,1,1,1,0,2,2,1,1,0,1,0,1,1,1,0
3,0,0,1,0,1,1,3,1,1,0,1,1,1,1,1,1,1
4,0,0,1,0,1,2,2,1,0,0,1,1,0,1,1,0,0
5,0,1,1,1,1,3,2,3,1,0,1,1,1,1,1,1,0
6,0,1,1,1,1,2,2,1,1,0,0,0,0,1,1,1,0
7,0,0,1,0,0,2,4,1,1,1,1,0,0,1,1,0,0
8,0,1,1,1,0,3,2,1,1,0,1,1,0,1,1,1,0
9,0,1,1,0,1,2,2,1,1,0,1,1,1,1,1,1,0


# Question 7 - Convert the continuous values of grades into classes (1 point)

*Consider the values in G1, G2 and G3 with >= 10 as pass(1) and < 10 as fail(0) and encode them into binary values. Print head of dataframe to check the values.*

#### Answer:

In [34]:
#i1 = int_stdf['G1'] >=10
i2 = pd.DataFrame(np.where(i1 >= 10, 1, 0),columns = i1.columns)
i2.head(10)

Unnamed: 0,G1,G2,G3
0,0,0,0
1,0,0,0
2,0,0,1
3,1,1,1
4,0,1,1
5,1,1,1
6,1,1,1
7,0,0,0
8,1,1,1
9,1,1,1


# Question 8 (0.5 points)

*Consider G3 is the target attribute and remaining all attributes as features to predict G3. Now, separate feature and target attributes into separate dataframes with X and y variable names.*

In [38]:
i3 = pd.concat([int_stdf, i2], axis=1)
i3.head(10)

Unnamed: 0_level_0,age,Medu,Fedu,goout,Walc,health,absences,G1,G2,G3
number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,18,4,4,4,1,3,6,0,0,0
1,17,1,1,3,1,3,4,0,0,0
2,15,1,1,2,3,3,10,0,0,1
3,15,4,2,2,1,5,2,1,1,1
4,16,3,3,2,2,5,4,0,1,1
5,16,4,3,2,2,5,10,1,1,1
6,16,2,2,4,1,3,0,1,1,1
7,17,4,4,4,1,1,6,0,0,0
8,15,3,2,2,1,1,0,1,1,1
9,15,3,4,1,1,5,0,1,1,1


In [39]:
st4 = pd.concat([o4, i3], axis=1)
st4.head(10)

Unnamed: 0_level_0,school,sex,address,famsize,Pstatus,Mjob,Fjob,reason,guardian,schoolsup,...,age,Medu,Fedu,goout,Walc,health,absences,G1,G2,G3
number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0,0,1,0,0,0,4,0,1,1,...,18,4,4,4,1,3,6,0,0,0
1,0,0,1,0,1,0,2,0,0,0,...,17,1,1,3,1,3,4,0,0,0
2,0,0,1,1,1,0,2,2,1,1,...,15,1,1,2,3,3,10,0,0,1
3,0,0,1,0,1,1,3,1,1,0,...,15,4,2,2,1,5,2,1,1,1
4,0,0,1,0,1,2,2,1,0,0,...,16,3,3,2,2,5,4,0,1,1
5,0,1,1,1,1,3,2,3,1,0,...,16,4,3,2,2,5,10,1,1,1
6,0,1,1,1,1,2,2,1,1,0,...,16,2,2,4,1,3,0,1,1,1
7,0,0,1,0,0,2,4,1,1,1,...,17,4,4,4,1,1,6,0,0,0
8,0,1,1,1,0,3,2,1,1,0,...,15,3,2,2,1,1,0,1,1,1
9,0,1,1,0,1,2,2,1,1,0,...,15,3,4,1,1,5,0,1,1,1


In [41]:
X = st4.drop("G3", axis=1)
y = st4[["G3"]]

# Question 9 - Training and testing data split (0.5 points)

# *So far, you have converted all categorical features into numeric values. Now, split the data into training and test sets with training size of 300 records. Print the number of train and test records.*

**Hint:** check **train_test_split()** from **sklearn**

#### Answer:

In [42]:
from sklearn.model_selection import train_test_split

test_size = 0.24 # taking 300 records for training and 95 for test set
seed = 5  # Random numbmer seeding for reapeatability of the code
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(300, 26)
(95, 26)
(300, 1)
(95, 1)


# Question 10 - Model Implementation and Testing the Accuracy (0.5 points)

*Build a **LogisticRegression** classifier using **fit()** functions in sklearn. 
* You need to import both Logistic regression and accuracy score from sklearn*
#### Answer:

In [46]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
model = LogisticRegression()
model.fit(X_train, y_train)
y_predict = model.predict(X_test)
accuracy = accuracy_score(y_test, y_predict)
print(accuracy)

0.8947368421052632


  y = column_or_1d(y, warn=True)


# Question 11 - Print the intercept of the Logistic regression model (0.5 points)

The value of the intercepts are stored in the model itself. You can use .intercept_ function to do the same

In [48]:
print('intercept', model.intercept_)

intercept [0.28323339]


# Question 12 - Print the coefficients of the model (0.5 points) and name the coefficient which has the highest impact on the dependent variable (0.5 points)

Hint: Use .coef_ to get the coefficients and use pd.Dataframe to store the coefficients in a dataframe with column names same as the independent variable dataframe

In [49]:
print('coefficients', model.coef_)

coefficients [[-3.84024988e-01  7.16906740e-02  1.37283314e-01 -3.61491738e-01
  -2.91410412e-01 -6.99772336e-02  1.63725620e-01  1.80199515e-01
   7.57622741e-02 -9.64333672e-02 -2.19269458e-01  3.46865659e-01
  -1.08278828e-01 -2.98145331e-01  3.92957889e-01 -3.51984580e-01
  -3.83193548e-01 -5.22820764e-02 -2.79845556e-03 -4.21354310e-01
  -1.91204892e-01  3.78879987e-01 -1.25370930e-01 -3.12224704e-02
   1.70371987e+00  3.92061923e+00]]


In [51]:
print(pd.DataFrame(np.transpose(model.coef_),X.columns)) 
#seems like G2 has the highest impact on the dependent variable

                   0
school     -0.384025
sex         0.071691
address     0.137283
famsize    -0.361492
Pstatus    -0.291410
Mjob       -0.069977
Fjob        0.163726
reason      0.180200
guardian    0.075762
schoolsup  -0.096433
famsup     -0.219269
paid        0.346866
activities -0.108279
nursery    -0.298145
higher      0.392958
internet   -0.351985
romantic   -0.383194
age        -0.052282
Medu       -0.002798
Fedu       -0.421354
goout      -0.191205
Walc        0.378880
health     -0.125371
absences   -0.031222
G1          1.703720
G2          3.920619


# Question 13 - Predict the dependent variable for both training and test dataset (0.5 points)

Accuracy score() should help you to print the accuracies

In [52]:
accuracy_test = accuracy_score(y_test, y_predict)
print(accuracy_test)

0.8947368421052632


In [53]:
y_predict_train = model.predict(X_train)
accuracy_train = accuracy_score(y_train, y_predict_train)
print(accuracy_train)

0.93
