
### Supervised Learning
### Activity: Building a Student Intervention System

### Question 1 - Classification vs. Regression
*Your goal for this project is to identify students who might need early intervention before they fail or pass. Which type of supervised learning problem is this, classification or regression? Why?*

**Answer: ** 

Classification model is suitable for the presented problem as we have to classify students based on whether they pass or fail.

We can classify accordingly with a binary outcome such as:
- Yes to 1, for students who need early intervention.
- No to 0, for students who do not need early intervention.

Since the outcome we are trying to predict is not a continuous variable, this is not a regression problem

### Question-2
load necessary Python libraries and load the student data. Note that the last column from this dataset, `'passed'`, will be our target label (whether the student graduated or didn't graduate). All other columns are features about each student.

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split

In [2]:
# Read student data
df = pd.read_csv('Data/student-data.csv')
df.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,passed
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,no,no,4,3,4,1,1,3,6,no
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,yes,no,5,3,3,1,1,3,4,no
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,yes,no,4,3,2,2,3,3,10,yes
3,GP,F,15,U,GT3,T,4,2,health,services,...,yes,yes,3,2,2,1,1,5,2,yes
4,GP,F,16,U,GT3,T,3,3,other,other,...,no,no,4,3,2,1,2,5,4,yes


In [3]:
df.shape

(395, 31)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 31 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   school      395 non-null    object
 1   sex         395 non-null    object
 2   age         395 non-null    int64 
 3   address     395 non-null    object
 4   famsize     395 non-null    object
 5   Pstatus     395 non-null    object
 6   Medu        395 non-null    int64 
 7   Fedu        395 non-null    int64 
 8   Mjob        395 non-null    object
 9   Fjob        395 non-null    object
 10  reason      395 non-null    object
 11  guardian    395 non-null    object
 12  traveltime  395 non-null    int64 
 13  studytime   395 non-null    int64 
 14  failures    395 non-null    int64 
 15  schoolsup   395 non-null    object
 16  famsup      395 non-null    object
 17  paid        395 non-null    object
 18  activities  395 non-null    object
 19  nursery     395 non-null    object
 20  higher    

In [5]:
df.isnull().sum()

school        0
sex           0
age           0
address       0
famsize       0
Pstatus       0
Medu          0
Fedu          0
Mjob          0
Fjob          0
reason        0
guardian      0
traveltime    0
studytime     0
failures      0
schoolsup     0
famsup        0
paid          0
activities    0
nursery       0
higher        0
internet      0
romantic      0
famrel        0
freetime      0
goout         0
Dalc          0
Walc          0
health        0
absences      0
passed        0
dtype: int64

### Question-3
Let's begin by investigating the dataset to determine how many students we have information on, and learn about the graduation rate among these students. In the code cell below, you will need to compute the following:
- The total number of students, `n_students`.
- The total number of features for each student, `n_features`.
- The number of those students who passed, `n_passed`.
- The number of those students who failed, `n_failed`.
- The graduation rate of the class, `grad_rate`, in percent (%).


In [6]:
# Calculate number of students
n_students = df.shape[0]

In [7]:
# Calculate number of features
n_features = df.shape[1]

In [8]:
# Calculate passing students
n_passed = df['passed'].loc[df['passed'] == 'yes'].count()

In [9]:
# Calculate failing students
n_failed = df['passed'].loc[df['passed'] == 'no'].count()

In [10]:
# Calculate graduation rate
grad_rate = (n_passed/n_students) * 100

In [11]:
# Print the results
print("\nTotal number of students in the dataset: ", n_students)
print("\nTotal number of features for each student in the dataset: ", n_features)
print("\nTotal number of students who passed: ", n_passed)
print("\nTotal number of students who failed: ", n_failed)
print("\nGraduation rate of the class is: ", round(grad_rate, 3))


Total number of students in the dataset:  395

Total number of features for each student in the dataset:  31

Total number of students who passed:  265

Total number of students who failed:  130

Graduation rate of the class is:  67.089


## Preparing the Data
you will prepare the data for modeling, training and testing.

### Question-4 Identify feature and target columns


separate the student data into feature and target columns to see if any features are non-numeric.

In [12]:
# Extract feature columns

In [13]:
df.drop('passed', axis = 1)

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,yes,no,no,4,3,4,1,1,3,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,yes,yes,no,5,3,3,1,1,3,4
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,yes,yes,no,4,3,2,2,3,3,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,yes,yes,yes,3,2,2,1,1,5,2
4,GP,F,16,U,GT3,T,3,3,other,other,...,yes,no,no,4,3,2,1,2,5,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
390,MS,M,20,U,LE3,A,2,2,services,services,...,yes,no,no,5,5,4,4,5,4,11
391,MS,M,17,U,LE3,T,3,1,services,services,...,yes,yes,no,2,4,5,3,4,2,3
392,MS,M,21,R,GT3,T,1,1,other,other,...,yes,no,no,5,5,3,3,3,3,3
393,MS,M,18,R,LE3,T,3,2,services,other,...,yes,yes,no,4,4,1,3,4,5,0


In [14]:
# Extract target column 'passed'

In [15]:
df['passed']

0       no
1       no
2      yes
3      yes
4      yes
      ... 
390     no
391    yes
392     no
393    yes
394     no
Name: passed, Length: 395, dtype: object

In [16]:
# Separate the data into feature data and target data (X and y, respectively)

In [17]:
X = df.drop('passed', axis = 1)
y = df['passed']

### Question-5 Preprocess Feature Columns

As you can see, there are several non-numeric columns that need to be converted! Many of them are simply `yes`/`no`, e.g. `internet`. These can be reasonably converted into `1`/`0` (binary) values.

Other columns, like `Mjob` and `Fjob`, have more than two values, and are known as _categorical variables_. The recommended way to handle such a column is to create as many columns as possible values (e.g. `Fjob_teacher`, `Fjob_other`, `Fjob_services`, etc.), and assign a `1` to one of them and `0` to all others.

These generated columns are sometimes called _dummy variables_, and we will use the [`pandas.get_dummies()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html?highlight=get_dummies#pandas.get_dummies) function to perform this transformation. Run the code cell below to perform the preprocessing routine discussed in this section.

In [18]:
obj_cols = X.select_dtypes(include='object').columns

In [19]:
for each in obj_cols:
    print(X[each].value_counts())

GP    349
MS     46
Name: school, dtype: int64
F    208
M    187
Name: sex, dtype: int64
U    307
R     88
Name: address, dtype: int64
GT3    281
LE3    114
Name: famsize, dtype: int64
T    354
A     41
Name: Pstatus, dtype: int64
other       141
services    103
at_home      59
teacher      58
health       34
Name: Mjob, dtype: int64
other       217
services    111
teacher      29
at_home      20
health       18
Name: Fjob, dtype: int64
course        145
home          109
reputation    105
other          36
Name: reason, dtype: int64
mother    273
father     90
other      32
Name: guardian, dtype: int64
no     344
yes     51
Name: schoolsup, dtype: int64
yes    242
no     153
Name: famsup, dtype: int64
no     214
yes    181
Name: paid, dtype: int64
yes    201
no     194
Name: activities, dtype: int64
yes    314
no      81
Name: nursery, dtype: int64
yes    375
no      20
Name: higher, dtype: int64
yes    329
no      66
Name: internet, dtype: int64
no     263
yes    132
Name: romantic, 

One hot encoding columns having more than 2 categories

In [20]:
X = pd.get_dummies(data=X, columns=['Mjob', 'Fjob', 'reason', 'guardian'])
X.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,traveltime,studytime,...,Fjob_other,Fjob_services,Fjob_teacher,reason_course,reason_home,reason_other,reason_reputation,guardian_father,guardian_mother,guardian_other
0,GP,F,18,U,GT3,A,4,4,2,2,...,0,0,1,1,0,0,0,0,1,0
1,GP,F,17,U,GT3,T,1,1,1,2,...,1,0,0,1,0,0,0,1,0,0
2,GP,F,15,U,LE3,T,1,1,1,2,...,1,0,0,0,0,1,0,0,1,0
3,GP,F,15,U,GT3,T,4,2,1,3,...,0,1,0,0,1,0,0,0,1,0
4,GP,F,16,U,GT3,T,3,3,1,2,...,1,0,0,0,1,0,0,1,0,0


Label encoding binary type categorical columns

In [21]:
obj_cols = X.select_dtypes(include='object').columns
obj_cols

Index(['school', 'sex', 'address', 'famsize', 'Pstatus', 'schoolsup', 'famsup',
       'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic'],
      dtype='object')

In [22]:
from sklearn.preprocessing import LabelEncoder

In [23]:
le = LabelEncoder()
for each in obj_cols:
    X[each] = le.fit_transform(X[each])

In [24]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 43 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   school             395 non-null    int32
 1   sex                395 non-null    int32
 2   age                395 non-null    int64
 3   address            395 non-null    int32
 4   famsize            395 non-null    int32
 5   Pstatus            395 non-null    int32
 6   Medu               395 non-null    int64
 7   Fedu               395 non-null    int64
 8   traveltime         395 non-null    int64
 9   studytime          395 non-null    int64
 10  failures           395 non-null    int64
 11  schoolsup          395 non-null    int32
 12  famsup             395 non-null    int32
 13  paid               395 non-null    int32
 14  activities         395 non-null    int32
 15  nursery            395 non-null    int32
 16  higher             395 non-null    int32
 17  internet        

Label encoding target variable

In [25]:
y = le.fit_transform(y)

### Question - 6 Implementation: Training and Testing Data Split
So far, we have converted all _categorical_ features into numeric values. For the next step, we split the data (both features and corresponding labels) into training and test sets. you will need to implement the following:
- Randomly shuffle and split the data (`X`, `y`) into training and testing subsets.
  - Use 300 training points (approximately 75%) and 95 testing points (approximately 25%).
  - Set a `random_state` for the function(s) you use, if provided.
  - Store the results in `X_train`, `X_test`, `y_train`, and `y_test`.

In [27]:
# splitting the data into train and test

In [28]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2, test_size=0.24)

In [29]:
# Show the results of the split
print("\nTraining feature set size: ", X_train.shape)
print("\nTraining target set size: ", y_train.shape)
print("\nTesting feature set size: ", X_test.shape)
print("\nTesting target set size: ", y_train.shape)


Training feature set size:  (300, 43)

Training target set size:  (300,)

Testing feature set size:  (95, 43)

Testing target set size:  (300,)


### Question - 7  Training and Evaluating Models
In this section, you will choose 3 supervised learning models that are appropriate for this problem and available in `scikit-learn`. You will first discuss the reasoning behind choosing these three models by considering what you know about the data and each model's strengths and weaknesses. You will then fit the model to varying sizes of training data and measure the accuracy score.

#### Three models:
- Logistic Regression
- Support Vector Machine
- Decision tree

###  Model Application
*List three supervised learning models that are appropriate for this problem. What are the general applications of each model? What are their strengths and weaknesses? Given what you know about the data, why did you choose these models to be applied?*

#explaination:

WE select the below mentioned models based on reasonable computational requirements and simplicity.
##### 1. Logistic Regression:
It uses a logistic function to frame binary output model. The output of the logistic regression will be a probability (0≤x≤1).

Advantages:
- Easy, fast and simple classification method.
- Can be used for multiclass classifications also.

Disadvantages:
- Cannot be applied on non-linear classification problems
- Colinearity and outliers tampers the accuracy of LR model.

##### 2. Support Vector Machine
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional space into classes so that we can easily put the new data point in the correct category in the future. This best decision boundary is called a hyperplane.

Advantages:
- SVM is more effective in high dimensional spaces
- SVM is relatively memory efficient

Disadvantages:
- SVM algorithm is not suitable for large data sets.
- In cases where the number of features for each data point exceeds the number of training data samples, the SVM will underperform.

##### 3. Decision Tree
Decision tree is a tree based algorithm used to solve regression and classification problems.

Advantages:
- No preprocessing needed on data.
- Decision trees can provide understandable explanation over the prediction.

Disadvantages:
- Chances for overfitting the model if we keep on building the tree to achieve high purity.
- Prone to outliers.

In [30]:
# Import the three supervised learning models from sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

In [31]:
from sklearn.metrics import accuracy_score

In [32]:
# fit model-1  on traning data 

#### Logistic Regression Model

In [36]:
log_reg = LogisticRegression(max_iter=300)
model_1 = log_reg.fit(X_train, y_train)

In [37]:
# predict on the test data 

In [38]:
lr_predictions = model_1.predict(X_test)

In [39]:
# calculate the accuracy score

In [41]:
print("The accuracy score of Logistic Regression model is: ", round((accuracy_score(lr_predictions, y_test)*100), 2), '%')

The accuracy score of Logistic Regression model is:  75.79 %


In [42]:
# fit the model-2 on traning data and predict on the test data and measure the accuracy

#### Support Vector Machine Model

In [43]:
supp_vec = SVC(kernel='linear')
model_2 = supp_vec.fit(X_train, y_train)

In [44]:
#predictions

In [45]:
svm_predictions = model_2.predict(X_test)

In [46]:
#accuracy score

In [51]:
print("The accuracy score of SVM model is: ", round((accuracy_score(svm_predictions, y_test)*100), 2), '%')

The accuracy score of SVM model is:  72.63 %


#### Decision Tree model

In [48]:
# fit the model-3 on traning data and predict on the test data and measure the accuracy

In [49]:
dtree = DecisionTreeClassifier()
model_3 = dtree.fit(X_train, y_train)
dt_predictions = model_3.predict(X_test)

In [52]:
print("The accuracy score of Decision tree model is: ", round((accuracy_score(dt_predictions, y_test)*100), 2), '%')

The accuracy score of Decision tree model is:  68.42 %


#### Conclusion:
- The Logistic Regression Model seems better for the data