
### Supervised Learning
### Activity: Building a Student Intervention System

### Question 1 - Classification vs. Regression
*Your goal for this project is to identify students who might need early intervention before they fail or pass. Which type of supervised learning problem is this, classification or regression? Why?*

**Answer: ** 

We'll use supervised classification because we need to classify the output variable and it is a binary class.

### Question-2
load necessary Python libraries and load the student data. Note that the last column from this dataset, `'passed'`, will be our target label (whether the student graduated or didn't graduate). All other columns are features about each student.

In [1]:
# Import libraries
import pandas as pd
import numpy as np

In [2]:
# Read student data
df=pd.read_csv('student-data.csv')

In [3]:
df.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,passed
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,no,no,4,3,4,1,1,3,6,no
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,yes,no,5,3,3,1,1,3,4,no
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,yes,no,4,3,2,2,3,3,10,yes
3,GP,F,15,U,GT3,T,4,2,health,services,...,yes,yes,3,2,2,1,1,5,2,yes
4,GP,F,16,U,GT3,T,3,3,other,other,...,no,no,4,3,2,1,2,5,4,yes


In [4]:
df.columns

Index(['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu',
       'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime',
       'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery',
       'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc',
       'Walc', 'health', 'absences', 'passed'],
      dtype='object')

### Question-3
Let's begin by investigating the dataset to determine how many students we have information on, and learn about the graduation rate among these students. In the code cell below, you will need to compute the following:
- The total number of students, `n_students`.
- The total number of features for each student, `n_features`.
- The number of those students who passed, `n_passed`.
- The number of those students who failed, `n_failed`.
- The graduation rate of the class, `grad_rate`, in percent (%).


In [5]:
# Calculate number of students
n_students=df.shape[0]
n_students

395

In [6]:
# Calculate number of features
n_features=df.shape[1]
n_features

31

In [7]:
# Calculate passing students
n_passed=len(df[df['passed']=='yes'])
n_passed


265

In [8]:
# Calculate failing students
n_failed=len(df[df['passed']=='no'])
n_failed

130

In [9]:
# Calculate graduation rate
grad_rate=(n_passed/n_students)*100
grad_rate

67.08860759493672

In [10]:
# Print the results
print('The total number of students in a school is: ',n_students)
print('The total features of a student is: ',n_features)
print('The total number of students who have passed :',n_passed)
print('The total number of students who have failed: ',n_failed)
print('Graduation rate of a school is: ',grad_rate)

The total number of students in a school is:  395
The total features of a student is:  31
The total number of students who have passed : 265
The total number of students who have failed:  130
Graduation rate of a school is:  67.08860759493672


## Preparing the Data
you will prepare the data for modeling, training and testing.

### Question-4 Identify feature and target columns


separate the student data into feature and target columns to see if any features are non-numeric.

In [11]:
# Extract feature columns

In [12]:
x=df.drop(['passed'],axis=1)
x

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,yes,no,no,4,3,4,1,1,3,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,yes,yes,no,5,3,3,1,1,3,4
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,yes,yes,no,4,3,2,2,3,3,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,yes,yes,yes,3,2,2,1,1,5,2
4,GP,F,16,U,GT3,T,3,3,other,other,...,yes,no,no,4,3,2,1,2,5,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
390,MS,M,20,U,LE3,A,2,2,services,services,...,yes,no,no,5,5,4,4,5,4,11
391,MS,M,17,U,LE3,T,3,1,services,services,...,yes,yes,no,2,4,5,3,4,2,3
392,MS,M,21,R,GT3,T,1,1,other,other,...,yes,no,no,5,5,3,3,3,3,3
393,MS,M,18,R,LE3,T,3,2,services,other,...,yes,yes,no,4,4,1,3,4,5,0


In [13]:
# Extract target column 'passed'

In [14]:
y=pd.DataFrame(df['passed'])
y

Unnamed: 0,passed
0,no
1,no
2,yes
3,yes
4,yes
...,...
390,no
391,yes
392,no
393,yes


In [15]:
# Separate the data into feature data and target data (X and y, respectively)

In [16]:
x=df.drop(['passed'],axis=1)
y=pd.DataFrame(df['passed'])

### Question-5 Preprocess Feature Columns

As you can see, there are several non-numeric columns that need to be converted! Many of them are simply `yes`/`no`, e.g. `internet`. These can be reasonably converted into `1`/`0` (binary) values.

Other columns, like `Mjob` and `Fjob`, have more than two values, and are known as _categorical variables_. The recommended way to handle such a column is to create as many columns as possible values (e.g. `Fjob_teacher`, `Fjob_other`, `Fjob_services`, etc.), and assign a `1` to one of them and `0` to all others.

These generated columns are sometimes called _dummy variables_, and we will use the [`pandas.get_dummies()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html?highlight=get_dummies#pandas.get_dummies) function to perform this transformation. Run the code cell below to perform the preprocessing routine discussed in this section.

In [17]:
x.columns

Index(['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu',
       'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime',
       'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery',
       'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc',
       'Walc', 'health', 'absences'],
      dtype='object')

In [18]:
x.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 30 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   school      395 non-null    object
 1   sex         395 non-null    object
 2   age         395 non-null    int64 
 3   address     395 non-null    object
 4   famsize     395 non-null    object
 5   Pstatus     395 non-null    object
 6   Medu        395 non-null    int64 
 7   Fedu        395 non-null    int64 
 8   Mjob        395 non-null    object
 9   Fjob        395 non-null    object
 10  reason      395 non-null    object
 11  guardian    395 non-null    object
 12  traveltime  395 non-null    int64 
 13  studytime   395 non-null    int64 
 14  failures    395 non-null    int64 
 15  schoolsup   395 non-null    object
 16  famsup      395 non-null    object
 17  paid        395 non-null    object
 18  activities  395 non-null    object
 19  nursery     395 non-null    object
 20  higher    

In [19]:
x.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,yes,no,no,4,3,4,1,1,3,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,yes,yes,no,5,3,3,1,1,3,4
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,yes,yes,no,4,3,2,2,3,3,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,yes,yes,yes,3,2,2,1,1,5,2
4,GP,F,16,U,GT3,T,3,3,other,other,...,yes,no,no,4,3,2,1,2,5,4


In [20]:
#let us encode our binary variables using label encoder
from sklearn.preprocessing import LabelEncoder
label_encoder=LabelEncoder()
x['school']=label_encoder.fit_transform(x['school'])

In [21]:
for i in x[['sex','address','famsize','Pstatus','schoolsup','famsup','paid','activities','nursery','higher','internet','romantic',]]:
    x[i]=label_encoder.fit_transform(x[i])

In [22]:
x

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences
0,0,0,18,1,0,0,4,4,at_home,teacher,...,1,0,0,4,3,4,1,1,3,6
1,0,0,17,1,0,1,1,1,at_home,other,...,1,1,0,5,3,3,1,1,3,4
2,0,0,15,1,1,1,1,1,at_home,other,...,1,1,0,4,3,2,2,3,3,10
3,0,0,15,1,0,1,4,2,health,services,...,1,1,1,3,2,2,1,1,5,2
4,0,0,16,1,0,1,3,3,other,other,...,1,0,0,4,3,2,1,2,5,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
390,1,1,20,1,1,0,2,2,services,services,...,1,0,0,5,5,4,4,5,4,11
391,1,1,17,1,1,1,3,1,services,services,...,1,1,0,2,4,5,3,4,2,3
392,1,1,21,0,0,1,1,1,other,other,...,1,0,0,5,5,3,3,3,3,3
393,1,1,18,0,1,1,3,2,services,other,...,1,1,0,4,4,1,3,4,5,0


In [23]:
x.columns

Index(['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu',
       'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime',
       'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery',
       'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc',
       'Walc', 'health', 'absences'],
      dtype='object')

In [24]:
#we need to encode nominaldata
x=pd.get_dummies(x)

In [25]:
x.columns

Index(['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu',
       'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid',
       'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel',
       'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences',
       'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services',
       'Mjob_teacher', 'Fjob_at_home', 'Fjob_health', 'Fjob_other',
       'Fjob_services', 'Fjob_teacher', 'reason_course', 'reason_home',
       'reason_other', 'reason_reputation', 'guardian_father',
       'guardian_mother', 'guardian_other'],
      dtype='object')

In [26]:
y['passed']=label_encoder.fit_transform(y['passed'])
y

Unnamed: 0,passed
0,0
1,0
2,1
3,1
4,1
...,...
390,0
391,1
392,0
393,1


### Question - 6 Implementation: Training and Testing Data Split
So far, we have converted all _categorical_ features into numeric values. For the next step, we split the data (both features and corresponding labels) into training and test sets. you will need to implement the following:
- Randomly shuffle and split the data (`X`, `y`) into training and testing subsets.
  - Use 300 training points (approximately 75%) and 95 testing points (approximately 25%).
  - Set a `random_state` for the function(s) you use, if provided.
  - Store the results in `X_train`, `X_test`, `y_train`, and `y_test`.

In [27]:
# splitting the data into train and test
from sklearn.model_selection import train_test_split

In [28]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=20,random_state=42)

In [29]:
# Show the results of the split
print('X-train:',x_train.shape,'Y-train:',y_train.shape,'X-test:',x_test.shape,'y_test:',y_test.shape,end='\n',sep=' ')

X-train: (375, 43) Y-train: (375, 1) X-test: (20, 43) y_test: (20, 1)


### Question - 7  Training and Evaluating Models
In this section, you will choose 3 supervised learning models that are appropriate for this problem and available in `scikit-learn`. You will first discuss the reasoning behind choosing these three models by considering what you know about the data and each model's strengths and weaknesses. You will then fit the model to varying sizes of training data and measure the accuracy score.

In [30]:
from sklearn.linear_model import LogisticRegression
lr=LogisticRegression()
from sklearn.tree import DecisionTreeClassifier
dr=DecisionTreeClassifier()
from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier()


In this data we can use classification algorithms, as we need to predict whether a student would pass or not. For predicting a categorical variable, we need to use classification models. In this we have used decision tree, logistic regression and Random Forest. 

We'll apply these models into our data and check which model performed well. 

###  Model Application
*List three supervised learning models that are appropriate for this problem. What are the general applications of each model? What are their strengths and weaknesses? Given what you know about the data, why did you choose these models to be applied?*

#explaination
Logistic Regression = This model is great for a classification model, this model can be used if we do have a binary target varibale. We need to check if a student would pass or fail, logistic regression performs well when we have to predict two categories. Though there are multi logistic regression, mostly simple logistic regression can be best for this model.

Decision Tree = Decision Tree models are best used for a calssification model, this model uses tree based structure to predict an outcome and it can also be used to classify a target variable too. Sometimes, decision tree over fits our data, to resolve this issue we can use Random Forest Algorithm. 

Random Forest Algorithm = Random Forest Algorithm is modified version of decision tree, in this model it compares n- number of trees and predict the output. This algorithm can be used to resolve the issue of an over fit. 

In [31]:
# Import the three supervised learning models from sklearn
from sklearn.linear_model import LogisticRegression
lr=LogisticRegression()
from sklearn.tree import DecisionTreeClassifier
dr=DecisionTreeClassifier()
from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier()

In [32]:
# fit model-1  on traning data 
#logistic regression

In [33]:
lr.fit(x_train,y_train)

  return f(*args, **kwargs)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

In [34]:
# predict on the test data 

In [35]:
lr_pred=lr.predict(x_test)

In [36]:
# calculate the accuracy score

In [37]:
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score,confusion_matrix,classification_report

In [38]:
print('Accuracy score is: ',accuracy_score(y_test,lr_pred),'Recall Score is: ',recall_score(y_test,lr_pred),'Precision Score is:',precision_score(y_test,lr_pred),'F1 Score is: ',f1_score(y_test,lr_pred),end='\n')

Accuracy score is:  0.6 Recall Score is:  0.8333333333333334 Precision Score is: 0.625 F1 Score is:  0.7142857142857143


In [39]:
confusion_matrix(y_test,lr_pred)

array([[ 2,  6],
       [ 2, 10]], dtype=int64)

In [40]:
classification_report(y_test,lr_pred)

'              precision    recall  f1-score   support\n\n           0       0.50      0.25      0.33         8\n           1       0.62      0.83      0.71        12\n\n    accuracy                           0.60        20\n   macro avg       0.56      0.54      0.52        20\nweighted avg       0.57      0.60      0.56        20\n'

In [41]:
# fit the model-2 on traning data and predict on the test data and measure the accuracy
#decision tree

In [42]:
dr.fit(x_train,y_train)

DecisionTreeClassifier()

In [43]:
# predict on the test data 
dr_pred=dr.predict(x_test)

In [44]:
# calculate the accuracy score
print('Accuracy score is: ',accuracy_score(y_test,dr_pred),'Recall Score is: ',recall_score(y_test,dr_pred),'Precision Score is:',precision_score(y_test,dr_pred),'F1 Score is: ',f1_score(y_test,dr_pred),end='\n')

Accuracy score is:  0.45 Recall Score is:  0.5833333333333334 Precision Score is: 0.5384615384615384 F1 Score is:  0.5599999999999999


In [45]:
confusion_matrix(y_test,dr_pred)

array([[2, 6],
       [5, 7]], dtype=int64)

In [46]:
classification_report(y_test,dr_pred)

'              precision    recall  f1-score   support\n\n           0       0.29      0.25      0.27         8\n           1       0.54      0.58      0.56        12\n\n    accuracy                           0.45        20\n   macro avg       0.41      0.42      0.41        20\nweighted avg       0.44      0.45      0.44        20\n'

In [47]:
# fit the model-3 on traning data and predict on the test data and measure the accuracy
#random Forest

In [48]:
rf.fit(x_train,y_train)

  rf.fit(x_train,y_train)


RandomForestClassifier()

In [49]:
# predict on the test data 
rf_pred=rf.predict(x_test)

In [50]:
# calculate the accuracy score
print('Accuracy score is: ',accuracy_score(y_test,rf_pred),'Recall Score is: ',recall_score(y_test,rf_pred),'Precision Score is:',precision_score(y_test,rf_pred),'F1 Score is: ',f1_score(y_test,rf_pred),end='\n')

Accuracy score is:  0.65 Recall Score is:  1.0 Precision Score is: 0.631578947368421 F1 Score is:  0.7741935483870968


In [51]:
confusion_matrix(y_test,rf_pred)

array([[ 1,  7],
       [ 0, 12]], dtype=int64)

In [52]:
classification_report(y_test,rf_pred)

'              precision    recall  f1-score   support\n\n           0       1.00      0.12      0.22         8\n           1       0.63      1.00      0.77        12\n\n    accuracy                           0.65        20\n   macro avg       0.82      0.56      0.50        20\nweighted avg       0.78      0.65      0.55        20\n'