
### Supervised Learning
### Activity: Building a Student Intervention System

### Question 1 - Classification vs. Regression
*Your goal for this project is to identify students who might need early intervention before they fail or pass. Which type of supervised learning problem is this, classification or regression? Why?*

**Answer:** 

 ***It is a classification problem since we are predicting a discrete output.In this case we are predicting whether student will fail or not.***

### Question-2
Load necessary Python libraries and load the student data. Note that the last column from this dataset, `'passed'`, will be our target label (whether the student graduated or didn't graduate). All other columns are features about each student.

In [46]:
# Import libraries
import pandas as pd
import numpy as np
from sklearn.metrics import f1_score
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')

In [4]:
# Read student data
df = pd.read_csv("student-data.csv")
df

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,passed
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,no,no,4,3,4,1,1,3,6,no
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,yes,no,5,3,3,1,1,3,4,no
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,yes,no,4,3,2,2,3,3,10,yes
3,GP,F,15,U,GT3,T,4,2,health,services,...,yes,yes,3,2,2,1,1,5,2,yes
4,GP,F,16,U,GT3,T,3,3,other,other,...,no,no,4,3,2,1,2,5,4,yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
390,MS,M,20,U,LE3,A,2,2,services,services,...,no,no,5,5,4,4,5,4,11,no
391,MS,M,17,U,LE3,T,3,1,services,services,...,yes,no,2,4,5,3,4,2,3,yes
392,MS,M,21,R,GT3,T,1,1,other,other,...,no,no,5,5,3,3,3,3,3,no
393,MS,M,18,R,LE3,T,3,2,services,other,...,yes,no,4,4,1,3,4,5,0,yes


### Question-3
Let's begin by investigating the dataset to determine how many students we have information on, and learn about the graduation rate among these students. In the code cell below, you will need to compute the following:
- The total number of students, `n_students`.
- The total number of features for each student, `n_features`.
- The number of those students who passed, `n_passed`.
- The number of those students who failed, `n_failed`.
- The graduation rate of the class, `grad_rate`, in percent (%).


In [5]:
# Calculate number of students
n_students = len(df)
n_students

395

In [14]:
# Calculate number of features
n_features = len(df.columns[:-1])
n_features

30

In [22]:
# Calculate passing students
n_passed = len(df[df.passed=='yes'])
n_passed

265

In [29]:
# Calculate failing students
n_failed = len(data[data.passed=="no"])
n_failed

130

In [30]:
# Calculate graduation rate
grad_rate = (n_passed/n_students)*100
grad_rate

67.08860759493672

In [34]:
# Print the results
print("Total no.of students:",n_students)
print("Total no.of featurse:",n_features)
print("Total no.of students passed:",n_passed)
print("Total no.of students failed:",n_failed)
print("Graduation rate of the class: {:.2f}%".format(grad_rate))

Total no.of students: 395
Total no.of featurse: 30
Total no.of students passed: 265
Total no.of students failed: 130
Graduation rate of the class: 67.09%


## Preparing the Data
you will prepare the data for modeling, training and testing.

### Question-4 Identify feature and target columns


separate the student data into feature and target columns to see if any features are non-numeric.

In [9]:
# Extract feature columns

In [43]:
feature_cols = list(df.columns[:-1])
print("Feature columns:\n\n",feature_cols)

Feature columns:

 ['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']


In [37]:
# Extract target column 'passed'

In [44]:
target_col = df.columns[-1]
print("\nTarget column: ",target_col)


Target column:  passed


In [13]:
# Separate the data into feature data and target data (X and y, respectively)

In [48]:
x = df[feature_cols]
y = df[target_col]

### Question-5 Preprocess Feature Columns

As you can see, there are several non-numeric columns that need to be converted! Many of them are simply `yes`/`no`, e.g. `internet`. These can be reasonably converted into `1`/`0` (binary) values.

Other columns, like `Mjob` and `Fjob`, have more than two values, and are known as _categorical variables_. The recommended way to handle such a column is to create as many columns as possible values (e.g. `Fjob_teacher`, `Fjob_other`, `Fjob_services`, etc.), and assign a `1` to one of them and `0` to all others.

These generated columns are sometimes called _dummy variables_, and we will use the [`pandas.get_dummies()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html?highlight=get_dummies#pandas.get_dummies) function to perform this transformation. Run the code cell below to perform the preprocessing routine discussed in this section.

In [55]:
#Now use label encoder for the category columns with binary data.
from sklearn.preprocessing import LabelEncoder
label_en = LabelEncoder()
col_list=['school','sex','address','famsize', 'Pstatus','schoolsup', 'famsup', 'paid', 'activities', 'nursery',
       'higher', 'internet', 'romantic']
for a in np.arange(len(col_list)):
    x[col_list[a]] = label_en.fit_transform(x[col_list[a]])   

In [60]:
dummy_cols = ['Mjob', 'Fjob', 'reason', 'guardian']
X = pd.get_dummies(x)
X.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,traveltime,studytime,failures,schoolsup,famsup,paid,activities,nursery,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,Mjob_at_home,Mjob_health,Mjob_other,Mjob_services,Mjob_teacher,Fjob_at_home,Fjob_health,Fjob_other,Fjob_services,Fjob_teacher,reason_course,reason_home,reason_other,reason_reputation,guardian_father,guardian_mother,guardian_other
0,0,0,18,1,0,0,4,4,2,2,0,1,0,0,0,1,1,0,0,4,3,4,1,1,3,6,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,0
1,0,0,17,1,0,1,1,1,1,2,0,0,1,0,0,0,1,1,0,5,3,3,1,1,3,4,1,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0
2,0,0,15,1,1,1,1,1,1,2,3,1,0,1,0,1,1,1,0,4,3,2,2,3,3,10,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0
3,0,0,15,1,0,1,4,2,1,3,0,0,1,1,1,1,1,1,1,3,2,2,1,1,5,2,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0
4,0,0,16,1,0,1,3,3,1,2,0,0,1,1,0,1,1,0,0,4,3,2,1,2,5,4,0,0,1,0,0,0,0,1,0,0,0,1,0,0,1,0,0


### Question - 6 Implementation: Training and Testing Data Split
So far, we have converted all _categorical_ features into numeric values. For the next step, we split the data (both features and corresponding labels) into training and test sets. you will need to implement the following:
- Randomly shuffle and split the data (`X`, `y`) into training and testing subsets.
  - Use 300 training points (approximately 75%) and 95 testing points (approximately 25%).
  - Set a `random_state` for the function(s) you use, if provided.
  - Store the results in `X_train`, `X_test`, `y_train`, and `y_test`.

In [61]:
# splitting the data into train and test
from sklearn.model_selection import train_test_split
num_train = 300
num_test = x.shape[0] - num_train
x_train, x_test, y_train, y_test = train_test_split(x,y,train_size=num_train,random_state=42)

In [62]:
# Show the results of the split
print("Training set has ",x_train.shape[0], " samples.")
print("Testing set has ",x_test.shape[0], " samples.")

Training set has  300  samples.
Testing set has  95  samples.


In [63]:
x_train.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,traveltime,studytime,failures,schoolsup,famsup,paid,activities,nursery,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,Mjob_at_home,Mjob_health,Mjob_other,Mjob_services,Mjob_teacher,Fjob_at_home,Fjob_health,Fjob_other,Fjob_services,Fjob_teacher,reason_course,reason_home,reason_other,reason_reputation,guardian_father,guardian_mother,guardian_other
210,0,0,19,1,0,1,3,3,1,4,0,0,1,1,1,1,1,1,0,4,3,3,1,2,3,10,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,1
75,0,1,15,1,0,1,4,3,1,2,0,0,1,1,1,1,1,1,0,4,3,3,2,3,5,6,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0
104,0,1,15,1,0,0,3,4,1,2,0,0,1,1,1,1,1,1,0,5,4,4,1,1,1,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,1,0
374,1,0,18,0,1,1,4,4,2,3,0,0,0,0,0,1,1,1,0,5,4,4,1,1,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,1,0
16,0,0,16,1,0,1,4,4,1,3,0,0,1,1,1,1,1,1,0,3,2,3,1,2,2,6,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,1,0


### Question - 7  Training and Evaluating Models
In this section, you will choose 3 supervised learning models that are appropriate for this problem and available in `scikit-learn`. You will first discuss the reasoning behind choosing these three models by considering what you know about the data and each model's strengths and weaknesses. You will then fit the model to varying sizes of training data and measure the accuracy score.

Supervised learning models are :

1.Logistic Regression

2.K-Nearest Neighbors (KNeighbors)

3.Support Vector Machines (SVM)

4.Decision Trees

5.Gaussian Naive Bayes (GaussianNB)

6.Ensemble Methods (Bagging, AdaBoost, Random Forest, Gradient Boosting)


###  Model Application
*List three supervised learning models that are appropriate for this problem. What are the general applications of each model? What are their strengths and weaknesses? Given what you know about the data, why did you choose these models to be applied?*

#### 1. Logistic Regression:

 It is a classification algorithm used to predict a binary outcome based on a series of independent variables.This method is based on Logistic Function called sigmoid function. This function has an S-shaped curve and take any value nd map it into a between 0 and 1


Strengths:
* Fast to train
* Outputs easy probabilistic interpretation
* can be regularized to avoid overfitting
* Performs well with small number of observations

Weaknesses: 
* Difficult to handle with noise data
* Logistic regression tends to underperform when there are multiple or non-linear decision boundaries
* Not flexible with more complex relationships


Aplications: 

* Credit Card Fraud
* Medicine
* Gaming
* Spam Detection


#### 2. Decision Tree:

This model works well with small number of training data.It's simple to interpret and explian the results.It's good to handle with the categorical features of data.Decision Tree model can be use to Classification and Regression problems

Strengths:
* decision tree does not require normalization of data
* it's simple to interpret and explian the results
* Works well with missing values and qualitative features
* fast for small number of training samples
* easy to interpret and explain; 
* simple to tune;  


Weaknesses: 
* A small change in the data can cause a large change in the structure of the decision tree causing instability.
* Decision tree sometimes calculation can go far more complex compared to other algorithms.
* Not good for big data
* Easy to overffiting without tuning;  

Applications:
* crptocurrency
* Medical diagnosis
* Control of nonlinear dynamical systems


#### 3. Support Vector Machines

Strengths:
* SVM works relatively well when there is a clear margin of separation between classes.
* SVM is effective in cases where the number of dimensions is greater than the number of samples.
* for small training samples it's fast to train
* Good for high dimensional data
* Predictions are fast
* SVM is relatively memory efficient

Weaknesses:
* not good for large datasets
* SVM does not perform very well when the data set has more noise i.e. target classes are overlapping
* In cases where the number of features for each data point exceeds the number of training data samples, the SVM will underperform.
* Outputs are hard to interprets
* Costly to learning

Aplications:
* Text and hypertext categorization
* Bioinformatics
* Handwriting Recognition 
* Face detection
* Clasification of Images







#### 1.Logistic Regression 

In [67]:
from sklearn.linear_model import LogisticRegression
logit_model = LogisticRegression(solver='lbfgs', max_iter=1000)
logit_model.fit(x_train,y_train)
y_pred = logit_model.predict(x_test)

In [68]:
from sklearn.metrics import f1_score,confusion_matrix,accuracy_score,precision_score,recall_score
print('Accuracy is:',accuracy_score(y_test,y_pred))
print('Precision is:',precision_score(y_test,y_pred,pos_label='yes'))
print('recall is:',recall_score(y_test,y_pred,pos_label='yes'))
print('f1 is:',f1_score(y_test,y_pred,pos_label='yes'))

Accuracy is: 0.7052631578947368
Precision is: 0.7105263157894737
recall is: 0.9
f1 is: 0.7941176470588235


In [69]:
confusion_matrix(y_test,y_pred)

array([[13, 22],
       [ 6, 54]], dtype=int64)

#### 2.KNN 

In [70]:
from sklearn.neighbors import KNeighborsClassifier
acc_values = []
neighbors = np.arange(3,15)
for k in neighbors:
    classifier = KNeighborsClassifier(n_neighbors=k,metric = 'minkowski')
    classifier.fit(x_train,y_train)
    y_pred=classifier.predict(x_test)
    acc = accuracy_score(y_test,y_pred)
    acc_values.append(acc)

In [71]:
acc_values

[0.631578947368421,
 0.6421052631578947,
 0.6736842105263158,
 0.6210526315789474,
 0.6210526315789474,
 0.631578947368421,
 0.6631578947368421,
 0.6421052631578947,
 0.631578947368421,
 0.6421052631578947,
 0.6631578947368421,
 0.6421052631578947]

In [72]:
classifier = KNeighborsClassifier(n_neighbors=5,metric = 'minkowski')
classifier.fit(x_train,y_train)
y_pred=classifier.predict(x_test)

In [73]:
print('Accuracy is:',accuracy_score(y_test,y_pred))
print('Precision is:',precision_score(y_test,y_pred,pos_label='yes'))
print('recall is:',recall_score(y_test,y_pred,pos_label='yes'))
print('f1 is:',f1_score(y_test,y_pred,pos_label='yes'))

Accuracy is: 0.6736842105263158
Precision is: 0.6835443037974683
recall is: 0.9
f1 is: 0.776978417266187


In [74]:
confusion_matrix(y_test,y_pred)

array([[10, 25],
       [ 6, 54]], dtype=int64)

#### 3.Decision Tree

In [86]:
from sklearn.tree import DecisionTreeClassifier
dt_model = DecisionTreeClassifier()
dt_model.fit(x_train,y_train)
y_pred = dt_model.predict(x_test)

In [87]:
print('Accuracy is:',accuracy_score(y_test,y_pred))
print('Precision is:',precision_score(y_test,y_pred,pos_label='yes'))
print('recall is:',recall_score(y_test,y_pred,pos_label='yes'))
print('f1 score is:',f1_score(y_test,y_pred,pos_label='yes'))

Accuracy is: 0.5368421052631579
Precision is: 0.6290322580645161
recall is: 0.65
f1 score is: 0.639344262295082


In [85]:
y_test.value_counts()

yes    60
no     35
Name: passed, dtype: int64

In [46]:
confusion_matrix(y_test,y_pred)

array([[13, 22],
       [16, 44]], dtype=int64)

#### 4.Random forest

In [96]:
from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier()
rf.fit(x_train,y_train)
y_pred=rf.predict(x_test)

In [97]:
from sklearn.metrics import f1_score,confusion_matrix
print('Accuracy is:',accuracy_score(y_test,y_pred))
print('Precision is:',precision_score(y_test,y_pred,pos_label='yes'))
print('recall is:',recall_score(y_test,y_pred,pos_label='yes'))
print('f1 score is:',f1_score(y_test,y_pred,pos_label='yes'))

Accuracy is: 0.6631578947368421
Precision is: 0.6627906976744186
recall is: 0.95
f1 score is: 0.780821917808219


In [98]:
confusion_matrix(y_test,y_pred)

array([[ 6, 29],
       [ 3, 57]], dtype=int64)

**CONCLUSION:** 

**When we compare the accuracy score of models used here we can see that Logistic regression has done a great job. So we can infer that logistic regression is good model**