
## Supervised Learning
## Building a Student Intervention System

###### The goal for this project is to identify students who might need early intervention before they fail to graduate.This is a typical classification problem because we have to predict the 'class' of students who might need early intervention.

## Exploring the Data
The code cell below loads the necessary Python libraries and the student data. The last column from this dataset, `'passed'`, will be our target label (whether the student graduated or didn't graduate). All other columns are features about each student.

In [3]:
# Import libraries
import numpy as np
import pandas as pd
from time import time
from sklearn.metrics import f1_score

# Read student data
student_data = pd.read_csv("student-data.csv")
print "Student data read successfully!"

Student data read successfully!


### Implementation: Data Exploration
Let's begin by investigating the dataset to determine how many students we have information on, and learn about the graduation rate among these students. In the code cell below, we will compute the following:
- The total number of students, `n_students`.
- The total number of features for each student, `n_features`.
- The number of those students who passed, `n_passed`.
- The number of those students who failed, `n_failed`.
- The graduation rate of the class, `grad_rate`, in percent (%).


In [4]:
# Calculate number of students
n_students = student_data.shape[0]

# Calculate number of features
n_features = student_data.shape[1]-1

# Calculate passing students

n_passed = 0
for _, student in student_data.iterrows():
    if student['passed']=='yes':
        n_passed=n_passed+1

# Calculate failing students
n_failed = n_students-n_passed

# Calculate graduation rate
grad_rate = float(100*n_passed)/float(n_students)


# Print the results
print "Total number of students: {}".format(n_students)
print "Number of features: {}".format(n_features)
print "Number of students who passed: {}".format(n_passed)
print "Number of students who failed: {}".format(n_failed)
print "Graduation rate of the class: {:.2f}%".format(grad_rate)

Total number of students: 395
Number of features: 30
Number of students who passed: 265
Number of students who failed: 130
Graduation rate of the class: 67.09%


## Preparing the Data
In this section, we will prepare the data for modeling, training and testing.

### Identify feature and target columns
It is often the case that the data you obtain contains non-numeric features. This can be a problem, as most machine learning algorithms expect numeric data to perform computations with.

In the code cell below we will separate the student data into feature and target columns to see if any features are non-numeric.

In [5]:
# Extract feature columns
feature_cols = list(student_data.columns[:-1])

# Extract target column 'passed'
target_col = student_data.columns[-1] 

# Show the list of columns
print "Feature columns:\n{}".format(feature_cols)
print "\nTarget column: {}".format(target_col)

# Separate the data into feature data and target data (X_all and y_all, respectively)
X_all = student_data[feature_cols]
y_all = student_data[target_col]

# Show the feature information by printing the first five rows
print "\nFeature values:"
print X_all.head()

Feature columns:
['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']

Target column: passed

Feature values:
  school sex  age address famsize Pstatus  Medu  Fedu     Mjob      Fjob  \
0     GP   F   18       U     GT3       A     4     4  at_home   teacher   
1     GP   F   17       U     GT3       T     1     1  at_home     other   
2     GP   F   15       U     LE3       T     1     1  at_home     other   
3     GP   F   15       U     GT3       T     4     2   health  services   
4     GP   F   16       U     GT3       T     3     3    other     other   

    ...    higher internet  romantic  famrel  freetime goout Dalc Walc health  \
0   ...       yes       no        no       4         3     4    1    1      3   
1   ...       

### Preprocess Feature Columns

There are several non-numeric columns that need to be converted! Many of them are simply `yes`/`no`, e.g. `internet`. These can be reasonably converted into `1`/`0` (binary) values.

Other columns, like `Mjob` and `Fjob`, have more than two values, and are known as _categorical variables_. The recommended way to handle such a column is to create as many columns as possible values (e.g. `Fjob_teacher`, `Fjob_other`, `Fjob_services`, etc.), and assign a `1` to one of them and `0` to all others.

These generated columns are sometimes called _dummy variables_, and we will use the [`pandas.get_dummies()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html?highlight=get_dummies#pandas.get_dummies) function to perform this transformation.

In [6]:
def preprocess_features(X):
    ''' Preprocesses the student data and converts non-numeric binary variables into
        binary (0/1) variables. Converts categorical variables into dummy variables. '''
    
    # Initialize new output DataFrame
    output = pd.DataFrame(index = X.index)

    # Investigate each feature column for the data
    for col, col_data in X.iteritems():
        
        # If data type is non-numeric, replace all yes/no values with 1/0
        if col_data.dtype == object:
            col_data = col_data.replace(['yes', 'no'], [1, 0])

        # If data type is categorical, convert to dummy variables
        if col_data.dtype == object:
            # Example: 'school' => 'school_GP' and 'school_MS'
            col_data = pd.get_dummies(col_data, prefix = col)  
        
        # Collect the revised columns
        output = output.join(col_data)
    
    return output

X_all = preprocess_features(X_all)
print "Processed feature columns ({} total features):\n{}".format(len(X_all.columns), list(X_all.columns))

Processed feature columns (48 total features):
['school_GP', 'school_MS', 'sex_F', 'sex_M', 'age', 'address_R', 'address_U', 'famsize_GT3', 'famsize_LE3', 'Pstatus_A', 'Pstatus_T', 'Medu', 'Fedu', 'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services', 'Mjob_teacher', 'Fjob_at_home', 'Fjob_health', 'Fjob_other', 'Fjob_services', 'Fjob_teacher', 'reason_course', 'reason_home', 'reason_other', 'reason_reputation', 'guardian_father', 'guardian_mother', 'guardian_other', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']


### Implementation: Training and Testing Data Split
We have converted all _categorical_ features into numeric values. For the next step, we split the data (both features and corresponding labels) into training and test sets. In the following code cell below, we implement the following
- Randomly shuffle and split the data (`X_all`, `y_all`) into training and testing subsets.
  - Use 300 training points (approximately 75%) and 95 testing points (approximately 25%).
  - Set a `random_state` for the function(s) you use, if provided.
  - Store the results in `X_train`, `X_test`, `y_train`, and `y_test`.

In [7]:
from sklearn.model_selection import train_test_split
# Set the number of training points
num_train = 300

# Set the number of testing points
num_test = X_all.shape[0] - num_train

# Shuffle and split the dataset into the number of training and testing points above

X_train, X_test, y_train, y_test = train_test_split(X_all,y_all, test_size=float(num_test)/float(num_train+num_test), random_state=0)

# Show the results of the split
print "Training set has {} samples.".format(X_train.shape[0])
print "Testing set has {} samples.".format(X_test.shape[0])

Training set has 300 samples.
Testing set has 95 samples.


## Training and Evaluating Models
In this section, we will choose 3 supervised learning models that are appropriate for this problem and available in `scikit-learn`. We will list the strengths and weaknesses of each model and later evaluate them  

The three models chosen are :
1. Ensemble Methods(Random Forest)
2. SVM
3. Logistic Regression

1.Ensemble Methods(Random Forest)

Real World Application:-RFs are useful in predicting fault prone parts in the design and code of the software.

Strengths & Weaknesses of Random Forest:

1(S) Random Forest tends to be a very accurate algorithm for a large number of problems

2(W) Random Forests take up significant amount of time compared to their counterparts since it builds multiple classifiers.Therefore they require parallelization.

3(S) Random Forests are flexible and are not very prone to overfitting.

Since the problem tends to have a lot of features and Random Forests are good at high dimensional spaces , hence one of the good choices is Random Forest 

2.SVM

Real World Application:-SVMs are used in text classification problems.

Strength & Weaknesses of SVM:

1(S) SVM can deal with cases which are not linearly separable(like using RBF ,poly kernel).

2(S) SVM can be used in high dimensional spaces.

3(W) SVM are very difficult to train for large number of examples

Since the no of examples are small and the problem has relatively more number of features SVMs are a good choice here.

3.Logistic Regression(LR)

Real World Application:It is used in credit scoring.

Strength & Weaknesses of LR:

1(S) Logistic regression is simple ,fast and efficient.

2(S) LR can be interpreted as a likelihood not only as a classification.

3(W) It is used on linear data and therefore whenever the data is nonlinear we are better off using SVM kernels or other methods.

4(W) It also performs weakly on very small training sets since it is a low bias/high variance classifier.As the data sets grow larger LR performance increases.

LR provides a baseline for any machine learning problem.It is quite possible that the problem might be as simple/linear to be captured by Logistic Regression itself.Moreover it is very fast.Therefore it is used in the problem .

### Setup
The code cell below initializse three helper functions which we use for training and testing the three supervised learning models you've chosen above. The functions are as follows:
- `train_classifier` - takes as input a classifier and training data and fits the classifier to the data.
- `predict_labels` - takes as input a fit classifier, features, and a target labeling and makes predictions using the F<sub>1</sub> score.
- `train_predict` - takes as input a classifier, and the training and testing data, and performs `train_clasifier` and `predict_labels`.
 - This function will report the F<sub>1</sub> score for both the training and testing data separately.

In [8]:
def train_classifier(clf, X_train, y_train):
    ''' Fits a classifier to the training data. '''
    
    # Start the clock, train the classifier, then stop the clock
    start = time()
    clf.fit(X_train, y_train)
    end = time()
    
    # Print the results
    print "Trained model in {:.4f} seconds".format(end - start)

    
def predict_labels(clf, features, target):
    ''' Makes predictions using a fit classifier based on F1 score. '''
    
    # Start the clock, make predictions, then stop the clock
    start = time()
    y_pred = clf.predict(features)
    end = time()
    
    # Print and return results
    print "Made predictions in {:.4f} seconds.".format(end - start)
    return f1_score(target.values, y_pred, pos_label='yes')


def train_predict(clf, X_train, y_train, X_test, y_test):
    ''' Train and predict using a classifer based on F1 score. '''
    
    # Indicate the classifier and the training set size
    print "Training a {} using a training set size of {}. . .".format(clf.__class__.__name__, len(X_train))
    
    # Train the classifier
    train_classifier(clf, X_train, y_train)
    
    # Print the results of prediction for both training and testing
    print "F1 score for training set: {:.4f}.".format(predict_labels(clf, X_train, y_train))
    print "F1 score for test set: {:.4f}.".format(predict_labels(clf, X_test, y_test))

### Implementation: Model Performance Metrics
With the predefined functions above, we will now import the three supervised learning models chosen.

In [9]:
# Import the three supervised learning models from sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn import linear_model
# Initialize the three models
clf_A = RandomForestClassifier(random_state=0)

clf_B = svm.SVC(random_state=0)

clf_C=linear_model.LogisticRegression()


# Set up the training set sizes
X_train_100 = X_train[0:100]
y_train_100 = y_train[0:100]

X_train_200 = X_train[0:200]
y_train_200 = y_train[0:200]

X_train_300 = X_train[0:300]
y_train_300 = y_train[0:300]

# Execute the 'train_predict' function for each classifier and each training set size
train_predict(clf_A, X_train_100, y_train_100, X_test, y_test)
train_predict(clf_B, X_train_100, y_train_100, X_test, y_test)
train_predict(clf_C, X_train_100, y_train_100, X_test, y_test)

train_predict(clf_A, X_train_200, y_train_200, X_test, y_test)
train_predict(clf_B, X_train_200, y_train_200, X_test, y_test)
train_predict(clf_C, X_train_200, y_train_200, X_test, y_test)

train_predict(clf_A, X_train_300, y_train_300, X_test, y_test)
train_predict(clf_B, X_train_300, y_train_300, X_test, y_test)
train_predict(clf_C, X_train_300, y_train_300, X_test, y_test)



Training a RandomForestClassifier using a training set size of 100. . .
Trained model in 0.1058 seconds
Made predictions in 0.0186 seconds.
F1 score for training set: 0.9841.
Made predictions in 0.0158 seconds.
F1 score for test set: 0.7119.
Training a SVC using a training set size of 100. . .
Trained model in 0.0036 seconds
Made predictions in 0.0029 seconds.
F1 score for training set: 0.8591.
Made predictions in 0.0028 seconds.
F1 score for test set: 0.7838.
Training a LogisticRegression using a training set size of 100. . .
Trained model in 0.0735 seconds
Made predictions in 0.0660 seconds.
F1 score for training set: 0.8571.
Made predictions in 0.0006 seconds.
F1 score for test set: 0.7612.
Training a RandomForestClassifier using a training set size of 200. . .
Trained model in 0.2174 seconds
Made predictions in 0.0232 seconds.
F1 score for training set: 1.0000.
Made predictions in 0.0167 seconds.
F1 score for test set: 0.7761.
Training a SVC using a training set size of 200. . .
Tr

### Tabular Results
Edit the cell below to see how a table can be designed in [Markdown](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet#tables). You can record your results from above in the tables provided.

** Classifer 1 - Random Forest **  

| Training Set Size |       Training Time     | Prediction Time (test) | F1 Score (train) | F1 Score (test) |
| :---------------: | :---------------------: | :--------------------: | :--------------: | :-------------: |
| 100               |        0.0580            |          0.008         |      0.9841      |  0.7119         |
| 200               |        0.0480            |         0.000         |      1.0     |  0.7761        |
| 300               |        0.064            |      0.008            |   0.9976        |   0.7344        |

** Classifer 2 - SVM**  

| Training Set Size | Training Time | Prediction Time (test) | F1 Score (train) | F1 Score (test) |
| :---------------: | :---------------------: | :--------------------: | :--------------: | :-------------: |
| 100               |      0.002              |   0.001                |     0.8591       |     0.7838      |
| 200               |      0.015              |  0.000                 |   0.8693         |     0.7755      |
| 300               |      0.009              |       0.000          |    0.8692        |     0.7586      |

** Classifer 3 - Logistic Regression**  

| Training Set Size | Training Time | Prediction Time (test) | F1 Score (train) | F1 Score (test) |
| :---------------: | :---------------------: | :--------------------: | :--------------: | :-------------: |
| 100               |          0.002         |    0.001               |    0.8571        |     0.7612      |
| 200               |          0.000         |    0.000               |    0.8380        | 0.7794          |
| 300               |          0.000          |    0.000               |    0.8381        |    0.7910       |

## Choosing the Best Model
 

Logistic Regression is the best model because it takes very less amount of training and testing time while at the same time giving on an average more accurate test results . Thus given the amount of data it will use the least resources, cost less and give maximum performance.

### Model in Layman's Terms

Logistic regression is used when the dependent variable is binary and there are one or more independent variables which are used to determine it.Logistic regression divides the output into 1 or 0. The aim of the algorithm is to come up with a probabilistic function during training.It does this by taking features about previous students like their age,gender and creates a model whilst assigning "weights" to these features that assess the importance of each feature in predicting the final outcome of passed or failed.When we want to predict an outcome for a new student , we take the new student's features and combine them with the "weights" previously assigned to them during training and then a final summed up value is applied to a function (called a "sigmoid") that then predicts the whether the student will pass or fail.

### Implementation: Model Tuning
For fine tuning the chosen model,we use grid search (`GridSearchCV`) with at least one important parameter tuned with at least 3 different values.

In [11]:
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.metrics import f1_score
from sklearn import linear_model

# Create the parameters list to tune
parameters = {'C':[0.5,0.7,1.0]}

# Initialize the classifier
clf = linear_model.LogisticRegression()

# Make an f1 scoring function using 'make_scorer' 
f1_scorer = make_scorer(f1_score,pos_label='yes')

# Perform grid search on the classifier using the f1_scorer as the scoring method
grid_obj =GridSearchCV(estimator=clf,param_grid=parameters,scoring=f1_scorer)

# Fit the grid search object to the training data and find the optimal parameters
grid_obj = grid_obj.fit(X_train,y_train)

# Get the estimator
clf = grid_obj.best_estimator_

# Report the final F1 score for training and testing after parameter tuning
print "Tuned model has a training F1 score of {:.4f}.".format(predict_labels(clf, X_train, y_train))
print "Tuned model has a testing F1 score of {:.4f}.".format(predict_labels(clf, X_test, y_test))

Made predictions in 0.0011 seconds.
Tuned model has a training F1 score of 0.8363.
Made predictions in 0.0004 seconds.
Tuned model has a testing F1 score of 0.8000.


### Final F<sub>1</sub> Score[Tuned v/s Untuned Model]


**Answer: **Final F1 Training Score:- 0.8363 

Final F1 Testing Score:- 0.8000

Initial F1 Training Score:- 0.8381

Initial F1 Testing Score:- 0.7910

Therefore the training score has gone up.