# Project 2: Supervised Learning
### Building a Student Intervention System

## 1. Classification vs Regression

**Your goal is to identify students who might need early intervention - which type of supervised machine learning problem is this, classification or regression? Why?**
  - **Answer:**  this is a classification problem - we are predicting a boolean value (is a given student likely to fail and therefore  requires early intervention, or not?).  We are not predicting a continuous value (e.g. a risk score for each student).

## 2. Exploring the Data

Let's go ahead and read in the student dataset first.

_To execute a code cell, click inside it and press **Shift+Enter**._

In [146]:
# Import libraries
import numpy as np
import pandas as pd

In [147]:
# Read student data
student_data = pd.read_csv("student-data.csv")
print "Student data read successfully!"
# Note: The last column 'passed' is the target/label, all other are feature columns

Student data read successfully!


Now, can you find out the following facts about the dataset?
- Total number of students
- Number of students who passed
- Number of students who failed
- Graduation rate of the class (%)
- Number of features

_Use the code block below to compute these values. Instructions/steps are marked using **TODO**s._

In [148]:
# TODO: Compute desired values - replace each '?' with an appropriate expression/function call
n_students = len(student_data)
n_features = len(student_data.columns[:-1])
n_passed = len(student_data[(student_data.passed == 'yes')])
n_failed = len(student_data[(student_data.passed == 'no')])
grad_rate = float(n_passed)/float(n_students)*100.0
print "Total number of students: {}".format(n_students)
print "Number of students who passed: {}".format(n_passed)
print "Number of students who failed: {}".format(n_failed)
print "Number of features: {}".format(n_features)
print "Graduation rate of the class: {:.2f}%".format(grad_rate)

Total number of students: 395
Number of students who passed: 265
Number of students who failed: 130
Number of features: 30
Graduation rate of the class: 67.09%


## 3. Preparing the Data
In this section, we will prepare the data for modeling, training and testing.

### Identify feature and target columns
It is often the case that the data you obtain contains non-numeric features. This can be a problem, as most machine learning algorithms expect numeric data to perform computations with.

Let's first separate our data into feature and target columns, and see if any features are non-numeric.<br/>
**Note**: For this dataset, the last column (`'passed'`) is the target or label we are trying to predict.

In [149]:
# Extract feature (X) and target (y) columns
feature_cols = list(student_data.columns[:-1])  # all columns but last are features
target_col = student_data.columns[-1]  # last column is the target/label
print "Feature column(s):-\n{}".format(feature_cols)
print "Target column: {}".format(target_col)

X_all = student_data[feature_cols]  # feature values for all students
y_all = student_data[target_col].replace(['yes', 'no'], [1, 0])  # corresponding targets/labels
 
print "\nFeature values:-"
print X_all.head()  # print the first 5 rows

Feature column(s):-
['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']
Target column: passed

Feature values:-
  school sex  age address famsize Pstatus  Medu  Fedu     Mjob      Fjob  \
0     GP   F   18       U     GT3       A     4     4  at_home   teacher   
1     GP   F   17       U     GT3       T     1     1  at_home     other   
2     GP   F   15       U     LE3       T     1     1  at_home     other   
3     GP   F   15       U     GT3       T     4     2   health  services   
4     GP   F   16       U     GT3       T     3     3    other     other   

    ...    higher internet  romantic  famrel  freetime goout Dalc Walc health  \
0   ...       yes       no        no       4         3     4    1    1      3   
1   ...    

### Preprocess feature columns

As you can see, there are several non-numeric columns that need to be converted! Many of them are simply `yes`/`no`, e.g. `internet`. These can be reasonably converted into `1`/`0` (binary) values.

Other columns, like `Mjob` and `Fjob`, have more than two values, and are known as _categorical variables_. The recommended way to handle such a column is to create as many columns as possible values (e.g. `Fjob_teacher`, `Fjob_other`, `Fjob_services`, etc.), and assign a `1` to one of them and `0` to all others.

These generated columns are sometimes called _dummy variables_, and we will use the [`pandas.get_dummies()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html?highlight=get_dummies#pandas.get_dummies) function to perform this transformation.

In [180]:
# Preprocess feature columns
def preprocess_features(X):
    outX = pd.DataFrame(index=X.index)  # output dataframe, initially empty

    # Check each column
    for col, col_data in X.iteritems():
        # If data type is non-numeric, try to replace all yes/no values with 1/0
        if col_data.dtype == object:
            col_data = col_data.replace(['yes', 'no'], [1, 0])
        # Note: This should change the data type for yes/no columns to int

        # If still non-numeric, convert to one or more dummy variables
        if col_data.dtype == object:
            col_data = pd.get_dummies(col_data, prefix=col)  # e.g. 'school' => 'school_GP', 'school_MS'

        outX = outX.join(col_data)  # collect column(s) in output dataframe

    return outX

X_all = preprocess_features(X_all)
print "Processed feature columns ({}):-\n{}".format(len(X_all.columns), list(X_all.columns))
print X_all.head()  # print the first 5 rows

Processed feature columns (48):-
['school_GP', 'school_MS', 'sex_F', 'sex_M', 'age', 'address_R', 'address_U', 'famsize_GT3', 'famsize_LE3', 'Pstatus_A', 'Pstatus_T', 'Medu', 'Fedu', 'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services', 'Mjob_teacher', 'Fjob_at_home', 'Fjob_health', 'Fjob_other', 'Fjob_services', 'Fjob_teacher', 'reason_course', 'reason_home', 'reason_other', 'reason_reputation', 'guardian_father', 'guardian_mother', 'guardian_other', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']
   school_GP  school_MS  sex_F  sex_M  age  address_R  address_U  famsize_GT3  \
0        1.0        0.0    1.0    0.0   18        0.0        1.0          1.0   
1        1.0        0.0    1.0    0.0   17        0.0        1.0          1.0   
2        1.0        0.0    1.0    0.0   15        0.0        1.0          0.0   
3        1.0  

### Split data into training and test sets

So far, we have converted all _categorical_ features into numeric values. In this next step, we split the data (both features and corresponding labels) into training and test sets.

In [151]:
# First, decide how many training vs test samples you want
num_all = student_data.shape[0]  # same as len(student_data)
num_train = 300  # about 75% of the data
num_test = num_all - num_train

# TODO: Then, select features (X) and corresponding labels (y) for the training and test sets
# Note: Shuffle the data or randomly select samples to avoid any bias due to ordering in the dataset

from sklearn import cross_validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X_all, y_all, test_size=num_test, random_state=1)

print "Training set: {} samples".format(X_train.shape[0])
print "Test set: {} samples".format(X_test.shape[0])
# Note: If you need a validation set, extract it from within training data

Training set: 300 samples
Test set: 95 samples


## 4. Training and Evaluating Models
Choose 3 supervised learning models that are available in scikit-learn, and appropriate for this problem. For each model:

**Decision Tree:**
- *What are the general applications of this model? What are its strengths and weaknesses?*
  - Strengths:
    - Decision trees can be very computationally efficient, therefore they can be a good choice when compute resources are limited.
    - They are easier to understand than other methods; they can be easily visualized and reasoned about (they are a "whitebox model"). 
    - They can be used with both numerical and categorical data (they are not optimized for datasets that only have a single type of variable).
    - They require less data preparation than other models.
  - Weaknesses:
    - Decision trees with greater depth are quite prone to overfitting (given unbounded depth/complexity, a tree can be created to perfectly explain any set of training data, but this usually does not generalize).
    - Decision trees do not model certain concepts well, such as XOR.
    - If data is not sufficiently shuffled/randomized, decision trees are more prone to overfitting/error.
  
- *Given what you know about the data so far, why did you choose this model to apply?*
  - A low depth decision tree is very computationally efficient (for both training and prediction).  Also, the "whitebox" aspect has instructional value; I am interested in seeing the tree that is created in order to gain a better understanding of both the algoritm and the data it is being applied on. 
  
- *Fit this model to the training data, try to predict labels (for both training and test sets), and measure the F<sub>1</sub> score. Repeat this process with different training set sizes (100, 200, 300), keeping test set constant.*
  - Done, see below
  
**Gaussian Naive Bayes:**
- *What are the general applications of this model? What are its strengths and weaknesses?*
  - Strengths:
    - Very computationally efficient (due to their simple, or "naive," nature; each distribution is treated as one-dimensional).
    - Require a small amount of training data
  - Weaknesses:
    - "Naive" assumption of independence between feature pairs.  This means that a potentially useful aspect of the data is neglected.
    - Often a useful classifier, but a usually a bad estimator
  
- *Given what you know about the data so far, why did you choose this model to apply?*
  - It is very computationally efficient, for both training and prediction.  It requires less training data.
  
- *Fit this model to the training data, try to predict labels (for both training and test sets), and measure the F<sub>1</sub> score. Repeat this process with different training set sizes (100, 200, 300), keeping test set constant.*
  - Done, see below
  
**Support Vector Classifier:**
- *What are the general applications of this model? What are its strengths and weaknesses?*
  - Strengths:
    - Effective for data with many features (even when the number of features exceeds the number of samples!)
    - Particularly strong at distinguishing data at the "border" of classes (the data that is most difficult to distinguish/classify)
    -  Highly tunable, due to availability of several kernel functions 
  - Weaknesses:
    - More computationally intensive than the other two models I selected
  
- *Given what you know about the data so far, why did you choose this model to apply?*
  - Though more computationally intensive than the other two models I selected, it is still efficient relative to other models.  Our data has several dimensions, and SVC is well suited for this.  
  
- *Fit this model to the training data, try to predict labels (for both training and test sets), and measure the F<sub>1</sub> score. Repeat this process with different training set sizes (100, 200, 300), keeping test set constant.*
  - Done, see below

Produce a table showing training time, prediction time, F<sub>1</sub> score on training set and F<sub>1</sub> score on test set, for each training set size.

Note: You need to produce 3 such tables - one for each model.

In [152]:
# Train a model
import time

def train_classifier(clf, X_train, y_train, gridSearch=False):
    print "Training {}...".format(clf.__class__.__name__)
    start = time.time()
    clf.fit(X_train, y_train)
    
    if gridSearch:
        clf = clf.best_estimator_
        print "Best estimator: {}".format(clf)
        
    end = time.time()
    print "Done!\nTraining time (secs): {:.3f}".format(end - start)

# TODO: Choose a model, import it and instantiate an object
from sklearn.tree import tree
clf = tree.DecisionTreeClassifier(max_depth=4)

# Fit model to training data
train_classifier(clf, X_train, y_train)  # note: using entire training set here
# you can inspect the learned model by printing it

Training DecisionTreeClassifier...
Done!
Training time (secs): 0.001


In [153]:
# Predict on training set and compute F1 score
from sklearn.metrics import f1_score

def predict_labels(clf, features, target):
    print "Predicting labels using {}...".format(clf.__class__.__name__)
    start = time.time()
    y_pred = clf.predict(features)
    end = time.time()
    print "Done!\nPrediction time (secs): {:.3f}".format(end - start)
    return f1_score(target.values, y_pred, pos_label=1)

train_f1_score = predict_labels(clf, X_train, y_train)
print "F1 score for training set: {}".format(train_f1_score)

Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.001
F1 score for training set: 0.853211009174


In [154]:
# Predict on test data
print "F1 score for test set: {}".format(predict_labels(clf, X_test, y_test))

Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.000
F1 score for test set: 0.847222222222


In [155]:
# Train and predict using different training set sizes
def train_predict(clf, X_train, y_train, X_test, y_test, gridSearch=False):
    print "------------------------------------------"
    print "Training set size: {}".format(len(X_train))
    train_classifier(clf, X_train, y_train, gridSearch)
    print "F1 score for training set: {}".format(predict_labels(clf, X_train, y_train))
    print "F1 score for test set: {}".format(predict_labels(clf, X_test, y_test))

# TODO: Run the helper function above for desired subsets of training data
# Note: Keep the test set constant
X_train200, _X_test, y_train200, _y_test = cross_validation.train_test_split(X_train, y_train, train_size=200, random_state=1)
train_predict(clf, X_train200, y_train200, X_test, y_test)

X_train100, _X_test, y_train100, _y_test = cross_validation.train_test_split(X_train, y_train, train_size=100, random_state=1)
train_predict(clf, X_train100, y_train100, X_test, y_test)


------------------------------------------
Training set size: 200
Training DecisionTreeClassifier...
Done!
Training time (secs): 0.001
Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.000
F1 score for training set: 0.856088560886
Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.000
F1 score for test set: 0.75
------------------------------------------
Training set size: 100
Training DecisionTreeClassifier...
Done!
Training time (secs): 0.001
Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.000
F1 score for training set: 0.884955752212
Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.000
F1 score for test set: 0.610169491525


In [156]:
# TODO: Train and predict using two other models

#Gaussian Naive Bayes
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
print "Training set size: 300"
train_classifier(gnb, X_train, y_train)
print""

train_f1_score = predict_labels(gnb, X_train, y_train)
print "F1 score for training set: {}".format(train_f1_score), '\n'

# Predict on test data
print "F1 score for test set: {}".format(predict_labels(clf, X_test, y_test))

#Retrain w/ different size training sets, and predict
train_predict(gnb, X_train200, y_train200, X_test, y_test)
train_predict(gnb, X_train100, y_train100, X_test, y_test)


Training set size: 300
Training GaussianNB...
Done!
Training time (secs): 0.001

Predicting labels using GaussianNB...
Done!
Prediction time (secs): 0.000
F1 score for training set: 0.792079207921 

Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.000
F1 score for test set: 0.610169491525
------------------------------------------
Training set size: 200
Training GaussianNB...
Done!
Training time (secs): 0.000
Predicting labels using GaussianNB...
Done!
Prediction time (secs): 0.000
F1 score for training set: 0.776470588235
Predicting labels using GaussianNB...
Done!
Prediction time (secs): 0.000
F1 score for test set: 0.6875
------------------------------------------
Training set size: 100
Training GaussianNB...
Done!
Training time (secs): 0.000
Predicting labels using GaussianNB...
Done!
Prediction time (secs): 0.000
F1 score for training set: 0.791666666667
Predicting labels using GaussianNB...
Done!
Prediction time (secs): 0.000
F1 score for test set

In [157]:
#Support Vector Machine
from sklearn import svm 
svm = svm.SVC()
print "Training set size: 300"
train_classifier(svm, X_train, y_train)
print""

train_f1_score = predict_labels(svm, X_train, y_train)
print "F1 score for training set: {}".format(train_f1_score), '\n'

# Predict on test data
print "F1 score for test set: {}".format(predict_labels(svm, X_test, y_test))

#Retrain w/ different size training sets, and predict
train_predict(svm, X_train200, y_train200, X_test, y_test)
train_predict(svm, X_train100, y_train100, X_test, y_test)

Training set size: 300
Training SVC...
Done!
Training time (secs): 0.006

Predicting labels using SVC...
Done!
Prediction time (secs): 0.004
F1 score for training set: 0.858387799564 

Predicting labels using SVC...
Done!
Prediction time (secs): 0.001
F1 score for test set: 0.846153846154
------------------------------------------
Training set size: 200
Training SVC...
Done!
Training time (secs): 0.003
Predicting labels using SVC...
Done!
Prediction time (secs): 0.002
F1 score for training set: 0.858085808581
Predicting labels using SVC...
Done!
Prediction time (secs): 0.001
F1 score for test set: 0.857142857143
------------------------------------------
Training set size: 100
Training SVC...
Done!
Training time (secs): 0.001
Predicting labels using SVC...
Done!
Prediction time (secs): 0.001
F1 score for training set: 0.868852459016
Predicting labels using SVC...
Done!
Prediction time (secs): 0.001
F1 score for test set: 0.784615384615


In [158]:
print "Metrics:\n"

dt_results = pd.DataFrame.from_items([('Training time(secs)', [0.001, 0.001, 0.002]), ('Prediction time(secs)', [0.000, 0.000, 0.001]), ('F1 score for training set', [0.8850, 0.8561, 0.8532]), ('F1 score for test set', [0.6218, 0.75, 0.8392])], orient='index', columns=['Training set size:    100', '200', '300'])
print "Decision Tree:\n", dt_results, '\n\n'

gnb_results = pd.DataFrame.from_items([('Training time(secs)', [0.000, 0.001, 0.002]), ('Prediction time(secs)', [0.000, 0.000, 0.000]), ('F1 score for training set', [0.7917, 0.7765, 0.7921]), ('F1 score for test set', [0.7610, 0.6875, 0.8392])], orient='index', columns=['Training set size:    100', '200', '300'])
print "Gaussian Naive Bayes:\n", gnb_results, '\n\n'

svm_results = pd.DataFrame.from_items([('Training time(secs)', [0.001, 0.003, 0.006]), ('Prediction time(secs)', [0.001, 0.001, 0.001]), ('F1 score for training set', [0.8689, 0.8581, 0.8584]), ('F1 score for test set', [0.7846, 0.8571, 0.8462])], orient='index', columns=['Training set size:    100', '200', '300'])
print "Support Vector Classifier:\n", svm_results, '\n\n'


Metrics:

Decision Tree:
                           Training set size:    100     200     300
Training time(secs)                           0.0010  0.0010  0.0020
Prediction time(secs)                         0.0000  0.0000  0.0010
F1 score for training set                     0.8850  0.8561  0.8532
F1 score for test set                         0.6218  0.7500  0.8392 


Gaussian Naive Bayes:
                           Training set size:    100     200     300
Training time(secs)                           0.0000  0.0010  0.0020
Prediction time(secs)                         0.0000  0.0000  0.0000
F1 score for training set                     0.7917  0.7765  0.7921
F1 score for test set                         0.7610  0.6875  0.8392 


Support Vector Classifier:
                           Training set size:    100     200     300
Training time(secs)                           0.0010  0.0030  0.0060
Prediction time(secs)                         0.0010  0.0010  0.0010
F1 score for training s

## 5. Choosing the Best Model

- **Based on the experiments you performed earlier, in 1-2 paragraphs explain to the board of supervisors what single model you chose as the best model. Which model is generally the most appropriate based on the available data, limited resources, cost, and performance?**
  - Of the three models evaluated above, the Support Vector Classification model is probably the best choice:
    - It performs well with limited training data.  Surprisingly, it performs better when trained on the 200 sample training set vs. the 300 sample training set.
    - Though the training time is higher than that of the other evaluated models, it is still quite low (it took 3ms to train on 200 samples).  The cost of this compute time is quite negligible (we're talking micropennies).
    - The prediction time is 1ms (on my laptop).  This is greater than the that of the other evaluated models (which clock in closer to 0ms).  However, in almost all concievable circumstances, a 1ms prediction time would significantly exceed minimum requirements (it could make 30,000 predictions in the time it took to write this bullet :)
    -  It performs better than any other model when evaluted by the F1 metric, by a margin of about 2%.


- **In 1-2 paragraphs explain to the board of supervisors in layman's terms how the final model chosen is supposed to work (for example if you chose a Decision Tree or Support Vector Machine, how does it make a prediction).**
  - The SVC model essentially graphs training data, and attempts to "draw a line" on this graph to separate the data into the distinct classes we are trying to distinguish/predict.  In our case, our data consists of multiple 'features' (student data fields/columns), and this results in a graph with many dimensions. Bear with me; graphs of greater than three dimensions are difficult to visualize, but they can be very useful, especially in this case.  
  
    The SVC model then finds a line (or "plane") on this graph which provides the best separation between classes (in our case, students who passed vs. students who failed).  The model takes extra care to find a line that maximizes the space between itself and the data classes it is separating.  It focuses on distinguishing the points that are most difficult to tell apart.  The idea is that if the model is good at making the most challenging distinctions, it should also be able to make less challenging distinctions with ease.
    
    After the the line is drawn (i.e. the "model is trained"), the model makes predictions by plotting new data against the graph and seeing which "side of the line/plane" the point falls on, which indicates which class it belongs to.
  
  

- **Fine-tune the model. Use Gridsearch with at least one important parameter tuned and with at least 3 settings. Use the entire training set for this.**
  - Done
  

- **What is the model's final F<sub>1</sub> score?**
  - 0.9252 (against the training set), and 0.8465 (against the test set).  The performance against the test set is significantly better than the untuned model, but the performance against the test slight is slightly worse!  We may be overfitting a bit by tuning the model against the training data. 

In [183]:
# TODO: Fine-tune your model and report the best F1 score
from sklearn import cross_validation
from sklearn.grid_search import GridSearchCV
from sklearn import svm 

parameters = {'kernel':('linear','poly','rbf','sigmoid'), 'C':[1, 2, 5, 10], 'degree':[1,2,3,4,5,6], 'gamma':['auto',.01,.05,.15]}

tuned_svm = svm.SVC()
tuned_svm = GridSearchCV(tuned_svm, parameters, scoring='f1')
print "Training {}...".format(tuned_svm.__class__.__name__)
start = time.time()
tuned_svm.fit(X_train, y_train)
best_estimator = tuned_svm.best_estimator_
end = time.time()

print "Done!\nTraining time (secs): {:.3f}".format(end - start)
print "Best estimator: {}\n".format(best_estimator)

print "F1 score for training set: {}\n".format(predict_labels(tuned_svm, X_train, y_train))

print "F1 score for test set: {}\n".format(predict_labels(tuned_svm, X_test, y_test))

print "Best parameters for the final tuned SVC model is {}".format(tuned_svm.best_params_)

Training GridSearchCV...
Done!
Training time (secs): 41.378
Best estimator: SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=1, gamma=0.05, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Predicting labels using GridSearchCV...
Done!
Prediction time (secs): 0.004
F1 score for training set: 0.92523364486

Predicting labels using GridSearchCV...
Done!
Prediction time (secs): 0.002
F1 score for test set: 0.83660130719

Best parameters for the final tuned SVC model is {'kernel': 'rbf', 'C': 1, 'gamma': 0.05, 'degree': 1}
