# Machine Learning Engineer Nanodegree
## Supervised Learning
## Project 2: Building a Student Intervention System

### Question 1 - Classification vs. Regression
*Your goal for this project is to identify students who might need early intervention before they fail to graduate. Which type of supervised learning problem is this, classification or regression? Why?*

**Answer: **

This is a classification question. This is because the outputs are categorical: a student either passes or fails at the end.

## Exploring the Data
Run the code cell below to load necessary Python libraries and load the student data. Note that the last column from this dataset, `'passed'`, will be our target label (whether the student graduated or didn't graduate). All other columns are features about each student.

In [1]:
# Import libraries
import numpy as np
import pandas as pd
from time import time
from sklearn.metrics import f1_score

# Read student data
student_data = pd.read_csv("student-data.csv")
print ("Student data read successfully!")

Student data read successfully!


### Implementation: Data Exploration
Let's begin by investigating the dataset to determine how many students we have information on, and learn about the graduation rate among these students.

In [2]:
student_data.columns

Index(['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu',
       'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime',
       'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery',
       'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc',
       'Walc', 'health', 'absences', 'passed'],
      dtype='object')

In [3]:
student_data.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,passed
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,no,no,4,3,4,1,1,3,6,no
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,yes,no,5,3,3,1,1,3,4,no
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,yes,no,4,3,2,2,3,3,10,yes
3,GP,F,15,U,GT3,T,4,2,health,services,...,yes,yes,3,2,2,1,1,5,2,yes
4,GP,F,16,U,GT3,T,3,3,other,other,...,no,no,4,3,2,1,2,5,4,yes


In [4]:
n_students = len(student_data)

n_features = student_data.shape[1]-1
#we subtract one, since 'passed' is our target

n_passed = (student_data['passed']=='yes').sum()

n_failed = (student_data['passed']=='no').sum()

grad_rate = float(n_passed)/(n_students)*100

# Print the results
print ("Total number of students: {}".format(n_students))
print ("Number of features: {}".format(n_features))
print ("Number of students who passed: {}".format(n_passed))
print ("Number of students who failed: {}".format(n_failed))
print ("Graduation rate of the class: {:.2f}%".format(grad_rate))

Total number of students: 395
Number of features: 30
Number of students who passed: 265
Number of students who failed: 130
Graduation rate of the class: 67.09%


## Preparing the Data
In this section, I prepare the data for modeling, training and testing.

### Identify feature and target columns

The code below separates the student data into feature and target columns to see if any features are non-numeric.

In [5]:
# Extract feature columns
feature_cols = list(student_data.columns[:-1])

# Extract target column 'passed'
target_col = student_data.columns[-1] 

# Show the list of columns
print ("Feature columns:\n{}".format(feature_cols))
print ("\nTarget column: {}".format(target_col))

# Separate the data into feature data and target data (X_all and y_all, respectively)
X_all = student_data[feature_cols]
y_all = student_data[target_col]

# Show the feature information by printing the first five rows
print ("\nFeature values:")
print (X_all.head())

Feature columns:
['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']

Target column: passed

Feature values:
  school sex  age address famsize Pstatus  Medu  Fedu     Mjob      Fjob  \
0     GP   F   18       U     GT3       A     4     4  at_home   teacher   
1     GP   F   17       U     GT3       T     1     1  at_home     other   
2     GP   F   15       U     LE3       T     1     1  at_home     other   
3     GP   F   15       U     GT3       T     4     2   health  services   
4     GP   F   16       U     GT3       T     3     3    other     other   

    ...    higher internet  romantic  famrel  freetime goout Dalc Walc health  \
0   ...       yes       no        no       4         3     4    1    1      3   
1   ...       

### Preprocess Feature Columns

There are several non-numeric columns that need to be converted. Many of them are simply `yes`/`no`, e.g. `internet`. These can be reasonably converted into `1`/`0` (binary) values.

Other columns, like `Mjob` and `Fjob`, have more than two values, and are known as _categorical variables_. A way to handle such a column is to create as many columns as possible values (e.g. `Fjob_teacher`, `Fjob_other`, `Fjob_services`, etc.), and assign a `1` to one of them and `0` to all others.

In [6]:
def preprocess_features(X):
    ''' Preprocesses the student data and converts non-numeric binary variables into
        binary (0/1) variables. Converts categorical variables into dummy variables. '''
    
    # Initialize new output DataFrame
    output = pd.DataFrame(index = X.index)

    # Investigate each feature column for the data
    for col, col_data in X.iteritems():
        
        # If data type is non-numeric, replace all yes/no values with 1/0
        if col_data.dtype == object:
            col_data = col_data.replace(['yes', 'no'], [1, 0])

        # If data type is categorical, convert to dummy variables
        if col_data.dtype == object:
            # Example: 'school' => 'school_GP' and 'school_MS'
            col_data = pd.get_dummies(col_data, prefix = col)  
        
        # Collect the revised columns
        output = output.join(col_data)
    
    return output

X_all = preprocess_features(X_all)
print ("Processed feature columns ({} total features):\n{}".format(len(X_all.columns), list(X_all.columns)))

Processed feature columns (48 total features):
['school_GP', 'school_MS', 'sex_F', 'sex_M', 'age', 'address_R', 'address_U', 'famsize_GT3', 'famsize_LE3', 'Pstatus_A', 'Pstatus_T', 'Medu', 'Fedu', 'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services', 'Mjob_teacher', 'Fjob_at_home', 'Fjob_health', 'Fjob_other', 'Fjob_services', 'Fjob_teacher', 'reason_course', 'reason_home', 'reason_other', 'reason_reputation', 'guardian_father', 'guardian_mother', 'guardian_other', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']


### Implementation: Training and Testing Data Split
So far, we have converted all _categorical_ features into numeric values. For the next step, we split the data (both features and corresponding labels) into training and test sets. In the following code cell below, you will need to implement the following:
- Randomly shuffle and split the data (`X_all`, `y_all`) into training and testing subsets.
  - Use 300 training points (approximately 75%) and 95 testing points (approximately 25%).
  - Set a `random_state` for the function(s) you use, if provided.
  - Store the results in `X_train`, `X_test`, `y_train`, and `y_test`.

In [7]:
from sklearn.cross_validation import train_test_split

num_train = 300

# Set the number of testing points
num_test = X_all.shape[0] - num_train

X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size=95, random_state=10)

# Show the results of the split
print ("Training set has {} samples.".format(X_train.shape[0]))
print ("Testing set has {} samples.".format(X_test.shape[0]))

Training set has 300 samples.
Testing set has 95 samples.


## Training and Evaluating Models
I choose 3 supervised learning models that are appropriate for this problem and available in `scikit-learn`.

### Question 2 - Model Application
*List three supervised learning models that are appropriate for this problem. For each model chosen*
- Describe one real-world application in industry where the model can be applied. *(You may need to do a small bit of research for this — give references!)* 
- What are the strengths of the model; when does it perform well? 
- What are the weaknesses of the model; when does it perform poorly?
- What makes this model a good candidate for the problem, given what you know about the data?

**Answer: **

*Model 1: Decision Trees*

A really simple process that is analagous to a decision tree is a decision flowchart (which almost any organization would use) - thus, decision trees are easy to explain conceptually to a non-technical audience. This is because they seem to mirror how people would naturally think about processes and problems. As a result, decision trees (or decision charts) are widely used in business to model processes. A more technical example of the usage of decision trees would be its use in random forests, where a possible application is classifying junk mail.

It's also straightforward to visualize what is happening at each branch. However, the decision tree loses its interpretability the deeper it is - they can quickly become too complex to explain. Decision trees are also prone to overfitting, but this can be mitigated by cross-validation of the predictions and pruning of the decision tree. Decision trees perform very well when the data is separated cleanly parallel to any axis (since this is by nature what a decision tree does).

I chose decision trees as my first classifier because it's one of the simplest choices, and acts as a good benchmark against future methods. For example, if the data is not separated parallel to its axes, we can use a variety of other methods that might compensate for this, such as SVM.

*Model 2: Logistic Regression*

Logistic regression is used in a wide variety of fields, but sees a lot of use specifically in the social and medical sciences. The TRISS (Trauma and Severity Score), used to predict the mortality of injured patients uses logistic regression. It can also be used to predict whether a given patient has a certain disease.

Logistic regression is based off of the logistic function, which is commonly used to model population growth. In the case of the logistic model, it models the probability of an outcome. An industry application can be predicting the odds that a bank customer will default. Logistic regression is good for this problem because the problem is simple binary classification. The strengths of the model are fast computation time, since it is not very processor intensive to implement. It works poorly when the nature of the relation between the predictor and response does not follow the logistic function, which looks similar to a population growth graph.

It seems to me as if logistic regression will be a good fit, since it's a simple binary classification problem. I also chose this model because I thought it might complement the decision tree, since logistic regression essentially searches for a linear boundary in the data, whereas decision trees search for multiple axis-aligned decision boundaries (so the end effect is that of a non-linear classifier).

*Model 3: SVM*

SVMs have a variety of use cases, and generally work best when the data is not too noisy, and there is minimal overlap. They have been used to classify proteins, recognize hand-writing, and classify images.

SVM is a strong algorithm when the relation between the predictors and response is linear in nature. Unlike decision trees, data can be separated even if the separation is not parallel to the axis. Part of the transformations that SVM employs can create non-linear looking boundaries, when projected onto a plane. The model is more computationally expensive than the earlier two models since it is more complex. It is also difficult to interpret and explain to a layperson, much unlike decision trees. 

SVM helps to compensate for some of the weaknesses of the decision trees - namely, the possibility that the data is not split parallel to the axes. Adding SVMs into the list of algorithms we will test with helps cover more ground.

### Setup
The code cell below is run to initialize three helper functions which are used for training and testing the models.

In [8]:
def train_classifier(clf, X_train, y_train):
    ''' Fits a classifier to the training data. '''
    
    # Start the clock, train the classifier, then stop the clock
    start = time()
    clf.fit(X_train, y_train)
    end = time()
    
    # Print the results
    print ("Trained model in {:.4f} seconds".format(end - start))

    
def predict_labels(clf, features, target):
    ''' Makes predictions using a fit classifier based on F1 score. '''
    
    # Start the clock, make predictions, then stop the clock
    start = time()
    y_pred = clf.predict(features)
    end = time()
    
    # Print and return results
    print ("Made predictions in {:.4f} seconds.".format(end - start))
    return f1_score(target.values, y_pred, pos_label='yes')


def train_predict(clf, X_train, y_train, X_test, y_test):
    ''' Train and predict using a classifer based on F1 score. '''
    
    # Indicate the classifier and the training set size
    print ("Training a {} using a training set size of {}. . .".format(clf.__class__.__name__, len(X_train)))
    
    # Train the classifier
    train_classifier(clf, X_train, y_train)
    
    # Print the results of prediction for both training and testing
    print ("F1 score for training set: {:.4f}.".format(predict_labels(clf, X_train, y_train)))
    print ("F1 score for test set: {:.4f}.".format(predict_labels(clf, X_test, y_test)))

### Implementation: Model Performance Metrics
I now import the three supervised learning models of your choice and run the `train_predict` function for each one.

In [9]:
from sklearn.tree import DecisionTreeClassifier as DTC
from sklearn.linear_model import LogisticRegression as LR
from sklearn.svm import SVC

clf_A = DTC(random_state=10)
clf_B = LR(random_state=10)
clf_C = SVC(random_state=10)

X_train_100 = X_train[:100]
y_train_100 = y_train[:100]

X_train_200 = X_train[:200]
y_train_200 = y_train[:200]

X_train_300 = X_train[:300]
y_train_300 = y_train[:300]

clf_arr = [clf_A, clf_B, clf_C]
training_arr = [[X_train_100, y_train_100],[X_train_200, y_train_200],[X_train_300, y_train_300]]

for clf in clf_arr:
    print('')
    print("#"*80)
    for train_size in training_arr:
        train_predict(clf, train_size[0], train_size[1], X_test, y_test)
        print()


################################################################################
Training a DecisionTreeClassifier using a training set size of 100. . .
Trained model in 0.0010 seconds
Made predictions in 0.0010 seconds.
F1 score for training set: 1.0000.
Made predictions in 0.0010 seconds.
F1 score for test set: 0.6870.

Training a DecisionTreeClassifier using a training set size of 200. . .
Trained model in 0.0020 seconds
Made predictions in 0.0000 seconds.
F1 score for training set: 1.0000.
Made predictions in 0.0000 seconds.
F1 score for test set: 0.7059.

Training a DecisionTreeClassifier using a training set size of 300. . .
Trained model in 0.0020 seconds
Made predictions in 0.0000 seconds.
F1 score for training set: 1.0000.
Made predictions in 0.0000 seconds.
F1 score for test set: 0.6720.


################################################################################
Training a LogisticRegression using a training set size of 100. . .
Trained model in 0.0020 seconds
Made pr

### Tabular Results


** Classifer 1 - Decision Trees**  

| Training Set Size | Training Time | Prediction Time (test) | F1 Score (train) | F1 Score (test) |
| :---------------: | :---------------------: | :--------------------: | :--------------: | :-------------: |
| 100               |            1.0 ms       |  Approx 1 ms           |       1.0        |    0.6870       |
| 200               |            1.0 ms       |  Approx 0 ms           |       1.0        |    0.7059       |
| 300               |            1.0 ms       |  Approx 0 ms           |       1.0        |    0.6720       |

** Classifer 2 - Logistic Regression**  

| Training Set Size | Training Time | Prediction Time (test) | F1 Score (train) | F1 Score (test) |
| :---------------: | :---------------------: | :--------------------: | :--------------: | :-------------: |
| 100               |              0 ms       |  Approx 0 ms           |     0.8593       |     0.7612      |
| 200               |            2.0 ms       |  Approx 0 ms           |     0.8444       |     0.7591      |
| 300               |            3.0 ms       |  Approx 0 ms           |     0.8263       |     0.8169      |

** Classifer 3 - Support Vector Machines**  

| Training Set Size | Training Time | Prediction Time (test) | F1 Score (train) | F1 Score (test) |
| :---------------: | :---------------------: | :--------------------: | :--------------: | :-------------: |
| 100               |           1.0 ms        |       1.0 ms           |     0.8366       |    0.8228       |
| 200               |           4.0 ms        |       2.0 ms           |     0.8552       |    0.7947       |
| 300               |           6.0 ms        |       5.0 ms           |     0.8615       |    0.8079       |

## Choosing the Best Model
In this final section, I choose from the three supervised learning models the *best* model to use on the student data. I then perform a grid search optimization for the model over the entire training set (`X_train` and `y_train`) by tuning the parameters to improve upon the untuned model's F<sub>1</sub> score. 

### Question 3 - Choosing the Best Model
*Based on the experiments you performed earlier, in one to two paragraphs, explain to the board of supervisors what single model you chose as the best model. Which model is generally the most appropriate based on the available data, limited resources, cost, and performance?*

**Answer: **

I believe logistic regression is the best all-around model, based on several factors: it has the greatest $F_1$ test score, a very reasonable training time, and also near instantaneous prediction times. One possible concern for logistic regression is that its $F_1$ score seemed to be inconsistent in that there was a large jump between the results of a sample of 200 and 300. I'm inclined to believe that this is just variance due to overfitting being mitigated as the sample size grows, since the training score is consistently dropping.

Another reason why I believe logistic regression to be superior is its computational complexity. Logistic regression generally scales linearly with the training size, whereas SVM scales in polynomial time, and is generally $O(n^2)$ or $O(n^3)$, depending on the implementation. The reason I compare SVM and Logistic Regression is that in this case they seem to be similar in terms of accuracy. Should the schoolboard decide to scale up the size of their datasets, logistic regression may be able to deliver superior speed and comparable (if not better) performance, and so appears to be the best choice.

### Question 4 - Model in Layman's Terms
*In one to two paragraphs, explain to the board of directors in layman's terms how the final model chosen is supposed to work. Be sure that you are describing the major qualities of the model, such as how the model is trained and how the model makes a prediction. Avoid using advanced mathematical or technical jargon, such as describing equations or discussing the algorithm implementation.*

**Answer: **

The model I've chosen is called logistic regression. It takes a bunch of information in (such as age, gender, and whatnot of the students) and returns the chance that they'll fail.

Here is some of the intuition behind it. Suppose you have a series of red and blue dots as pictured below (ignore the green line and dotted orange line for now). If we were given a random dot somewhere within the range of the graph, and asked to predict whether it would be blue or red we could do it as so. If it's on the far left, it'll most likely be red - we know this because the density of reds is much higher, and there's not a blue dot to be seen on the left. Conversely if it were on the far right, it would most likely be blue. However if the dot were in middle, it wouldn't be as clear cut.

![Logistic Regression](logistic_regression.jpg "Logistic Regression")
[Image Source: plot.ly](https://plot.ly/~florianh/140.png)

So we know that starting on the far left, there's basically no chance that the dot is blue, and as we go farther to the right the chance that the dot is blue increases. We can visualize the chance that the dot is blue by looking at the green line. Logistic regression assumes that the relationship follows something called a *logistic function*, which you can remember as a soft S-shaped curve. This is essentially how logistic regression works: it gives you the probability of something falling within a certain class.

How does logistic regression find this curve? A simple heuristic you can use is to think about the "density" of the dots in the area. For example, to the far left it seems as if all of them are red, and similarly for blue on the right. In the middle, there seems to be something closer to a mix of the two, so we can't be as sure. If we repeat this for all the intermediate positions, and assume the relation is clearly separated and smooth, then we will get something like the logistic function.

### Implementation: Model Tuning
I use grid search (`GridSearchCV`) with at least one important parameter tuned with at least 3 different values. I also use stratified shuffle split along with cross validation. This is because it seems as if the data may have been imbalanced from earlier. Specifically, the values at 100, 200, and 300 samples for $R_1$ score were 0.76, 0.76 and 0.82, which suggests that the last third of the data may have had a disporportionate effect on the predictions.

In [14]:
from sklearn.grid_search import GridSearchCV 
from sklearn.svm import SVC
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.metrics import f1_score
from sklearn.metrics import make_scorer

parameters = {'C':np.logspace(-4, 10, 50)}
f1_scorer = make_scorer(f1_score, pos_label="yes")
ssscv = StratifiedShuffleSplit( y_train, n_iter=10, test_size=0.1)
grid_obj = GridSearchCV( SVC(), parameters, cv = ssscv , scoring=f1_scorer)
grid_obj.fit( X_train, y_train )
best = grid_obj.best_estimator_ 
y_pred = best.predict( X_test )

print ("F1 score: {}".format( f1_score( y_test, y_pred, pos_label = 'yes' )))
print ("Best params: {}".format( grid_obj.best_params_ ))

F1 score: 0.8025477707006369
Best params: {'C': 0.51794746792312074}


### Question 5 - Final F<sub>1</sub> Score
*What is the final model's F<sub>1</sub> score for training and testing? How does that score compare to the untuned model?*

**Answer: **

The final tuned model has a testing $F_1$ score of 0.8025. The score is a bit lower than the untuned model, which scored 0.8169. This suggests that the previously higher result may have been due to chance.