# Project 2: Supervised Learning
### Building a Student Intervention System

## 1. Classification vs Regression

Your goal is to identify students who might need early intervention - which type of supervised machine learning problem is this, classification or regression? Why?

#### ANSWER:
  Classification, as our target value is just a binary of Pass or Fail. Classification is suited to situations where you have discrete categorical values you are trying to predict, while Regression is designed for working with continuous numbers, such as the pricing of houses from our last project.

## 2. Exploring the Data

Let's go ahead and read in the student dataset first.

Now, can you find out the following facts about the dataset?
- Total number of students
- Number of students who passed
- Number of students who failed
- Graduation rate of the class (%)
- Number of features

_Use the code block below to compute these values. Instructions/steps are marked using **TODO**s._

In [24]:
# TODO: Compute desired values - replace each '?' with an appropriate expression/function call
n_students = np.shape(student_data)[0]
n_features = np.shape(student_data)[1] - 1 # Subtract target column
n_passed = np.shape(student_data[student_data['passed']=='yes'])[0]
n_failed = np.shape(student_data[student_data['passed']=='no'])[0]
grad_rate = float(n_passed) / float(n_students)*100
print "Total number of students: {}".format(n_students)
print "Number of students who passed: {}".format(n_passed)
print "Number of students who failed: {}".format(n_failed)
print "Number of features: {}".format(n_features)
print "Graduation rate of the class: {:.2f}%".format(grad_rate)

Total number of students: 395
Number of students who passed: 265
Number of students who failed: 130
Number of features: 30
Graduation rate of the class: 67.09%


### Preprocess feature columns


## 4. Training and Evaluating Models
Choose 3 supervised learning models that are available in scikit-learn, and appropriate for this problem. For each model:

### - What are the general applications of this model? What are its strengths and weaknesses?
#### Answer:
##### Logistic Regression:
  Used to predict caterogical outcomes of a binary result, this is a common and simple to use algorithm that is also fast to train and predict, which can be important when you have limited or costly computing power. Downsides can include a lack of correct predictions compared to more complex models, and it may need a larger dataset than other models to begin getting sufficient accuracy.
    
#### Decision Trees:
  Another way of predicting an outcome, these are very easy to understand by a human just by plotting out the trained model and observing which traits go down which paths, and it is relatively quick to train and predict (even quicker than the logistic regression on this project).
  But they can be prone to overfitting. As it has to choose path A or B for example, if path A is only slightly better (but both work) it will stick the results on on path A, as it lacks the probabilistic abilities of other models. Some of this can be solved by creating forests and pruning if this becomes a problem.

#### Gradient Boosting
  This is a sort of derivative of decision trees, whereas it attempts to build an ensemble of models, and give extra predictive power to each subsuquent models strengths, while downplaying the weaknesses. Though it can be a bit more complex than the two preceeding models, it has the ability to increase the amount of correct predictions by creating deeper learning models with the multitude of sequential trees created.
  Unfortunately it can be hard to get the model set up correctly for each problem, as there are multiple parameters to tune before you begin to train. To solve this you could use a grid search to help select the best parameters. It also can be computationally expensive, as my results below show, it took the longest to train.


### - Given what you know about the data so far, why did you choose this model to apply?

#### Answer:
##### Logistic Regression:
  Given it's ease of use and simplcity, it made sense to start out with this to get a benchmark of performance against some of the other models, as we only need to predict Pass/Fail it seems to be a good match for this type of problem.
    
#### Decision Trees:
  These are very simple to explain and understand, and are also computationally cheap which is important for this project. It can create a simple model tree for prediction of whether or not a student will pass, though I do understand they can be prone to overfitting unless some extra work is done such as ensembling or pruning.

#### Gradient Boosting
  I decided to throw in a slightly more complex model to see if it would have some clear benefits over the others. It has the potential to get the most correct predictions judged by F<sub>1</sub> score, though it does take a bit more computational work to train these models, and has many parameters you may need to tune to get it to run optimally

### TODO:
Produce a table showing training time, prediction time, F<sub>1</sub> score on training set and F<sub>1</sub> score on test set, for each training set size.

Note: You need to produce 3 such tables - one for each model.

In [22]:
LogRegTable

Unnamed: 0,Training set size:,100,200,300
0,Training time (secs),0.003,0.003,0.005
1,Prediction time (secs),0.0,0.001,0.001
2,F1 score for training set,0.90683,0.86598,0.83105
3,F1 score for test set,0.75912,0.788321,0.8


In [23]:
DecTreeTable

Unnamed: 0,Training set size:,100,200,300
0,Training time (secs),0.01,0.002,0.003
1,Prediction time (secs),0.0,0.0,0.0
2,F1 score for training set,1.0,1.0,1.0
3,F1 score for test set,0.74419,0.70967,0.650407


In [453]:
GradBoostTable

Unnamed: 0,Training set size:,100,200,300
0,Training time (secs),0.078,0.109,0.136
1,Prediction time (secs),0.001,0.001,0.0
2,F1 score for training set,1.0,0.99281,0.975728
3,F1 score for test set,0.78519,0.761194,0.821439


## 5. Choosing the Best Model

### Question:
#### - Based on the experiments you performed earlier, in 1-2 paragraphs explain to the board of supervisors what single model you chose as the best model. Which model is generally the most appropriate based on the available data, limited resources, cost, and performance?
### Answer:
  As we can see with the data *Logistic Regression* seems to hover right at or slightly above 80% accuracy, depending on the training size and other parameters. While it seems *Gradient Descent* was able to go slightly higher, at 82% with a training size of 300, it does not seem to offer enough of a benefit to be worth the extra computing cost. Gradient Boosting takes approcimately twice as long to calculate (.132s vs .07s), and though while the current training size is small and runs relatively fast on today's machines, if we were ever to expand the program the performance speed could become a bigger issue. 

  So that's why my current reccomendation is just to stay with regular Logistic Regression. It is a simple to understand way of modeling, and performs very quickly even if we scale up to larger datasets in the future.


### Question:
#### - In 1-2 paragraphs explain to the board of supervisors in layman's terms how the final model chosen is supposed to work (for example if you chose a Decision Tree or Support Vector Machine, how does it make a prediction).
### Answer:
  Logistic Regression is fairly easy to understand when you break it down. It is a derivate of the **Linear Regression** model that you commonly encounter in early level statistics or finance courses. To give a refresher, Linear Regression allows you to plot the relationship of two variables, one being independent (X-axis) and the other (Y-axis) being dependent upon that first value. An example could be the price of a home *(dependent, Y-axis)* being predicted by the square footage *(independent, X-axis).*
  
  You begin by plotting the datapoints of our current known information *(such as the square footage and selling price of previous homes)*, and then draw a best-fit line through the datapoints that minimizes the differences in y-values from the line to the points themselves. This line is created by the regression formula.
    
  Taking this another step forward to **logistic regression**, the y-axis values are now binary from 0 to 1. We are now **classifying the output** rather than finding a number. All the *Pass* students go at the very top of the y-axis, and all the *Fail* students go at the bottom of the y-axis. So we now fit a curved s-shaped line to plot a training model, using the information about previous students and whether they passed or failed. Then we can use that model to predict whether future students are more likely to pass or fail, depending on which side of the line we plotted their datapoint rests.
    

### TODO:
#### - Fine-tune the model. Use Gridsearch with at least one important parameter tuned and with at least 3 settings. Use the entire training set for this.

In [42]:
# TODO: Fine-tune your model and report the best F1 score
from sklearn import grid_search
from sklearn.metrics import f1_score
from sklearn.metrics import make_scorer
f1_scorer = make_scorer(f1_score, pos_label="yes")


# Set the parameters to search, Logistic Regression is relatively simple, not many parameters
myparameters = {'C': [0.0001, 0.001, 0.01,0.05, 0.1,0.5, 1,5, 10, 100, 500,1000, 10000] }
clf = grid_search.GridSearchCV(LogisticRegression(penalty='l2'), scoring = f1_scorer, param_grid = myparameters)

train_predict(clf, X_train_300, y_train_300, X_test, y_test)


------------------------------------------
Training set size: 300
Training GridSearchCV...
Done!
Training time (secs): 0.369
Predicting labels using GridSearchCV...
Done!
Prediction time (secs): 0.000
F1 score for training set: 0.802395209581
Predicting labels using GridSearchCV...
Done!
Prediction time (secs): 0.000
F1 score for test set: 0.805031446541


### - What is the model's final F<sub>1</sub> score?

#### Answer:

  After tuning for possible parameter values, I am only able to obtain an 80.5% F<sub>1</sub> score. Which is just slightly higher than what the model was able to get before the grid search, at 80%.