In [37]:
# Project 2: Supervised Learning
### Building a Student Intervention System

## 1. Classification vs Regression

Your goal is to identify students who might need early intervention - which type of supervised machine learning problem is this, classification or regression? Why?

As we want to predict weather a given student will pass or fail, given information about his life and habits, this
indicates a classification problem with two classes, pass and fail. 

In [38]:
# Import libraries
import numpy as np
import pandas as pd

In [39]:
# Read student data
student_data = pd.read_csv("student-data.csv")
print "Student data read successfully!"
# Note: The last column 'passed' is the target/label, all other are feature columns

Student data read successfully!


Now, can you find out the following facts about the dataset?
- Total number of students
- Number of students who passed
- Number of students who failed
- Graduation rate of the class (%)
- Number of features

_Use the code block below to compute these values. Instructions/steps are marked using **TODO**s._

In [40]:
# TODO: Compute desired values - replace each '?' with an appropriate expression/function call

n_students = student_data.shape[0]
n_features = student_data.shape[1] - 1
n_passed = sum([1 for y in student_data['passed'] if y == 'yes'])
n_failed = sum([1 for n in student_data['passed'] if n == 'no'])
grad_rate = 100.*n_passed/(n_passed + n_failed)

print "Total number of students: {}".format(n_students)
print "Number of students who passed: {}".format(n_passed)
print "Number of students who failed: {}".format(n_failed)
print "Number of features: {}".format(n_features)
print "Graduation rate of the class: {:.2f}%".format(grad_rate)

Total number of students: 395
Number of students who passed: 265
Number of students who failed: 130
Number of features: 30
Graduation rate of the class: 67.09%


## 3. Preparing the Data
In this section, we will prepare the data for modeling, training and testing.

### Identify feature and target columns
It is often the case that the data you obtain contains non-numeric features. This can be a problem, as most machine learning algorithms expect numeric data to perform computations with.

Let's first separate our data into feature and target columns, and see if any features are non-numeric.<br/>
**Note**: For this dataset, the last column (`'passed'`) is the target or label we are trying to predict.

In [41]:
# Extract feature (X) and target (y) columns
feature_cols = list(student_data.columns[:-1])  # all columns but last are features
target_col = student_data.columns[-1]  # last column is the target/label
print "Feature column(s):-\n{}".format(feature_cols)
print "Target column: {}".format(target_col)

X_all = student_data[feature_cols]  # feature values for all students
y_all = student_data[target_col]  # corresponding targets/labels
print "\nFeature values:-"
print X_all.head()  # print the first 5 rows

Feature column(s):-
['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']
Target column: passed

Feature values:-
  school sex  age address famsize Pstatus  Medu  Fedu     Mjob      Fjob  \
0     GP   F   18       U     GT3       A     4     4  at_home   teacher   
1     GP   F   17       U     GT3       T     1     1  at_home     other   
2     GP   F   15       U     LE3       T     1     1  at_home     other   
3     GP   F   15       U     GT3       T     4     2   health  services   
4     GP   F   16       U     GT3       T     3     3    other     other   

    ...    higher internet  romantic  famrel  freetime goout Dalc Walc health  \
0   ...       yes       no        no       4         3     4    1    1      3   
1   ...    

### Preprocess feature columns

As you can see, there are several non-numeric columns that need to be converted! Many of them are simply `yes`/`no`, e.g. `internet`. These can be reasonably converted into `1`/`0` (binary) values.

Other columns, like `Mjob` and `Fjob`, have more than two values, and are known as _categorical variables_. The recommended way to handle such a column is to create as many columns as possible values (e.g. `Fjob_teacher`, `Fjob_other`, `Fjob_services`, etc.), and assign a `1` to one of them and `0` to all others.

These generated columns are sometimes called _dummy variables_, and we will use the [`pandas.get_dummies()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html?highlight=get_dummies#pandas.get_dummies) function to perform this transformation.

In [42]:
# Preprocess feature columns
def preprocess_features(X):
    outX = pd.DataFrame(index=X.index)  # output dataframe, initially empty

    # Check each column
    for col, col_data in X.iteritems():
        # If data type is non-numeric, try to replace all yes/no values with 1/0
        if col_data.dtype == object:
            col_data = col_data.replace(['yes', 'no'], [1, 0])
        # Note: This should change the data type for yes/no columns to int

        # If still non-numeric, convert to one or more dummy variables
        if col_data.dtype == object:
            col_data = pd.get_dummies(col_data, prefix=col)  # e.g. 'school' => 'school_GP', 'school_MS'

        outX = outX.join(col_data)  # collect column(s) in output dataframe

    return outX

X_all = preprocess_features(X_all)
print "Processed feature columns ({}):-\n{}".format(len(X_all.columns), list(X_all.columns))

Processed feature columns (48):-
['school_GP', 'school_MS', 'sex_F', 'sex_M', 'age', 'address_R', 'address_U', 'famsize_GT3', 'famsize_LE3', 'Pstatus_A', 'Pstatus_T', 'Medu', 'Fedu', 'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services', 'Mjob_teacher', 'Fjob_at_home', 'Fjob_health', 'Fjob_other', 'Fjob_services', 'Fjob_teacher', 'reason_course', 'reason_home', 'reason_other', 'reason_reputation', 'guardian_father', 'guardian_mother', 'guardian_other', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']


### Split data into training and test sets

So far, we have converted all _categorical_ features into numeric values. In this next step, we split the data (both features and corresponding labels) into training and test sets.

In [89]:
# First, decide how many training vs test samples you want
num_all = student_data.shape[0]  # same as len(student_data)
num_train = 300  # about 75% of the data
num_test = num_all - num_train

# TODO: Then, select features (X) and corresponding labels (y) for the training and test sets
# Note: Shuffle the data or randomly select samples to avoid any bias due to ordering in the dataset
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_all, y_all, test_size = .24, random_state = 0)

print "Training set: {} samples".format(X_train.shape[0])
print "Test set: {} samples".format(X_test.shape[0])
# Note: If you need a validation set, extract it from within training data

Training set: 300 samples
Test set: 95 samples


## 4. Training and Evaluating Models
Choose 3 supervised learning models that are available in scikit-learn, and appropriate for this problem. For each model:

- What are the general applications of this model? What are its strengths and weaknesses?
- Given what you know about the data so far, why did you choose this model to apply?
- Fit this model to the training data, try to predict labels (for both training and test sets), and measure the F<sub>1</sub> score. Repeat this process with different training set sizes (100, 200, 300), keeping test set constant.

Produce a table showing training time, prediction time, F<sub>1</sub> score on training set and F<sub>1</sub> score on test set, for each training set size.

Note: You need to produce 3 such tables - one for each model.

* * *
**Model consideration**  

Before we manipulated it, the data was strongly categorical. Even age, though ordered, basically could be thought of as buckets between 15 and 22. So initially I thought that maybe the DecisionTree would be the most apt to deal with this problem. Even though we saw in the lectures that decision trees handle continuous data as well, the motivation for the algorithm presented in class, made me think that it would handle this "categorical" data well (even though we made it numeric). In class we had a boolean function with several weather related categories as input, and an output of play/no play tenis. Here we have data that is closely categorical, with a pass/no pass output. To get my bearings and not to impose any pre-judgement, I wanted to try most of the classifiers seen in class out of the box. I excluded neural net, since its recommended use requires the features to be scaled (one of the disadvantages of that algorithm). The result of this can be seen in the table below.

In [90]:
# Helper Functions
import time
from sklearn.metrics import f1_score

# Return the classifier's training time
def timeTraining(clf, X_train, y_train):
    start = time.time()
    clf.fit(X_train, y_train)
    end = time.time()
    return (end - start)

# Return the classifier's predictions and prediction time
def predictAndTime(clf, features):
    start = time.time()
    y_pred = clf.predict(features)
    end = time.time()
    return y_pred, (end - start)

# Return the f1 score for the target values and predictions
def F1(target, prediction):
    return f1_score(target.values, prediction, pos_label='yes')



In [82]:
# To get my bearings I wanted to try most of the classifiers seen in class out of the box.
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import AdaBoostClassifier

#Adding Kfold validation to try out the Pro Tip from CodeReview2
from sklearn.cross_validation import KFold  

# Setting up KFold cross_validation object
kf = KFold(X_train.shape[0], 10)

# Array of classifiers
clfs = [DecisionTreeClassifier(criterion = "entropy"),
        SVC(C = 1.0, kernel="rbf"),
        GaussianNB(),
        AdaBoostClassifier(),
        KNeighborsClassifier(n_neighbors = 3)]
 
#Gathering Table column and index labels
classifier_names = [clf.__class__.__name__ for clf in clfs]
benchmarks = ["Training time",  "F1 score training set","Prediction time", "F1 score test set"]
table = pd.DataFrame(columns = classifier_names, index = benchmarks)

# Fit Classifiers and average the times and f1 scores resulting from KFold (10 folds)
for clf in clfs: 
    classifier   = clf.__class__.__name__
    t_test  = 0.0 
    t_train = 0.0
    F1_test = 0.0
    F1_train =0.0
    
    #Averaging scores and seconds accross the folds
    for tr_i ,t_i in kf:
        #Train (k-1 buckets)
        t_train += timeTraining(clf, X_train.iloc[tr_i], y_train.iloc[tr_i])
        pred_train_set = predictAndTime(clf, X_train.iloc[tr_i])[0]
        F1_train += F1(y_train.iloc[tr_i], pred_train_set)
        #Test (kth bucket)
        pred_test_set, t_t = predictAndTime(clf,X_train.iloc[t_i])
        t_test += t_t
        F1_test += F1(y_train.iloc[t_i], pred_test_set)
        
    #Filling table 
    table[classifier]['Training time']         = "{:10.4f} s".format(t_train/10)
    table[classifier]['F1 score training set'] = F1_train/10
    table[classifier]['Prediction time']       = "{:10.4f} s".format(t_test/10)
    table[classifier]['F1 score test set']     = F1_test/10
    
from IPython.display import display, HTML  
display(table)

Unnamed: 0,DecisionTreeClassifier,SVC,GaussianNB,AdaBoostClassifier,KNeighborsClassifier
Training time,0.0035 s,0.0069 s,0.0009 s,0.1022 s,0.0007 s
F1 score training set,1,0.874614,0.799737,0.870058,0.889301
Prediction time,0.0002 s,0.0007 s,0.0003 s,0.0053 s,0.0011 s
F1 score test set,0.702098,0.807571,0.767603,0.76675,0.780717


At CodeReview2 my reviewer reminded me of how Kfold cross validation can be used to get the maximum use of a small data set like the one here. After separating the data into a training set (size 300) and test set (size 95), I used Kfold cross-validation on the training set with 10 folds to have a basic handle on how the models were performing on average. It is clearer now how they differ in performance. The decision tree trails all others in its f1 test score, while scoring perfectly on the training set (which no other does) so it seems to be an over-fitter. This is a known possible problem with trees. The SVC tops all others with an average test f1 of 0.808. KNN doesn't trail far with 0.781. Naive Bayes and AdaBoost are pretty much tied with test f1's of 0.768 and 0.767 respectively. 

As to narrowing the field to 3 models, I will discount AdaBoostClassifier and GaussianNB. Though I did not find this explicitly stated in the documentation, from the above table, AdaBoost is clearly expensive to train as its average training time of 0.1022s is approximately 15 times slower than the next slowest time of 0.069s by SVC. Since computational cost is a constraint, other algorithms may better serve this problem. GaussianNB was one of the fastest to be trained and had a decent average f1 test score. Two of Naive Bayes' advantages that are relevant to this problem are, its capacity to train on small amounts of data and its training speed. In this sense it is perfectly suited for the situation at hand. The main reason I passe on it, is the lack of tune-able parameters in sklearn's GaussianNB implementation. So I don't have a real chance of improving its performance from here, what I see now is what I get (I maybe wrong on this). 

As touched on in my previous comments decision tree is easy to conceptualize. It is also fast to use, as predicting data has logarithmic cost on number of  features (it had the fastest prediction time above too). A possible problem that I might encounter is its instability in generating a tree consistently upon data variation. Upon closer reading of section 1.6.4 of Sklearn's documentation, "Nearest Neighbor Algorithms", KNN as, I used it here, makes a choice of algorithm based on the data passed to fit(). The choice being between brute force, K-D tree, and Ball tree. This data set is too large for brute force (n = 300 >> 30) so we can consider KNN here as either K-D or Ball tree. Now, since the size of the feature space, is for computational purposes, D = 48 > 20 (after adding all the dummy classes) it is likely it is choosing Ball Tree. In any case its time complexity at prediction is O[Dlog(N)] which explains why its slower than the decision tree. Since the amount of student features is not likely to change much if KNN is put to practice in our scenario (suppose we choose it in the end for use of the school board), this can be consider a pro for this algorithm, since it is essentially logarithmic, just as the decision tree. Also from sklearn;  

“Ball tree and KD tree query times can be greatly influenced by data structure. In general, sparser data with a smaller intrinsic dimensionality leads to faster query times.”  

If I'm understanding correctly (may very well not be) since sparcity of the data set “refers to the degree to which the data fills the parameter space” ,  then it seems to me that the data set here is somewhat sparse. The reason I say this is because when we added the dummies, essentially we added a lot of zero components to each student “vector”. Since the student's mother, say, cannot be partially between being a Teacher and Healthcare worker, in regard to those categorical variables, there are regions in the feature space that are empty. Because no vectors will ever have components that are non-zero there. If this is indeed the case, then it would also be a pro for KNN. The advantages of SVC are that it is memory efficient because it only uses a subset of the training points in the decision function. This fits our computational cost constraint. It is also very customizable at the tuning phase (4 kernel options plus parameters). Since we are not concerned with probability estimates here, which SVC does expensively, it has little downsides. Next I try each of these models with varying training set sizes, from 50 students to 300 in increments of 50, for a total of 6 training set sizes. 


In [96]:
# Helper function makeTable
def makeTable(clf, training_sizes, X_tr, X_t, y_tr, y_t):
    
    #Gathering column and row labels for the table
    benchmarks = ["Training time",  "F1 score training set","Prediction time", "F1 score test set"]
    size_labels = ["Training samples: {}".format(s) for s in training_sizes]
    table = pd.DataFrame(columns = benchmarks, index = size_labels)
    
    for i, size in enumerate(training_sizes):
        #Use only the first "size" number of samples
        X_train, X_test, y_train, y_test = [df.iloc[:size] for df in [X_tr, X_t, y_tr, y_t]]
        
        #Compute benchmarks
        t_train    = timeTraining(clf, X_train, y_train)
        pred_train_set = predictAndTime(clf, X_train)[0]
        pred_test_set, t_test = predictAndTime(clf,X_test)  
        
        #fill table
        table['Training time'][i]    = t_train
        table['F1 score training set'][i] = F1(y_train, pred_train_set)
        table['Prediction time'][i]  = t_test
        table['F1 score test set'][i] = F1(y_test, pred_test_set)
        
    return table


In [43]:
# Chosen Classifiers
chosen_clfs = [DecisionTreeClassifier(criterion = "entropy"),
        SVC(C = 1.0, kernel="rbf"),
        KNeighborsClassifier(n_neighbors = 3)]

# Test Classifiers with increasing data set size
training_sizes = [50,100,150,200,250,300]
for clf in chosen_clfs:
    print clf.__class__.__name__
    table = makeTable(clf, training_sizes, X_train, X_test, y_train, y_test)
    display(table)



DecisionTreeClassifier


Unnamed: 0,Training time,F1 score training set,Prediction time,F1 score test set
Training samples: 50,0.00178289,1,0.000478983,0.730159
Training samples: 100,0.00202084,1,0.000292063,0.713043
Training samples: 150,0.0025692,1,0.00028491,0.68254
Training samples: 200,0.00457883,1,0.000495911,0.728682
Training samples: 250,0.00521111,1,0.000490189,0.721311
Training samples: 300,0.00376606,1,0.000236988,0.75


SVC


Unnamed: 0,Training time,F1 score training set,Prediction time,F1 score test set
Training samples: 50,0.00140691,0.90625,0.000607014,0.738462
Training samples: 100,0.00197101,0.85906,0.00135183,0.783784
Training samples: 150,0.00450397,0.870813,0.00194883,0.771429
Training samples: 200,0.0061152,0.869281,0.00191188,0.77551
Training samples: 250,0.00635195,0.879177,0.00191903,0.758621
Training samples: 300,0.00846219,0.869198,0.00212193,0.758621


KNeighborsClassifier


Unnamed: 0,Training time,F1 score training set,Prediction time,F1 score test set
Training samples: 50,0.000960112,0.8,0.000761986,0.761905
Training samples: 100,0.000619888,0.823529,0.00151801,0.666667
Training samples: 150,0.000834942,0.816327,0.00256419,0.677419
Training samples: 200,0.00111508,0.86121,0.00196004,0.666667
Training samples: 250,0.000755072,0.889503,0.00311494,0.711111
Training samples: 300,0.000849962,0.886878,0.00293112,0.721805


The SVC seemed the most stable across different training data set sizes. KNN had a strange u-shaped behavior, starting with a highish score decreasing, and then rising again. I decided to further explore them with grid search.

## 5. Choosing the Best Model

- Based on the experiments you performed earlier, in 1-2 paragraphs explain to the board of supervisors what single model you chose as the best model. Which model is generally the most appropriate based on the available data, limited resources, cost, and performance?
- In 1-2 paragraphs explain to the board of supervisors in layman's terms how the final model chosen is supposed to work (for example if you chose a Decision Tree or Support Vector Machine, how does it make a prediction).
- Fine-tune the model. Use Gridsearch with at least one important parameter tuned and with at least 3 settings. Use the entire training set for this.
- What is the model's final F<sub>1</sub> score?

In [100]:
# TODO: Fine-tune your model and report the best F1 score
from sklearn import grid_search
from sklearn.metrics import make_scorer

r = np.arange
scorer = make_scorer(F1)

tree_param = { 'max_features':["sqrt", "log2"], 'max_depth': range(2,11), 'min_samples_split':range(2,9),
               'min_samples_leaf':range(1,9) }
    
svc_param  = [{'C':[100,150,200], 'kernel':['linear']}, 
              {'C':[100,150,200], 'gamma':[0.01, 0.001, 0.0001], 'kernel':['rbf']},
              {'C':[100,150,200], 'degree':[2,3,4,5], 'coef0':[1,10,100],'kernel':['poly']}]

neigh_param = {'n_neighbors' : [10,20,25,30,40], 'weights' : ['uniform', 'distance'], 'p':[1,2,3,5,10]}

#Perform grid Search
def gridIt(clf, params):
    grid_clf = grid_search.GridSearchCV(clf, params, scorer)
    print clf.__class__.__name__
    print "Grid search time:", timeTraining(grid_clf, X_train, y_train)
    print "Parameters of tuned model: ", grid_clf.best_params_
    y_pred, predict_t = predictAndTime(grid_clf, X_test)
    print "f1_score and prediction time on X_test, y_test: "
    print F1(y_test, y_pred), predict_t
    print '------------------\n'
    
gridIt(chosen_clfs[0], tree_param)
gridIt(chosen_clfs[1], ada_param)
gridIt(chosen_clfs[2], svc_param)
gridIt(chosen_clfs[3], neigh_param)


DecisionTreeClassifier
Grid search time: 15.498 s
Parameters of tuned model:  {'max_features': 'log2', 'min_samples_split': 3, 'max_depth': 2, 'min_samples_leaf': 8}
f1_score and prediction time on X_test, y_test: 
0.802919708029 0.000 s
------------------

AdaBoostClassifier
Grid search time: 10.084 s
Parameters of tuned model:  {'n_estimators': 50, 'learning_rate': 0.6}
f1_score and prediction time on X_test, y_test: 
0.779411764706 0.007 s
------------------

SVC
Grid search time: 45.106 s
Parameters of tuned model:  {'kernel': 'rbf', 'C': 150, 'gamma': 0.0001}
f1_score and prediction time on X_test, y_test: 
0.786206896552 0.002 s
------------------

KNeighborsClassifier
Grid search time: 2.923 s
Parameters of tuned model:  {'n_neighbors': 30, 'weights': 'uniform', 'p': 2}
f1_score and prediction time on X_test, y_test: 
0.77027027027 0.003 s
------------------



In [115]:
##Confirm successfull DecisionTree and KNN parameters aquired from GridSearchCV

dTree = DecisionTreeClassifier(max_features = 'log2', min_samples_split= 3, max_depth= 2, min_samples_leaf= 8)
nn = KNeighborsClassifier(n_neighbors= 30, weights = 'uniform', p= 2)  

rand_states = [0,5,10,20,30,40]

def averageF1(clf, X_tr, y_tr, X_t):
    sf1 = 0
    for i in range(100):
        clf.fit(X_tr,y_tr)
        y_pred , p_time = predictAndTime(clf, X_t)
        sf1 += F1(y_test, y_pred)
        
    return sf1/100
    
for s in rand_states:
    X_tr, X_t, y_tr, y_t = train_test_split(X_all, y_all, 
                                                        test_size = .24, random_state = s)
    print "Average Tree F1 for random state: " + str(s) + " " + str(averageF1(dTree, X_tr, y_tr, X_t))
    print "Average KNN F1 for random state: "  + str(s) + " " + str(averageF1(nn, X_tr, y_tr, X_t))
    print "------------------------------------"

Average Tree F1 for random state: 0 0.808176628012
Average KNN F1 for random state: 0 0.787096774194
------------------------------------
Average Tree F1 for random state: 5 0.779762642075
Average KNN F1 for random state: 5 0.786666666667
------------------------------------
Average Tree F1 for random state: 10 0.791981446416
Average KNN F1 for random state: 10 0.810126582278
------------------------------------
Average Tree F1 for random state: 20 0.794643739379
Average KNN F1 for random state: 20 0.802547770701
------------------------------------
Average Tree F1 for random state: 30 0.793069562794
Average KNN F1 for random state: 30 0.782051282051
------------------------------------
Average Tree F1 for random state: 40 0.805694255864
Average KNN F1 for random state: 40 0.815789473684
------------------------------------


## Conclusions
**Algorithm Selection:**  
* * *
My code reviewer pointed out that I had used the whole training set while I was still in model selection phase.
After correcting this, the results from grid search in cell [100] became more comprehensible. Based on the computations performed there, although all algorithms (even if only marginally) benefited from the tuning process, two of them in my opinion are good candidates for consideration for final selection. These are DecisionTreeClassifier and KNeighborsClassifier. The SVM was very costly to tune with grid search, taking 45 seconds, with the next longest tuning being DecisionTree's at 15.5 seconds. Adaboost had a relatively fast training time of 10 s, but grid search failed to find a combination of parameters that significantly improved its final f1 score. For the the full training set size of 300, the DecisionTree had a marked f1 score improvement from 0.766 (cell [97]'s output) to 0.803. KNeighborsClassifier similarly benefitted from the grid search to improve its f1 score from 0.722 to 0.770. In previous attempts, when I found a good combination of parameters, performance seemed dependent on the particular run I had just observed. As hinted to me after the code review, the train/test spliting proccess can have marked effects on the algorithm's final performance, especially on small data sets like this one. To check that I had found consistently good parameters with grid search, I did a final test by averaging 100 f1_scores for each of 6 random_state seed values in train_test_split. As can be seen from the output of cell [115], both KNeighbors and DecisionTree sustained the good performance exhibited after grid search in cell [100], and also differed little from one another in f1_score. However KNeighbors grid_search training time was a mere 2.9 seconds. Therefore under the rubric of: available data, limited resources, cost, and performance, KNeighborsClassifier is the best model for use in this student intervention system. 

**Layman Explanation: KNeighborsClassifier**  
* * *
The mechanism with which our model decides whether a current student will pass or fail is very intuitive. Each student has 30 attributes associated with him/her. These range from basic descriptors like their age, sex, and health, to behavioral descriptors like whether they are in a romantic relationship, how much time they devote to study, if they have any extra curricular activities. Some of the attributes are of things out of there control like whether they have internet access, the size of their family, and what neighborhood they live in. What are model process does, is that it assigns a relevant number to every one of these attributes. Just like for example you can take a house and assign it a longitude, a latitude, and maybe if it sits on a hill an altitude. In the same way that information like longitude, latitude and altitude, alows us to decide how far away two houses are from each other, we can decide how "far away" two particular students are from each other. Based on all those attributes like free time and age and so forth. Since we have information about which students have failed and which have passed, our model basically answers the question; how "close" is this student to other students who have passed. Maybe he is "closer" to students who fail. We can choose to  compare him/her to the closest _single_ student to determine how likely he is to pass or fail. Usually, however we tune the model to find a _group_ of students closest to him/her, maybe the closest 4 students, or maybe 10 closest students. The exact number is determined while tuning. Since we use students that we know either passed or failed in this group, we can determine if our student in question is closer to others who pass or those who fail. 

***

**Final F1 score**
* * *
Given the way I selected the KNeighborsClassifier, I present the average and standard deviation of f1 scores across
the different train/test splits of data.

In [116]:
knn_f1_scores = [0.787096774194, 0.786666666667, 0.810126582278, 0.802547770701, 0.782051282051, 0.815789473684]
print "Average f1 score: {}".format(np.mean(knn_f1_scores))
print "Std for the f1 scores {}".format(np.std(knn_f1_scores))

Average f1 score: 0.797379758263
Std for the f1 scores 0.0128035136636
