# An "Early Warning System" for Student Drop-out Intervention using a Feature-Weighted Nearest Neighbor Model
---------

The following notebook walks you through the development of a supervised learning machine learning tool for the early intervention of potentially failing students.  A sample dataset is included as *student_data.csv* in the same directory.  The accompanying Python module "dropout_ews.py" contains proprietary functions for this notebook.

In [1]:
# LIBRARIES

import numpy as np
import pandas as pd #Successfully installed pandas-0.19.2
import time

from sklearn.metrics import f1_score
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import f1_score
from sklearn.metrics import make_scorer

import dropout_ews as dews

## Data Exploration and Data Preparation
-----------

The sample dataset used in this project is included as student_data.csv. The last column 'passed' is the target/label, all other are feature columns.  The CSV contains a header with the following 30 attributes:

- __school__ : student's school (binary: "GP" or "MS")
- __sex__ : student's sex (binary: "F" - female or "M" - male)
- __age__ : student's age (numeric: from 15 to 22)
- __address__ : student's home address type (binary: "U" - urban or "R" - rural)
- __famsize__ : family size (binary: "LE3" - less or equal to 3 or "GT3" - greater than 3)
- __Pstatus__ : parent's cohabitation status (binary: "T" - living together or "A" - apart)
- __Medu__ : mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to - __9th grade, 3 - secondary education or 4 - higher education)
- __Fedu__ : father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to - __9th grade, 3 - secondary education or 4 - higher education)
- __Mjob__ : mother's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")
- __Fjob__ : father's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")
- __reason__ : reason to choose this school (nominal: close to "home", school "reputation", "course" preference or "other")
- __guardian__ : student's guardian (nominal: "mother", "father" or "other")
- __traveltime__ : home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
- __studytime__ : weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
- __failures__ : number of past class failures (numeric: n if 1<=n<3, else 4)
- __schoolsup__ : extra educational support (binary: yes or no)
- __famsup__ : family educational support (binary: yes or no)
- __paid__ : extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
- __activities__ : extra-curricular activities (binary: yes or no)
- __nursery__ : attended nursery school (binary: yes or no)
- __higher__ : wants to take higher education (binary: yes or no)
- __internet__ : Internet access at home (binary: yes or no)
- __romantic__ : with a romantic relationship (binary: yes or no)
- __famrel__ : quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
- __freetime__ : free time after school (numeric: from 1 - very low to 5 - very high)
- __goout__ : going out with friends (numeric: from 1 - very low to 5 - very high)
- __Dalc__ : workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
- __Walc__ : weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
- __health__ : current health status (numeric: from 1 - very bad to 5 - very good)
- __absences__ : number of school absences (numeric: from 0 to 93)

Each student has a target that takes two discrete labels:

- __passed__ : did the student pass the final exam (binary: yes or no)

In [2]:
# READ-IN STUDENT DATA
student_data = pd.read_csv("student_data.csv")
print "Student data read successfully!"

Student data read successfully!


In [3]:
# EXPLORE THE DATA
n_students = student_data.shape[0]
n_features = student_data.shape[1] - 1
n_passed = student_data["passed"].value_counts()[0]
n_failed = student_data["passed"].value_counts()[1]
grad_rate = float(n_passed)/n_students*100

print "Total number of students: {}".format(n_students)
print "Number of students who passed: {}".format(n_passed)
print "Number of students who failed: {}".format(n_failed)
print "Number of student features: {}".format(n_features)
print "Graduation rate of the class: {:.2f}%".format(grad_rate)

Total number of students: 395
Number of students who passed: 265
Number of students who failed: 130
Number of student features: 30
Graduation rate of the class: 67.09%


In [4]:
# EXTRACT FEATURE AND TARGET DATA
feature_cols = list(student_data.columns[:-1])  # all columns but last are features
target_col = student_data.columns[-1]  # last column is the target/label

X_all = student_data[feature_cols]  # feature values for all students
y_all = student_data[target_col]  # corresponding targets/labels

print "Feature values:"
print X_all.head()  # print the first 5 rows

Feature values:
  school sex  age address famsize Pstatus  Medu  Fedu     Mjob      Fjob  \
0     GP   F   18       U     GT3       A     4     4  at_home   teacher   
1     GP   F   17       U     GT3       T     1     1  at_home     other   
2     GP   F   15       U     LE3       T     1     1  at_home     other   
3     GP   F   15       U     GT3       T     4     2   health  services   
4     GP   F   16       U     GT3       T     3     3    other     other   

    ...    higher internet  romantic  famrel  freetime goout Dalc Walc health  \
0   ...       yes       no        no       4         3     4    1    1      3   
1   ...       yes      yes        no       5         3     3    1    1      3   
2   ...       yes      yes        no       4         3     2    2    3      3   
3   ...       yes      yes       yes       3         2     2    1    1      5   
4   ...       yes       no        no       4         3     2    1    2      5   

  absences  
0        6  
1        4  
2

In [5]:
# CREATE DUMMY BINARY VARS FOR ALL CATEGORICAL FEATURES
X_all = dews.preprocess_features(X_all)

print "Original feature columns: {}".format(n_features)
print "Final feature columns: {}\n\nList of features: {}".format(len(X_all.columns), list(X_all.columns))

Original feature columns: 30
Final feature columns: 48

List of features: ['school_GP', 'school_MS', 'sex_F', 'sex_M', 'age', 'address_R', 'address_U', 'famsize_GT3', 'famsize_LE3', 'Pstatus_A', 'Pstatus_T', 'Medu', 'Fedu', 'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services', 'Mjob_teacher', 'Fjob_at_home', 'Fjob_health', 'Fjob_other', 'Fjob_services', 'Fjob_teacher', 'reason_course', 'reason_home', 'reason_other', 'reason_reputation', 'guardian_father', 'guardian_mother', 'guardian_other', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']


In [6]:
# TEST/TRAIN SPLIT DATA
num_all = student_data.shape[0]  # same as len(student_data)
num_train = 300  # about 75% of the data
num_test = 95

X_train, X_test, y_train, y_test = train_test_split(X_all,y_all,train_size=num_train,test_size=num_test,stratify=y_all)

print "Data successfully split into Training set and Test set."
print "Training set: {} samples".format(X_train.shape[0])
print "Test set: {} samples".format(X_test.shape[0])

Data successfully split into Training set and Test set.
Training set: 300 samples
Test set: 95 samples


##  Evaluating Other Supervised Learning Models
---------

We choose several supervised learning models that are available in scikit-learn, and evaluate their effectiveness on the sample data set for sake of comparison.  They are:

- Support Vector Machine (SVM) with RBF kernel
- Trimmed Decision Tree
- Bayesian Model

For each model:
- fit the model to the training data
- predict labels (for both training and test sets)
- measure the F<sub>1</sub> score
- repeat this process with different training set sizes (100, 200, 300) while keeping test set constant

We product a table showing training time, prediction time, F<sub>1</sub> score on training set and F<sub>1</sub> score on test set, for each training set size.

In [7]:
# WE VARY THE SIZE OF THE TRAINING SETS FOR THE FOLLOWING MODELS:

# Other Model 1: SUPPORT VECTOR MACHINES
clf_default = LinearSVC()
clf_tuned = SVC(kernel='rbf')

dews.train_predict("SVC", X_train, y_train, X_test, y_test, 100, 200, 300, clf_default, clf_tuned)

# Other Model 2: DECISION TREE
clf_default = DecisionTreeClassifier()
clf_tuned = DecisionTreeClassifier(max_depth=5)

dews.train_predict("DT", X_train, y_train, X_test, y_test, 100, 200, 300, clf_default, clf_tuned)

# Other Model 3: BAYESIAN CLASSIFICATION
clf_default = GaussianNB()

dews.train_predict("Bayes", X_train, y_train, X_test, y_test, 100, 200, 300, clf_default)



       model     tr_size     tr_time    tr_ptime       tr_f1   tst_ptime      tst_f1
 SVC_default         100      0.0145      0.0008      0.8784      0.0008      0.7794
 SVC_default         200      0.0217      0.0008      0.7615      0.0004      0.6038
 SVC_default         300      0.1148      0.0010      0.7241      0.0006      0.7350
   SVC_tuned         100      0.0105      0.0036      0.8375      0.0018      0.8101
   SVC_tuned         200      0.0084      0.0044      0.8536      0.0029      0.8025
   SVC_tuned         300      0.0165      0.0120      0.8517      0.0026      0.8158


       model     tr_size     tr_time    tr_ptime       tr_f1   tst_ptime      tst_f1
  DT_default         100      0.0058      0.0073      1.0000      0.0004      0.7402
  DT_default         200      0.0053      0.0060      1.0000      0.0007      0.7302
  DT_default         300      0.0067      0.0010      1.0000      0.0004      0.7231
    DT_tuned         100      0.0016      0.0012      0.8571 

## The Feature-Weighted Nearest Neighbor Model
--------

We "feature-weight" a Nearest Neighbors model by paramater-optimizing a decision tree on the full feature set using GridSearch from the SKLearn library.  This allows us to "subset" the full feature set to a set of features that are deemed "important" in the determination of student success.

For academic background on featuring-weighting in nearest neighbor models, see "The Utility of Feature Weighting in Nearest-Neighbor Algorithms", available at http://www.isle.org/~langley/papers/diet.ecml97.pdf.

### Identifying the weighted features with a decision tree

In [8]:
# make scorer
f1_scorer = make_scorer(f1_score, pos_label="yes")

# Fit a tuned decision tree to training set with ALL features
parameters = {'max_depth': range(1,15)}
dt = DecisionTreeClassifier()
grid_search = GridSearchCV(dt,parameters,scoring=f1_scorer)
grid_search.fit(X_train,y_train)

# NOTE: with some test/train splits, a DT model will choose max_depth=1 as the best paramter-
# In order retain generality of the subsetting method, we hardcode the max_depth to '3' in dt_tuned classifier.
# dt_tuned = DecisionTreeClassifier(max_depth=grid_search.best_params_['max_depth'])
dt_tuned = DecisionTreeClassifier(max_depth=3)
dt_tuned.fit(X_train,y_train)

# Subset the Dataset by removing features whose 'importance' is zero, 
# according to a tuned Decision tree in 1.1 
X_train_subset = X_train[np.nonzero(dt_tuned.feature_importances_)[0].tolist()]
X_test_subset = X_test[np.nonzero(dt_tuned.feature_importances_)[0].tolist()]

print "Weighted Features Identified."
print "Most-important features are {}".format(list(X_train_subset.columns))

Weighted Features Identified.
Most-important features are ['reason_course', 'studytime', 'failures', 'schoolsup', 'internet', 'absences']


### Final KNN Model with Feature Weighting, and comparison to KNN Model without Feature Weighting

In [9]:
# out-of-the-box KNN
clf_default = KNeighborsClassifier()

# Determine the number of nearest neighbors that optimizes accuracy 
parameters = {'n_neighbors': range(1,30)}
knn = KNeighborsClassifier()
knn_tuned = GridSearchCV(knn,parameters,scoring=f1_scorer)
knn_tuned.fit(X_train_subset,y_train)
clf_tuned = KNeighborsClassifier(n_neighbors=knn_tuned.best_params_['n_neighbors'])

# print performance metrics of default and tuned KNN classifiers on ALL data
dews.train_predict("Full_KNN", X_train, y_train, X_test, y_test, 100, 200, 300, clf_default, KNeighborsClassifier(n_neighbors=10))

# print performance metrics of default and tuned KNN classifiers on SUBSET of data
dews.train_predict("Subset_KNN", X_train_subset, y_train, X_test_subset, y_test, 100, 200, 300, clf_default, clf_tuned)



       model     tr_size     tr_time    tr_ptime       tr_f1   tst_ptime      tst_f1
Full_KNN_default         100      0.0010      0.0013      0.8816      0.0010      0.7746
Full_KNN_default         200      0.0008      0.0027      0.8505      0.0014      0.7500
Full_KNN_default         300      0.0010      0.0053      0.8636      0.0022      0.7518
Full_KNN_tuned         100      0.0008      0.0016      0.7947      0.0013      0.7862
Full_KNN_tuned         200      0.0009      0.0028      0.8185      0.0015      0.7801
Full_KNN_tuned         300      0.0010      0.0056      0.8345      0.0024      0.7518


       model     tr_size     tr_time    tr_ptime       tr_f1   tst_ptime      tst_f1
Subset_KNN_default         100      0.0008      0.0008      0.8054      0.0006      0.7465
Subset_KNN_default         200      0.0007      0.0009      0.8350      0.0006      0.7692
Subset_KNN_default         300      0.0008      0.0012      0.8413      0.0006      0.7368
Subset_KNN_tuned         