# Project 2: Supervised Learning 
### Building a Student Intervention System

## 1. Classification vs Regression

Your goal is to identify students who might need early intervention - which type of supervised machine learning problem is this, classification or regression? Why?

**Answer:** Idetifying students who might need an early  intervention, is a classification problem. Classification is used to predict the  discrete output variable, while regression is used to estimate the continous output variable. And in the above case, the output variable i.e student needs Intervention is  discrete i.e "Yes" or  "No", hence we use classification.
Ref: https://en.wikipedia.org/wiki/Regression_analysis

## 2. Exploring the Data

Let's go ahead and read in the student dataset first.

_To execute a code cell, click inside it and press **Shift+Enter**._

In [1]:
# Import libraries
import numpy as np
import pandas as pd

In [2]:
# Read student data
student_data = pd.read_csv("student-data.csv")
print "Student data read successfully!"
# Note: The last column 'passed' is the target/label, all other are feature columns

Student data read successfully!


Now, can you find out the following facts about the dataset?
- Total number of students            :  **Answer** : 395
- Number of students who passed          **Answer** : 265
- Number of students who failed          **Answer** : 130
- Graduation rate of the class (%)       **Answer** : 67.09%
- Number of features                     **Answer** : 30

_Use the code block below to compute these values. Instructions/steps are marked using **TODO**s._

In [3]:
# TODO: Compute desired values - replace each '?' with an appropriate expression/function call
n_students = student_data.shape[0]
n_features = student_data.shape[1] - 1 # 1 i.e 1 column is the target  variable column 
n_passed = (student_data[student_data.passed=='yes']).shape[0]
n_failed = (student_data[student_data.passed=='no']).shape[0]
grad_rate = n_passed * 100.0 / n_students
print "Total number of students: {}".format(n_students)
print "Number of students who passed: {}".format(n_passed)
print "Number of students who failed: {}".format(n_failed)
print "Number of features: {}".format(n_features)
print "Graduation rate of the class: {:.2f}%".format(grad_rate)

Total number of students: 395
Number of students who passed: 265
Number of students who failed: 130
Number of features: 30
Graduation rate of the class: 67.09%


## 3. Preparing the Data
In this section, we will prepare the data for modeling, training and testing.

### Identify feature and target columns
It is often the case that the data you obtain contains non-numeric features. This can be a problem, as most machine learning algorithms expect numeric data to perform computations with.

Let's first separate our data into feature and target columns, and see if any features are non-numeric.<br/>
**Note**: For this dataset, the last column (`'passed'`) is the target or label we are trying to predict.

In [4]:
# Extract feature (X) and target (y) columns
feature_cols = list(student_data.columns[:-1])  # all columns but last are features
target_col = student_data.columns[-1]  # last column is the target/label
print "Feature column(s):-\n{}".format(feature_cols)
print "Target column: {}".format(target_col)

X_all = student_data[feature_cols]  # feature values for all students
y_all = student_data[target_col]  # corresponding targets/labels
print "\nFeature values:-"
print X_all.head()  # print the first 5 rows

Feature column(s):-
['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']
Target column: passed

Feature values:-
  school sex  age address famsize Pstatus  Medu  Fedu     Mjob      Fjob  \
0     GP   F   18       U     GT3       A     4     4  at_home   teacher   
1     GP   F   17       U     GT3       T     1     1  at_home     other   
2     GP   F   15       U     LE3       T     1     1  at_home     other   
3     GP   F   15       U     GT3       T     4     2   health  services   
4     GP   F   16       U     GT3       T     3     3    other     other   

    ...    higher internet  romantic  famrel  freetime goout Dalc Walc health  \
0   ...       yes       no        no       4         3     4    1    1      3   
1   ...    

### Preprocess feature columns

As you can see, there are several non-numeric columns that need to be converted! Many of them are simply `yes`/`no`, e.g. `internet`. These can be reasonably converted into `1`/`0` (binary) values.

Other columns, like `Mjob` and `Fjob`, have more than two values, and are known as _categorical variables_. The recommended way to handle such a column is to create as many columns as possible values (e.g. `Fjob_teacher`, `Fjob_other`, `Fjob_services`, etc.), and assign a `1` to one of them and `0` to all others.

These generated columns are sometimes called _dummy variables_, and we will use the [`pandas.get_dummies()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html?highlight=get_dummies#pandas.get_dummies) function to perform this transformation.

In [5]:
# Preprocess feature columns
def preprocess_features(X):
    outX = pd.DataFrame(index=X.index)  # output dataframe, initially empty

    # Check each column
    for col, col_data in X.iteritems():
        # If data type is non-numeric, try to replace all yes/no values with 1/0
        if col_data.dtype == object:
            col_data = col_data.replace(['yes', 'no'], [1, 0])
        # Note: This should change the data type for yes/no columns to int

        # If still non-numeric, convert to one or more dummy variables
        if col_data.dtype == object:
            col_data = pd.get_dummies(col_data, prefix=col)  # e.g. 'school' => 'school_GP', 'school_MS'

        outX = outX.join(col_data)  # collect column(s) in output dataframe

    return outX

X_all = preprocess_features(X_all)
print "Processed feature columns ({}):-\n{}".format(len(X_all.columns), list(X_all.columns))

Processed feature columns (48):-
['school_GP', 'school_MS', 'sex_F', 'sex_M', 'age', 'address_R', 'address_U', 'famsize_GT3', 'famsize_LE3', 'Pstatus_A', 'Pstatus_T', 'Medu', 'Fedu', 'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services', 'Mjob_teacher', 'Fjob_at_home', 'Fjob_health', 'Fjob_other', 'Fjob_services', 'Fjob_teacher', 'reason_course', 'reason_home', 'reason_other', 'reason_reputation', 'guardian_father', 'guardian_mother', 'guardian_other', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']


In [6]:
print X_all.head()

   school_GP  school_MS  sex_F  sex_M  age  address_R  address_U  famsize_GT3  \
0          1          0      1      0   18          0          1            1   
1          1          0      1      0   17          0          1            1   
2          1          0      1      0   15          0          1            0   
3          1          0      1      0   15          0          1            1   
4          1          0      1      0   16          0          1            1   

   famsize_LE3  Pstatus_A    ...     higher  internet  romantic  famrel  \
0            0          1    ...          1         0         0       4   
1            0          0    ...          1         1         0       5   
2            1          0    ...          1         1         0       4   
3            0          0    ...          1         1         1       3   
4            0          0    ...          1         0         0       4   

   freetime  goout  Dalc  Walc  health  absences  
0         3

### Split data into training and test sets

So far, we have converted all _categorical_ features into numeric values. In this next step, we split the data (both features and corresponding labels) into training and test sets.

** Key Takeaway : Remove Data bias i.e ordering, no of class sets (Stratified sampling)**

In [7]:
from sklearn.cross_validation import train_test_split

# First, decide how many training vs test samples you want
num_all = student_data.shape[0]  # same as len(student_data)
num_train = 300  # about 75% of the data
num_test = num_all - num_train

# TODO: Then, select features (X) and corresponding labels (y) for the training and test sets
# Note: Shuffle the data or randomly select samples to avoid any bias due to ordering in the dataset


X_train, X_test, y_train, y_test = train_test_split(X_all,y_all, test_size = (num_test *1.0 /num_all), random_state = 40)

#X_train = ?
#y_train = ?
#X_test = ?
#y_test = ?
print "Full set: {} samples".format(X_all.shape[0])
print "Training set: {} samples".format(X_train.shape[0])
print "Test set: {} samples".format(X_test.shape[0])
# Note: If you need a validation set, extract it from within training data

Full set: 395 samples
Training set: 300 samples
Test set: 95 samples


## 4. Training and Evaluating Models
Choose 3 supervised learning models that are available in scikit-learn, and appropriate for this problem. For each model:

- What are the general applications of this model? What are its strengths and weaknesses?
- Given what you know about the data so far, why did you choose this model to apply?
- Fit this model to the training data, try to predict labels (for both training and test sets), and measure the F<sub>1</sub> score. Repeat this process with different training set sizes (100, 200, 300), keeping test set constant.

Produce a table showing training time, prediction time, F<sub>1</sub> score on training set and F<sub>1</sub> score on test set, for each training set size.

Note: You need to produce 3 such tables - one for each model.

**ANSWER : Analysis of  Available Supervised Learning models to chosse  3 supervised learning models**


**1. Decision Tree :** [REf: 1]       
    General Application : Operations Research, specially in the decision analysis process i. e To identify a strategy that most likely reaches a desired goal.[15] (Both classification and regression problem can be solved using it)
    Strengths           
  -  Understandability, Sturdy model( not requires  data normalisation, dummy variables, outlier, missing values resolution and sustains model generation assumption violations )
  - Logathermic Scalability with data size,  Categorical and numerical values support
  - MultiClass Problem solver
  - Scalable 
    
Weaknesses
  - Biased tree for unbalanced classes
  - Overfitting problem (Solve by : Pruning, no of records to split or max depth)  ,
  - Sensitive to Small data variations(e.g variable remodeling / addition /removal)
  - No Global optimal model gurantee ( because of Local optimal splits)
  - Cannot model certain  solutions i.e XOR or multiplexer.   
   
    In Our case reason for choosing will probably be : Understandability + Categorical and numerical values support
    
**2. Random Forest :**            
    General Application  : Used in the genetics, neuroscience[16] . Often  for both classification (RandomForestClassifier) and regression (RandomForestRegression)             
    Strengths           
  -  Increased variance (Better accuracy model )
  -  Scalable (due to parallelisation possibility)
    
Weakness
  -   Lost understandability              
           

**3. Naive Bayes : **            
    General Application  : Classification  i.e Text categorisation e.g document classification, spam filtering. Also is used in  automatic medical diagnosis.            
    Strengths             
  -  Irrelevant attributes (Curse of dimensionality) and noise handler (REf 2)
  -  Small Data size (can  estimate parameters even with small data size). [3]
  -  Highly scalable - Extremeley fast model generation [3] 
    
Weakness  
  -   Probability  Predicted Badly estimated ( Predicted probability should not be taken too seriously)        
  -   Outperformed by other approaches as Boosted trees or random forest( A 2006 study) [4]


**4. Logistic Regression : **            
    General Application  : Regression i.e mostly in medical and social sciences[6] (Trauma and Injury Severity Score, Patient severity assesment, Disease presence Prediction). Engineering (Predict failure of machiene, system, process). Marketing (Predict Customer's preference)   and classification (Binomial, Ordinaal and multinominal), the outcome prediction variable is encoded as 0 or 1             
    Strengths             
  -  Incremental Data addition to training model  efficiently supported [5]
    
Weakness  
  -   Limited expressive power (cannot model ) [5]        
  -   Cannot model S- shaped or other  class separating hyperplanes for different classes
  -   Makes non sensical prediction for binary dependent varaibles. in such cases logit transformation i.e binary to continous value transformation is done, based on event occuracny probability.
                         

**4. K-nn : **            
    General Application  : Often used in pattern recognition. Classification  and Regression (sklearn.neighbors.KNeighborsClassifier and  KNeighborsRegressor)      
    Strengths             
  -  Incremental Data (New training examples )addition to training model  efficiently supported and easy [5]
  -  simple and powerful. No complex tuning required[8]
  
Weakness  
  -   Expensive and slow Not scalable i.e to compute nearest neighbours, have to compute  distance to all m training examples[8]
  -   Must select proper meaningful distance function for higher accuracy
  -   Suffers from Curse of dimensionality : i.e higher no of dimensions less effective [7]
  -   Optimal Choice of K. Is highly data dependent.        
  -   Performance Suffers from class distribution problem i.e skewed class distribution, examples of a more frequent class tend to dominate the prediction of the new example
  
**4. SVM : **            
    General Application  : Data classification and many industrial scale applications. Classification, Regression and outliers detection [9]             
    Strengths         
  -  Usually Works very well [11]  
  -  Effective in high dimensional space [9]
  -  Can model complex separating hyperplanes
  -  Memory efficient, as only uses a subset of training  point to generate the hyperplanes.[9]
  -  Diff kernel function to model different  separating hyperplanes [9]
  -  Effective in cases where number of dimension is greater than the number of training examples.[9]
  
Weakness  
  -   Need to select good kernel function[11]
  -   requires a lot of memory and cpu time [11]
  -   Numerical stability problems in some cases. [11]
  -   Does not  directly provide probability estimates. Expensive 5- fold cross validation used if needed[9]
  -   Expensive for multi class problem, as it is directly applicable to two class tasks only. Models multi class to several binary class task. [10]
  -   Suffers from Performance , if number of dimensions is muh greater than training samples[9]
  
  
**4. Neural Network : **            
    General Application  : Classification, Regression and  a wide variety of complex tasks hard to solve, such as  computer vision, speech recognition, time series prediction, fitness approximation,sequence recognition, novelty detection and sequential decision making. System identification and control (Vehicle control, trajectory prediction, natural resource management), quantum chemistry, pattern recognition(radar system, face identification, object recognition), sequence recognition( gesture, speech, handwriting recognition), medical diagnosis, financial applicationi.e automated trading e.t.c  [12]             
    Strengths         
  -  Can solve large no of complex challenges 
  -  Fast application [14]
  -  Can handle large no of features / dimensions [14]
  
Weakness  
  -   Slow training time
  -   Black box model
  -   Scalability. Efficient and large neural networks require considerable processing and storage resources.[11]
  -   Requires a lot of tuning  across number of hyperparameters [13]
  -   Sensisitve to feature scaling / data normalisation. [13]
  

REf :
1. http://scikit-learn.org/stable/modules/tree.html
2. https://www.uni-ulm.de/fileadmin/website_uni_ulm/mawi.inst.110/lehre/ss08/StatMeth/DM_NaiveBayes.pdf
3. http://scikit-learn.org/stable/modules/naive_bayes.html#
4. http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=2B17989F2A96894F41613D2363CFED45?doi=10.1.1.122.5901&rep=rep1&type=pdf via https://en.wikipedia.org/wiki/Naive_Bayes_classifier#cite_note-6
5. https://www.quora.com/What-are-the-advantages-of-different-classification-algorithms
6. https://en.wikipedia.org/wiki/Logistic_regression
7. http://scikit-learn.org/stable/modules/neighbors.html#classification
8. http://ocw.mit.edu/courses/sloan-school-of-management/15-097-prediction-machine-learning-and-statistics-spring-2012/lecture-notes/MIT15_097S12_lec06.pdf
9. http://scikit-learn.org/stable/modules/svm.html
10.https://en.wikipedia.org/wiki/Support_vector_machine
11. http://u.cs.biu.ac.il/~haimga/Teaching/AI/saritLectures/svm.pdf
12. https://en.wikipedia.org/wiki/Artificial_neural_network
13. http://scikit-learn.org/dev/modules/neural_networks_supervised.html#
14. https://www.coursehero.com/file/p3od0ug/Pros-and-Cons-of-Neural-Network-Cons-Slow-training-time-Hard-to-interpret-Hard/
15. https://en.wikipedia.org/wiki/Decision_tree
16. http://www.statistik.uni-dortmund.de/useR-2008/slides/Strobl+Zeileis.pdf



What are the general applications of this model? What are its strengths and weaknesses?
Given what you know about the data so far, why did you choose this model to apply?
Fit this model to the training data, try to predict labels (for both training and test sets), and measure the F1 score. Repeat this process with different training set sizes (100, 200, 300), keeping test set constant.

** ANSWERS**          
**Given what you know about the data so far, why did you choose this model to apply?**
1. Naive Bayes :  Curse of dimensionality(Irrelevant attributes not filtered)  + Probability Prediction not required => makes Naive Bayes a good choice for us .	  

2.  Support Vector machiene :  Not a multi class problem (Expensive for so) + No. of dimension < No of training sample (Bad performance, if so) + No Probability estimate required  + SVm's capacity to model complex separting hyperplanes => makes SVM a feasible choice. 

3. Random Forest :  Robustness in the face of  noise +  irrelevant attributes +  Good performance + Understandability of the model  less prioirity to  performance - > makes Random forest a good choice

**Fit this model to the training data, try to predict labels (for both training and test sets), and measure the F1 score. Repeat this process with different training set sizes (100, 200, 300), keeping test set constant.**           
                 
Answer : F1 Score for trainig sample Size            
             
                   
|			|TRAIN100\*	|TEST100\*\*    |TRAIN200\*\*\*	|Test200\*\*\*\*	|TRAIN\*\*\*\*\* 		|TEST\*\*\*\*\*\*		|
|-----------------------------------|-----------|---------------|-----------|-----------|-----------|-----------|
|1. Logistic Regression Classifier	|1			|0.72			|0.85		|0.75       |0.82		|0.81		|
|2. Deccision tree classifier		|0.90		|0.77			|0.91		|0.76       |0.92		|0.76		|
|3. SVM classifier-SVC				|0.88		|0.83			|0.85		|0.81       |0.87		|0.83		|
|4. Naive Bayes						|0.768		|0.772			|0.769		|0.765      |0.77		|0.79		|
|5. Random forest					|0.875		|0.786			|0.881		|0.797		|0.886		|0.863		|
              
               

\*                   : F1 Score on the Training data  for model created using  when 100 training examples         
\*\*                 : F1 Score on the Test data  for model created using  when 100 training examples             
\*\*\*               : F1 Score on the training data  for model created using  when 200 training examples           
\*\*\*\*             : F1 Score on the Test data  for model created using  when 200 training examples                
\*\*\*\*\*           : F1 Score on the training Size  for model created using  when 300 training examples              
\*\*\*\*\*\*         : F1 Score on the Test data  for model created using  when 300 training examples

  




In [8]:
print(X_train[1:1])
print(student_data[1:1])


Empty DataFrame
Columns: [school_GP, school_MS, sex_F, sex_M, age, address_R, address_U, famsize_GT3, famsize_LE3, Pstatus_A, Pstatus_T, Medu, Fedu, Mjob_at_home, Mjob_health, Mjob_other, Mjob_services, Mjob_teacher, Fjob_at_home, Fjob_health, Fjob_other, Fjob_services, Fjob_teacher, reason_course, reason_home, reason_other, reason_reputation, guardian_father, guardian_mother, guardian_other, traveltime, studytime, failures, schoolsup, famsup, paid, activities, nursery, higher, internet, romantic, famrel, freetime, goout, Dalc, Walc, health, absences]
Index: []

[0 rows x 48 columns]
Empty DataFrame
Columns: [school, sex, age, address, famsize, Pstatus, Medu, Fedu, Mjob, Fjob, reason, guardian, traveltime, studytime, failures, schoolsup, famsup, paid, activities, nursery, higher, internet, romantic, famrel, freetime, goout, Dalc, Walc, health, absences, passed]
Index: []

[0 rows x 31 columns]


In [9]:
# Train a model
import time

def train_classifier(clf, X_train, y_train):
    print "Training {}...".format(clf.__class__.__name__)
    start = time.time()
    clf.fit(X_train, y_train)
    end = time.time()
    print "Done!\nTraining time (secs): {:.3f}".format(end - start)

# TODO: Choose a model, import it and instantiate an object
from sklearn.tree import  DecisionTreeClassifier
from sklearn import linear_model
print(X_train.shape[0]*0.05)
# min

15.0


In [10]:
# Train a model
import time

def train_classifier(clf, X_train, y_train , print_info = 1 ):
    if print_info == 1:
        print "Training {}...".format(clf.__class__.__name__)
    start = time.time()
    clf.fit(X_train, y_train)
    end = time.time()
    train_time = '{:.3f}'.format(end-start)
    if print_info == 1: print "Done!\nTraining time (secs): {}".format(train_time)
    return train_time

# TODO: Choose a model, import it and instantiate an object
from sklearn.tree import  DecisionTreeClassifier
from sklearn import linear_model
print(X_train.shape[0]*0.05)
clf_logit = linear_model.LogisticRegression(C=1e5)

# Fit model to training data
clf_logit_train_time = train_classifier(clf_logit, X_train, y_train, 0)  # note: using entire training set here
#print('clf_logit_train_time is ',clf_logit_train_time)
print clf_logit  # you can inspect the learned model by printing it


15.0
LogisticRegression(C=100000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)


In [11]:
# Predict on training set and compute F1 score
from sklearn.metrics import f1_score

def predict_labels(clf, features, target, print_info = 1):
    if print_info == 1:
        print "Predicting labels using {}...".format(clf.__class__.__name__)
    start = time.time()
    y_pred = clf.predict(features)
    end = time.time()
    pred_time = '{:.3f}'.format(end -start)
    if print_info == 1:
        print "Done!\nPrediction time (secs): {}".format(end - start)
    return f1_score(target.values, y_pred, pos_label='yes'), pred_time 


clf_logit_train_f1_score, clf_logit_train_pred_time  = predict_labels(clf_logit, X_train, y_train)
print "F1 score for training set / Training time / Prediction time : {} / {} / {}". \
        format(clf_logit_train_f1_score, clf_logit_train_time,clf_logit_train_pred_time)

Predicting labels using LogisticRegression...
Done!
Prediction time (secs): 0.269999980927
F1 score for training set / Training time / Prediction time : 0.820276497696 / 0.009 / 0.270


In [12]:
# Predict on test data
clf_logit_test_f1_score, clf_logit_test_pred_time  = predict_labels(clf_logit, X_test, y_test)
print "F1 score for test set: {}".format(predict_labels(clf_logit, X_test, y_test))

Predicting labels using LogisticRegression...
Done!
Prediction time (secs): 0.000999927520752
Predicting labels using LogisticRegression...
Done!
Prediction time (secs): 0.0
F1 score for test set: (0.81428571428571428, '0.000')


In [13]:
# Train and predict using different training set sizes
def train_predict(clf, X_train, y_train, X_test, y_test, verbose = 1):
    if verbose == 1 :
        print "------------------------------------------"
        print "Training set size: {}".format(len(X_train))
    train_classifier(clf, X_train, y_train, verbose)
    clf_train_f1_score, clf_train_predict_time =  predict_labels(clf, X_train, y_train,verbose)
    clf_test_f1_score, clf_test_predict_time = predict_labels(clf, X_test, y_test,verbose)
    if verbose == 1:
        print "F1 score for training set: {}".format(clf_train_f1_score)
        print "F1 score for test set: {}".format(clf_test_f1_score)
    return clf_train_f1_score, clf_train_predict_time,  clf_test_f1_score, clf_test_predict_time

print(X_train.shape)
print(X_train[0:100].shape)

X_train100, X_test100, y_train100, y_test100 = train_test_split(X_train,y_train, test_size = 200.0/300, random_state = 40)
X_train200, X_test200, y_train200, y_test200 = train_test_split(X_train,y_train, test_size = 100.0/300, random_state = 40)
train_predict(clf_logit, X_train100, y_train100, X_test, y_test,0 )
train_predict(clf_logit, X_train200, y_train200, X_test, y_test,0 )

# TODO: Run the helper function above for desired subsets of training data
# Note: Keep the test set constant

(300, 48)
(100, 48)


(0.84827586206896555, '0.000', 0.84057971014492761, '0.000')

In [14]:
# TODO: Train and predict using two other models
# Model1 :  dEcistion tree Classifier
# min sample to split on  be 5 percent
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier

models = ['tree','svm', 'naive','randomforest']
for model in models:
    if model == 'tree':        
        clf = DecisionTreeClassifier(random_state=0, min_samples_split = X_train.shape[0]*0.03)       
    if model == 'svm':
        clf = svm.SVC()
    if model == 'naive':
        clf = GaussianNB()
    if model == 'randomforest':
        clf = RandomForestClassifier(min_samples_split=15 , n_estimators=10)  # 0.90 & 0.85       
        
    clf_train_time = train_classifier(clf, X_train, y_train,0)  # note: using entire training set here
    #print clf  # you can inspect the learned model by printing it
    clf_train_f1_score, clf_train_pred_time = predict_labels(clf, X_train, y_train, 0)
    clf_test_f1_score, clf_test_pred_time = predict_labels(clf, X_test, y_test, 0)
    #print "F1 score for test set: {}"

    X_train100, X_test100, y_train100, y_test100 = train_test_split(X_train,y_train, test_size = 200.0/300, random_state = 40)
    X_train200, X_test200, y_train200, y_test200 = train_test_split(X_train,y_train, test_size = 100.0/300, random_state = 40)
    clf_train_f1_score100, clf_train_pred_time100, clf_test_f1_score100, clf_test_pred_time100  =  \
                    train_predict(clf, X_train100, y_train100, X_test, y_test,0 )
    clf_train_f1_score200, clf_train_pred_time200, clf_test_f1_score200, clf_test_pred_time200 = \
                train_predict(clf, X_train200, y_train200, X_test, y_test, 0 )
    #print "F1 score - train set / F1 - Test Set / Model Train time / Pred Time on Train set / Pred Time on Test set : \n";
    print  "{:>25} F1-Train {:.3f}  F1-Test {:.3f}   Train Time {}  Pred Time_AllTrain {}   Pred Time AllTest {} ".\
        format(clf.__class__.__name__, clf_train_f1_score,clf_test_f1_score, clf_train_time, \
                                                         clf_train_pred_time, clf_test_pred_time   )



   DecisionTreeClassifier F1-Train 0.928  F1-Test 0.767   Train Time 0.002  Pred Time_AllTrain 0.000   Pred Time AllTest 0.000 
                      SVC F1-Train 0.879  F1-Test 0.837   Train Time 0.008  Pred Time_AllTrain 0.005   Pred Time AllTest 0.002 
               GaussianNB F1-Train 0.778  F1-Test 0.797   Train Time 0.001  Pred Time_AllTrain 0.000   Pred Time AllTest 0.000 
   RandomForestClassifier F1-Train 0.876  F1-Test 0.805   Train Time 0.025  Pred Time_AllTrain 0.001   Pred Time AllTest 0.001 


In [15]:
# TODO: Train and predict using two other models
# Model1 :  Support VEcotr machiene
from sklearn import svm
clf_svm = svm.SVC()

train_classifier(clf_svm, X_train, y_train)  # note: using entire training set here
print clf  # you can inspect the learned model by printing it
predict_labels(clf_svm, X_train, y_train)
predict_labels(clf_svm, X_test, y_test)
print "F1 score for training set: {}".format(predict_labels(clf_svm, X_train, y_train))
print "F1 score for test set: {}".format(predict_labels(clf_svm, X_test, y_test))

X_train100, X_test100, y_train100, y_test100 = train_test_split (X_train, y_train, test_size = 200.0/300, random_state = 40)
X_train200, X_test200, y_train200, y_test200 = train_test_split (X_train, y_train, test_size = 100.0/300, random_state = 40)
train_predict(clf_svm, X_train100, y_train100, X_test, y_test )
train_predict(clf_svm, X_train200, y_train200, X_test, y_test )



Training SVC...
Done!
Training time (secs): 0.009
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=15,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
Predicting labels using SVC...
Done!
Prediction time (secs): 0.00699996948242
Predicting labels using SVC...
Done!
Prediction time (secs): 0.0019998550415
Predicting labels using SVC...
Done!
Prediction time (secs): 0.00500011444092
F1 score for training set: (0.87858719646799122, '0.005')
Predicting labels using SVC...
Done!
Prediction time (secs): 0.00200009346008
F1 score for test set: (0.83660130718954251, '0.002')
------------------------------------------
Training set size: 100
Training SVC...
Done!
Training time (secs): 0.001
Predicting labels using SVC...
Done!
Prediction time 

(0.87947882736156346, '0.003', 0.82580645161290311, '0.002')

In [16]:
# TODO: Train and predict using two other models
#from sknn.mlp  import Classifier, Layer
#clf = Classifier(layers = [Layer("Rectifier", units = 100), Layer("Linear")], learning_rate = 0.02, n_iter=10)
#y_valid = nn.predict(X_valid)
#score = nn.score(X_test, y_test)
#print('score is', score)
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
#clf = MLPCLassifier()

train_classifier(clf, X_train, y_train)  # note: using entire training set here
print clf  # you can inspect the learned model by printing it
print "F1 score for training set: {}".format(predict_labels(clf, X_train, y_train))
print "F1 score for test set: {}".format(predict_labels(clf, X_test, y_test))

X_train100, X_test100, y_train100, y_test100 = train_test_split(X_train,y_train, test_size = 200.0/300, random_state = 40)
X_train200, X_test200, y_train200, y_test200 = train_test_split(X_train,y_train, test_size = 100.0/300, random_state = 40)
train_predict(clf, X_train100, y_train100, X_test, y_test )
train_predict(clf, X_train200, y_train200, X_test, y_test )


Training GaussianNB...
Done!
Training time (secs): 0.002
GaussianNB()
Predicting labels using GaussianNB...
Done!
Prediction time (secs): 0.00100016593933
F1 score for training set: (0.77750611246943768, '0.001')
Predicting labels using GaussianNB...
Done!
Prediction time (secs): 0.0
F1 score for test set: (0.79710144927536231, '0.000')
------------------------------------------
Training set size: 100
Training GaussianNB...
Done!
Training time (secs): 0.000
Predicting labels using GaussianNB...
Done!
Prediction time (secs): 0.00100016593933
Predicting labels using GaussianNB...
Done!
Prediction time (secs): 0.0
F1 score for training set: 0.547368421053
F1 score for test set: 0.451612903226
------------------------------------------
Training set size: 200
Training GaussianNB...
Done!
Training time (secs): 0.001
Predicting labels using GaussianNB...
Done!
Prediction time (secs): 0.0
Predicting labels using GaussianNB...
Done!
Prediction time (secs): 0.0
F1 score for training set: 0.79710

(0.79710144927536219, '0.000', 0.78518518518518499, '0.000')

In [17]:
# TODO: Train and predict using two other models
# Model1 :  Support VEcotr machiene
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(min_samples_split=15 , n_estimators=10)  # 0.90 & 0.85

train_classifier(clf, X_train, y_train)  # note: using entire training set here
print clf  # you can inspect the learned model by printing it
print "F1 score for training set: {}".format(predict_labels(clf, X_train, y_train))
print "F1 score for test set: {}".format(predict_labels(clf, X_test, y_test))

X_train100, X_test100, y_train100, y_test100 = train_test_split(X_train,y_train, test_size = 200.0/300, random_state = 40)
X_train200, X_test200, y_train200, y_test200 = train_test_split(X_train,y_train, test_size = 100.0/300, random_state = 40)
train_predict(clf, X_train100, y_train100, X_test, y_test )
train_predict(clf, X_train200, y_train200, X_test, y_test )


Training RandomForestClassifier...
Done!
Training time (secs): 0.019
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=15,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
Predicting labels using RandomForestClassifier...
Done!
Prediction time (secs): 0.00200009346008
F1 score for training set: (0.86681715575620766, '0.002')
Predicting labels using RandomForestClassifier...
Done!
Prediction time (secs): 0.000999927520752
F1 score for test set: (0.84137931034482749, '0.001')
------------------------------------------
Training set size: 100
Training RandomForestClassifier...
Done!
Training time (secs): 0.017
Predicting labels using RandomForestClassifier...
Done!
Prediction time (secs): 0.00100016593933
Predicting labels using RandomFore

(0.88294314381270911, '0.001', 0.83116883116883122, '0.001')

## 5. Choosing the Best Model

**5.1 Question : Based on the experiments you performed earlier, in 2-3 paragraphs explain to the board of supervisors what single model you choose as the best model. Which model has the best test F1 score and time efficiency? Which model is generally the most appropriate based on the available data, limited resources, cost, and performance? Please directly compare and contrast the numerical values recored to make your case.**           
                     
**Answer :**  Support Vector machiene has been selected as the best model, based on the F1 score and the training time.               
We observe that the F1 score when tested against the test set for   the Decision tree classifier, SVM, Gaussian Naive Bayes and the RandomForestClassifier were 0.76, 0.83, 0.79 and 0.82 respectively. Amongst them the F1-score of 0.83  and 0.82 of SVM and
RandomForestClassifier respectively, were higher  to others (0.76 and 0.79) and hence, our choice was narrowed down to those two models.                     
Despite the short model training time(0.009 sec and 0.003) for the  Decision Trees and Naive Bayes respectively, their comparatively lower F1-scores meant, those models were discarded from our consideration. The models would have been likely choice, in case when we had very very large dataset thus making the model building using complex algorithms as SVM and Random Forest a expensive  task in terms of time.                  
However because we had a very small dataset  higher F1- score was of higher priority to the   training time taken. Amongst the two models, Random Forest and the SVM narrowed down, the lower training time taken by the SVM i.e 0.029 sec vs 0.086 of the Random Forest, along with the SVM's capability to model comples separating hyper planes in the data,  led us to choose the SVM as our first choice of model.                       
Please refer to the table below, illustrating the comparative F1 scores and the Training time required by each model, for further clarification.               

|                       |F1-Train       |F1-Test       |Train Time       |Pred Time_AllTrain        |Pred Time All Test      |
|-----------------------|-------------- |--------------|-----------------|--------------------------|------------------------|
|DecisionTreeClassifier |F1-Train 0.928 |F1-Test 0.767 |Train Time 0.009 | Pred Time_AllTrain 0.001 |Pred Time AllTest 0.000 |
|                   SVM |F1-Train 0.879 |F1-Test 0.837 |Train Time 0.029 | Pred Time_AllTrain 0.019 |Pred Time AllTest 0.006 |
|            GaussianNB |F1-Train 0.778 |F1-Test 0.797 |Train Time 0.003 | Pred Time_AllTrain 0.002 |Pred Time AllTest 0.001 |
|RandomForestClassifier |F1-Train 0.877 |F1-Test 0.821 |Train Time 0.086 | Pred Time_AllTrain 0.005 |Pred Time AllTest 0.003 |

                           
                                

**5.2 Question: In 1-2 paragraphs explain to the board of supervisors in layman's terms how the final model chosen is supposed to work (for example if you chose a Decision Tree or Support Vector Machine, how does it make a prediction).**          
                 
** Answer :**  
Support Vector machiene, is  a supervised learning model, in which given  a set of training examples  belonging to different categories, the examples are represented in a space  such that there is a clear and maximum gap between each categories. [5]
The process of building the Support VEctor machiene model can be best explained with the following example.                       
Let's say we have a simple board with two different colors of balls (diff colors corollory to categories and all the balls corollory to training examples).Now in SVM, training a model implies building a model, such that there is a clear and maximum gap between each categories i.e  Lets say the  two color balls red and blue, for a simple scenario are placed such that,  red is on the left side and blue is on the right , then in that case, a simple line in the middle of the board  can easily separate the two color balls.  This simple line, that separates the two classes (categories) is what is produced with the SVM during the model creation process.                         
To complicate a bit further the above example case, lets say, after some time, the color balls slowly started to diffuse on to each others side and that hence there is no more any straight line that can separate them.In such case, the simple line does not exist and hence we have to find some other way/ technique, so that we can get a separating line that can separate the two classes. Lets say for example , if we project the same balls into any other space i.e by throwing the balls up in the air, and if lets say the  red balls are ligther than blue, then the red balls gets into much higher space than the blue balls and we can easily draw a straight  plane  at some height, that separates the  heavier ball with the lighter ball.   This act of throwing balls in the air is often done during the model training process for complex cases, when no simple line exists to separate the two classes (i.e categories) using complex tehniques i.e diff. kernels. Complex techniques are used by the SVM  to model such complex separating planes. And thus finally, the Support Vector machiene model,  with the separating plane that separates the two classes ( categories) is obtained, in the model training process.                        
As for the prediction process,  once the maximum gap separating plane that separates the  categopries(corollory to classes), is found during the training process, the  test examples are then, also projected into the  respective space, as was done for the training examples. Now once the test examples are projected into the complex space (corollory to throwing the ball in the air), then the separating plane,  that separated the categories  in the training set is   projected and   the test examples are classified to the corresponding respective categorties (i.e classes), based on the test examples position on the respective side of the  separating plane i.e In the above balls case, all the  test balls, that  are heavier  than the separating plane, are categorised as blue and vice versa. And thus the prediction is done  for the test examples, using the separating plane obtained from the trained model, by the SVM.           


Ref 
1. https://en.wikipedia.org/wiki/Support_vector_machine

** 5.2 Fine-tune the model. Use Gridsearch with at least one important parameter tuned and with at least 3 settings. Use the entire training set for this. **            
               
** Answer codes below **

In [20]:
# TODO: Fine-tune your model and report the best F1 score
def performance_metric(y_true, y_predicted):
    return f1_score(y_true, y_predicted, pos_label='yes')   

from sklearn.grid_search import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn import svm
svm_clf = svm.SVC()
param_grid = [{'C':[1, 10, 100], 'kernel':['linear'],  'class_weight':['balanced',None]  },
              {'C':[1, 10, 100], 'kernel':['poly'],    'class_weight':['balanced',None], 'gamma':['auto',0.001,0.0001], 
               'degree':[2,3,4,5,6]  },
              {'C':[1, 10, 100], 'kernel':['sigmoid'], 'class_weight':['balanced',None],'gamma':['auto',0.001,0.0001] },
              {'C':[1, 10, 100], 'kernel':['rbf'],     'class_weight':['balanced',None],'gamma':['auto',0.001,0.0001] }
             ]
cv = StratifiedShuffleSplit(y_train, n_iter=3, test_size=0.3, random_state=40)
gs_clf = GridSearchCV(svm_clf, param_grid = param_grid, scoring= make_scorer(performance_metric, greater_is_better=True), 
                      verbose = 100, cv = cv, refit = True)
gs_clf.fit(X_train, y_train)
print 'best model is',gs_clf.best_estimator_
gs_clf = gs_clf.best_estimator_

train_classifier(gs_clf , X_train, y_train)  # note: using entire training set here
print gs_clf  # you can inspect the learned model by printing it
print "F1 score for  entire training set and not the Entire Data Set(excluding test set): {}"\
         .format(predict_labels(gs_clf, X_train, y_train))
print "F1 score for test set: {}".format(predict_labels(gs_clf, X_test, y_test))

#####
#print('################################  Original untweaked model of the SVM')
####
# from sklearn import svm
# clf = svm.SVC()
# train_classifier(clf, X_train, y_train)  # note: using entire training set here
# print clf  # you can inspect the learned model by printing it
# print "F1 score for training set: {}".format(predict_labels(clf, X_train, y_train))
# print "F1 score for test set: {}".format(predict_labels(clf, X_test, y_test))



Fitting 3 folds for each of 132 candidates, totalling 396 fits
[CV] kernel=linear, C=1, class_weight=balanced .......................
[CV]  kernel=linear, C=1, class_weight=balanced, score=0.627451 -   0.0s
[Parallel(n_jobs=1)]: Done   1 tasks       | elapsed:    0.0s
[CV] kernel=linear, C=1, class_weight=balanced .......................
[CV]  kernel=linear, C=1, class_weight=balanced, score=0.666667 -   0.0s
[Parallel(n_jobs=1)]: Done   2 tasks       | elapsed:    0.0s
[CV] kernel=linear, C=1, class_weight=balanced .......................
[CV]  kernel=linear, C=1, class_weight=balanced, score=0.724138 -   0.0s
[Parallel(n_jobs=1)]: Done   3 tasks       | elapsed:    0.0s
[CV] kernel=linear, C=1, class_weight=None ...........................
[CV] .. kernel=linear, C=1, class_weight=None, score=0.775194 -   0.0s
[Parallel(n_jobs=1)]: Done   4 tasks       | elapsed:    0.0s
[CV] kernel=linear, C=1, class_weight=None ...........................
[CV] .. kernel=linear, C=1, class_weight=Non

  'precision', 'predicted', average, warn_for)


[CV]  kernel=poly, C=1, gamma=auto, degree=3, class_weight=balanced, score=0.743802 -   0.0s
[Parallel(n_jobs=1)]: Done  29 tasks       | elapsed:   25.1s
[CV] kernel=poly, C=1, gamma=auto, degree=3, class_weight=balanced ...
[CV]  kernel=poly, C=1, gamma=auto, degree=3, class_weight=balanced, score=0.725806 -   0.0s
[Parallel(n_jobs=1)]: Done  30 tasks       | elapsed:   25.1s
[CV] kernel=poly, C=1, gamma=0.001, degree=3, class_weight=balanced ..
[CV]  kernel=poly, C=1, gamma=0.001, degree=3, class_weight=balanced, score=0.800000 -   0.0s
[Parallel(n_jobs=1)]: Done  31 tasks       | elapsed:   25.1s
[CV] kernel=poly, C=1, gamma=0.001, degree=3, class_weight=balanced ..
[CV]  kernel=poly, C=1, gamma=0.001, degree=3, class_weight=balanced, score=0.802721 -   0.0s
[Parallel(n_jobs=1)]: Done  32 tasks       | elapsed:   25.1s
[CV] kernel=poly, C=1, gamma=0.001, degree=3, class_weight=balanced ..
[CV]  kernel=poly, C=1, gamma=0.001, degree=3, class_weight=balanced, score=0.802721 -   0.0s


** What is the model's final F<sub>1</sub> score?**   
**Answer :** 0.84