# How to implement a learning algorithm which is sklearn-compatible 
(actually more like demonstrating that it works, see my implementation for details on how to actually do it)

First off you need to implement you algorithm as a class and provide named parameters with default values in the
constructor

ex) 

**def \__init\__(self, max_depth=5, min_samples_split=20)**

then you will need to provide the following methods at the bare minimum

**def fit(self, X, y)**   - to train your model

**def predict(self, X)**  - to make use of your model

**def score(self, X, y)**  - for evaluating your model's accuracy

however, this will only get you so far, and primarily provides the same "feel" as a scikit learn estimator.  to make it compatible with more things, you should also implement:

**def get_params(self)**  - simply returns a dict of the current parameter values

**def set_params(self, \*\*params)** - allows you to override parameter values passed to the constructor

Anyway, that should get you pretty far, as I'll show below. There is a link at the end of this notebook to the
scikit-learn documentation which provides more details (such as what to subclass, and mixins you can use), but as we'll see soon, you can get pretty far doing the bare minimum - I simply sublassed 'object' and provided the public interface described above)

## Using reusable code in an Ipython notebook
Jupyter's Ipython kernel (at least currently) doesn't allow you to import python
packages/modules from other directories unless they are in sys.path

I've added a **config_notebook.py** in the 'notebooks' subdir from which
you can **import setup_pgh_ml_path** from, call it, and then you'll be able
to import python code from any directory below PghML

In [1]:
# install 'PghML' in sys.path (if it isn't there already)
from config_notebook import setup_pgh_ml_path
setup_pgh_ml_path()

# loader function for my dataset
from datasets.loaders import load_banknote_authentication
# my decision tree implementation
from pgh_ml_py.sklearn_compat.tree.cart_decision_tree import CartDecisionTreeClassifier, display_tree

# useful sklearn functions/Classes which we wish to be able to leverage (the point of making our code compatible)
from sklearn.model_selection import cross_val_score, GridSearchCV, StratifiedKFold, train_test_split
# other dependencies
import numpy as np
import pandas as pd

In [2]:
# my custom dataset loader function. typically you'll have it return some fields using the conventions
# data -> the features matrix
# target -> the labels vector
# in addition, I'm also returning 'dataframe' which is the original pandas dataframe I loaded, so we can analyze
# the data as well
dataset = load_banknote_authentication()

In [3]:
df = dataset.dataframe
X = dataset.data
y = dataset.target

In [4]:
train_X, test_X, train_y, test_y = train_test_split(X, y)

In [5]:
print train_X.shape
print train_y.shape
print test_X.shape
print test_y.shape

(1029, 4)
(1029,)
(343, 4)
(343,)


In [6]:
clf = CartDecisionTreeClassifier()

In [7]:
clf.fit(train_X, train_y)

CartDecisionTreeClassifier(max_depth=5, min_samples_split=20)

### Following sklearn's conventions for decision trees, my implementation's fit method sets the following 2 attributes:

clf.tree_  - the underlying representation of the decision tree

clf.classes_ - the set of unique classes in y

my ad_hoc function **display_tree()** which understands my decision trees representation makes use of these

In [8]:
display_tree(clf.tree_, clf.classes_)

if feat[0] <= 1.594: (gini: 0.990 samples: 1029 [566, 463])
T-> if feat[1] <= 9.748: (gini: 0.826 samples: 650 [194, 456])
  T-> if feat[0] <= -7.042: (gini: 0.754 samples: 609 [153, 456])
    T-> 1
    F-> 1
  F-> if feat[0] <= -2.226: (gini: 0.000 samples: 41 [41, 0])
    T-> 0
    F-> if feat[0] <= -2.226: (gini: 0.000 samples: 29 [29, 0])
      T-> 0
      F-> 0
F-> if feat[2] <= -4.929: (gini: 0.066 samples: 379 [372, 7])
  T-> 1
  F-> if feat[0] <= 1.594: (gini: 0.040 samples: 376 [372, 4])
    T-> 0
    F-> 0


## Ok, great, but let's actually try and do something with my tree.  Let's call predict() on it

In [9]:
clf.predict(test_X[0])



[1]

**yikes!** what's happening here ??? As you can see, it *works* but spits out an ugly deprecation warning

Sklearn classfiers predict methods expect an **array** of rows, so if we're passing in a single row of data we simply need to pass it as [row]


In [10]:
clf.predict([test_X[0]])

[1]

### And let's see how accurate my tree is by passing in full test data set

In [11]:
clf.score(test_X, test_y)

82.79883381924198

### Ok. That's fine for demonstrating how to fit/predict/score a single tree, but let's do a cross validation with 5 folds


In [12]:
cross_val_score(clf, dataset.data, dataset.target, cv=5)

array([ 70.54545455,  69.09090909,  76.64233577,  37.59124088,  28.46715328])

### Hmm. something looks **very** wrong here,  the original code scored ~80%.  Did I break something in my refactoring?  Lets take a look at the data (like I **should** have prior to doing anything)

In [13]:
df.head()

Unnamed: 0,variance,skewness,curtosis,entropy,label
0,3.6216,8.6661,-2.8073,-0.44699,0
1,4.5459,8.1674,-2.4586,-1.4621,0
2,3.866,-2.6383,1.9242,0.10645,0
3,3.4566,9.5228,-4.0112,-3.5944,0
4,0.32924,-4.4552,4.5718,-0.9888,0


In [14]:
df.tail()

Unnamed: 0,variance,skewness,curtosis,entropy,label
1367,0.40614,1.3492,-1.4501,-0.55949,1
1368,-1.3887,-4.8773,6.4774,0.34179,1
1369,-3.7503,-13.4586,17.5932,-2.7771,1
1370,-3.5637,-8.3827,12.393,-1.2823,1
1371,-2.5419,-0.65804,2.6842,1.1952,1


## Ok,  I *think* I'm seeing a pattern. Let's print out some more of the dataset to make sure I'm not hallucinating

In [15]:
df

Unnamed: 0,variance,skewness,curtosis,entropy,label
0,3.621600,8.66610,-2.807300,-0.446990,0
1,4.545900,8.16740,-2.458600,-1.462100,0
2,3.866000,-2.63830,1.924200,0.106450,0
3,3.456600,9.52280,-4.011200,-3.594400,0
4,0.329240,-4.45520,4.571800,-0.988800,0
5,4.368400,9.67180,-3.960600,-3.162500,0
6,3.591200,3.01290,0.728880,0.564210,0
7,2.092200,-6.81000,8.463600,-0.602160,0
8,3.203200,5.75880,-0.753450,-0.612510,0
9,1.535600,9.17720,-2.271800,-0.735350,0


### It turns out that my original function which created k-folds was randomizing (shuffling) the order of records, but that's not happening here

As you can see, **all** of the rows labeled **0** are in the **1st half** of the dataset while all the rows labeled **1** are in the **2nd half** of the dataset.  

By default, if you simply pass in an int for the cv param it uses KFold which doesn't deal with this. 

### Let's make use of StratifiedKFold instead to make sure that all of our folds have the classes balanced

In [16]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=np.random.RandomState(1))

In [17]:
cross_val_score(clf, dataset.data, dataset.target, cv=cv)

array([ 84.        ,  81.81818182,  83.21167883,  83.21167883,  83.21167883])

### Ok, great!  These values are pretty much in sync with the original blog post I based this off of.  That's a relief - I didn't break anything in all of my refactoring.

So far, I've simply made use of my class using it's default values of max_depth=5 and min_samples_split=20

### Let's make use of sklearn's GridSearchCV to try automatically optimize values for these parameters. 

Of course, this can take a while, so I'll keep the ranges of values to a reasonable size

In [18]:
parameters = {'max_depth': range(3, 6), 'min_samples_split': range(10, 26, 5)}

In [19]:
dt = CartDecisionTreeClassifier()

In [20]:
clf = GridSearchCV(dt, parameters, cv=cv, verbose=True)
print clf

GridSearchCV(cv=StratifiedKFold(n_splits=5,
        random_state=<mtrand.RandomState object at 0x112ae2c80>,
        shuffle=True),
       error_score='raise',
       estimator=CartDecisionTreeClassifier(max_depth=5, min_samples_split=20),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'min_samples_split': [10, 15, 20, 25], 'max_depth': [3, 4, 5]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=True)


In [21]:
clf.fit(X, y)

Fitting 5 folds for each of 12 candidates, totalling 60 fits


[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed:   47.2s finished


GridSearchCV(cv=StratifiedKFold(n_splits=5,
        random_state=<mtrand.RandomState object at 0x112ae2c80>,
        shuffle=True),
       error_score='raise',
       estimator=CartDecisionTreeClassifier(max_depth=5, min_samples_split=20),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'min_samples_split': [10, 15, 20, 25], 'max_depth': [3, 4, 5]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=True)

### After running fit(), the GridSearchCV has a bunch of attributes set.  The ones I found most useful were:
* cv\_results\_      - lots of details which can be imported into pandas as a dataframe
* best\_score\_      - score of the best result
* best\_params\_     - dict of the best parameter values discovered
* best\_estimator\_  - the best estimator object (useful as you can inspect it)

In [22]:
df2 = pd.DataFrame(clf.cv_results_)

In [23]:
df2

Unnamed: 0,mean_fit_time,mean_score_time,mean_test_score,mean_train_score,param_max_depth,param_min_samples_split,params,rank_test_score,split0_test_score,split0_train_score,...,split2_test_score,split2_train_score,split3_test_score,split3_train_score,split4_test_score,split4_train_score,std_fit_time,std_score_time,std_test_score,std_train_score
0,0.840534,0.000676,83.163265,83.546035,3,10,"{u'min_samples_split': 10, u'max_depth': 3}",1,81.454545,83.682771,...,85.036496,83.515483,82.116788,83.515483,84.671533,82.969035,0.009733,3.4e-05,1.426715,0.347811
1,0.779057,0.000609,83.163265,83.546035,3,15,"{u'min_samples_split': 15, u'max_depth': 3}",1,81.454545,83.682771,...,85.036496,83.515483,82.116788,83.515483,84.671533,82.969035,0.011207,6e-06,1.426715,0.347811
2,0.773153,0.000608,83.163265,83.546035,3,20,"{u'min_samples_split': 20, u'max_depth': 3}",1,81.454545,83.682771,...,85.036496,83.515483,82.116788,83.515483,84.671533,82.969035,0.001926,6e-06,1.426715,0.347811
3,0.771235,0.000603,83.163265,83.546035,3,25,"{u'min_samples_split': 25, u'max_depth': 3}",1,81.454545,83.682771,...,85.036496,83.515483,82.116788,83.515483,84.671533,82.969035,0.002325,7e-06,1.426715,0.347811
4,0.775854,0.000615,83.163265,83.546035,4,10,"{u'min_samples_split': 10, u'max_depth': 4}",1,81.454545,83.682771,...,85.036496,83.515483,82.116788,83.515483,84.671533,82.969035,0.003135,7e-06,1.426715,0.347811
5,0.776714,0.000663,83.163265,83.546035,4,15,"{u'min_samples_split': 15, u'max_depth': 4}",1,81.454545,83.682771,...,85.036496,83.515483,82.116788,83.515483,84.671533,82.969035,0.003052,9.4e-05,1.426715,0.347811
6,0.77732,0.000605,83.163265,83.546035,4,20,"{u'min_samples_split': 20, u'max_depth': 4}",1,81.454545,83.682771,...,85.036496,83.515483,82.116788,83.515483,84.671533,82.969035,0.002686,5e-06,1.426715,0.347811
7,0.772857,0.000607,83.163265,83.546035,4,25,"{u'min_samples_split': 25, u'max_depth': 4}",1,81.454545,83.682771,...,85.036496,83.515483,82.116788,83.515483,84.671533,82.969035,0.002586,7e-06,1.426715,0.347811
8,0.777278,0.000606,83.163265,83.546035,5,10,"{u'min_samples_split': 10, u'max_depth': 5}",1,81.454545,83.682771,...,85.036496,83.515483,82.116788,83.515483,84.671533,82.969035,0.002679,1e-05,1.426715,0.347811
9,0.775073,0.000616,83.163265,83.546035,5,15,"{u'min_samples_split': 15, u'max_depth': 5}",1,81.454545,83.682771,...,85.036496,83.515483,82.116788,83.515483,84.671533,82.969035,0.0021,1.3e-05,1.426715,0.347811


In [24]:
print """
best score: %f
best params: %s
""" % (clf.best_score_, clf.best_params_)


best score: 83.163265
best params: {'min_samples_split': 10, 'max_depth': 3}



In [25]:
best_tree = clf.best_estimator_

In [26]:
display_tree(best_tree.tree_, best_tree.classes_)

if feat[0] <= 1.794: (gini: 0.961 samples: 1372 [762, 610])
T-> if feat[1] <= 9.659: (gini: 0.607 samples: 890 [285, 605])
  T-> if feat[0] <= -7.042: (gini: 0.568 samples: 831 [226, 605])
    T-> 1
    F-> 1
  F-> if feat[0] <= -1.577: (gini: 0.000 samples: 59 [59, 0])
    T-> 0
    F-> 0
F-> if feat[2] <= -4.942: (gini: 0.057 samples: 482 [477, 5])
  T-> 1
  F-> if feat[0] <= 1.794: (gini: 0.015 samples: 480 [477, 3])
    T-> 0
    F-> 0


## References

Original blog post I got the algorithm from: http://machinelearningmastery.com/implement-decision-tree-algorithm-scratch-python/

sklearn documentation on rolling your own estimator: http://scikit-learn.org/stable/developers/contributing.html#rolling-your-own-estimator

**NOTE** I did only the minimal work I found necessary to get this working, you'll probably want to do something like the TemplateClassifer to make things **completely** compatible. 