## Tree Classifiers

### Decision Tree Construction Algorithm

#### We build a tree top-down in a greedy manner.

    function BuildDT(X, Y, threshold)
        if (all the labels (Y) are the same
            assign as leaf node with label set to class of Y
            return tree
        elseif (threshold is the minimum number of examples to keep)
            assign as leaf node with label set to most commin class
            return tree
        else
            let f be the feature that is best to split on
            let leftChildBranch = BuildDT(data (X) where f=True, labels (Y) with f=True)
            let rightChildBranch = BuildDT(data (X) where f=False, labels (Y) with f=False)
            return tree
            
### Determining best split feature by evaluating node purity

<img style="float: l;" src="Purity.png">
 
$Entropy(t) = - \displaystyle \sum_{i=0}^{c-1} p(\space i \space | \space t \space) \space log_2 \space p(\space i \space | \space t \space) \space \space \space \space \space \space \space \space \space \space Gini(t) = 1 - \displaystyle \sum_{i=0}^{c-1} [p(\space i \space | \space t \space)]^2 \space \space \space \space \space \space \space \space \space \space Classification error(t) = 1 - \max_i [p(\space i \space | \space t \space)]$

#### 1) Determine which feature is better to split on ?
<img style="float: left;" src="Gini.png">  $ A.N1 = 1 - (4/7)^2 + (3/7)^2 = 0.4898 \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space$
                                           $ A.N2 = 1 - (2/5)^2 + (3/5)^2 = 0.4800 \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space \space$
                                           $A = (7/12) x 0.4898 + (5/12) X 0.4800 = 0.486$ 


### Boosting Weak Classifiers (Decision Stumps)

#### AdaBoost is a method by which we take a weak classifier and construct a distribution over the input data observations. The key is to constantly change the distribution of data on which we train our “weak learner” so that it learns to correct its own mistakes. It runs in $T$ rounds and in each round learns a weak hypothesis $h_t$. It also learns a $coefficient \space \alpha_t$ for each weak hypothesis. The final prediction for an input $x$ is:

$\hat{y} = sign \big{[} \displaystyle \sum_{t=1}^T \alpha_t h_t (\hat{x}) \big{]} $

#### We define a classification algorithm  (the weak learner) $A \big{(} \big{\langle} x_n, y_n, D_n \big{\rangle}_{n=1}^N \big{)} $ that takes as input the training data, training labels, and a weighting vector for each observation. 


                                           
### Environment Setup

In [None]:
import re
import nltk
import numpy as np
import pandas as pd
import pickle as pkl
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_curve
from nltk.corpus import stopwords
from sklearn.svm import SVC
%matplotlib inline 
import matplotlib.pyplot as plt

### Syntatic NLP Processing

#### We will define some Python functions that will perform some syntatic work on our corpus. 

In [None]:
def tokenize(text):
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = [ token for token in tokens if re.search('(^[a-zA-Z]+$)', token) ]
    return filtered_tokens

cachedStopWords = stopwords.words("english") + ['year', 'old', 'man', 'woman', 'ap', 'am', 'pm', 'portable', 'pa', 'lat', 'admitting', 'diagnosis', 'lateral']


### Retrieving our Corpus

#### Let's pull in our corpus that we had serialized out to disk.  

In [None]:
file = open('classification-corpus.pkl','rb')
corpus = pkl.load(file)
file.close()
corpusList = list(corpus.values())
labels = list(corpus.keys())

### Generate Document-Term Frequency Counts

#### In this step we tokenize our text and remove stop words in addition to generating our frequency counts.

#### 1) How many documents are we working with and how many features (unigrams & bigrams)?

#### 2) Can you figure out what max_df and min_df is doing to our feature count?

In [None]:
cv = CountVectorizer(lowercase=True, max_df=0.80, max_features=None, min_df=0.033,
                     ngram_range=(1, 2), preprocessor=None, stop_words=cachedStopWords,
                     strip_accents=None, tokenizer=tokenize, vocabulary=None)
X = cv.fit_transform(corpusList)
print(X.shape)
print()
lexicon = cv.get_feature_names()
print (lexicon)
print()

### Construct our Classes

#### We need to assign a class for each classification. We typically assign numeric values to classes.

In [None]:
Y = []
for key in corpus:        
    if (key.startswith('Trauma')):
        Y.append(0)
    elif (key.startswith('PNA')):
        Y.append(1)
Y = np.array(Y)

### Let's Run It!

#### We will generate models and evaluate the modes using bootstrapping.

In [None]:
truth = []
dt_prediction = []
bdt_prediction = []
rf_prediction = []

for i in range(0,10):
    print('Interation: ' + str(i+1))
    N, D = X.shape
    ITB = np.random.choice(N, N, replace=True)
    X_ITB = X[ITB, :]
    Y_ITB = Y[ITB]
    X_OOB = np.delete(X.A, list(set(ITB)), 0)
    Y_OOB = np.delete(Y, list(set(ITB)), 0)
    N_OOB, D_OOB = X_OOB.shape
    truth.append(Y_OOB)
    dt = DecisionTreeClassifier(random_state=0)
    dt.fit(X_ITB, Y_ITB)
    Y_hat = dt.predict(X_OOB)
    dt_prediction.append(Y_hat)
    bdt = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), algorithm="SAMME", n_estimators=201)
    bdt.fit(X_ITB, Y_ITB)
    Y_hat = bdt.predict(X_OOB)
    bdt_prediction.append(Y_hat)
    rf = RandomForestClassifier(n_estimators = 101, oob_score = True, n_jobs = -1, max_features = "auto")
    rf.fit(X_ITB, Y_ITB)
    Y_hat = rf.predict(X_OOB)
    rf_prediction.append(Y_hat)

truth = np.concatenate(truth, axis=0)    
dt_prediction = np.concatenate(dt_prediction, axis=0)
bdt_prediction = np.concatenate(bdt_prediction, axis=0)
rf_prediction = np.concatenate(rf_prediction, axis=0)

### Contingency Tables

#### Let's look at the contingency tables

#### 1) Can you calculate the Sensitivity, Specificity, PPV, NPV? 

In [None]:
dt_ct = pd.crosstab(dt_prediction, truth, margins=True)
dt_ct.columns = ["Trauma", "PNA", "Total"]
dt_ct.index = ["Trauma", "PNA", "Total"]
print("Decision Tree")
print(dt_ct)
print()

bdt_ct = pd.crosstab(bdt_prediction, truth, margins=True)
bdt_ct.columns = ["Trauma", "PNA", "Total"]
bdt_ct.index = ["Trauma", "PNA", "Total"]
print("Boosting Decision Stumps")
print(bdt_ct)
print()

rf_ct = pd.crosstab(rf_prediction, truth, margins=True)
rf_ct.columns = ["Trauma", "PNA", "Total"]
rf_ct.index = ["Trauma", "PNA", "Total"]
print("Random Forest")
print(rf_ct)
print()

### Evaluation Statistics

#### We will calculate some test statistics for our classifiers

In [None]:
Sens = dt_ct.iloc[1][1]/dt_ct.iloc[2][1]
Spec = dt_ct.iloc[0][0]/dt_ct.iloc[2][0]
PPV = dt_ct.iloc[1][1]/dt_ct.iloc[1][2]
NPV = dt_ct.iloc[0][0]/dt_ct.iloc[0][2]
ACC = (dt_ct.iloc[0][0] + dt_ct.iloc[1][1]) / dt_ct.iloc[2][2]
print("Decision Tree: Sensitivity: %.5f Specificity: %.5f PPV: %.5f NPV: %.5f Accuracy: %.5f" % (Sens, Spec, PPV, NPV, ACC))

Sens = bdt_ct.iloc[1][1]/bdt_ct.iloc[2][1]
Spec = bdt_ct.iloc[0][0]/bdt_ct.iloc[2][0]
PPV = bdt_ct.iloc[1][1]/bdt_ct.iloc[1][2]
NPV = bdt_ct.iloc[0][0]/bdt_ct.iloc[0][2]
ACC = (bdt_ct.iloc[0][0] + bdt_ct.iloc[1][1]) / bdt_ct.iloc[2][2]
print("Boosting Decision Stumps: Sensitivity: %.5f Specificity: %.5f PPV: %.5f NPV: %.5f Accuracy: %.5f" % (Sens, Spec, PPV, NPV, ACC))

Sens = rf_ct.iloc[1][1]/rf_ct.iloc[2][1]
Spec = rf_ct.iloc[0][0]/rf_ct.iloc[2][0]
PPV = rf_ct.iloc[1][1]/rf_ct.iloc[1][2]
NPV = rf_ct.iloc[0][0]/rf_ct.iloc[0][2]
ACC = (rf_ct.iloc[0][0] + rf_ct.iloc[1][1]) / rf_ct.iloc[2][2]
print("Random Forest: Sensitivity: %.5f Specificity: %.5f PPV: %.5f NPV: %.5f Accuracy: %.5f" % (Sens, Spec, PPV, NPV, ACC))                                                                                       

### ROC Curve

#### 1) So which classifier do you think is better?

In [None]:
dt_fpr, dt_tpr, dt_thresholds = roc_curve(truth, dt_prediction, pos_label=1)
bdt_fpr, bdt_tpr, bdt_thresholds = roc_curve(truth, bdt_prediction, pos_label=1)
rf_fpr, rf_tpr, rf_thresholds = roc_curve(truth, rf_prediction, pos_label=1)
plt.figure(1)
#plt.xlim(0, 0.2)
#plt.ylim(0.8, 1)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(dt_fpr, dt_tpr, label='DecisionTree')
plt.plot(bdt_fpr, bdt_tpr, label='BoostingDecisionStumps')
plt.plot(rf_fpr, rf_tpr, label='RandomForest')
plt.xlabel('False positive rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.title('ROC curve')
plt.legend(loc='best')
plt.show()