# Sagemaker Example with Decision Tree

### Summary
Apply ML to build a decision tree that follows the same steps to diagnose the root cause of a problem. 

### Steps
1. Import data from S3
1. Transform data using LabelEncoder
1. Fitting a simple decision tree
1. Using an inner property of our decision tree model to extract rules
1. Constructing a recursive visitor to the decision tree

In [None]:
import boto3
import io
import numpy as np
import pandas as pd

### Get Data from S3

In [None]:
session = boto3.Session()
s3 = boto3.client('s3')

file_name = 'example-file.csv'
bucket_name = "example-bucket"

obj = s3.get_object(Bucket = bucket_name, Key = file_name)
data = pd.read_csv(obj['Body'], low_memory=False)


In [None]:
data.head(2)

In [None]:
df = data.copy()
df.columns

In [None]:
df.shape

### Transforming the Data

In [None]:
drop_columns = ['column_1', 'column_2']
target_column = 'target_column_name'

# Target Variable
inputs = df.drop(drop_columns, axis='columns')
target = df.target_column

In [None]:
inputs.head()

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

# Find any df columns that are objects, this will mess up the transformer
objList = inputs.select_dtypes(include = "object").columns

for column in inputs.columns:
    if column in objList:
        inputs[column + "_n"] = le.fit_transform(inputs[column].astype(str))
    
    else:
        inputs[column  + "_n"] = le.fit_transform(inputs[column])


In [None]:
inputs.head()

In [None]:
# drop all original columns, keep the newly formatted columns
cols = list(set(df.columns) - set(drop_columns))

inputs_n = inputs.drop(cols, axis='columns')

In [None]:
inputs_n.head()

## Working with DecisionTreeClassifier

##### Step 1: Import the model you want to use
    * This was already imported earlier in the notebook so commenting out

    * from sklearn.tree import DecisionTreeClassifier

##### Step 2: Make an instance of the Model

    * clf = DecisionTreeClassifier(max_depth = 2,random_state = 0)
                             
##### Step 3: Train the model on the data
    * clf.fit(X_train, Y_train)

##### Step 4: Predict labels of unseen (test) data
    * clf.predict(X_test)

In [None]:
from sklearn import tree

clf = tree.DecisionTreeClassifier(max_depth = 2, random_state = 0)
clf.fit(inputs_n, target)

clf.score(inputs_n, target)

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(inputs_n, target, random_state=0)
len(X_train)

clf.fit(X_train, Y_train)
clf.score(X_train, Y_train)

In [None]:
clf.predict(X_test)

In [None]:
tree.plot_tree(clf)

In [None]:
import matplotlib.pyplot as plt
fn= inputs_n.columns
# cn=['setosa', 'versicolor', 'virginica']
fig, axes = plt.subplots(nrows = 1,ncols = 1,figsize = (4,4), dpi=300)
tree.plot_tree(clf,
               feature_names = fn, 
               # class_names=cn,
               filled = True)

# fig.savefig('imagename.png')

In [None]:
clf2 = tree.DecisionTreeClassifier(max_leaf_nodes=5, random_state=0)
clf2.fit(X_train, Y_train)
clf2.score(X_train, Y_train)

In [None]:
clf.score(X_train, Y_train)

# Get information about the tree that was constructed

* [Extract Decision Tree Rules](https://stackoverflow.com/questions/56334210/how-to-extract-sklearn-decision-tree-rules-to-pandas-boolean-conditions)

In [None]:
n_nodes = clf.tree_.node_count
children_left = clf.tree_.children_left
children_right = clf.tree_.children_right
feature = clf.tree_.feature
threshold = clf.tree_.threshold

### Define two recursive functions. 
* The first one will find the path from the tree's root to create a specific node (all the leaves in our case). 
* The second one will write the specific rules used to create a node using its creation path

In [None]:
def find_path(node_numb, path, x):
        path.append(node_numb)
        if node_numb == x:
            return True
        left = False
        right = False
        if (children_left[node_numb] !=-1):
            left = find_path(children_left[node_numb], path, x)
        if (children_right[node_numb] !=-1):
            right = find_path(children_right[node_numb], path, x)
        if left or right :
            return True
        path.remove(node_numb)
        return False


def get_rule(path, column_names):
    mask = ''
    for index, node in enumerate(path):
        #We check if we are not in the leaf
        if index!=len(path)-1:
            # Do we go under or over the threshold ?
            if (children_left[node] == path[index+1]):
                mask += "(df['{}']<= {}) \t ".format(column_names[feature[node]], threshold[node])
            else:
                mask += "(df['{}']> {}) \t ".format(column_names[feature[node]], threshold[node])
    # We insert the & at the right places
    mask = mask.replace("\t", "&", mask.count("\t") - 1)
    mask = mask.replace("\t", "")
    return mask

### Use those two functions to first store the creation path of each leaf
* Store the rules used to create each leaf

In [None]:
# Leaves
leave_id = clf.apply(X_test)

paths ={}
for leaf in np.unique(leave_id):
    path_leaf = []
    find_path(0, path_leaf, leaf)
    paths[leaf] = np.unique(np.sort(path_leaf))

rules = {}
for key in paths:
    rules[key] = get_rule(paths[key], inputs_n.columns)

In [None]:
rules

Since the rules are strings, you can't directly call them using `df[rules[3]]`, you have to use the eval function like so `df[eval(rules[3])]`

### Example Understanding the decision tree structure
* [scikit learn](https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html)

In [None]:
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

n_nodes = clf.tree_.node_count
children_left = clf.tree_.children_left
children_right = clf.tree_.children_right
feature = clf.tree_.feature
threshold = clf.tree_.threshold


# The tree structure can be traversed to compute various properties such
# as the depth of each node and whether or not it is a leaf.
node_depth = np.zeros(shape=n_nodes, dtype=np.int64)
is_leaves = np.zeros(shape=n_nodes, dtype=bool)
stack = [(0, -1)]  # seed is the root node id and its parent depth
while len(stack) > 0:
    node_id, parent_depth = stack.pop()
    node_depth[node_id] = parent_depth + 1

    # If we have a test node
    if (children_left[node_id] != children_right[node_id]):
        stack.append((children_left[node_id], parent_depth + 1))
        stack.append((children_right[node_id], parent_depth + 1))
    else:
        is_leaves[node_id] = True

print("The binary tree structure has %s nodes and has "
      "the following tree structure:"
      % n_nodes)
for i in range(n_nodes):
    if is_leaves[i]:
        print("%snode=%s leaf node." % (node_depth[i] * "\t", i))
    else:
        print("%snode=%s test node: go to node %s if X[:, %s] <= %s else to "
              "node %s."
              % (node_depth[i] * "\t",
                 i,
                 children_left[i],
                 feature[i],
                 threshold[i],
                 children_right[i],
                 ))
print()

In [None]:
# First let's retrieve the decision path of each sample. The decision_path
# method allows to retrieve the node indicator functions. A non zero element of
# indicator matrix at the position (i, j) indicates that the sample i goes
# through the node j.

node_indicator = clf.decision_path(X_test)

# Similarly, we can also have the leaves ids reached by each sample.

leave_id = clf.apply(X_test)

# Now, it's possible to get the tests that were used to predict a sample or
# a group of samples. First, let's make it for the sample.

# For a group of samples, we have the following common node.
sample_ids = [0, 1]
common_nodes = (node_indicator.toarray()[sample_ids].sum(axis=0) ==
                len(sample_ids))

common_node_id = np.arange(n_nodes)[common_nodes]

print("\nThe following samples %s share the node %s in the tree"
      % (sample_ids, common_node_id))
print("It is %s %% of all nodes." % (100 * len(common_node_id) / n_nodes,))

# What can be done better? What is hard?

* Multiple Layers to Coupon Errors
    * Need to process those multiple layers and causes of errors
* Time zones: Half team was in UK
* ALWAYS MORE DATA PROCESSING
    * measure the importance of difference targets
* Extract more specific rules from decision tree models as action items

# Next Steps?

1. Check to see if the model is creating rule sets
    * Maybe compare to current methods and see if we can perform regression testing
2. Did we succeed?
    * Yes? We successfully produced decision trees and rules sets
    * Can produce rules and not just "black box" 
    * People are not physically creating the rules
    * This specific decision tree code can produce decicsion trees for any dataframe model


# A Product is Possible?
* YES? Cloud native application for processing custody vs prometheus cpns to recommend the changes to analysts
* We have used coupon comparisons to determine the decisions impacting the occurance of errors
* We need to use dataset with different targets to evaluate the best possible decisions for coupons
* We need to expand on the rule outputs from the decision trees to measure the success and accuracy of the specified targets