# Interpreting Anomalies from Isolation Forest


## Isolation Forest

The idea behind Isolation Forest is that anomalies are easier to separate from the rest of the data than other points.  The Isolation Forest algorithm partitions the data through a forest of decision trees.  Each split is made randomly.  The number of splits it takes to isolate a record indicates whether or not the record is an anomaly. When a forest of random trees collectively produces shorter path lengths for particular samples, they are highly likely to be anomalies.

In this demo, we will use the Isolation Forest technique to find employees that may be anomalies.  


## Loading the Data

Before we dive into the anomaly detection, let's initialize the h2o cluster and load our data in. We will be using the [synthetic employee attrition dataset](https://www.ibm.com/communities/analytics/watson-analytics-blog/hr-employee-attrition/). This contains a record per employee with information about their employment history and whether they engaged in attrition.

In [None]:
import h2o
h2o.init()

In [None]:
employee_data = h2o.import_file("../../data/topics/isolation_forest/WA_Fn-UseC_-HR-Employee-Attrition.csv")

In [None]:
employee_data.head()

## Isolation Forests

To find our anomalous employees, let's train our isolation forest and see how the predictions look. We will only use a subset of columns for demo purposes.

In [None]:
from h2o.estimators import H2OIsolationForestEstimator
myX = ['Age', 'BusinessTravel', 'DistanceFromHome', 'Education', 'Gender', 'JobInvolvement', 'JobLevel', 
       'MaritalStatus', 'MonthlyIncome', 'NumCompaniesWorked', 'OverTime', 'PercentSalaryHike',
       'PerformanceRating', 'TotalWorkingYears', 'TrainingTimesLastYear', 'YearsAtCompany', 
       'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager']

isolation_model = H2OIsolationForestEstimator(model_id = "isolation_forest.hex", seed = 1234)
isolation_model.train(training_frame = employee_data, x = myX)

The predictions from the isolation forest return the `mean_length`.  This is the average number of splits it took to isolate the record across all the decision trees in the forest.  Records with a smaller `mean_length` are more likely to be anomalous since it takes fewer partitions of the data to isolate them.

In [None]:
predictions = isolation_model.predict(employee_data)
predictions.head()

The histogram of the `mean_length` shows that most employees have a `mean_length` greater than 6.5.  This means that it takes more than 6 splits on average to partition them.  

In [None]:
predictions["mean_length"].hist()

## Defining Anomalies

We will define an anomaly as an employee who's `mean_length` is less than 5.5.  These were employees who were easier to isolate from the rest of the data.

There are 34 anomalous employees.

In [None]:
anomalies = employee_data[predictions["mean_length"] < 5.5]
print("Number of Anomalies: " + str(anomalies.nrow))

In [None]:
isolation_model.predict(anomalies)["mean_length"].cbind(anomalies[myX])

## Interpreting Anomalies

Now that we have found anomalous employees, we are interested in why they are considered anomalies.  Let's examine the first anomaly.

In [None]:
anomalies[0, myX]

In [None]:
isolation_model.predict(anomalies[0, :])

To determine why this employee is considered anomalous, we will build a surrogate decision tree.  The goal of the decision tree is to separate this employee from all other employees.

The structure of the decision tree will tell us why the employee is different from others.

In [None]:
surrogate_data = employee_data[:, :]
surrogate_data["AnomalyRecord"] = (surrogate_data["EmployeeNumber"] == 81).ifelse("Anomaly", "NotAnomaly")

In [None]:
surrogate_data["AnomalyRecord"].table()

In [None]:
from h2o.estimators import H2ORandomForestEstimator

decision_tree = H2ORandomForestEstimator(model_id = "surrogate_decision_tree.hex", ntrees = 1, max_depth = 3,
                                         sample_rate = 1, mtries = len(myX))
decision_tree.train(training_frame = surrogate_data, x = myX, y = "AnomalyRecord")

We can visualize this decision tree to see how it split to isolate our anomaly record.

In [None]:
import os
import subprocess
from IPython.display import Image
def generateTreeImage(decision_tree, image_file_path):
    # Download MOJO
    mojo_path = decision_tree.download_mojo(get_genmodel_jar=True)
    directory = os.path.dirname(mojo_path)
    h2o_jar_path = os.path.join(directory, "h2o-genmodel.jar")
    # Create Graphviz file
    gv_file_path = os.path.join(directory, "decision_tree.gv")
    gv_call = " ".join(["java", "-cp", h2o_jar_path, "hex.genmodel.tools.PrintMojo", "--tree 0 -i", mojo_path , "-o", gv_file_path])
    result = subprocess.call(gv_call, shell=True)
    result = subprocess.call(["ls", gv_file_path], shell = False)
    result = subprocess.call(["dot", "-Tpng", gv_file_path, "-o", image_file_path], shell=False)
    result = subprocess.call(["ls",image_file_path], shell = False)
    
    return Image(image_file_path)

In [None]:
generateTreeImage(decision_tree, "./decision_tree.png")

The anomalous employee falls in the bucket of employees with a high number of years in the Current Role and Age. It falls in the bucket: `YearsInCurrentRole >= 15.5` and `Age >= 57.5`.  

We can see that our simple decision tree is perfectly able to separate the anomaly from the other employees because it has an AUC of 1.  This means that this employee is the only one in the data that has been in his/her current role more than 15.5 years and is older than 57.

In [None]:
decision_tree.model_performance(surrogate_data).auc()

In [None]:
anomalies[0, ["Age", "YearsInCurrentRole"]]

If we examine the distribution of these two features, we can see that the employee falls on the right of the spectrum for both.

In [None]:
employee_data["Age"].hist()

In [None]:
employee_data["YearsInCurrentRole"].hist()