# Interpreting Anomalies from Isolation Forest


## Isolation Forest

The idea behind Isolation Forest is that anomalies are easier to separate from the rest of the data than other points.  The Isolation Forest algorithm partitions the data through a forest of decision trees.  Each split is made randomly.  The number of splits it takes to isolate a record indicates whether or not the record is an anomaly. When a forest of random trees collectively produces shorter path lengths for particular samples, they are highly likely to be anomalies.

In this demo, we will use the Isolation Forest technique to find employees that may be anomalies.  

## Start H2O-3 cluster
_**Note**: The `os.system` command below is used solely for the H2O Aquarium training platform._

In [None]:
import os
os.system('/home/h2o/bin/startup')
!sleep 10

Start by importing `h2o` and creating a connection to the server. The parameters used in `h2o.init` will depend on your specific environment.

In [None]:
import h2o
h2o.init(url='http://localhost:54321/h2o')

## Loading the Data

We will be using the [synthetic employee attrition dataset](https://www.ibm.com/communities/analytics/watson-analytics-blog/hr-employee-attrition/). This contains a record per employee with information about their employment history and whether they engaged in attrition.

In [None]:
employee_data = h2o.import_file("/home/h2o/data/employee_attrition/HR-Employee-Attrition.csv")

In [None]:
employee_data.head()

## Isolation Forests

To find our anomalous employees, let's train our isolation forest and see how the predictions look. We will only use a subset of columns for demo purposes.

In [None]:
from h2o.estimators import H2OIsolationForestEstimator
myX = ['Age', 'BusinessTravel', 'DistanceFromHome', 'Education', 'Gender', 'JobInvolvement', 'JobLevel', 
       'MaritalStatus', 'MonthlyIncome', 'NumCompaniesWorked', 'OverTime', 'PercentSalaryHike',
       'PerformanceRating', 'TotalWorkingYears', 'TrainingTimesLastYear', 'YearsAtCompany', 
       'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager']

isolation_model = H2OIsolationForestEstimator(model_id = "isolation_forest.hex", seed = 1234)
isolation_model.train(training_frame = employee_data, x = myX)

The predictions from the isolation forest return the `mean_length`.  This is the average number of splits it took to isolate the record across all the decision trees in the forest.  Records with a smaller `mean_length` are more likely to be anomalous since it takes fewer partitions of the data to isolate them.

In [None]:
predictions = isolation_model.predict(employee_data)
predictions.head()

The histogram of the `mean_length` shows that most employees have a `mean_length` greater than 6.5.  This means that it takes more than 6 splits on average to partition them.  

In [None]:
predictions["mean_length"].hist()

## Defining Anomalies

We will define an anomaly as an employee who's `mean_length` is less than 5.5.  These were employees who were easier to isolate from the rest of the data.

There are 34 anomalous employees.

In [None]:
anomalies = employee_data[predictions["mean_length"] < 5.5]
print("Number of Anomalies: " + str(anomalies.nrow))

In [None]:
isolation_model.predict(anomalies)["mean_length"].cbind(anomalies[myX])

## Interpreting Anomalies

There are two levels of interpretation:

* global level: high level understanding of what segments of data are considered anomalous
* local level: understanding of why an individual record is considered anomalous

We will start with the global level.  Our goal is to gain an understanding of what segments of data are considered anomalous.

### Global Level

Now that we have found anomalous employees, we are interested in why they are considered anomalies.  To do this, we will train a surrogate decision tree.  The purpose of the surrogate decision tree is to find records with the anomaly flag.  To do this, it will find segments of similar anomalies and discover how to separate them from records that are not anomalies.  We can use this decision tree to then describe anomalous segments of the data.

The steps of interpreting anomalies on a global level are:

1. Create a frame with a column that indicates whether the record was considered an anomaly.
2. Train a decision tree to predict the anomaly flag.
3. Visualize the decision tree to determine which segments of the data are considered anomalous.

In our first step, we will add a column called `anomaly`.  This is a flag that indicates whether the isolation forest considered the record an anomaly.

In [None]:
global_surrogate_data = employee_data[:, :]
global_surrogate_data["anomaly"] = (predictions["mean_length"] < 5.5).ifelse("Yes", "No")
global_surrogate_data["anomaly"].table()

Now that we have the surrogate data, we can train a single decision tree to predict the anomaly flag.  We will keep this decision tree simple (only a single decision tree with a depth of 3) because the purpose of the decision tree is to be completely interpretable.

In [None]:
from h2o.estimators import H2ORandomForestEstimator

global_surrogate_dt = H2ORandomForestEstimator(model_id = "global_surrogate_decision_tree.hex", 
                                               ntrees = 1, max_depth = 3,
                                               sample_rate = 1, mtries = len(myX))
global_surrogate_dt.train(training_frame = global_surrogate_data, x = myX, y = "anomaly")

We can now visualize the decision tree to find segments of the data that are anomalous.

In [None]:
import subprocess
from IPython.display import Image

def generateTreeImage(decision_tree, image_file_path):
    # Download MOJO
    mojo_path = decision_tree.download_mojo(get_genmodel_jar=True)
    directory = os.path.dirname(mojo_path)
    h2o_jar_path = os.path.join(directory, "h2o-genmodel.jar")
    # Create Graphviz file
    gv_file_path = os.path.join(directory, "decision_tree.gv")
    gv_call = " ".join(["java", "-cp", h2o_jar_path, "hex.genmodel.tools.PrintMojo", "--tree 0 -i", mojo_path , "-o", gv_file_path])
    result = subprocess.call(gv_call, shell=True)
    result = subprocess.call(["ls", gv_file_path], shell = False)
    result = subprocess.call(["dot", "-Tpng", gv_file_path, "-o", image_file_path], shell=False)
    result = subprocess.call(["ls",image_file_path], shell = False)
    
    return Image(image_file_path)

In [None]:
generateTreeImage(global_surrogate_dt, "./global_surrogate_decision_tree.png")

The visualization shows our global surrogate decision tree.  The values in the leaf nodes represent the probability of the record not being an anomaly.  We are, therefore, interested in leaf nodes with low values - these will indicate a segment of data that is anomalous.

We can see that there are three leaf nodes with all anomalies.  One leaf node is defined as: 

* Total Working Years < 32.5
* Years Since Last Promotion < 12.5
* Percent Salary Hike >= 18.0

This segment seems strange for two reasons: 

* the employees have been working 32 years or less but have not received a promotion for more than 12 years
    * most employees have a promotion ever 4 years of working 
* the employee has not had a promotion recently but has received a large salary hike
    * years since last promotion is negatively correlated with salary hike

In [None]:
promotion_per_working_years = employee_data[employee_data["YearsSinceLastPromotion"] > 0]
promotion_per_working_years = promotion_per_working_years["TotalWorkingYears"]/promotion_per_working_years["YearsSinceLastPromotion"]
promotion_per_working_years.median()

In [None]:
employee_data[["YearsSinceLastPromotion", "PercentSalaryHike"]].cor()

### Local Level

Now we will perform a local level interpretation.  The goal of this interpretation is to determine why a specific employee is considered an anomaly.

The steps of interpreting anomalies on a local level are:

1. Create a frame with a column that indicates whether the record is our selected anomaly.
2. Train a decision tree to predict the anomaly flag.
3. Visualize the decision tree to determine how the selected anomaly separates from the rest of the dat.a.

Let's begin by examining our first anomaly.

In [None]:
anomalies[0, myX]

In [None]:
isolation_model.predict(anomalies[0, :])

To determine why this employee is considered anomalous, we will build a surrogate decision tree.  The goal of the decision tree is to separate this employee from all other employees.

The structure of the decision tree will tell us why the employee is different from others.

In [None]:
local_surrogate_data = employee_data[:, :]
local_surrogate_data["anomaly_record"] = (local_surrogate_data["EmployeeNumber"] == 81).ifelse("Anomaly", "NotAnomaly")

In [None]:
local_surrogate_data["anomaly_record"].table()

In [None]:
local_surrogate_dt = H2ORandomForestEstimator(model_id = "local_level_surrogate_decision_tree.hex", 
                                              ntrees = 1, max_depth = 3,
                                              sample_rate = 1, mtries = len(myX))
local_surrogate_dt.train(training_frame = local_surrogate_data, x = myX, y = "anomaly_record")

We can visualize this decision tree to see how it split to isolate our anomaly record.

In [None]:
generateTreeImage(local_surrogate_dt, "./global_surrogate_decision_tree.png")

The anomalous employee falls in the bucket of employees with a high number of years in the Current Role and Age. It falls in the bucket: `YearsInCurrentRole >= 15.5` and `Age >= 57.5`.  

We can see that our simple decision tree is perfectly able to separate the anomaly from the other employees because it has an AUC of 1.  This means that this employee is the only one in the data that has been in his/her current role more than 15.5 years and is older than 57.

In [None]:
local_surrogate_dt.model_performance(local_surrogate_data).auc()

In [None]:
anomalies[0, ["Age", "YearsInCurrentRole"]]

If we examine the distribution of these two features, we can see that the employee falls on the right of the spectrum for both.

In [None]:
employee_data["Age"].hist()

In [None]:
employee_data["YearsInCurrentRole"].hist()

In [None]:
h2o.cluster().shutdown()