In [1]:
%matplotlib inline


# A wild IsolationForest has appeared(!)

This document shows the results of me playing around with a IsolationForest tutorial. IsolationForest is an algorithm presented the paper cited below, linked in the tutorial at [scikit tutorials](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html). 

It randomly selects a feature in the training dataset and then splits between the maximum and minimum for that feature, generating a tree. Those with shorter tree paths, on average, can be considered anomalies. 

I don't really have a deep understanding of what is going on here, but hopefully I will next quarter after I take machine learning! 

.. [1] Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. "Isolation forest." Data Mining, 2008. ICDM'08. Eighth IEEE International Conference on.

Potentially interesting readings (for myself !):

[Intro blog post to anomaly detection](https://iwringer.wordpress.com/2015/11/17/anomaly-detection-concepts-and-techniques/)

[LSTM Nueral network for anomaly detection](https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2015-56.pdf)

[RNN for anomaly detection](https://arxiv.org/abs/1702.00833)


In [12]:
print(__doc__)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
from math import log

DEBUG = False

def printResults(yPred, cont, loss, X_train):

    num = 0
    value = 0

    for i in range(len(yPred)):

        if yPred[i] == -1:
            if DEBUG:
                print(i+1)
        else:
            num += 1
            value += X_train[i,1]

    print ("Proportion of contamination: " + str(float(cont)/100))
    print ("Loss: " + str(log(sum(loss))))
    print ("Number of samples labeled non-anomalies: " + str(num))
    print ("Final AOV based on anomaly detection: " + str(float(value)/num) + "\n")

    
def preprocess():
    dF = None

    # get data into a pandas dataframe
    with open("shop.csv", 'rb') as data:
        dF = pd.read_csv(data)

    # turn payment_type into categorical integer where cash = 0, credit = 1, debit = 2
    X_train = dF.as_matrix(columns=['user_id','order_amount','total_items','payment_method'])
    payment_enc = pd.factorize(X_train[:,3])
    X_train[:,3] = payment_enc[0]
    
    return X_train


def trainAndOutput(X_train):
    
    # find the best value for contamination parameter
    for cont in range(1,8,1):
        
        # fit the model
        clf = IsolationForest(contamination=float(cont)/100, n_estimators=1000, max_features=4)
        clf.fit(X_train)

        yPred = clf.predict(X_train)
        printResults(yPred, cont, clf.decision_function(X_train), X_train)



Automatically created module for IPython interactive environment


In [13]:
def main():
    X_train = preprocess()
    trainAndOutput(X_train)

main()
    

Proportion of contamination: 0.01
Loss: 4.838689427573604
Number of samples labeled non-anomalies: 4950
Final AOV based on anomaly detection: 389.72545454545457

Proportion of contamination: 0.02
Loss: 4.814770417469775
Number of samples labeled non-anomalies: 4900
Final AOV based on anomaly detection: 303.43795918367346

Proportion of contamination: 0.03
Loss: 4.795452841153745
Number of samples labeled non-anomalies: 4850
Final AOV based on anomaly detection: 293.65051546391754

Proportion of contamination: 0.04
Loss: 4.791997923126564
Number of samples labeled non-anomalies: 4800
Final AOV based on anomaly detection: 289.71208333333334

Proportion of contamination: 0.05
Loss: 4.851792297869462
Number of samples labeled non-anomalies: 4750
Final AOV based on anomaly detection: 286.14736842105265

Proportion of contamination: 0.06
Loss: 4.80017472865933
Number of samples labeled non-anomalies: 4700
Final AOV based on anomaly detection: 282.67936170212766

Proportion of contamination: 

As we can see above, it seems that if we think there are about 70-80 outliers that need to be removed from the dataset (as per analysis in the other notebook), the AOV above is pretty close to what we calculated in our other document - the above output shows somewhere between 389-303 as AOV.

Obviously the above method is random, so it would be problematic to say that the above proves anything. However, what the above method does seem to support is that our AOV calculation makes sense - as seen in the second batch of data, if we were to eliminate about 100 suspicious data points from the purchases, leaving 4900 purchases, we would get an AOV of around 303.