**Isolation Forest** is based on simple idea that outlier data-points are easier to isolate from rest of the samples than normal data-points. Normal points will be part of a cluster so it will take more number of splits to isolate them. In this method, Isolation Forest recursively splits the data by randomly selecting a feature and then a random split value of the feature until the data-points are isolated in their own leaf nodes. Then an average depth is calculated across a large number of trees (forest). The average depth is then converted to an anomaly score between the scale of -1 and 1. A score of -1 indicates defintely an anomaly and a score of 1 indicates a normal point

In [1]:
import pandas as pd
import numpy as np
from sklearn.covariance import EllipticEnvelope
import matplotlib.pyplot as plt

In [2]:
## We will use Kaggle ECG data for this example

import zipfile

# Download data
!wget https://github.com/asreddyIITB/ml/raw/main/ECG_Classification/Kaggle_ECG_dataset/MIT-BIH/mitbih_train.csv.zip

zip_ref = zipfile.ZipFile("mitbih_train.csv.zip", "r")
zip_ref.extractall()
zip_ref.close()

--2022-03-31 21:46:59--  https://github.com/asreddyIITB/ml/raw/main/ECG_Classification/Kaggle_ECG_dataset/MIT-BIH/mitbih_train.csv.zip
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/asreddyIITB/ml/main/ECG_Classification/Kaggle_ECG_dataset/MIT-BIH/mitbih_train.csv.zip [following]
--2022-03-31 21:46:59--  https://raw.githubusercontent.com/asreddyIITB/ml/main/ECG_Classification/Kaggle_ECG_dataset/MIT-BIH/mitbih_train.csv.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 69854147 (67M) [application/zip]
Saving to: ‘mitbih_train.csv.zip’


2022-03-31 21:47:01 (212 MB/s) - ‘mitbih_train.csv.zip’ s

In [4]:
df_mitbih = pd.read_csv('mitbih_train.csv', header=None)
df_mitbih.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,178,179,180,181,182,183,184,185,186,187
0,0.977941,0.926471,0.681373,0.245098,0.154412,0.191176,0.151961,0.085784,0.058824,0.04902,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.960114,0.863248,0.461538,0.196581,0.094017,0.125356,0.099715,0.088319,0.074074,0.082621,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.659459,0.186486,0.07027,0.07027,0.059459,0.056757,0.043243,0.054054,0.045946,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.925414,0.665746,0.541436,0.276243,0.196133,0.077348,0.071823,0.060773,0.066298,0.058011,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.967136,1.0,0.830986,0.586854,0.356808,0.248826,0.14554,0.089202,0.117371,0.150235,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [5]:
## class labels are in last column
df_mitbih[187].value_counts()

0.0    72471
4.0     6431
2.0     5788
1.0     2223
3.0      641
Name: 187, dtype: int64

In [9]:
df_mitbih = df_mitbih.rename(columns = {187 : 'labels'})

In [10]:
df_mitbih.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,178,179,180,181,182,183,184,185,186,labels
0,0.977941,0.926471,0.681373,0.245098,0.154412,0.191176,0.151961,0.085784,0.058824,0.04902,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.960114,0.863248,0.461538,0.196581,0.094017,0.125356,0.099715,0.088319,0.074074,0.082621,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.659459,0.186486,0.07027,0.07027,0.059459,0.056757,0.043243,0.054054,0.045946,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.925414,0.665746,0.541436,0.276243,0.196133,0.077348,0.071823,0.060773,0.066298,0.058011,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.967136,1.0,0.830986,0.586854,0.356808,0.248826,0.14554,0.089202,0.117371,0.150235,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
df_mitbih.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Index: 87554 entries, 0 to 87553
Data columns (total 188 columns):
 #    Column  Non-Null Count  Dtype  
---   ------  --------------  -----  
 0    0       87554 non-null  float64
 1    1       87554 non-null  float64
 2    2       87554 non-null  float64
 3    3       87554 non-null  float64
 4    4       87554 non-null  float64
 5    5       87554 non-null  float64
 6    6       87554 non-null  float64
 7    7       87554 non-null  float64
 8    8       87554 non-null  float64
 9    9       87554 non-null  float64
 10   10      87554 non-null  float64
 11   11      87554 non-null  float64
 12   12      87554 non-null  float64
 13   13      87554 non-null  float64
 14   14      87554 non-null  float64
 15   15      87554 non-null  float64
 16   16      87554 non-null  float64
 17   17      87554 non-null  float64
 18   18      87554 non-null  float64
 19   19      87554 non-null  float64
 20   20      87554 non-null  float64
 21   21      87554 n

  """Entry point for launching an IPython kernel.


In [20]:
df_mitbih.isna().sum()

0         0
1         0
2         0
3         0
4         0
         ..
183       0
184       0
185       0
186       0
labels    0
Length: 188, dtype: int64

In [22]:
df_mitbih['labels'].value_counts()

0    72471
1    15083
Name: labels, dtype: int64

In [29]:
feature_cols = [*range(187)]

In [27]:
X_normal = df_mitbih[df_mitbih['labels']==1][feature_cols].values
X_anomalous = df_mitbih[df_mitbih['labels']==-1][feature_cols].values

In [37]:
from sklearn.ensemble import IsolationForest

iforest = IsolationForest(n_estimators=100, max_samples='auto', 
                          contamination=0.20, max_features=1.0, 
                          bootstrap=False, n_jobs=-1, random_state=1)


# Returns 1 of inliers, -1 for outliers
pred = iforest.fit_predict(df_mitbih[feature_cols])

df_mitbih['pred_Label'] = np.where(pred == 1, 0, 1)

## *Metrics*

In [50]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import seaborn as sns

from sklearn.metrics import classification_report
print(classification_report(df_mitbih['labels'], df_mitbih['pred_Label']))

              precision    recall  f1-score   support

           0       0.88      0.85      0.87     72471
           1       0.40      0.46      0.43     15083

    accuracy                           0.79     87554
   macro avg       0.64      0.66      0.65     87554
weighted avg       0.80      0.79      0.79     87554




## Notes

1.   We can improve above numbers by fine tuning the hyper parameters.
2.   In our notebook on LigtGBM we revisit this problem to improve our statistics.

