Last updated: 15 Feb 2023

# PyCaret Anomaly Detection

PyCaret is an open-source, low-code machine learning library in Python that automates machine learning workflows. It is an end-to-end machine learning and model management tool that exponentially speeds up the experiment cycle and makes you more productive.


In [None]:
pip install pycaret

Collecting pycaret
  Downloading pycaret-3.3.2-py3-none-any.whl.metadata (17 kB)
Collecting scipy<=1.11.4,>=1.6.1 (from pycaret)
  Downloading scipy-1.11.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.4/60.4 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting joblib<1.4,>=1.2.0 (from pycaret)
  Downloading joblib-1.3.2-py3-none-any.whl.metadata (5.4 kB)
Collecting pyod>=1.1.3 (from pycaret)
  Downloading pyod-2.0.2.tar.gz (165 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m165.8/165.8 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting category-encoders>=2.4.0 (from pycaret)
  Downloading category_encoders-2.6.3-py2.py3-none-any.whl.metadata (8.0 kB)
Collecting deprecation>=2.1.0 (from pycaret)
  Downloading deprecation-2.1.0-py2.py3-none-any.whl.metadata (4.6 kB)
Collecting xxhash (from pycaret

In [None]:
pip install pycaret[full]

Collecting shap~=0.44.0 (from pycaret[full])
  Downloading shap-0.44.1-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (24 kB)
Collecting interpret>=0.2.7 (from pycaret[full])
  Downloading interpret-0.6.3-py3-none-any.whl.metadata (1.1 kB)
Collecting umap-learn>=0.5.2 (from pycaret[full])
  Downloading umap_learn-0.5.6-py3-none-any.whl.metadata (21 kB)
Collecting ydata-profiling>=4.3.1 (from pycaret[full])
  Downloading ydata_profiling-4.10.0-py2.py3-none-any.whl.metadata (20 kB)
Collecting explainerdashboard>=0.3.8 (from pycaret[full])
  Downloading explainerdashboard-0.4.7-py3-none-any.whl.metadata (3.8 kB)
Collecting fairlearn==0.7.0 (from pycaret[full])
  Downloading fairlearn-0.7.0-py3-none-any.whl.metadata (7.3 kB)
Collecting kmodes>=0.11.1 (from pycaret[full])
  Downloading kmodes-0.12.2-py2.py3-none-any.whl.metadata (8.1 kB)
Collecting statsforecast<1.6.0,>=0.5.5 (from pycaret[full])
  Downloading statsforecast-1.5

In [None]:
# check installed version
import pycaret
pycaret.__version__
print(pycaret.datasets.get_data('index'))

Unnamed: 0,Dataset,Data Types,Default Task,Target Variable 1,Target Variable 2,# Instances,# Attributes,Missing Values
0,anomaly,Multivariate,Anomaly Detection,,,1000,10,N
1,france,Multivariate,Association Rule Mining,InvoiceNo,Description,8557,8,N
2,germany,Multivariate,Association Rule Mining,InvoiceNo,Description,9495,8,N
3,bank,Multivariate,Classification (Binary),deposit,,45211,17,N
4,blood,Multivariate,Classification (Binary),Class,,748,5,N
5,cancer,Multivariate,Classification (Binary),Class,,683,10,N
6,credit,Multivariate,Classification (Binary),default,,24000,24,N
7,diabetes,Multivariate,Classification (Binary),Class variable,,768,9,N
8,electrical_grid,Multivariate,Classification (Binary),stabf,,10000,14,N
9,employee,Multivariate,Classification (Binary),left,,14999,10,N


                             Dataset    Data Types  \
0                            anomaly  Multivariate   
1                             france  Multivariate   
2                            germany  Multivariate   
3                               bank  Multivariate   
4                              blood  Multivariate   
5                             cancer  Multivariate   
6                             credit  Multivariate   
7                           diabetes  Multivariate   
8                    electrical_grid  Multivariate   
9                           employee  Multivariate   
10                             heart  Multivariate   
11                     heart_disease  Multivariate   
12                         hepatitis  Multivariate   
13                            income  Multivariate   
14                             juice  Multivariate   
15                               nba  Multivariate   
16                              wine  Multivariate   
17                         t

# Quick start

PyCaret’s Anomaly Detection Module is an unsupervised machine learning module that is used for identifying rare items, events, or observations that raise suspicions by differing significantly from the majority of the data.

Typically, the anomalous items will translate to some kind of problems such as bank fraud, a structural defect, medical problems, or errors.

PyCaret's Anomaly Detection module provides several pre-processing features to prepare the data for modeling through the `setup` function. It has over 10 ready-to-use algorithms and few plots to analyze the performance of trained models.

A typical workflow in PyCaret's unsupervised module consist of following 6 steps in this order:

**Setup** ➡️ **Create Model** ➡️ **Assign Labels** ➡️ **Analyze Model** ➡️ **Prediction** ➡️ **Save Model**

In [None]:
# loading sample dataset from pycaret dataset module
from pycaret.datasets import get_data
data = get_data('hepatitis')
print(len(data))

Unnamed: 0,Class,AGE,SEX,STEROID,ANTIVIRALS,FATIGUE,MALAISE,ANOREXIA,LIVER BIG,LIVER FIRM,SPLEEN PALPABLE,SPIDERS,ASCITES,VARICES,BILIRUBIN,ALK PHOSPHATE,SGOT,ALBUMIN,PROTIME,HISTOLOGY
0,0,30,2,1.0,2,2,2,2,1.0,2.0,2.0,2.0,2.0,2.0,1.0,85.0,18.0,4.0,,1
1,0,50,1,1.0,2,1,2,2,1.0,2.0,2.0,2.0,2.0,2.0,0.9,135.0,42.0,3.5,,1
2,0,78,1,2.0,2,1,2,2,2.0,2.0,2.0,2.0,2.0,2.0,0.7,96.0,32.0,4.0,,1
3,0,31,1,,1,2,2,2,2.0,2.0,2.0,2.0,2.0,2.0,0.7,46.0,52.0,4.0,80.0,1
4,0,34,1,2.0,2,2,2,2,2.0,2.0,2.0,2.0,2.0,2.0,1.0,,200.0,4.0,,1


154


## Setup
This function initializes the training environment and creates the transformation pipeline. The setup function must be called before executing any other function. It takes one mandatory parameter only: data. All the other parameters are optional.

In [None]:
# import pycaret anomaly and init setup
from pycaret.anomaly import *
s = setup(data, session_id = 123)

Unnamed: 0,Description,Value
0,Session id,123
1,Original data shape,"(154, 20)"
2,Transformed data shape,"(154, 20)"
3,Numeric features,20
4,Rows with missing values,48.1%
5,Preprocess,True
6,Imputation type,simple
7,Numeric imputation,mean
8,Categorical imputation,mode
9,CPU Jobs,-1


Once the setup has been successfully executed it shows the information grid containing experiment level information.

- **Session id:**  A pseudo-random number distributed as a seed in all functions for later reproducibility. If no `session_id` is passed, a random number is automatically generated that is distributed to all functions.<br/>
<br/>
- **Original data shape:**  Shape of the original data prior to any transformations. <br/>
<br/>
- **Transformed data shape:**  Shape of data after transformations <br/>
<br/>
- **Numeric features :**  The number of features considered as numerical. <br/>
<br/>
- **Categorical features :**  The number of features considered as categorical. <br/>

PyCaret has two set of API's that you can work with. (1) Functional (as seen above) and (2) Object Oriented API.

With Object Oriented API instead of executing functions directly you will import a class and execute methods of class.

In [None]:
# import AnomalyExperiment and init the class
from pycaret.anomaly import AnomalyExperiment
exp = AnomalyExperiment()

In [None]:
# check the type of exp
type(exp)

In [None]:
# init setup on exp
exp.setup(data, session_id = 123)

Unnamed: 0,Description,Value
0,Session id,123
1,Original data shape,"(154, 20)"
2,Transformed data shape,"(154, 20)"
3,Numeric features,20
4,Rows with missing values,48.1%
5,Preprocess,True
6,Imputation type,simple
7,Numeric imputation,mean
8,Categorical imputation,mode
9,CPU Jobs,-1


<pycaret.anomaly.oop.AnomalyExperiment at 0x7e5cf30b1fc0>

You can use any of the two method i.e. Functional or OOP and even switch back and forth between two set of API's. The choice of method will not impact the results and has been tested for consistency.

## Create Model

This function trains an unsupervised anomaly detection model. All the available models can be accessed using the models function.

In [None]:
# train iforest model
iforest = create_model('iforest')
iforest

Processing:   0%|          | 0/3 [00:00<?, ?it/s]

IForest(behaviour='new', bootstrap=False, contamination=0.05,
    max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=-1,
    random_state=123, verbose=0)

In [None]:
# to check all the available models
models()

Unnamed: 0_level_0,Name,Reference
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
abod,Angle-base Outlier Detection,pyod.models.abod.ABOD
cluster,Clustering-Based Local Outlier,pycaret.internal.patches.pyod.CBLOFForceToDouble
cof,Connectivity-Based Local Outlier,pyod.models.cof.COF
iforest,Isolation Forest,pyod.models.iforest.IForest
histogram,Histogram-based Outlier Detection,pyod.models.hbos.HBOS
knn,K-Nearest Neighbors Detector,pyod.models.knn.KNN
lof,Local Outlier Factor,pyod.models.lof.LOF
svm,One-class SVM detector,pyod.models.ocsvm.OCSVM
pca,Principal Component Analysis,pyod.models.pca.PCA
mcd,Minimum Covariance Determinant,pyod.models.mcd.MCD


## Assign Model
This function assigns anomaly labels to the training data, given a trained model.

In [None]:
iforest_anomalies = assign_model(iforest)
iforest_anomalies

Unnamed: 0,Class,AGE,SEX,STEROID,ANTIVIRALS,FATIGUE,MALAISE,ANOREXIA,LIVER BIG,LIVER FIRM,...,ASCITES,VARICES,BILIRUBIN,ALK PHOSPHATE,SGOT,ALBUMIN,PROTIME,HISTOLOGY,Anomaly,Anomaly_Score
0,0,30,2,1.0,2,2,2,2,1.0,2.0,...,2.0,2.0,1.0,85.0,18.0,4.0,,1,0,-0.055963
1,0,50,1,1.0,2,1,2,2,1.0,2.0,...,2.0,2.0,0.9,135.0,42.0,3.5,,1,0,-0.115683
2,0,78,1,2.0,2,1,2,2,2.0,2.0,...,2.0,2.0,0.7,96.0,32.0,4.0,,1,0,-0.161088
3,0,31,1,,1,2,2,2,2.0,2.0,...,2.0,2.0,0.7,46.0,52.0,4.0,80.0,1,0,-0.117720
4,0,34,1,2.0,2,2,2,2,2.0,2.0,...,2.0,2.0,1.0,,200.0,4.0,,1,0,-0.187318
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
149,1,46,1,2.0,2,1,1,1,2.0,2.0,...,1.0,1.0,7.6,,242.0,3.3,50.0,2,1,0.017026
150,0,44,1,2.0,2,1,2,2,2.0,1.0,...,2.0,2.0,0.9,126.0,142.0,4.3,,2,0,-0.162179
151,0,61,1,1.0,2,1,1,2,1.0,1.0,...,2.0,2.0,0.8,75.0,20.0,4.1,,2,0,-0.100460
152,0,53,2,1.0,2,1,2,2,2.0,2.0,...,2.0,1.0,1.5,81.0,19.0,4.1,48.0,2,0,-0.016313


## Analyze Model

You can use the `plot_model` function to analyzes the performance of a trained model on the test set. It may require re-training the model in certain cases.

In [None]:
# tsne plot anomalies
plot_model(iforest, plot = 'tsne')

In [None]:
# check docstring to see available plots
# help(plot_model)

An alternate to `plot_model` function is `evaluate_model`. It can only be used in Notebook since it uses ipywidget.

In [None]:
evaluate_model(iforest)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

## Prediction
The `predict_model` function returns `Anomaly` and `Anomaly_Score` label as a new column in the input dataframe. This step may or may not be needed depending on the use-case. Some times clustering models are trained for analysis purpose only and the interest of user is only in assigned labels on the training dataset, that can be done using `assign_model` function. `predict_model` is only useful when you want to obtain cluster labels on unseen data (i.e. data that was not used during training the model).

In [None]:
# predict on test set
iforest_pred = predict_model(iforest, data=data)
iforest_pred

Unnamed: 0,Class,AGE,SEX,STEROID,ANTIVIRALS,FATIGUE,MALAISE,ANOREXIA,LIVER BIG,LIVER FIRM,...,ASCITES,VARICES,BILIRUBIN,ALK PHOSPHATE,SGOT,ALBUMIN,PROTIME,HISTOLOGY,Anomaly,Anomaly_Score
0,0.0,30.0,2.0,1.000000,2.0,2.0,2.0,2.0,1.0,2.0,...,2.0,2.0,1.0,85.000000,18.0,4.0,61.852273,1.0,0,-0.055963
1,0.0,50.0,1.0,1.000000,2.0,1.0,2.0,2.0,1.0,2.0,...,2.0,2.0,0.9,135.000000,42.0,3.5,61.852273,1.0,0,-0.115683
2,0.0,78.0,1.0,2.000000,2.0,1.0,2.0,2.0,2.0,2.0,...,2.0,2.0,0.7,96.000000,32.0,4.0,61.852273,1.0,0,-0.161088
3,0.0,31.0,1.0,1.509804,1.0,2.0,2.0,2.0,2.0,2.0,...,2.0,2.0,0.7,46.000000,52.0,4.0,80.000000,1.0,0,-0.117720
4,0.0,34.0,1.0,2.000000,2.0,2.0,2.0,2.0,2.0,2.0,...,2.0,2.0,1.0,105.325397,200.0,4.0,61.852273,1.0,0,-0.187318
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
149,1.0,46.0,1.0,2.000000,2.0,1.0,1.0,1.0,2.0,2.0,...,1.0,1.0,7.6,105.325397,242.0,3.3,50.000000,2.0,1,0.017026
150,0.0,44.0,1.0,2.000000,2.0,1.0,2.0,2.0,2.0,1.0,...,2.0,2.0,0.9,126.000000,142.0,4.3,61.852273,2.0,0,-0.162179
151,0.0,61.0,1.0,1.000000,2.0,1.0,1.0,2.0,1.0,1.0,...,2.0,2.0,0.8,75.000000,20.0,4.1,61.852273,2.0,0,-0.100460
152,0.0,53.0,2.0,1.000000,2.0,1.0,2.0,2.0,2.0,2.0,...,2.0,1.0,1.5,81.000000,19.0,4.1,48.000000,2.0,0,-0.016313


The same function works for predicting the labels on unseen dataset. Let's create a copy of original data and drop the `Class variable`. We can then use the new data frame without labels for scoring.

## Save Model

Finally, you can save the entire pipeline on disk for later use, using pycaret's `save_model` function.

In [None]:
# save pipeline
save_model(iforest, 'iforest_pipeline')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(include=['Class', 'AGE', 'SEX', 'STEROID',
                                              'ANTIVIRALS', 'FATIGUE', 'MALAISE',
                                              'ANOREXIA', 'LIVER BIG',
                                              'LIVER FIRM', 'SPLEEN PALPABLE',
                                              'SPIDERS', 'ASCITES', 'VARICES',
                                              'BILIRUBIN', 'ALK PHOSPHATE',
                                              'SGOT', 'ALBUMIN', 'PROTIME',
                                              'HISTOLOGY'],
                                     transformer=SimpleImputer())),
                 ('categorical_imputer',
                  TransformerWrapper(include=[],
                                     transformer=SimpleImputer(strategy='most_frequent'))),
                 ('trained_model',
                  IForest(behavio

In [None]:
# load pipeline
loaded_iforest_pipeline = load_model('iforest_pipeline')
loaded_iforest_pipeline

Transformation Pipeline and Model Successfully Loaded
