<a href="https://colab.research.google.com/github/aditipatil0711/SJSU_Masters_Assignments/blob/main/CMPE255_Data_Mining/Assignment%202/Pycaret/Anomaly_Detection/Anomaly_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **E. Anamoly Detection Using Pycaret**

Anomaly detection in machine learning identifies unusual patterns that do not conform to expected behavior. It's often used in fraud detection, system health monitoring, and fault detection. These outliers or anomalies can indicate significant events or issues.

In this example, we will try to detect anomalies in given dataset of KDD-19 paper

Reference Used: https://pycaret.gitbook.io/docs/get-started/tutorials

Note: Implementation of commands is of Functional Type

# Step 1: Installations:

We install Pycaret and gradio in the environment. It is recomended to restart the environment once installation is done.

In [1]:
!pip install pycaret

Collecting pycaret
  Downloading pycaret-3.0.4-py3-none-any.whl (484 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/484.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.2/484.4 kB[0m [31m2.8 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m440.3/484.4 kB[0m [31m6.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m484.4/484.4 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
Collecting pyod>=1.0.8 (from pycaret)
  Downloading pyod-1.1.0.tar.gz (153 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m153.4/153.4 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting category-encoders>=2.4.0 (from pycaret)
  Downloading category_encoders-2.6.2-py2.py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━

# Step 2:Uploading Dataset and Reading Data

Now we import the required libraries and upload our dataset file to Colab. after which we will read our data

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from google.colab import files
uploaded = files.upload()

Saving bank-additional-full_normalised.csv to bank-additional-full_normalised.csv


In [3]:
data = pd.read_csv('bank-additional-full_normalised.csv')
data.head()

Unnamed: 0,age,job=housemaid,job=services,job=admin.,job=blue-collar,job=technician,job=retired,job=management,job=unemployed,job=self-employed,...,previous,poutcome=nonexistent,poutcome=failure,poutcome=success,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,class
0,0.209877,0,0,0,0,0,0,0,0,0,...,0.0,1,0,0,1.0,0.882307,0.376569,0.98073,1.0,0
1,0.296296,0,0,1,0,0,0,0,0,0,...,0.0,1,0,0,1.0,0.484412,0.615063,0.981183,1.0,0
2,0.246914,1,0,0,0,0,0,0,0,0,...,0.0,1,0,0,0.9375,0.698753,0.60251,0.957379,0.859735,0
3,0.160494,0,1,0,0,0,0,0,0,0,...,0.142857,0,1,0,0.333333,0.26968,0.192469,0.150759,0.512287,0
4,0.530864,0,0,0,1,0,0,0,0,0,...,0.0,1,0,0,0.333333,0.340608,0.154812,0.17479,0.512287,1


# Step:3 Initiatialize Pycaret and work!

Here , I have imported the classification experiment library from pycaret. The setup() command helps initialize the setup

In [4]:
from pycaret.anomaly import *
data_anamoly = setup(data,session_id=123,use_gpu=True)

Unnamed: 0,Description,Value
0,Session id,123
1,Original data shape,"(41188, 63)"
2,Transformed data shape,"(41188, 63)"
3,Numeric features,63
4,Preprocess,True
5,Imputation type,simple
6,Numeric imputation,mean
7,Categorical imputation,mode
8,CPU Jobs,-1
9,Use GPU,True


# Create Model

This function trains an unsupervised anomaly detection model. All the available models can be accessed using the models function.

In [7]:
iforest = create_model('iforest')
iforest

Processing:   0%|          | 0/3 [00:00<?, ?it/s]

IForest(behaviour='new', bootstrap=False, contamination=0.05,
    max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=-1,
    random_state=123, verbose=0)

## Assign Model
This function assigns anomaly labels to the training data, given a trained model.

In [9]:
results = assign_model(iforest)
results

Unnamed: 0,age,job=housemaid,job=services,job=admin.,job=blue-collar,job=technician,job=retired,job=management,job=unemployed,job=self-employed,...,poutcome=failure,poutcome=success,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,class,Anomaly,Anomaly_Score
0,0.209877,0,0,0,0,0,0,0,0,0,...,0,0,1.000000,0.882307,0.376569,0.980730,1.000000,0,0,-0.075632
1,0.296296,0,0,1,0,0,0,0,0,0,...,0,0,1.000000,0.484412,0.615063,0.981183,1.000000,0,0,-0.112675
2,0.246914,1,0,0,0,0,0,0,0,0,...,0,0,0.937500,0.698753,0.602510,0.957379,0.859735,0,0,-0.101088
3,0.160494,0,1,0,0,0,0,0,0,0,...,1,0,0.333333,0.269680,0.192469,0.150759,0.512287,0,0,-0.046748
4,0.530864,0,0,0,1,0,0,0,0,0,...,0,0,0.333333,0.340608,0.154812,0.174790,0.512287,1,0,-0.055720
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,0.271605,0,0,0,1,0,0,0,0,0,...,0,0,0.333333,0.269680,0.192469,0.158694,0.512287,0,0,-0.048235
41184,0.333333,0,0,0,0,0,0,1,0,0,...,0,0,0.687500,0.389322,0.368201,0.767853,0.877883,1,0,-0.027155
41185,0.172840,0,0,0,0,1,0,0,0,0,...,0,0,0.937500,0.698753,0.602510,0.956926,0.859735,0,0,-0.086666
41186,0.148148,0,0,1,0,0,0,0,0,0,...,0,0,1.000000,0.882307,0.376569,0.980503,1.000000,0,0,-0.098253


# Evaluate Model

In [10]:
evaluate_model(iforest)


interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

# Analyse Model

We use the `plot_model` function to analyze the performance of a trained model on the test set.

In [12]:
plot_model(iforest,plot='tsne')

KeyboardInterrupt: ignored

# Model Prediction:

The predict_model function adds new columns to the dataframe: prediction_label and prediction_score, which represent the predicted class and its probability, respectively. If no data is provided (default setting), it scores using the test set established by the setup function.

In [13]:
new_data = data.copy()
new_data.head()

Unnamed: 0,age,job=housemaid,job=services,job=admin.,job=blue-collar,job=technician,job=retired,job=management,job=unemployed,job=self-employed,...,previous,poutcome=nonexistent,poutcome=failure,poutcome=success,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,class
0,0.209877,0,0,0,0,0,0,0,0,0,...,0.0,1,0,0,1.0,0.882307,0.376569,0.98073,1.0,0
1,0.296296,0,0,1,0,0,0,0,0,0,...,0.0,1,0,0,1.0,0.484412,0.615063,0.981183,1.0,0
2,0.246914,1,0,0,0,0,0,0,0,0,...,0.0,1,0,0,0.9375,0.698753,0.60251,0.957379,0.859735,0
3,0.160494,0,1,0,0,0,0,0,0,0,...,0.142857,0,1,0,0.333333,0.26968,0.192469,0.150759,0.512287,0
4,0.530864,0,0,0,1,0,0,0,0,0,...,0.0,1,0,0,0.333333,0.340608,0.154812,0.17479,0.512287,1


In [14]:
predict = predict_model(iforest,data=new_data)


# Saving a Model

We can save this model for later use as well with this function

In [15]:
save_model(iforest,'iforest_model')


Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=FastMemory(location=/tmp/joblib),
          steps=[('numerical_imputer',
                  TransformerWrapper(include=['age', 'job=housemaid',
                                              'job=services', 'job=admin.',
                                              'job=blue-collar',
                                              'job=technician', 'job=retired',
                                              'job=management', 'job=unemployed',
                                              'job=self-employed', 'job=unknown',
                                              'job=entrepreneur', 'job=student',
                                              'marital=married',
                                              'marital=single',
                                              'marital=d...
                                              'default=0', 'default=unknown',
                                              'default=1', 'housing=0',
                                    


# **To summarize:**

PyCaret streamlines the ML workflow by automating preprocessing, offering unified syntax for various ML tasks, and facilitating easy model comparison and deployment. It simplifies complex processes, making rapid prototyping and model development more accessible.