# Anomaly Detection in Machine Data with CrateDB and PyCaret

In this Jupyter Notebook, we explore the integration of CrateDB and PyCaret to detect anomalies in machine data, crucial for identifying potential failures or inefficiencies in technological systems. CrateDB's capability for handling large-scale data with ease pairs seamlessly with PyCaret's low-code approach to machine learning, offering a streamlined path to uncovering insights within vast datasets.

Through this tutorial, we'll demonstrate how to harness CrateDB for efficient data retrieval, and leverage PyCaret for its powerful anomaly detection algorithms. This concise guide is designed to equip you with the knowledge to perform anomaly analysis effectively, ensuring the reliability and security of your operations.

## Step 1: Install required dependencies

If not available already, install both [SQLAlchemy] and [Pycaret].

[SQLAlchemy]: https://cratedb.com/docs/python/en/latest/sqlalchemy.html
[Pycaret]: https://github.com/pycaret/pycaret

In [4]:
%pip install 'crate[sqlalchemy]' pycaret pandas

Collecting crate[sqlalchemy]
  Obtaining dependency information for crate[sqlalchemy] from https://files.pythonhosted.org/packages/de/59/47ac6669c67ca29ba9dc070003e56d54eda4482c98465d29743046f8ffe1/crate-0.35.2-py2.py3-none-any.whl.metadata
  Downloading crate-0.35.2-py2.py3-none-any.whl.metadata (4.9 kB)
Collecting pycaret
  Obtaining dependency information for pycaret from https://files.pythonhosted.org/packages/c7/d8/4e703f17f17d7d7764fe6d586872e787ce3a248640221f5f2d586f671eee/pycaret-3.3.0-py3-none-any.whl.metadata
  Using cached pycaret-3.3.0-py3-none-any.whl.metadata (17 kB)
Collecting pandas
  Obtaining dependency information for pandas from https://files.pythonhosted.org/packages/a5/78/1d859bfb619c067e3353ed079248ae9532c105c4e018fa9a776d04b34572/pandas-2.2.1-cp311-cp311-macosx_11_0_arm64.whl.metadata
  Downloading pandas-2.2.1-cp311-cp311-macosx_11_0_arm64.whl.metadata (19 kB)
Collecting urllib3<2.2 (from crate[sqlalchemy])
  Obtaining dependency information for urllib3<2.2 fro

## Step 2: Importing Libraries
In the first cell of your Jupyter Notebook, import the required libraries:

In [1]:
# Data manipulation
import pandas as pd

# CrateDB
import sqlalchemy as sa

# Variables mngmt
import os

# PyCaret for anomaly detection
from pycaret.anomaly import *

# Graph plotting
import plotly.graph_objects as go
import plotly.express as px

## Step 3: Import Machine Data into CrateDB

In this step, we will create the table and populate it with the dataset in the following link https://media.githubusercontent.com/media/crate/cratedb-datasets/main/timeseries/nab-machine-failure.csv. If you are using a Cloud cluster, you can use the [URL import] available in the console, otherwise, use the `COPY FROM` statement as demonstreated below. You can run it in the console in the Admin UI or you can use [Crash].

[URL import]: https://cratedb.com/docs/cloud/en/latest/reference/overview.html#import-from-url
[Crash]: https://cratedb.com/docs/crate/crash/en/latest/getting-started.html

In [2]:
CREATE TABLE machine_data (
   "timestamp" TIMESTAMP,
   "value" DOUBLE PRECISION
)

COPY machine_data FROM 'https://media.githubusercontent.com/media/crate/cratedb-datasets/main/timeseries/nab-machine-failure.csv';

SyntaxError: invalid syntax (3125537374.py, line 1)

## Step 4: Query Data into a DataFrame

Once the data is loaded and ready to be used in CrateDB, we can start by accessing it and saving into a DataFrame as follows. 
In this exercise, instead of selecting all data from the `machine_data` table as is, we will use the `DATA_BIN` function to create buckets of 5 minutes and calculate the average for the values whithin the 5 minutes. That is because the models we are going to use require an evenly spaced observation times and, to ensure that, we can use this aggregation function available in CrateDB.

In [14]:
#CONNECTION_STRING = os.environ.get(
#     "CRATEDB_CONNECTION_STRING",
#     "crate://<USER>:<PASSWORD>@<HOST>",
# )

CONNECTION_STRING = os.environ.get(
     "CRATEDB_CONNECTION_STRING",
     "crate://localhost:4200",
 )

engine = sa.create_engine(CONNECTION_STRING, echo=os.environ.get('DEBUG'))

query = "SELECT DATE_BIN('5 min'::INTERVAL, \"timestamp\", 0) AS timestamp, AVG(value) AS avg_value FROM machine_data GROUP BY timestamp ORDER BY timestamp ASC;" 
with engine.connect() as conn:
    result = conn.execute(sa.text(query))
    columns = result.keys() # Extract column names
    df = pd.DataFrame(result.fetchall(), columns=columns)

df = df.set_index('timestamp')

In [16]:
df.describe()

Unnamed: 0,temperature
count,22683.0
mean,85.922461
std,13.749422
min,2.084721
25%,83.074742
50%,89.403336
75%,94.016255
max,108.510544


## Step 5: Defining the model

For this step, we are going to use the Pycaret library to identify the anomalies in the dataset. First, start by calling the `setup()` function to initializes the training environment and create the transformation pipeline. Define a `session_id` to ensure the experiment can be reproduced later by using the same value. Then, by using the function `models()` you can check all the models available. In this exercise, we are going to use Isolation Forest (iforest).

In [15]:
s = setup(df, session_id = 123)
models()

Unnamed: 0,Description,Value
0,Session id,123
1,Original data shape,"(22683, 1)"
2,Transformed data shape,"(22683, 1)"
3,Numeric features,1
4,Preprocess,True
5,Imputation type,simple
6,Numeric imputation,mean
7,Categorical imputation,mode
8,CPU Jobs,-1
9,Use GPU,False


Unnamed: 0_level_0,Name,Reference
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
abod,Angle-base Outlier Detection,pyod.models.abod.ABOD
cluster,Clustering-Based Local Outlier,pycaret.internal.patches.pyod.CBLOFForceToDouble
cof,Connectivity-Based Local Outlier,pyod.models.cof.COF
iforest,Isolation Forest,pyod.models.iforest.IForest
histogram,Histogram-based Outlier Detection,pyod.models.hbos.HBOS
knn,K-Nearest Neighbors Detector,pyod.models.knn.KNN
lof,Local Outlier Factor,pyod.models.lof.LOF
svm,One-class SVM detector,pyod.models.ocsvm.OCSVM
pca,Principal Component Analysis,pyod.models.pca.PCA
mcd,Minimum Covariance Determinant,pyod.models.mcd.MCD


## Step 6: Running the model
The `create_model()` function trains an unsupervised anomaly detection model. This function assigns anomaly labels to the training data, given a trained model. Below, you may see a sample of the readings that were flagged as 'Anomaly' by the model.

In [20]:
iforest = create_model('iforest')
iforest_results = assign_model(iforest)
iforest_results[iforest_results['Anomaly'] == 1].head()

Unnamed: 0_level_0,avg_value,Anomaly,Anomaly_Score
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1386661800000,51.768349,1,0.003257
1386662100000,51.950832,1,0.002692
1386662400000,51.854042,1,0.004543
1386662700000,51.425903,1,0.007701
1386663000000,51.459923,1,0.00729


## Step 7: Plotting the results
A better way to see the anomaly readings is to plot all the readings and highlight the anomalies. Below, we use the Plotly library to do that. The red spots correspond to the anomalies flagged by the model.

In [21]:
# plot value on y-axis and date on x-axis
fig = px.line(iforest_results, x=iforest_results.index, y="avg_value", title='MACHINE DATA - UNSUPERVISED ANOMALY DETECTION', template = 'plotly_dark')

# create list of outlier_dates
outlier_dates = iforest_results[iforest_results['Anomaly'] == 1].index

# obtain y value of anomalies to plot
y_values = [iforest_results.loc[i]['avg_value'] for i in outlier_dates]

fig.add_trace(go.Scatter(x=outlier_dates, y=y_values, mode = 'markers',
                name = 'Anomaly',
                marker=dict(color='red',size=10)))

fig.show()