# Anomaly Detection Basic Tutorial

This Jupyter notebook will give a basic tutorial on how to use PyDeequ's Anomaly Detection module.

Often times in dealing with large datasets it is difficult to understand the constraints needed in our data. However, we often have a better understanding of how much change we expect in certain metrics of our data. Therefore, we can use anomaly detection to measure the data quality in large datasets. 

The idea is that we regularly store the metrics in a MetricsRepository. Once we do that, we can run anomaly checks using the Verification Suite to compare the current values of the metric to its past values in order to detect anomalous changes. 

In this simple example, we compute the size of a dataset every day to ensure sure that the size does not drastically change. The number of rows on a given day should not be more than double of what we have seen before. 

First import the proper imports, the pydeequ repository and run a sparksession for your test. 

In [1]:
from pyspark.sql import SparkSession, Row, DataFrame
import pandas as pd
import sagemaker_pyspark

import pydeequ

classpath = ":".join(sagemaker_pyspark.classpath_jars())

spark = (SparkSession
    .builder
    .config("spark.driver.extraClassPath", classpath)
    .config("spark.jars.packages", pydeequ.deequ_maven_coord)
    .config("spark.jars.excludes", pydeequ.f2j_maven_coord)
    .getOrCreate())
sc = spark.sparkContext


### Initialize Metrics Repository

Let us import `repository`  to access the `InMemoryMetricsRepository` class needed for anomoly detection, and create a metrics repository to store the metrics. 

In [3]:
from pydeequ.repository import *
metricsRepository = InMemoryMetricsRepository(spark)

## Let us do an anomaly detection check using Metrics Repository!

We will be using our ficticious datasets from yesterday which contains only two rows.

In [4]:
yesterdaysDataset = sc.parallelize([
            Row(a=3, b=0,),
            Row(a=3, b=5,)]).toDF()


We will be demoing the `RelativeRateOfChangeStrategy` for detecting anomalies on the analyzer `Size()`. With the `maxRateIncrease` parameter set as 2.0, this will test for anomalies in the size of the data to ensure that it does not increase by more than 2x the saved amount. Alternatively, we can use other strategies or analyzers better suited for analyzing our data.

The resulting metrics are stored using `useRepository` and `saveOrAppendResult` under a result key: `yesterdaysKey` with yesterday's timestamp. To run the `VerificationSuite` use the `run()` method.

In [5]:
from pydeequ.verification import *

yesterdaysKey = ResultKey(spark, ResultKey.current_milli_time() - 24 * 60 * 60 * 1000)

prev_Result = VerificationSuite(spark).onData(yesterdaysDataset) \
    .useRepository(metricsRepository) \
    .saveOrAppendResult(yesterdaysKey) \
    .addAnomalyCheck(RelativeRateOfChangeStrategy(maxRateIncrease=2.0), Size()) \
    .run()

Next, the ficticious data from today has five rows, the data size more than doubled and should raise an anomaly detection. 

In [6]:
todaysDataset = sc.parallelize([
            Row(a=3,  b=0,),
            Row(a=3,  b=5,),
            Row(a=100,b=5,),
            Row(a=2,  b=30,),
            Row(a=10, b=5,)]).toDF()

Repeat the anomaly check using our metrics repository and verification suite.

In [7]:
todaysKey = ResultKey(spark, ResultKey.current_milli_time())

currResult = VerificationSuite(spark).onData(todaysDataset) \
    .useRepository(metricsRepository) \
    .saveOrAppendResult(todaysKey) \
    .addAnomalyCheck(RelativeRateOfChangeStrategy(maxRateIncrease=2.0), Size()) \
    .run()


### Let us detect any anomalies!

We can now look at the status of the verification to see if an anomaly has been detected. If detected (which it should have) the contents of our metrics repository will be printed. 

In [8]:
if (currResult.status != "Success"):
    print("Anomaly detected in the Size() metric!")
    metricsRepository.load().forAnalyzers([Size()]).getSuccessMetricsAsDataFrame().show()


Anomaly detected in the Size() metric!
+-------+--------+----+-----+-------------+
| entity|instance|name|value| dataset_date|
+-------+--------+----+-----+-------------+
|Dataset|       *|Size|  5.0|1594942992829|
|Dataset|       *|Size|  2.0|1594856587188|
+-------+--------+----+-----+-------------+



### We see that an anomaly has been detected in the dataset and it is due to a data size increase from 2 to 5!

### For more info ... look at full list of strategies available for anomaly_detection in `docs/anomaly_detection.md`