Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
4ddd083
commit 0893c68
Showing
5 changed files
with
178 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
94 changes: 94 additions & 0 deletions
94
src/main/scala/com/amazon/deequ/examples/AnomalyDetectionExample.scala
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,94 @@ | ||
/** | ||
* Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. | ||
* | ||
* Licensed under the Apache License, Version 2.0 (the "License"). You may not | ||
* use this file except in compliance with the License. A copy of the License | ||
* is located at | ||
* | ||
* http://aws.amazon.com/apache2.0/ | ||
* | ||
* or in the "license" file accompanying this file. This file is distributed on | ||
* an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either | ||
* express or implied. See the License for the specific language governing | ||
* permissions and limitations under the License. | ||
* | ||
*/ | ||
|
||
package com.amazon.deequ.examples | ||
|
||
import com.amazon.deequ.VerificationSuite | ||
import com.amazon.deequ.analyzers.Size | ||
import com.amazon.deequ.anomalydetection.RateOfChangeStrategy | ||
import com.amazon.deequ.examples.ExampleUtils.{itemsAsDataframe, withSpark} | ||
import com.amazon.deequ.repository.ResultKey | ||
import com.amazon.deequ.repository.memory.InMemoryMetricsRepository | ||
import com.amazon.deequ.checks.CheckStatus._ | ||
|
||
private[examples] object AnomalyDetectionExample extends App { | ||
|
||
withSpark { session => | ||
|
||
/* In this simple example, we assume that we compute metrics on a dataset every day and we want | ||
to ensure that they don't change drastically. For sake of simplicity, we just look at the | ||
size of the data */ | ||
|
||
/* Anomaly detection operates on metrics stored in a metric repository, so lets create one */ | ||
val metricsRepository = new InMemoryMetricsRepository() | ||
|
||
/* This is the key which we use to store the metrics for the dataset from yesterday */ | ||
val yesterdaysKey = ResultKey(System.currentTimeMillis() - 24 * 60 * 1000) | ||
|
||
/* Yesterday, the data had only two rows */ | ||
val yesterdaysDataset = itemsAsDataframe(session, | ||
Item(1, "Thingy A", "awesome thing.", "high", 0), | ||
Item(2, "Thingy B", "available at http://thingb.com", null, 0)) | ||
|
||
/* We test for anomalies in the size of the data, it should not increase by more than 2x. Note | ||
that we store the resulting metrics in our repository */ | ||
VerificationSuite() | ||
.onData(yesterdaysDataset) | ||
.useRepository(metricsRepository) | ||
.saveOrAppendResult(yesterdaysKey) | ||
.addAnomalyCheck( | ||
RateOfChangeStrategy(maxRateIncrease = Some(2.0)), | ||
Size() | ||
) | ||
.run() | ||
|
||
/* Todays data has five rows, so the data size more than doubled and our anomaly check should | ||
catch this */ | ||
val todaysDataset = itemsAsDataframe(session, | ||
Item(1, "Thingy A", "awesome thing.", "high", 0), | ||
Item(2, "Thingy B", "available at http://thingb.com", null, 0), | ||
Item(3, null, null, "low", 5), | ||
Item(4, "Thingy D", "checkout https://thingd.ca", "low", 10), | ||
Item(5, "Thingy E", null, "high", 12)) | ||
|
||
/* The key for today's result */ | ||
val todaysKey = ResultKey(System.currentTimeMillis()) | ||
|
||
/* Repeat the anomaly check for today's data */ | ||
val verificationResult = VerificationSuite() | ||
.onData(todaysDataset) | ||
.useRepository(metricsRepository) | ||
.saveOrAppendResult(todaysKey) | ||
.addAnomalyCheck( | ||
RateOfChangeStrategy(maxRateIncrease = Some(2.0)), | ||
Size() | ||
) | ||
.run() | ||
|
||
/* Did we find an anomaly? */ | ||
if (verificationResult.status != Success) { | ||
println("Anomaly detected in the Size() metric!") | ||
|
||
/* Lets have a look at the actual metrics. */ | ||
metricsRepository | ||
.load() | ||
.forAnalyzers(Seq(Size())) | ||
.getSuccessMetricsAsDataFrame(session) | ||
.show() | ||
} | ||
} | ||
|
||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
80 changes: 80 additions & 0 deletions
80
src/main/scala/com/amazon/deequ/examples/anomaly_detection_example.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
# Anomaly detection | ||
|
||
Very often, it is hard to exactly define what constraints we want to evaluate on our data. However, we often have a better understanding of how much change we expect in certain metrics of our data. Therefore, **deequ** supports anomaly detection for data quality metrics. The idea is that we regularly store the metrics of our data in a [MetricsRepository](https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/examples/metrics_repository_example.md). Once we do that, we can run anomaly checks that compare the current value of the metric to its values in the past and allow us to detect anomalous changes. | ||
|
||
In this simple example, we assume that we compute the size of a dataset every day and we want to ensure that it does not change drastically: the number of rows on a given day should not be more than double of what we have seen on the day before. | ||
|
||
Anomaly detection operates on metrics stored in a metrics repository, so lets create one. | ||
```scala | ||
val metricsRepository = new InMemoryMetricsRepository() | ||
``` | ||
|
||
This is our fictious data from yesterday which only has only two rows. | ||
```scala | ||
val yesterdaysDataset = itemsAsDataframe(session, | ||
Item(1, "Thingy A", "awesome thing.", "high", 0), | ||
Item(2, "Thingy B", "available at http://thingb.com", null, 0)) | ||
``` | ||
|
||
We test for anomalies in the size of the data, and want to enforce that it should not increase by more than 2x. We define a check for this by using the [RateOfChangeStrategy](https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/anomalydetection/RateOfChangeStrategy.scala) for detecting anomalies. Note that we store the resulting metrics in our repository via `useRepository` and `saveOrAppendResult` under a result key `yesterdaysKey` with yesterdays timestamp. | ||
```scala | ||
val yesterdaysKey = ResultKey(System.currentTimeMillis() - 24 * 60 * 1000) | ||
|
||
VerificationSuite() | ||
.onData(yesterdaysDataset) | ||
.useRepository(metricsRepository) | ||
.saveOrAppendResult(yesterdaysKey) | ||
.addAnomalyCheck( | ||
RateOfChangeStrategy(maxRateIncrease = Some(2.0)), | ||
Size()) | ||
.run() | ||
``` | ||
|
||
The fictious data of today has five rows, so the data size more than doubled and our anomaly check should | ||
catch this. | ||
```scala | ||
val todaysDataset = itemsAsDataframe(session, | ||
Item(1, "Thingy A", "awesome thing.", "high", 0), | ||
Item(2, "Thingy B", "available at http://thingb.com", null, 0), | ||
Item(3, null, null, "low", 5), | ||
Item(4, "Thingy D", "checkout https://thingd.ca", "low", 10), | ||
Item(5, "Thingy E", null, "high", 12)) | ||
``` | ||
We repeat the anomaly check using our metrics repository. | ||
```scala | ||
val todaysKey = ResultKey(System.currentTimeMillis()) | ||
|
||
val verificationResult = VerificationSuite() | ||
.onData(todaysDataset) | ||
.useRepository(metricsRepository) | ||
.saveOrAppendResult(todaysKey) | ||
.addAnomalyCheck( | ||
RateOfChangeStrategy(maxRateIncrease = Some(2.0)), | ||
Size()) | ||
.run() | ||
``` | ||
|
||
We can now have a look at the `status` of the result of the verification to see if your check caught an anomaly (it should have). We print the contents of our metrics repository in that case. | ||
```scala | ||
if (verificationResult.status != Success) { | ||
println("Anomaly detected in the Size() metric!") | ||
|
||
metricsRepository | ||
.load() | ||
.forAnalyzers(Seq(Size())) | ||
.getSuccessMetricsAsDataFrame(session) | ||
.show() | ||
} | ||
``` | ||
|
||
We see that the following metrics are stored in the repository, which shows us the reason the anomaly: the data size increased from 2 to 5! | ||
``` | ||
+-------+--------+----+-----+-------------+ | ||
| entity|instance|name|value| dataset_date| | ||
+-------+--------+----+-----+-------------+ | ||
|Dataset| *|Size| 2.0|1538384009558| | ||
|Dataset| *|Size| 5.0|1538385453983| | ||
+-------+--------+----+-----+-------------+ | ||
``` | ||
|
||
An [executable version of this example](https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/examples/AnomalyDetectionExample.scala) is available as part of our code base. We also provide more [anomaly detection strategies](https://github.com/awslabs/deequ/tree/master/src/main/scala/com/amazon/deequ/anomalydetection). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters