# Storing Computed Metrics in a MetricsRepository

PyDeequ allows us to persist the metrics we computed on dataframes in a so-called MetricsRepository. In the following example, we showcase how to store metrics in a filesystem and query them later on.

In [2]:
from pyspark.sql import SparkSession, Row, DataFrame
import json
import pandas as pd
import sagemaker_pyspark

import pydeequ

classpath = ":".join(sagemaker_pyspark.classpath_jars())

spark = (SparkSession
    .builder
    .config("spark.driver.extraClassPath", classpath)
    .config("spark.jars.packages", pydeequ.deequ_maven_coord)
    .config("spark.jars.excludes", pydeequ.f2j_maven_coord)
    .getOrCreate())

### We will be using the Amazon Product Reviews dataset

Specifically the Electronics and Books subset.

In [3]:
df_electronics = spark.read.parquet("s3a://amazon-reviews-pds/parquet/product_category=Electronics/")

df_books = spark.read.parquet("s3a://amazon-reviews-pds/parquet/product_category=Books/")

print(df_electronics.printSchema(), df_books.printSchema())

root
 |-- marketplace: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- review_id: string (nullable = true)
 |-- product_id: string (nullable = true)
 |-- product_parent: string (nullable = true)
 |-- product_title: string (nullable = true)
 |-- star_rating: integer (nullable = true)
 |-- helpful_votes: integer (nullable = true)
 |-- total_votes: integer (nullable = true)
 |-- vine: string (nullable = true)
 |-- verified_purchase: string (nullable = true)
 |-- review_headline: string (nullable = true)
 |-- review_body: string (nullable = true)
 |-- review_date: date (nullable = true)
 |-- year: integer (nullable = true)

root
 |-- marketplace: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- review_id: string (nullable = true)
 |-- product_id: string (nullable = true)
 |-- product_parent: string (nullable = true)
 |-- product_title: string (nullable = true)
 |-- star_rating: integer (nullable = true)
 |-- helpful_votes: integer (nullable 

### Initialize Metrics Repository

We will be demoing with the `FileSystemMetricsRepository` class, but you can optionally use `InMemoryMetricsRepository` the exact same way without creating a `metrics_file` like so: `repository = InMemoryMetricsRepository(spark)`. 

**Metrics Repository allows us to store the metrics in json format on the local disk (note that it also supports HDFS and S3).**

In [4]:
from pydeequ.repository import *

metrics_file = FileSystemMetricsRepository.helper_metrics_file(spark, 'metrics.json')
print(f'metrics_file path: {metrics_file}')
repository = FileSystemMetricsRepository(spark, metrics_file)

metrics_file path: /tmp/1595457441222-0/metrics.json


**Each set of metrics that we computed needs be indexed by a so-called `ResultKey`, which contains a timestamp and supports arbitrary tags in the form of key-value pairs. Let's setup one for this example:**

In [5]:
key_tags = {'tag': 'electronics'}
resultKey = ResultKey(spark, ResultKey.current_milli_time(), key_tags)

### We will be building off the Analyzers basic tutorial ... including Metrics Repository into it! 

Now we can run checks or analyzers on our data as usual. However, we make deequ store the resulting metrics for the checks in our repository by adding the `useRepository` and `saveOrAppendResult` methods to our invocation:

In [6]:
from pydeequ.analyzers import *

analysisResult = AnalysisRunner(spark) \
                    .onData(df_electronics) \
                    .addAnalyzer(Size()) \
                    .addAnalyzer(Completeness("review_id")) \
                    .addAnalyzer(ApproxCountDistinct("review_id")) \
                    .addAnalyzer(Mean("star_rating")) \
                    .addAnalyzer(Compliance("top star_rating", "star_rating >= 4.0")) \
                    .addAnalyzer(Correlation("total_votes", "star_rating")) \
                    .addAnalyzer(Correlation("total_votes", "helpful_votes")) \
                    .useRepository(repository) \
                    .saveOrAppendResult(resultKey) \
                    .run()
                    
analysisResult_df = AnalyzerContext.successMetricsAsDataFrame(spark, analysisResult)
analysisResult_df.show()

+-----------+--------------------+-------------------+--------------------+
|     entity|            instance|               name|               value|
+-----------+--------------------+-------------------+--------------------+
|     Column|           review_id|       Completeness|                 1.0|
|     Column|           review_id|ApproxCountDistinct|           3010972.0|
|Mutlicolumn|total_votes,star_...|        Correlation|-0.03451097996538765|
|    Dataset|                   *|               Size|           3120938.0|
|     Column|         star_rating|               Mean|   4.036143941340712|
|     Column|     top star_rating|         Compliance|  0.7494070692849394|
|Mutlicolumn|total_votes,helpf...|        Correlation|  0.9936463809903863|
+-----------+--------------------+-------------------+--------------------+



### We can load it back now from Metrics Repository 

PyDeequ now executes the verification as usual and additionally stores the metrics under our specified key. Afterwards, we can retrieve the metrics from the repository in different ways. We can for example directly load the metric for a particular analyzer stored under our result key as follows:

In [7]:
analysisResult_metRep = repository.load() \
                            .before(ResultKey.current_milli_time()) \
                            .getSuccessMetricsAsDataFrame()

analysisResult_metRep.show()

+-----------+--------------------+-------------------+--------------------+-------------+-----------+
|     entity|            instance|               name|               value| dataset_date|        tag|
+-----------+--------------------+-------------------+--------------------+-------------+-----------+
|     Column|           review_id|       Completeness|                 1.0|1595457441235|electronics|
|     Column|           review_id|ApproxCountDistinct|           3010972.0|1595457441235|electronics|
|Mutlicolumn|total_votes,star_...|        Correlation|-0.03451097996538765|1595457441235|electronics|
|    Dataset|                   *|               Size|           3120938.0|1595457441235|electronics|
|     Column|         star_rating|               Mean|   4.036143941340712|1595457441235|electronics|
|     Column|     top star_rating|         Compliance|  0.7494070692849394|1595457441235|electronics|
|Mutlicolumn|total_votes,helpf...|        Correlation|  0.9936463809903863|1595457

### But that's not very interesting... Let's run another Analysis on the books dataset! 

In [8]:
key_tags_2 = {'tag': 'books'}
resultKey_2 = ResultKey(spark, ResultKey.current_milli_time(), key_tags_2)

analysisResult_2 = AnalysisRunner(spark) \
                    .onData(df_books) \
                    .addAnalyzer(Size()) \
                    .addAnalyzer(Completeness("review_id")) \
                    .addAnalyzer(ApproxCountDistinct("review_id")) \
                    .addAnalyzer(Mean("star_rating")) \
                    .addAnalyzer(Compliance("top star_rating", "star_rating >= 4.0")) \
                    .addAnalyzer(Correlation("total_votes", "star_rating")) \
                    .addAnalyzer(Correlation("total_votes", "helpful_votes")) \
                    .useRepository(repository) \
                    .saveOrAppendResult(resultKey_2) \
                    .run()

analysisResult_2_df = AnalyzerContext.successMetricsAsDataFrame(spark, analysisResult_2)
analysisResult_2_df.show()

+-----------+--------------------+-------------------+--------------------+
|     entity|            instance|               name|               value|
+-----------+--------------------+-------------------+--------------------+
|     Column|           review_id|       Completeness|                 1.0|
|     Column|           review_id|ApproxCountDistinct|          2.005151E7|
|Mutlicolumn|total_votes,star_...|        Correlation|-0.13092955077624202|
|    Dataset|                   *|               Size|          2.072616E7|
|     Column|         star_rating|               Mean|   4.340540167594962|
|     Column|     top star_rating|         Compliance|  0.8302768095971468|
|Mutlicolumn|total_votes,helpf...|        Correlation|  0.9613189372804929|
+-----------+--------------------+-------------------+--------------------+



### Now we should see two different tags when we load it back from Metrics Repository

In [9]:
analysisResult_metRep_2 = repository.load() \
                            .before(ResultKey.current_milli_time()) \
                            .getSuccessMetricsAsDataFrame()

analysisResult_metRep_2.show(analysisResult_metRep_2.count(), False)

+-----------+-------------------------+-------------------+--------------------+-------------+-----------+
|entity     |instance                 |name               |value               |dataset_date |tag        |
+-----------+-------------------------+-------------------+--------------------+-------------+-----------+
|Column     |review_id                |Completeness       |1.0                 |1595457441235|electronics|
|Column     |review_id                |ApproxCountDistinct|3010972.0           |1595457441235|electronics|
|Mutlicolumn|total_votes,star_rating  |Correlation        |-0.03451097996538765|1595457441235|electronics|
|Dataset    |*                        |Size               |3120938.0           |1595457441235|electronics|
|Column     |star_rating              |Mean               |4.036143941340712   |1595457441235|electronics|
|Column     |top star_rating          |Compliance         |0.7494070692849394  |1595457441235|electronics|
|Mutlicolumn|total_votes,helpful_vote

### We can see the differences in the `dataset_date` and `tag` column and filter our results like so

In [10]:
filtered_tags = repository.load() \
        .withTagValues(key_tags_2) \
        .getSuccessMetricsAsDataFrame()

filtered_tags.show(filtered_tags.count(), False)

+-----------+-------------------------+-------------------+--------------------+-------------+-----+
|entity     |instance                 |name               |value               |dataset_date |tag  |
+-----------+-------------------------+-------------------+--------------------+-------------+-----+
|Column     |review_id                |Completeness       |1.0                 |1595457494596|books|
|Column     |review_id                |ApproxCountDistinct|2.005151E7          |1595457494596|books|
|Mutlicolumn|total_votes,star_rating  |Correlation        |-0.13092955077624202|1595457494596|books|
|Dataset    |*                        |Size               |2.072616E7          |1595457494596|books|
|Column     |star_rating              |Mean               |4.340540167594962   |1595457494596|books|
|Column     |top star_rating          |Compliance         |0.8302768095971468  |1595457494596|books|
|Mutlicolumn|total_votes,helpful_votes|Correlation        |0.9613189372804929  |15954574945

In [15]:
filtered_time = repository.load() \
        .after(1595457441235+1) \
        .getSuccessMetricsAsDataFrame()

filtered_time.show(filtered_time.count(), False)

+-----------+-------------------------+-------------------+--------------------+-------------+-----+
|entity     |instance                 |name               |value               |dataset_date |tag  |
+-----------+-------------------------+-------------------+--------------------+-------------+-----+
|Column     |review_id                |Completeness       |1.0                 |1595457494596|books|
|Column     |review_id                |ApproxCountDistinct|2.005151E7          |1595457494596|books|
|Mutlicolumn|total_votes,star_rating  |Correlation        |-0.13092955077624202|1595457494596|books|
|Dataset    |*                        |Size               |2.072616E7          |1595457494596|books|
|Column     |star_rating              |Mean               |4.340540167594962   |1595457494596|books|
|Column     |top star_rating          |Compliance         |0.8302768095971468  |1595457494596|books|
|Mutlicolumn|total_votes,helpful_votes|Correlation        |0.9613189372804929  |15954574945

### For more info ... look at full list of Metrics Repository in `docs/repository.md` 