# Storing Computed Metrics in a MetricsRepository

__Updated June 2024 to use a new dataset__

PyDeequ allows us to persist the metrics we computed on dataframes in a so-called MetricsRepository. In the following example, we showcase how to store metrics in a filesystem and query them later on.

In [1]:
import os
# indicate your Spark version, here we use Spark 3.5 with pydeequ 1.4.0
os.environ["SPARK_VERSION"] = '3.5'

In [2]:
from pyspark.sql import SparkSession, Row, DataFrame
import json
import pandas as pd
import sagemaker_pyspark

import pydeequ

classpath = ":".join(sagemaker_pyspark.classpath_jars())

spark = (SparkSession
    .builder
    .config("spark.driver.extraClassPath", classpath)
    .config("spark.jars.packages", pydeequ.deequ_maven_coord)
    .config("spark.jars.excludes", pydeequ.f2j_maven_coord)
    .getOrCreate())

:: loading settings :: url = jar:file:/home/ec2-user/anaconda3/envs/python3/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/ec2-user/.ivy2/cache
The jars for the packages stored in: /home/ec2-user/.ivy2/jars
com.amazon.deequ#deequ added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-3fd29e82-4619-4f88-ba49-669eee4ba096;1.0
	confs: [default]
	found com.amazon.deequ#deequ;2.0.3-spark-3.3 in central
	found org.scala-lang#scala-reflect;2.12.10 in central
	found org.scalanlp#breeze_2.12;0.13.2 in central
	found org.scalanlp#breeze-macros_2.12;0.13.2 in central
	found com.github.fommil.netlib#core;1.1.2 in central
	found net.sf.opencsv#opencsv;2.3 in central
	found com.github.rwl#jtransforms;2.4.0 in central
	found junit#junit;4.8.2 in central
	found org.apache.commons#commons-math3;3.2 in central
	found org.spire-math#spire_2.12;0.13.0 in central
	found org.spire-math#spire-macros_2.12;0.13.0 in central
	found org.typelevel#machinist_2.12;0.6.1 in central
	found com.chuusai#shapeless_2.12;2.3.2 in central
	found org.typelevel#macro-compat_2.12;1.

24/06/14 23:36:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


24/06/14 23:36:10 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
24/06/14 23:36:10 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
24/06/14 23:36:10 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.


### We will be using the synthetic reviews dataset

Specifically the Electronics and Books subset.

In [3]:
df_electronics = spark.read.parquet("s3a://aws-bigdata-blog/generated_synthetic_reviews/data/product_category=Electronics/")

df_books = spark.read.parquet("s3a://aws-bigdata-blog/generated_synthetic_reviews/data/product_category=Books/")

print(df_electronics.printSchema(), df_books.printSchema())

24/06/14 23:36:38 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties


                                                                                

root
 |-- marketplace: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- review_id: string (nullable = true)
 |-- product_id: string (nullable = true)
 |-- product_title: string (nullable = true)
 |-- star_rating: long (nullable = true)
 |-- helpful_votes: long (nullable = true)
 |-- total_votes: long (nullable = true)
 |-- insight: string (nullable = true)
 |-- review_headline: string (nullable = true)
 |-- review_body: string (nullable = true)
 |-- review_date: timestamp (nullable = true)
 |-- review_year: long (nullable = true)

root
 |-- marketplace: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- review_id: string (nullable = true)
 |-- product_id: string (nullable = true)
 |-- product_title: string (nullable = true)
 |-- star_rating: long (nullable = true)
 |-- helpful_votes: long (nullable = true)
 |-- total_votes: long (nullable = true)
 |-- insight: string (nullable = true)
 |-- review_headline: string (nullable = true)
 |-- revi

### Initialize Metrics Repository

We will be demoing with the `FileSystemMetricsRepository` class, but you can optionally use `InMemoryMetricsRepository` the exact same way without creating a `metrics_file` like so: `repository = InMemoryMetricsRepository(spark)`. 

**Metrics Repository allows us to store the metrics in json format on the local disk (note that it also supports HDFS and S3).**

In [4]:
from pydeequ.repository import *

metrics_file = FileSystemMetricsRepository.helper_metrics_file(spark, 'metrics.json')
print(f'metrics_file path: {metrics_file}')
repository = FileSystemMetricsRepository(spark, metrics_file)

metrics_file path: /tmp/1718408214845-0/metrics.json


**Each set of metrics that we computed needs be indexed by a so-called `ResultKey`, which contains a timestamp and supports arbitrary tags in the form of key-value pairs. Let's setup one for this example:**

In [5]:
key_tags = {'tag': 'electronics'}
resultKey = ResultKey(spark, ResultKey.current_milli_time(), key_tags)

### We will be building off the Analyzers basic tutorial ... including Metrics Repository into it! 

Now we can run checks or analyzers on our data as usual. However, we make deequ store the resulting metrics for the checks in our repository by adding the `useRepository` and `saveOrAppendResult` methods to our invocation:

In [6]:
from pydeequ.analyzers import *

analysisResult = AnalysisRunner(spark) \
                    .onData(df_electronics) \
                    .addAnalyzer(Size()) \
                    .addAnalyzer(Completeness("review_id")) \
                    .addAnalyzer(ApproxCountDistinct("review_id")) \
                    .addAnalyzer(Mean("star_rating")) \
                    .addAnalyzer(Compliance("top star_rating", "star_rating >= 4.0")) \
                    .addAnalyzer(Correlation("total_votes", "star_rating")) \
                    .addAnalyzer(Correlation("total_votes", "helpful_votes")) \
                    .useRepository(repository) \
                    .saveOrAppendResult(resultKey) \
                    .run()
                    
analysisResult_df = AnalyzerContext.successMetricsAsDataFrame(spark, analysisResult)
analysisResult_df.show()

24/06/14 23:37:18 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


                                                                                

+-----------+--------------------+-------------------+--------------------+
|     entity|            instance|               name|               value|
+-----------+--------------------+-------------------+--------------------+
|     Column|           review_id|       Completeness|                 1.0|
|     Column|           review_id|ApproxCountDistinct|           3160409.0|
|Mutlicolumn|total_votes,star_...|        Correlation|-7.38808965018615...|
|    Dataset|                   *|               Size|           3010972.0|
|     Column|         star_rating|               Mean|  3.9999973430506826|
|     Column|     top star_rating|         Compliance|  0.7499993357626706|
|Mutlicolumn|total_votes,helpf...|        Correlation|  0.9817922803462663|
+-----------+--------------------+-------------------+--------------------+





### We can load it back now from Metrics Repository 

PyDeequ now executes the verification as usual and additionally stores the metrics under our specified key. Afterwards, we can retrieve the metrics from the repository in different ways. We can for example directly load the metric for a particular analyzer stored under our result key as follows:

In [7]:
analysisResult_metRep = repository.load() \
                            .before(ResultKey.current_milli_time()) \
                            .getSuccessMetricsAsDataFrame()

analysisResult_metRep.show()

+-----------+--------------------+-------------------+--------------------+-------------+-----------+
|     entity|            instance|               name|               value| dataset_date|        tag|
+-----------+--------------------+-------------------+--------------------+-------------+-----------+
|     Column|           review_id|       Completeness|                 1.0|1718408220742|electronics|
|     Column|           review_id|ApproxCountDistinct|           3160409.0|1718408220742|electronics|
|Mutlicolumn|total_votes,star_...|        Correlation|-7.38808965018615...|1718408220742|electronics|
|    Dataset|                   *|               Size|           3010972.0|1718408220742|electronics|
|     Column|         star_rating|               Mean|  3.9999973430506826|1718408220742|electronics|
|     Column|     top star_rating|         Compliance|  0.7499993357626706|1718408220742|electronics|
|Mutlicolumn|total_votes,helpf...|        Correlation|  0.9817922803462663|1718408

### But that's not very interesting... Let's run another Analysis on the books dataset! 

In [8]:
key_tags_2 = {'tag': 'books'}
resultKey_2 = ResultKey(spark, ResultKey.current_milli_time(), key_tags_2)

analysisResult_2 = AnalysisRunner(spark) \
                    .onData(df_books) \
                    .addAnalyzer(Size()) \
                    .addAnalyzer(Completeness("review_id")) \
                    .addAnalyzer(ApproxCountDistinct("review_id")) \
                    .addAnalyzer(Mean("star_rating")) \
                    .addAnalyzer(Compliance("top star_rating", "star_rating >= 4.0")) \
                    .addAnalyzer(Correlation("total_votes", "star_rating")) \
                    .addAnalyzer(Correlation("total_votes", "helpful_votes")) \
                    .useRepository(repository) \
                    .saveOrAppendResult(resultKey_2) \
                    .run()

analysisResult_2_df = AnalyzerContext.successMetricsAsDataFrame(spark, analysisResult_2)
analysisResult_2_df.show()



+-----------+--------------------+-------------------+--------------------+
|     entity|            instance|               name|               value|
+-----------+--------------------+-------------------+--------------------+
|     Column|           review_id|       Completeness|                 1.0|
|     Column|           review_id|ApproxCountDistinct|         1.0865041E7|
|Mutlicolumn|total_votes,star_...|        Correlation|1.747345622996871...|
|    Dataset|                   *|               Size|           9672664.0|
|     Column|         star_rating|               Mean|  2.9938504015026264|
|     Column|     top star_rating|         Compliance| 0.33738967878962817|
|Mutlicolumn|total_votes,helpf...|        Correlation|8.085328839629536E-5|
+-----------+--------------------+-------------------+--------------------+



                                                                                

### Now we should see two different tags when we load it back from Metrics Repository

In [9]:
analysisResult_metRep_2 = repository.load() \
                            .before(ResultKey.current_milli_time()) \
                            .getSuccessMetricsAsDataFrame()

analysisResult_metRep_2.show(analysisResult_metRep_2.count(), False)

+-----------+-------------------------+-------------------+---------------------+-------------+-----------+
|entity     |instance                 |name               |value                |dataset_date |tag        |
+-----------+-------------------------+-------------------+---------------------+-------------+-----------+
|Column     |review_id                |Completeness       |1.0                  |1718408220742|electronics|
|Column     |review_id                |ApproxCountDistinct|3160409.0            |1718408220742|electronics|
|Mutlicolumn|total_votes,star_rating  |Correlation        |-7.388089650186156E-4|1718408220742|electronics|
|Dataset    |*                        |Size               |3010972.0            |1718408220742|electronics|
|Column     |star_rating              |Mean               |3.9999973430506826   |1718408220742|electronics|
|Column     |top star_rating          |Compliance         |0.7499993357626706   |1718408220742|electronics|
|Mutlicolumn|total_votes,hel

### We can see the differences in the `dataset_date` and `tag` column and filter our results like so

In [10]:
filtered_tags = repository.load() \
        .withTagValues(key_tags_2) \
        .getSuccessMetricsAsDataFrame()

filtered_tags.show(filtered_tags.count(), False)

+-----------+-------------------------+-------------------+---------------------+-------------+-----+
|entity     |instance                 |name               |value                |dataset_date |tag  |
+-----------+-------------------------+-------------------+---------------------+-------------+-----+
|Column     |review_id                |Completeness       |1.0                  |1718408257243|books|
|Column     |review_id                |ApproxCountDistinct|1.0865041E7          |1718408257243|books|
|Mutlicolumn|total_votes,star_rating  |Correlation        |1.7473456229968713E-4|1718408257243|books|
|Dataset    |*                        |Size               |9672664.0            |1718408257243|books|
|Column     |star_rating              |Mean               |2.9938504015026264   |1718408257243|books|
|Column     |top star_rating          |Compliance         |0.33738967878962817  |1718408257243|books|
|Mutlicolumn|total_votes,helpful_votes|Correlation        |8.085328839629536E-5 |1

In [11]:
filtered_time = repository.load() \
        .after(1595457441235+1) \
        .getSuccessMetricsAsDataFrame()

filtered_time.show(filtered_time.count(), False)

+-----------+-------------------------+-------------------+---------------------+-------------+-----------+
|entity     |instance                 |name               |value                |dataset_date |tag        |
+-----------+-------------------------+-------------------+---------------------+-------------+-----------+
|Column     |review_id                |Completeness       |1.0                  |1718408220742|electronics|
|Column     |review_id                |ApproxCountDistinct|3160409.0            |1718408220742|electronics|
|Mutlicolumn|total_votes,star_rating  |Correlation        |-7.388089650186156E-4|1718408220742|electronics|
|Dataset    |*                        |Size               |3010972.0            |1718408220742|electronics|
|Column     |star_rating              |Mean               |3.9999973430506826   |1718408220742|electronics|
|Column     |top star_rating          |Compliance         |0.7499993357626706   |1718408220742|electronics|
|Mutlicolumn|total_votes,hel

### For more info ... look at full list of Metrics Repository in `docs/repository.md` 