# Storing Computed Metrics in S3 with AWS Glue

PyDeequ allows us to persist the metrics we computed on dataframes in a so-called MetricsRepository using AWS Glue. In the following example, we showcase how to store metrics in S3 and query them later on.

In [1]:
import sys
from awsglue.utils import getResolvedOptions

from awsglue.context import GlueContext
from pyspark.context import SparkContext

glueContext = GlueContext(SparkContext.getOrCreate())
session = glueContext.spark_session


Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
99,application_1595892420059_0100,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

### We will be using the Amazon Product Reviews dataset 
Specifically the Electronics dataset.

In [2]:
df_electronics = session.read.parquet("s3a://amazon-reviews-pds/parquet/product_category=Electronics/")

print(df_electronics.printSchema())

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- marketplace: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- review_id: string (nullable = true)
 |-- product_id: string (nullable = true)
 |-- product_parent: string (nullable = true)
 |-- product_title: string (nullable = true)
 |-- star_rating: integer (nullable = true)
 |-- helpful_votes: integer (nullable = true)
 |-- total_votes: integer (nullable = true)
 |-- vine: string (nullable = true)
 |-- verified_purchase: string (nullable = true)
 |-- review_headline: string (nullable = true)
 |-- review_body: string (nullable = true)
 |-- review_date: date (nullable = true)
 |-- year: integer (nullable = true)

None

### Initialize Metrics Repository

We will be demoing with the `FileSystemMetricsRepository` class, but you can optionally use `InMemoryMetricsRepository` the exact same way without creating a `metrics_file` like so: `repository = InMemoryMetricsRepository(session)`. 

**Metrics Repository allows us to store the metrics in json format on S3.**

In [3]:
s3_write_path = "s3://joanpydeequ/tmp/simple_metrics_tutorial.json"

import pydeequ
from pydeequ.repository import *

repository = FileSystemMetricsRepository(session, s3_write_path)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Each set of metrics that we computed needs be indexed by a so-called `ResultKey`, which contains a timestamp and supports arbitrary tags in the form of key-value pairs. Let's setup one for this example:

In [4]:
key_tags = {'tag': 'general_electronics'}
resultKey = ResultKey(session, ResultKey.current_milli_time(), key_tags)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

This tutorial builds upon the Analyzer and Metrics Repository Tutorial. We make Deequ write and store our metrics in S3 by adding `useRepository` and  the `saveOrAppendResult` method.

In [5]:
from pydeequ.analyzers import *

analysisResult = AnalysisRunner(session) \
                    .onData(df_electronics) \
                    .addAnalyzer(Size()) \
                    .addAnalyzer(Completeness("review_id")) \
                    .addAnalyzer(ApproxCountDistinct("review_id")) \
                    .addAnalyzer(Mean("star_rating")) \
                    .addAnalyzer(Distinctness("customer_id")) \
                    .addAnalyzer(Correlation("helpful_votes","total_votes")) \
                    .addAnalyzer(ApproxQuantile("star_rating",.5)) \
                    .useRepository(repository) \
                    .saveOrAppendResult(resultKey) \
                    .run()
                    
analysisResult_df = AnalyzerContext.successMetricsAsDataFrame(session, analysisResult)
analysisResult_df.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----------+--------------------+-------------------+------------------+
|     entity|            instance|               name|             value|
+-----------+--------------------+-------------------+------------------+
|     Column|           review_id|       Completeness|               1.0|
|     Column|           review_id|ApproxCountDistinct|         3010972.0|
|     Column|         customer_id|       Distinctness|0.6951804233214501|
|Mutlicolumn|helpful_votes,tot...|        Correlation|0.9936463809903863|
|    Dataset|                   *|               Size|         3120938.0|
|     Column|         star_rating|               Mean| 4.036143941340712|
|     Column|         star_rating| ApproxQuantile-0.5|               5.0|
+-----------+--------------------+-------------------+------------------+

### We can now load it back from the Metrics Repository 

PyDeequ now executes the verification as usual and additionally stores the metrics under our specified key. Afterwards, we can retrieve the metrics from the repository in different ways. We can for example directly load the metric for a particular analyzer stored under our result key as follows:

In [6]:
analysisResult_metRep = repository.load() \
                            .before(ResultKey.current_milli_time()) \
                            .getSuccessMetricsAsDataFrame()

analysisResult_metRep.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----------+--------------------+-------------------+------------------+-------------+-------------------+
|     entity|            instance|               name|             value| dataset_date|                tag|
+-----------+--------------------+-------------------+------------------+-------------+-------------------+
|     Column|           review_id|       Completeness|               1.0|1597344080757|general_electronics|
|     Column|           review_id|ApproxCountDistinct|         3010972.0|1597344080757|general_electronics|
|     Column|         customer_id|       Distinctness|0.6951804233214501|1597344080757|general_electronics|
|Mutlicolumn|helpful_votes,tot...|        Correlation|0.9936463809903863|1597344080757|general_electronics|
|    Dataset|                   *|               Size|         3120938.0|1597344080757|general_electronics|
|     Column|         star_rating|               Mean| 4.036143941340712|1597344080757|general_electronics|
|     Column|         star_r

## Great, we got our results!

Let us take a closer look at the data distribution in the star rating column. Use the `filter` method to partition our table into two. One table will contain values below the average star rating [1,3], the second table will contain the higher rated scores.

In [7]:
lower_rating = df_electronics.filter("star_rating < 4")
higher_rating = df_electronics.filter("star_rating >= 4")


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

**We can find the correlation between helpful and total votes, specifically between higher and lower ratings.**

In [8]:
key_tags_2 = {'tag': 'star_rating[1-3]'}
resultKey = ResultKey(session, ResultKey.current_milli_time(), key_tags_2)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [9]:
analysisResult = AnalysisRunner(session) \
                    .onData(lower_rating) \
                    .addAnalyzer(Size()) \
                    .addAnalyzer(ApproxCountDistinct("review_id")) \
                    .addAnalyzer(Mean("star_rating")) \
                    .addAnalyzer(Correlation("helpful_votes","total_votes")) \
                    .addAnalyzer(Compliance('range','star_rating > 0 AND star_rating < 4')) \
                    .useRepository(repository) \
                    .saveOrAppendResult(resultKey) \
                    .run()
                    
analysisResult_df = AnalyzerContext.successMetricsAsDataFrame(session, analysisResult)
analysisResult_df.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----------+--------------------+-------------------+------------------+
|     entity|            instance|               name|             value|
+-----------+--------------------+-------------------+------------------+
|     Column|           review_id|ApproxCountDistinct|          779797.0|
|     Column|               range|         Compliance|               1.0|
|Mutlicolumn|helpful_votes,tot...|        Correlation|0.9870764816013522|
|    Dataset|                   *|               Size|          782085.0|
|     Column|         star_rating|               Mean| 1.846948861057302|
+-----------+--------------------+-------------------+------------------+

In [10]:
key_tags_3 = {'tag': 'star_rating[4-5]'}
resultKey = ResultKey(session, ResultKey.current_milli_time(), key_tags_3)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [11]:
analysisResult = AnalysisRunner(session) \
                    .onData(higher_rating) \
                    .addAnalyzer(Size()) \
                    .addAnalyzer(ApproxCountDistinct("review_id")) \
                    .addAnalyzer(Mean("star_rating")) \
                    .addAnalyzer(Correlation("helpful_votes","total_votes")) \
                    .addAnalyzer(Compliance('range','star_rating >=4')) \
                    .useRepository(repository) \
                    .saveOrAppendResult(resultKey) \
                    .run()
                    
analysisResult_df = AnalyzerContext.successMetricsAsDataFrame(session, analysisResult)
analysisResult_df.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----------+--------------------+-------------------+------------------+
|     entity|            instance|               name|             value|
+-----------+--------------------+-------------------+------------------+
|     Column|           review_id|ApproxCountDistinct|         2348579.0|
|Mutlicolumn|helpful_votes,tot...|        Correlation|0.9976826161406824|
|    Dataset|                   *|               Size|         2338853.0|
|     Column|         star_rating|               Mean| 4.768185089015855|
|     Column|               range|         Compliance|               1.0|
+-----------+--------------------+-------------------+------------------+

### Now we should see three different tags when we load it back from Metrics Repository

In [12]:
analysisResult_metRep = repository.load() \
                            .before(ResultKey.current_milli_time()) \
                            .getSuccessMetricsAsDataFrame()

analysisResult_metRep.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----------+--------------------+-------------------+------------------+-------------+-------------------+
|     entity|            instance|               name|             value| dataset_date|                tag|
+-----------+--------------------+-------------------+------------------+-------------+-------------------+
|     Column|           review_id|       Completeness|               1.0|1597344080757|general_electronics|
|     Column|           review_id|ApproxCountDistinct|         3010972.0|1597344080757|general_electronics|
|     Column|         customer_id|       Distinctness|0.6951804233214501|1597344080757|general_electronics|
|Mutlicolumn|helpful_votes,tot...|        Correlation|0.9936463809903863|1597344080757|general_electronics|
|    Dataset|                   *|               Size|         3120938.0|1597344080757|general_electronics|
|     Column|         star_rating|               Mean| 4.036143941340712|1597344080757|general_electronics|
|     Column|         star_r

There seems to be a slightly higher correlation between helpful and total votes with higher ratings than the lower rated instances!

By leveraging the metrics repository file, all the analysis on the data is now saved within your S3 bucket for future reference!