# NOTE:  THIS NOTEBOOK WILL TAKE 5-10 MINUTES TO COMPLETE.

# PLEASE BE PATIENT.


# Analyze Data Quality with SageMaker Processing Jobs and Spark

Typically a machine learning (ML) process consists of few steps. First, gathering data with various ETL jobs, then pre-processing the data, featurizing the dataset by incorporating standard techniques or prior knowledge, and finally training an ML model using an algorithm.

Often, distributed data processing frameworks such as Spark are used to process and analyze data sets in order to detect data quality issues and prepare them for model training.  

In this notebook we'll use Amazon SageMaker Processing with a library called [**Deequ**](https://github.com/awslabs/deequ), and leverage the power of Spark with a managed SageMaker Processing Job to run our data processing workloads.

Here are some great resources on Deequ: 
* Blog Post:  https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/
* Research Paper:  https://assets.amazon.science/4a/75/57047bd343fabc46ec14b34cdb3b/towards-automated-data-quality-management-for-machine-learning.pdf

![Deequ](img/deequ.png)

![](img/processing.jpg)

# Amazon Customer Reviews Dataset

https://s3.amazonaws.com/amazon-reviews-pds/readme.html

### Dataset Columns:

- `marketplace`: 2-letter country code (in this case all "US").
- `customer_id`: Random identifier that can be used to aggregate reviews written by a single author.
- `review_id`: A unique ID for the review.
- `product_id`: The Amazon Standard Identification Number (ASIN).  `http://www.amazon.com/dp/<ASIN>` links to the product's detail page.
- `product_parent`: The parent of that ASIN.  Multiple ASINs (color or format variations of the same product) can roll up into a single parent.
- `product_title`: Title description of the product.
- `product_category`: Broad product category that can be used to group reviews (in this case digital videos).
- `star_rating`: The review's rating (1 to 5 stars).
- `helpful_votes`: Number of helpful votes for the review.
- `total_votes`: Number of total votes the review received.
- `vine`: Was the review written as part of the [Vine](https://www.amazon.com/gp/vine/help) program?
- `verified_purchase`: Was the review from a verified purchase?
- `review_headline`: The title of the review itself.
- `review_body`: The text of the review.
- `review_date`: The date the review was written.

In [1]:
%store -r ingest_create_athena_table_tsv_passed

In [2]:
try:
    ingest_create_athena_table_tsv_passed
except NameError:
    print('++++++++++++++++++++++++++++++++++++++++++++++')
    print('[ERROR] YOU HAVE TO RUN THE NOTEBOOKS IN THE INGEST FOLDER FIRST. You did not register the TSV Data.')
    print('++++++++++++++++++++++++++++++++++++++++++++++')

In [3]:
print(ingest_create_athena_table_tsv_passed)

True


In [4]:
if not ingest_create_athena_table_tsv_passed:
    print('++++++++++++++++++++++++++++++++++++++++++++++')
    print('[ERROR] YOU HAVE TO RUN THE NOTEBOOKS IN THE INGEST FOLDER FIRST. You did not register the TSV Data.')
    print('++++++++++++++++++++++++++++++++++++++++++++++')
else:
    print('[OK]')

[OK]


# Run the Analysis Job using a SageMaker Processing Job with Spark
Next, use the Amazon SageMaker Python SDK to submit a processing job. Use the Spark container that was just built with our Spark script.

In [5]:
import sagemaker
import boto3

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = sagemaker_session.default_bucket()
region = boto3.Session().region_name

# Review the Spark preprocessing script.

In [6]:
!pygmentize preprocess-deequ-pyspark.py

[34mfrom[39;49;00m [04m[36m__future__[39;49;00m [34mimport[39;49;00m print_function
[34mfrom[39;49;00m [04m[36m__future__[39;49;00m [34mimport[39;49;00m unicode_literals

[34mimport[39;49;00m [04m[36mtime[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mshutil[39;49;00m
[34mimport[39;49;00m [04m[36mcsv[39;49;00m
[34mimport[39;49;00m [04m[36msubprocess[39;49;00m
subprocess.check_call([sys.executable, [33m'[39;49;00m[33m-m[39;49;00m[33m'[39;49;00m, [33m'[39;49;00m[33mpip[39;49;00m[33m'[39;49;00m, [33m'[39;49;00m[33minstall[39;49;00m[33m'[39;49;00m, [33m'[39;49;00m[33m--no-deps[39;49;00m[33m'[39;49;00m, [33m'[39;49;00m[33mpydeequ==0.1.5[39;49;00m[33m'[39;49;00m])
subprocess.check_call([sys.executable, [33m'[39;49;00m[33m-m[39;49;00m[33m'[39;49;00m, [33m'[39;49;00m[33mpip[39;49;00m[33m'[39;49;00m, [33m'[39;49;00m[33min

In [7]:
from sagemaker.spark.processing import PySparkProcessor

processor = PySparkProcessor(base_job_name='spark-amazon-reviews-analyzer',
                            role=role,
#                            py_version='py37',
#                            container_version='v1.0',
                            instance_count=1,
                            instance_type='ml.r5.2xlarge',
                            max_runtime_in_seconds=300)

In [8]:
s3_input_data = 's3://{}/amazon-reviews-pds/tsv/'.format(bucket)
print(s3_input_data)

s3://sagemaker-us-east-1-835319576252/amazon-reviews-pds/tsv/


In [9]:
!aws s3 ls $s3_input_data

2020-11-12 03:50:07   18997559 amazon_reviews_us_Digital_Software_v1_00.tsv.gz
2020-11-12 03:50:09   27442648 amazon_reviews_us_Digital_Video_Games_v1_00.tsv.gz


## Setup Output Data

In [10]:
from time import gmtime, strftime
timestamp_prefix = strftime("%Y-%m-%d-%H-%M-%S", gmtime())

output_prefix = 'amazon-reviews-spark-analyzer-{}'.format(timestamp_prefix)
processing_job_name = 'amazon-reviews-spark-analyzer-{}'.format(timestamp_prefix)

print('Processing job name:  {}'.format(processing_job_name))

Processing job name:  amazon-reviews-spark-analyzer-2020-11-28-03-36-09


In [11]:
s3_output_analyze_data = 's3://{}/{}/output'.format(bucket, output_prefix)

print(s3_output_analyze_data)

s3://sagemaker-us-east-1-835319576252/amazon-reviews-spark-analyzer-2020-11-28-03-36-09/output


## Start the Spark Processing Job

_Notes on Invoking from Lambda:_
* However, if we use the boto3 SDK (ie. with a Lambda), we need to copy the `preprocess.py` file to S3 and specify the everything include --py-files, etc.
* We would need to do the following before invoking the Lambda:
     !aws s3 cp preprocess.py s3://<location>/sagemaker/spark-preprocess-reviews-demo/code/preprocess.py
     !aws s3 cp preprocess.py s3://<location>/sagemaker/spark-preprocess-reviews-demo/py_files/preprocess.py
* Then reference the s3://<location> above in the --py-files, etc.
* See Lambda example code in this same project for more details.

_Notes on not using ProcessingInput and Output:_
* Since Spark natively reads/writes from/to S3 using s3a://, we can avoid the copy required by ProcessingInput and ProcessingOutput (FullyReplicated or ShardedByS3Key) and just specify the S3 input and output buckets/prefixes._"
* See https://github.com/awslabs/amazon-sagemaker-examples/issues/994 for issues related to using /opt/ml/processing/input/ and output/
* If we use ProcessingInput, the data will be copied to each node (which we don't want in this case since Spark already handles this)

In [12]:
from sagemaker.processing import ProcessingOutput

processor.run(submit_app='preprocess-deequ-pyspark.py',
              submit_jars=['deequ-1.0.3-rc2.jar'],
              arguments=['s3_input_data', s3_input_data,
                         's3_output_analyze_data', s3_output_analyze_data,
              ],
              logs=True,
              wait=False
)


Job Name:  spark-amazon-reviews-analyzer-2020-11-28-03-36-09-337
Inputs:  [{'InputName': 'jars', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-835319576252/spark-amazon-reviews-analyzer-2020-11-28-03-36-09-337/input/jars', 'LocalPath': '/opt/ml/processing/input/jars', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-835319576252/spark-amazon-reviews-analyzer-2020-11-28-03-36-09-337/input/code/preprocess-deequ-pyspark.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  []


In [13]:
from IPython.core.display import display, HTML

processing_job_name = processor.jobs[-1].describe()['ProcessingJobName']

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/processing-jobs/{}">Processing Job</a></b>'.format(region, processing_job_name)))


In [14]:
from IPython.core.display import display, HTML

processing_job_name = processor.jobs[-1].describe()['ProcessingJobName']

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/cloudwatch/home?region={}#logStream:group=/aws/sagemaker/ProcessingJobs;prefix={};streamFilter=typeLogStreamPrefix">CloudWatch Logs</a> After a Few Minutes</b>'.format(region, processing_job_name)))


In [15]:
from IPython.core.display import display, HTML

s3_job_output_prefix = output_prefix

display(HTML('<b>Review <a target="blank" href="https://s3.console.aws.amazon.com/s3/buckets/{}/{}/?region={}&tab=overview">S3 Output Data</a> After The Spark Job Has Completed</b>'.format(bucket, s3_job_output_prefix, region)))


# Monitor the Processing Job

In [16]:
running_processor = sagemaker.processing.ProcessingJob.from_processing_name(processing_job_name=processing_job_name,
                                                                            sagemaker_session=sagemaker_session)

processing_job_description = running_processor.describe()

print(processing_job_description)

{'ProcessingInputs': [{'InputName': 'jars', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-835319576252/spark-amazon-reviews-analyzer-2020-11-28-03-36-09-337/input/jars', 'LocalPath': '/opt/ml/processing/input/jars', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-835319576252/spark-amazon-reviews-analyzer-2020-11-28-03-36-09-337/input/code/preprocess-deequ-pyspark.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}], 'ProcessingJobName': 'spark-amazon-reviews-analyzer-2020-11-28-03-36-09-337', 'ProcessingResources': {'ClusterConfig': {'InstanceCount': 1, 'InstanceType': 'ml.r5.2xlarge', 'VolumeSizeInGB': 30}}, 'StoppingCondition': {'MaxRuntimeInSeconds': 300}, 'AppSpecification': {'ImageUri': '173754725891.dkr.ecr.us-eas

In [None]:
running_processor.wait()

.........................[34m11-28 03:40 smspark.cli  INFO     Parsing arguments. argv: ['/usr/local/bin/smspark-submit', '--jars', '/opt/ml/processing/input/jars', '/opt/ml/processing/input/code/preprocess-deequ-pyspark.py', 's3_input_data', 's3://sagemaker-us-east-1-835319576252/amazon-reviews-pds/tsv/', 's3_output_analyze_data', 's3://sagemaker-us-east-1-835319576252/amazon-reviews-spark-analyzer-2020-11-28-03-36-09/output'][0m
[34m11-28 03:40 smspark.cli  INFO     Raw spark options before processing: {'jars': '/opt/ml/processing/input/jars', 'class_': None, 'py_files': None, 'files': None, 'verbose': False}[0m
[34m11-28 03:40 smspark.cli  INFO     App and app arguments: ['/opt/ml/processing/input/code/preprocess-deequ-pyspark.py', 's3_input_data', 's3://sagemaker-us-east-1-835319576252/amazon-reviews-pds/tsv/', 's3_output_analyze_data', 's3://sagemaker-us-east-1-835319576252/amazon-reviews-spark-analyzer-2020-11-28-03-36-09/output'][0m
[34m11-28 03:40 smspark.cli  INFO     R

[34m20/11/28 03:40:18 INFO namenode.NameNode: NameNode RPC up at: algo-1/10.0.75.4:8020[0m
[34m20/11/28 03:40:18 INFO namenode.FSNamesystem: Starting services required for active state[0m
[34m20/11/28 03:40:18 INFO namenode.FSDirectory: Initializing quota with 4 thread(s)[0m
[34m20/11/28 03:40:18 INFO mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog[0m
[34m20/11/28 03:40:18 INFO namenode.FSDirectory: Quota initialization completed in 5 milliseconds[0m
[34mname space=1[0m
[34mstorage space=0[0m
[34mstorage types=RAM_DISK=0, SSD=0, DISK=0, ARCHIVE=0[0m
[34m20/11/28 03:40:18 INFO server.AuthenticationFilter: Unable to initialize FileSignerSecretProvider, falling back to use random secrets.[0m
[34m20/11/28 03:40:18 INFO blockmanagement.CacheReplicationMonitor: Starting CacheReplicationMonitor with interval 30000 milliseconds[0m
[34m20/11/28 03:40:18 INFO http.HttpRequestLog: Http request log for http.requests.resou

[34m  Downloading https://files.pythonhosted.org/packages/ef/d5/c0c33ca15e31062220ac5964f3492409eaf90a5cf5399503cd8264f2f8e9/botocore-1.19.25-py2.py3-none-any.whl (6.9MB)[0m
[34mInstalling collected packages: botocore, boto3
  Found existing installation: botocore 1.17.60
    Uninstalling botocore-1.17.60:[0m
[34m      Successfully uninstalled botocore-1.17.60
  Found existing installation: boto3 1.14.58
    Uninstalling boto3-1.14.58:[0m
[34m      Successfully uninstalled boto3-1.14.58[0m
[34mSuccessfully installed boto3-1.16.17 botocore-1.19.25[0m
[34ms3a://sagemaker-us-east-1-835319576252/amazon-reviews-pds/tsv/[0m
[34ms3a://sagemaker-us-east-1-835319576252/amazon-reviews-spark-analyzer-2020-11-28-03-36-09/output[0m
[34m20/11/28 03:40:33 INFO spark.SparkContext: Running Spark version 2.4.5-amzn-0[0m
[34m20/11/28 03:40:33 INFO spark.SparkContext: Submitted application: PySparkAmazonReviewsAnalyzer[0m
[34m20/11/28 03:40:33 INFO spark.SecurityManager: Changing view a

[34m20/11/28 03:40:40 INFO hdfs.StateChange: BLOCK* allocate blk_1073741825_1001, replicas=10.0.75.4:50010 for /user/root/.sparkStaging/application_1606534817904_0001/__spark_libs__7385830466042033714.zip[0m
[34m20/11/28 03:40:40 INFO datanode.DataNode: Receiving BP-84689111-10.0.75.4-1606534815776:blk_1073741825_1001 src: /10.0.75.4:58714 dest: /10.0.75.4:50010[0m
[34m20/11/28 03:40:41 INFO DataNode.clienttrace: src: /10.0.75.4:58714, dest: /10.0.75.4:50010, bytes: 134217728, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_135285079_19, offset: 0, srvID: 3530ccd7-d051-4021-9716-7b5582e3e0a3, blockid: BP-84689111-10.0.75.4-1606534815776:blk_1073741825_1001, duration: 235742840[0m
[34m20/11/28 03:40:41 INFO datanode.DataNode: PacketResponder: BP-84689111-10.0.75.4-1606534815776:blk_1073741825_1001, type=LAST_IN_PIPELINE terminating[0m
[34m20/11/28 03:40:41 INFO hdfs.StateChange: BLOCK* allocate blk_1073741826_1002, replicas=10.0.75.4:50010 for /user/root/.sparkStaging/application

[34m20/11/28 03:40:49 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 32813.[0m
[34m20/11/28 03:40:49 INFO netty.NettyBlockTransferService: Server created on 10.0.75.4:32813[0m
[34m20/11/28 03:40:49 INFO storage.BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy[0m
[34m20/11/28 03:40:49 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.0.75.4, 32813, None)[0m
[34m20/11/28 03:40:49 INFO storage.BlockManagerMasterEndpoint: Registering block manager 10.0.75.4:32813 with 912.3 MB RAM, BlockManagerId(driver, 10.0.75.4, 32813, None)[0m
[34m20/11/28 03:40:49 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.0.75.4, 32813, None)[0m
[34m20/11/28 03:40:49 INFO storage.BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.0.75.4, 32813, None)[0m
[34m20/11/28 03:40:49 INFO cluster.YarnSched

[34m20/11/28 03:41:00 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 3824 ms on algo-1 (executor 1) (1/2)[0m
[34m20/11/28 03:41:00 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 4061 ms on algo-1 (executor 1) (2/2)[0m
[34m20/11/28 03:41:00 INFO cluster.YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool [0m
[34m20/11/28 03:41:00 INFO scheduler.DAGScheduler: ShuffleMapStage 0 (collect at AnalysisRunner.scala:323) finished in 4.111 s[0m
[34m20/11/28 03:41:00 INFO scheduler.DAGScheduler: looking for newly runnable stages[0m
[34m20/11/28 03:41:00 INFO scheduler.DAGScheduler: running: Set()[0m
[34m20/11/28 03:41:00 INFO scheduler.DAGScheduler: waiting: Set()[0m
[34m20/11/28 03:41:00 INFO scheduler.DAGScheduler: failed: Set()[0m
[34m20/11/28 03:41:00 INFO adaptive.CoalesceShufflePartitions: advisoryTargetPostShuffleInputSize: 67108864, targetPostShuffleInputSize 48.[0m
[34m20/11/28 03:41:00 INFO 

[34m[/var/log/yarn/userlogs/application_1606534817904_0001/container_1606534817904_0001_01_000002/stderr] 20/11/28 03:41:09 INFO storage.ShuffleBlockFetcherIterator: Getting 2 non-empty blocks including 2 local blocks and 0 remote blocks[0m
[34m[/var/log/yarn/userlogs/application_1606534817904_0001/container_1606534817904_0001_01_000002/stderr] 20/11/28 03:41:09 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms[0m
[34m[/var/log/yarn/userlogs/application_1606534817904_0001/container_1606534817904_0001_01_000002/stderr] 20/11/28 03:41:09 INFO codegen.CodeGenerator: Code generated in 10.3398 ms[0m
[34m[/var/log/yarn/userlogs/application_1606534817904_0001/container_1606534817904_0001_01_000002/stderr] 20/11/28 03:41:09 INFO executor.Executor: Finished task 0.0 in stage 11.0 (TID 16). 1937 bytes result sent to driver[0m
[34m[/var/log/yarn/userlogs/application_1606534817904_0001/container_1606534817904_0001_01_000002/stderr] 20/11/28 03:41:09 INFO executor.

[34m20/11/28 03:41:21 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 24.0 (TID 63) in 3500 ms on algo-1 (executor 1) (1/2)[0m
[34m20/11/28 03:41:22 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 24.0 (TID 62) in 4323 ms on algo-1 (executor 1) (2/2)[0m
[34m20/11/28 03:41:22 INFO cluster.YarnScheduler: Removed TaskSet 24.0, whose tasks have all completed, from pool [0m
[34m20/11/28 03:41:22 INFO scheduler.DAGScheduler: ShuffleMapStage 24 (collect at AnalysisRunner.scala:323) finished in 4.335 s[0m
[34m20/11/28 03:41:22 INFO scheduler.DAGScheduler: looking for newly runnable stages[0m
[34m20/11/28 03:41:22 INFO scheduler.DAGScheduler: running: Set()[0m
[34m20/11/28 03:41:22 INFO scheduler.DAGScheduler: waiting: Set()[0m
[34m20/11/28 03:41:22 INFO scheduler.DAGScheduler: failed: Set()[0m
[34m20/11/28 03:41:22 INFO adaptive.CoalesceShufflePartitions: advisoryTargetPostShuffleInputSize: 67108864, targetPostShuffleInputSize 566.[0m
[34m20/11/28 03:41:2

[34m20/11/28 03:41:27 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 30.0 (TID 69) in 2914 ms on algo-1 (executor 1) (1/2)[0m
[34m20/11/28 03:41:28 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 30.0 (TID 68) in 3422 ms on algo-1 (executor 1) (2/2)[0m
[34m20/11/28 03:41:28 INFO cluster.YarnScheduler: Removed TaskSet 30.0, whose tasks have all completed, from pool [0m
[34m20/11/28 03:41:28 INFO scheduler.DAGScheduler: ShuffleMapStage 30 (countByKey at ColumnProfiler.scala:605) finished in 3.439 s[0m
[34m20/11/28 03:41:28 INFO scheduler.DAGScheduler: looking for newly runnable stages[0m
[34m20/11/28 03:41:28 INFO scheduler.DAGScheduler: running: Set()[0m
[34m20/11/28 03:41:28 INFO scheduler.DAGScheduler: waiting: Set(ResultStage 31)[0m
[34m20/11/28 03:41:28 INFO scheduler.DAGScheduler: failed: Set()[0m
[34m20/11/28 03:41:28 INFO scheduler.DAGScheduler: Submitting ResultStage 31 (ShuffledRDD[66] at countByKey at ColumnProfiler.scala:605), which has n

# _Please Wait Until the ^^ Processing Job ^^ Completes Above._

# Inspect the Processed Output 

## These are the quality checks on our dataset.

## _The next cells will not work properly until the job completes above._

In [None]:
!aws s3 ls --recursive $s3_output_analyze_data/

## Copy the Output from S3 to Local
* dataset-metrics/
* constraint-checks/
* success-metrics/
* constraint-suggestions/


In [None]:
!aws s3 cp --recursive $s3_output_analyze_data ./amazon-reviews-spark-analyzer/ --exclude="*" --include="*.csv"

## Analyze Constraint Checks

In [None]:
import glob
import pandas as pd
import os

def load_dataset(path, sep, header):
    data = pd.concat([pd.read_csv(f, sep=sep, header=header) for f in glob.glob('{}/*.csv'.format(path))], ignore_index = True)

    return data

In [None]:
df_constraint_checks = load_dataset(path='./amazon-reviews-spark-analyzer/constraint-checks/', sep='\t', header=0)
df_constraint_checks[['check', 'constraint', 'constraint_status', 'constraint_message']]

## Analyze Dataset Metrics

In [None]:
df_dataset_metrics = load_dataset(path='./amazon-reviews-spark-analyzer/dataset-metrics/', sep='\t', header=0)
df_dataset_metrics

## Analyze Success Metrics

In [None]:
df_success_metrics = load_dataset(path='./amazon-reviews-spark-analyzer/success-metrics/', sep='\t', header=0)
df_success_metrics

## Analyze Constraint Suggestions

In [None]:
# df_constraint_suggestions = load_dataset(path='./amazon-reviews-spark-analyzer/constraint-suggestions/', sep='\t', header=0)
# df_constraint_suggestions.columns=['column_name', 'description', 'code']
# df_constraint_suggestions

# Save for the Next Notebook(s)

In [None]:
%%javascript
Jupyter.notebook.save_checkpoint();
Jupyter.notebook.session.delete();