# Constraint Suggestions Basic Tutorial

This Jupyter notebook will give a basic tutorial on how to use PyDeequ's Constraint Suggestions module.

In [1]:
from pyspark.sql import SparkSession, Row, DataFrame
import json
import pandas as pd
import sagemaker_pyspark

import pydeequ

classpath = ":".join(sagemaker_pyspark.classpath_jars())

spark = (SparkSession
    .builder
    .config("spark.driver.extraClassPath", classpath)
    .config("spark.jars.packages", pydeequ.deequ_maven_coord)
    .config("spark.jars.excludes", pydeequ.f2j_maven_coord)
    .getOrCreate())

### We will be using the Amazon Product Reviews dataset -- specifically the Electronics subset. 

In [2]:
df = spark.read.parquet("s3a://amazon-reviews-pds/parquet/product_category=Electronics/")

df.printSchema()

root
 |-- marketplace: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- review_id: string (nullable = true)
 |-- product_id: string (nullable = true)
 |-- product_parent: string (nullable = true)
 |-- product_title: string (nullable = true)
 |-- star_rating: integer (nullable = true)
 |-- helpful_votes: integer (nullable = true)
 |-- total_votes: integer (nullable = true)
 |-- vine: string (nullable = true)
 |-- verified_purchase: string (nullable = true)
 |-- review_headline: string (nullable = true)
 |-- review_body: string (nullable = true)
 |-- review_date: date (nullable = true)
 |-- year: integer (nullable = true)



In [3]:
from pydeequ.suggestions import *

suggestionResult = ConstraintSuggestionRunner(spark) \
             .onData(df) \
             .addConstraintRule(DEFAULT()) \
             .run()

print(json.dumps(suggestionResult, indent=2))

{
  "constraint_suggestions": [
    {
      "constraint_name": "CompletenessConstraint(Completeness(review_id,None))",
      "column_name": "review_id",
      "current_value": "Completeness: 1.0",
      "description": "'review_id' is not null",
      "suggesting_rule": "CompleteIfCompleteRule()",
      "rule_description": "If a column is complete in the sample, we suggest a NOT NULL constraint",
      "code_for_constraint": ".isComplete(\"review_id\")"
    },
    {
      "constraint_name": "UniquenessConstraint(Uniqueness(List(review_id),None))",
      "column_name": "review_id",
      "current_value": "ApproxDistinctness: 0.9647650802419017",
      "description": "'review_id' is unique",
      "suggesting_rule": "UniqueIfApproximatelyUniqueRule()",
      "rule_description": "If the ratio of approximate num distinct values in a column is close to the number of records (within the error of the HLL sketch), we suggest a UNIQUE constraint",
      "code_for_constraint": ".isUnique(\"review

### For more info ... look at full list of suggestions in `docs/suggestions.md` 