# Profiles Basic Tutorial

This Jupyter notebook will give a basic tutorial on how to use PyDeequ's Profiles module.

In [1]:
from pyspark.sql import SparkSession, Row, DataFrame
import json
import pandas as pd
import sagemaker_pyspark

import pydeequ

classpath = ":".join(sagemaker_pyspark.classpath_jars())

spark = (SparkSession
    .builder
    .config("spark.driver.extraClassPath", classpath)
    .config("spark.jars.packages", pydeequ.deequ_maven_coord)
    .config("spark.jars.excludes", pydeequ.f2j_maven_coord)
    .getOrCreate())

### We will be using the Amazon Product Reviews dataset -- specifically the Electronics subset. 

In [2]:
df = spark.read.parquet("s3a://amazon-reviews-pds/parquet/product_category=Electronics/")

df.printSchema()

root
 |-- marketplace: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- review_id: string (nullable = true)
 |-- product_id: string (nullable = true)
 |-- product_parent: string (nullable = true)
 |-- product_title: string (nullable = true)
 |-- star_rating: integer (nullable = true)
 |-- helpful_votes: integer (nullable = true)
 |-- total_votes: integer (nullable = true)
 |-- vine: string (nullable = true)
 |-- verified_purchase: string (nullable = true)
 |-- review_headline: string (nullable = true)
 |-- review_body: string (nullable = true)
 |-- review_date: date (nullable = true)
 |-- year: integer (nullable = true)



In [4]:
from pydeequ.profiles import *

result = ColumnProfilerRunner(spark) \
    .onData(df) \
    .run()

In [5]:
for col, profile in result.profiles.items():
    print(profile)

Column review_id:
completeness: 1.0
approx distinct: 3010972
datatype: String

Statistics of customer_id
Min:10005.0
Max:53096582.0
Mean:28806032.68895954
StandardDeviation:15415072.111267326

Column review_date:
completeness: 1.0
approx distinct: 5898
datatype: Unknown

Statistics of helpful_votes
Min:0.0
Max:12786.0
Mean:1.865194053838942
StandardDeviation:21.296393520562624

Statistics of star_rating
Min:1.0
Max:5.0
Mean:4.036143941340712
StandardDeviation:1.3866747032700206

Statistics of year
Min:1999.0
Max:2015.0
Mean:2012.8595236432125
StandardDeviation:2.464162689284542

Column product_title:
completeness: 1.0
approx distinct: 164112
datatype: String

Column review_headline:
completeness: 0.9999987183340393
approx distinct: 1694860
datatype: String

Column product_id:
completeness: 1.0
approx distinct: 169835
datatype: String

Statistics of total_votes
Min:0.0
Max:12944.0
Mean:2.3798239503636407
StandardDeviation:22.457108543167916

Statistics of product_parent
Min:6478.0
Max:9

### For more info ... look at full list of profiles in `docs/profiles.md` 