# Single Column Profiling Example

Very often we are faced with large, raw datasets and struggle to make sense of the data. A common example might be that we are given a huge CSV file and want to understand and clean the data contained therein. PyDeequ supports single-column profiling of such data and its implementation scales to large datasets with billions of rows. In the following, we showcase the basic usage of this profiling functionality:

Assume we have raw data that is string typed (such as the data you would get from a CSV file). For the sake of simplicity, we use the following toy data in this example:

In [1]:
import pydeequ

import sagemaker_pyspark
from pyspark.sql import SparkSession, Row

classpath = ":".join(sagemaker_pyspark.classpath_jars()) # aws-specific jars

spark = (SparkSession
    .builder
    .config("spark.driver.extraClassPath", classpath)
    .config("spark.jars.packages", pydeequ.deequ_maven_coord)
    .config("spark.jars.excludes", pydeequ.f2j_maven_coord)
    .getOrCreate())

In [2]:
df = spark.sparkContext.parallelize([
    Row(productName="thingA", totalNumber="13.0", status="IN_TRANSIT", valuable="true"),
    Row(productName="thingA", totalNumber="5", status="DELAYED", valuable="false"),
    Row(productName="thingB", totalNumber=None, status="DELAYED", valuable=None),
    Row(productName="thingC", totalNumber=None, status="IN_TRANSIT", valuable="false"),
    Row(productName="thingD", totalNumber="1.0", status="DELAYED", valuable="true"),
    Row(productName="thingC", totalNumber="7.0", status="UNKNOWN", valuable=None),
    Row(productName="thingC", totalNumber="20", status="UNKNOWN", valuable=None),
    Row(productName="thingE", totalNumber="20", status="DELAYED", valuable="false")]).toDF()

It only takes a single method invocation to make **PyDeequ** profile this data. Note that it will execute the three passes over the data and avoid any shuffles in order to easily scale to large data.

In [6]:
from pydeequ.profiles import *

result = ColumnProfilerRunner(spark) \
            .onData(df) \
            .run()

As a result, we get a profile for each column in the data, which allows us to inspect the completeness of the column, the approximate number of distinct values and the inferred datatype.

In case of our toy data, we would get the following profiling results. Note that **PyDeequ** detected that `totalNumber` is a fractional column (and could be casted to float or double type) and that `valuable` is a boolean column.

In [12]:
for col, profile in result.profiles.items():
    print(f'Column \'{col}\'')
    print('\t',f'completeness: {profile.completeness}')
    print('\t',f'approximate number of distinct values: {profile.approximateNumDistinctValues}')
    print('\t',f'datatype: {profile.dataType}')

Column 'productName'
	 completeness: 1.0
	 approximate number of distinct values: 5
	 datatype: String
Column 'status'
	 completeness: 1.0
	 approximate number of distinct values: 3
	 datatype: String
Column 'totalNumber'
	 completeness: 0.75
	 approximate number of distinct values: 5
	 datatype: Fractional
Column 'valuable'
	 completeness: 0.625
	 approximate number of distinct values: 2
	 datatype: Boolean


For numeric columns, we get an extended profile which also contains descriptive statistics.

For the `totalNumber` column we can inspect its minimum, maximum, mean and standard deviation:

In [25]:
totalNumber_profile = result.profiles['totalNumber']

print(f'Statistics of \'totalNumber\':')
print('\t',f"minimum: {totalNumber_profile.minimum}")
print('\t',f"maximum: {totalNumber_profile.maximum}")
print('\t',f"mean: {totalNumber_profile.mean}")
print('\t',f"standard deviation: {totalNumber_profile.stdDev}")

Statistics of 'totalNumber':
	 minimum: 1.0
	 maximum: 20.0
	 mean: 11.0
	 standard deviation: 7.280109889280518


For columns with a low number of distinct values, we collect the full value distribution. Here are accurate statistics about the values in the `status` column:

In [36]:
status_profile = result.profiles['status']

print('Value distribution in \'status\':')
for unique_entry in status_profile.histogram: 
    print('\t',f"{unique_entry.value} occurred {unique_entry.count} times (ratio is {unique_entry.ratio})")

Value distribution in 'status':
	 IN_TRANSIT occurred 2 times (ratio is 0.25)
	 UNKNOWN occurred 2 times (ratio is 0.25)
	 DELAYED occurred 4 times (ratio is 0.5)
