# KLL Example

Here is a basic example of running a `ColumnProfiler` with a [KLL Sketches](https://arxiv.org/abs/1603.05346) 

We'll start by creating a Spark session and a small sample dataframe.

In [1]:
import pydeequ

import sagemaker_pyspark
from pyspark.sql import SparkSession, Row

classpath = ":".join(sagemaker_pyspark.classpath_jars()) # aws-specific jars

spark = (SparkSession
    .builder
    .config("spark.driver.extraClassPath", classpath)
    .config("spark.jars.packages", pydeequ.deequ_maven_coord)
    .config("spark.jars.excludes", pydeequ.f2j_maven_coord)
    .getOrCreate())

In [2]:
df = spark.sparkContext.parallelize([
    Row(idx=1, name="Thingy A", description="awesome thing.", rating="high", units=0),
    Row(idx=2, name="Thingy B", description="available at http://thingb.com", rating=None, units=0),
    Row(idx=3, name=None, description=None, rating="low", units=5),
    Row(idx=4, name="Thingy D", description="checkout https://thingd.ca", rating="low", units=10),
    Row(idx=5, name="Thingy E", description=None, rating="high", units=12)]).toDF()

In [3]:
from pydeequ.profiles import *
from pydeequ.analyzers import KLLParameters

result = ColumnProfilerRunner(spark) \
            .onData(df) \
            .withKLLProfiling() \
            .setKLLParameters(KLLParameters(spark, 2, 0.64, 2)) \
            .run()

In [18]:
for col, profile in result.profiles.items():
    print(f'Column: {col}')
    
    if isinstance(profile, pydeequ.profiles.NumericColumnProfile):  
        d = {}
        d['minimum'] = profile.minimum
        d['maximum'] = profile.maximum
        d['mean'] = profile.mean
        d['standard_deviation'] = profile.stdDev
        d['distribution'] = {}
        d['distribution']['KLL'] = {}
        d['distribution']['KLL']['buckets'] = {}
        for b in range(len(profile.kll.buckets)): 
            d['distribution']['KLL']['buckets'][f'bucket_{b}'] = {
                'lowValue': profile.kll.buckets[b].lowValue,
                'highValue':profile.kll.buckets[b].highValue,
                'count': profile.kll.buckets[b].count
            }
        d['distribution']['KLL']['sketch'] = {
            'c': profile.kll.parameters[0],
            'k': profile.kll.parameters[1]
        }
        d['distribution']['KLL']['data'] = profile.kll.data

        print(json.dumps(d, indent=2))
 
    else: 
        for i in profile.histogram: 
            print(f"{i.value} occurred {i.count} times (ratio is: {i.ratio})")
        
    print('\n')


Column: name
NullValue occurred 1 times (ratio is: 0.2)
Thingy E occurred 1 times (ratio is: 0.2)
Thingy D occurred 1 times (ratio is: 0.2)
Thingy B occurred 1 times (ratio is: 0.2)
Thingy A occurred 1 times (ratio is: 0.2)


Column: description
awesome thing. occurred 1 times (ratio is: 0.2)
available at http://thingb.com occurred 1 times (ratio is: 0.2)
NullValue occurred 2 times (ratio is: 0.4)
checkout https://thingd.ca occurred 1 times (ratio is: 0.2)


Column: rating
NullValue occurred 1 times (ratio is: 0.2)
low occurred 2 times (ratio is: 0.4)
high occurred 2 times (ratio is: 0.4)


Column: units
{
  "minimum": 0.0,
  "maximum": 12.0,
  "mean": 5.4,
  "standard_deviation": 4.963869458396343,
  "distribution": {
    "KLL": {
      "buckets": {
        "bucket_0": {
          "lowValue": 0.0,
          "highValue": 6.0,
          "count": 4
        },
        "bucket_1": {
          "lowValue": 6.0,
          "highValue": 12.0,
          "count": 1
        }
      },
      "sketc