# Run HyUCC Algorithm - Example

This notebook shows an example for running the unique column combination discovery algorithm HyUCC.

In [1]:
# Use a Socrata dataset as the example input.

from openclean.data.source.socrata import Socrata
df = Socrata().dataset('bre9-aqqr').load()

In [2]:
df

Unnamed: 0,Report Date,FIPS,Locality,VDH Health District,Total Cases,Hospitalizations,Deaths
0,11/28/2020,51001,Accomack,Eastern Shore,1340,107,21
1,11/28/2020,51003,Albemarle,Thomas Jefferson,1896,100,27
2,11/28/2020,51005,Alleghany,Alleghany,316,19,7
3,11/28/2020,51007,Amelia,Piedmont,210,21,6
4,11/28/2020,51009,Amherst,Central Virginia,810,31,6
...,...,...,...,...,...,...,...
46279,02/27/2021,51800,Suffolk,Western Tidewater,6876,387,150
46280,02/27/2021,51810,Virginia Beach,Virginia Beach,30491,1291,300
46281,02/27/2021,51820,Waynesboro,Central Shenandoah,2126,67,30
46282,02/27/2021,51830,Williamsburg,Peninsula,524,24,9


## Run using local Java JRE

In [3]:
# Configure the runtime environment.

# When running the algorithms using the local java installation make sure
# that the environment variable METANOME_JARPATH references the Metanome.jar
# file on your machine. Alternatively, configure the environment settings:

import openclean_metanome.config as config

env = {config.METANOME_JARPATH: '/path/to/Metanome.jar'}

In [4]:
# Run the HyUCC algorithm on the downloaded dataset.

from openclean_metanome.algorithm.hyucc import hyucc

keys = hyucc(df, max_ucc_size=3, env=env)

Metanome Data Profiling Wrapper - Version 0.1.0

Initializing ...
Reading data and calculating plis ...
Sorting plis by number of clusters ...
Inverting plis ...
Extracting integer representations for the records ...
Investigating comparison suggestions ... 
Sorting clusters ...(86ms)
Running initial windows ...(140ms)
Moving window over clusters ... 
Inducing UCC candidates ...
Validating UCCs using plis ...
	Level 1: 4 elements; (V)(C)(G); 0 intersections; 0 validations; 0 invalid; 0 new candidates; --> 0 UCCs
	Level 2: 9 elements; (V)(C)(G); 4 intersections; 4 validations; 2 invalid; 7 new candidates; --> 2 UCCs
Investigating comparison suggestions ... 
Moving window over clusters ... 
Inducing UCC candidates ...
Validating UCCs using plis ...
	Level 3: 13 elements; (V)(-)(-); 13 intersections; 13 validations; 13 invalid; - new candidates; --> 0 UCCs
Translating UCC-tree into result format ...
... done! (2 UCCs)
Time: 631 ms



In [5]:
for ucc in keys:
    print(ucc)

['Report Date', 'FIPS']
['Report Date', 'Locality']


## Run using local Docker instance

In [6]:
# Configure the runtime environment.

# When running the algorithms using Docker make sure that the environment
# variable METANOME_WORKERS references a worker configuration file that
# assigns a docker worker to the Docker image 'heikomueller/openclean-metanome:0.1.0'.
# Note that you can also specify a different image using the environment
# variable 'METANOME_CONTAINER'.

# An example docker worker configuration file is included in the package:

import openclean_metanome.config as config

env = {
    config.METANOME_WORKERS: '../../config/docker_worker.yaml',
    config.METANOME_CONTAINER: 'heikomueller/openclean-metanome:0.1.0'
}

In [7]:
# Run the HyUCC algorithm on the downloaded dataset.

from openclean_metanome.algorithm.hyucc import hyucc

keys = hyucc(df, max_ucc_size=3, env=env)

Metanome Data Profiling Wrapper - Version 0.1.0

Initializing ...
Reading data and calculating plis ...
Sorting plis by number of clusters ...
Inverting plis ...
Extracting integer representations for the records ...
Investigating comparison suggestions ... 
Sorting clusters ...(129ms)
Running initial windows ...(141ms)
Moving window over clusters ... 
Inducing UCC candidates ...
Validating UCCs using plis ...
	Level 1: 4 elements; (V)(C)(G); 0 intersections; 0 validations; 0 invalid; 0 new candidates; --> 0 UCCs
	Level 2: 9 elements; (V)(C)(G); 4 intersections; 4 validations; 2 invalid; 7 new candidates; --> 2 UCCs
Investigating comparison suggestions ... 
Moving window over clusters ... 
Inducing UCC candidates ...
Validating UCCs using plis ...
	Level 3: 13 elements; (V)(-)(-); 13 intersections; 13 validations; 13 invalid; - new candidates; --> 0 UCCs
Translating UCC-tree into result format ...
... done! (2 UCCs)
Time: 720 ms



In [8]:
for ucc in keys:
    print(ucc)

['Report Date', 'FIPS']
['Report Date', 'Locality']
