## Profiler - openclean_pattern

In this notebook we create a new custom profiler for openclean that returns patterns for each column, using a pipeline we define in **openclean_pattern**.

Let's first set up a pattern resolver pipeline. Our pipeline will consist of the following stages:
1. Tokenization + Encoding
2. Collection
3. Alignment
4. Regex generation

### Setting up

In [1]:
# A tokenizer that identifies addresses types (e.g. ave, blvd etc) ,
# as well as secondary unit descriptors (e.g. Apt, Unit etc) and default types (alpha, alphanum, numeric etc)

from openclean_pattern.datatypes.resolver import AddressDesignatorResolver, DefaultTypeResolver
from openclean_pattern.tokenize.regex import RegexTokenizer

types_resolver = DefaultTypeResolver(interceptors=AddressDesignatorResolver())
tokenizer = RegexTokenizer(type_resolver=types_resolver)

In [2]:
# The collection stage groups similar type values, e.g. all alphas could be a single group 
# and all digits a separate one and a pattern will be generated per group

from openclean_pattern.collect.cluster import Cluster

# We shall use the tree edit distance and 3 samples minimum as the epsilon here.
# The cluster collector is an implementation of the DBSCAN clusterer
collector = Cluster(dist='TED', min_samples= 3) 

In [3]:
# The Progressive Aligner uses techniques from biostatistics to add gaps an align values in groups
# This will ensure that values such as '3 John Avenue' (Num, Alpha, Street) and 23 Blvd (Num, Street)
# are aligned as (Num, Alpha, Street) <==> (Num, Gap, Street) if they appear in the same group

from openclean_pattern.align.progressive import ProgressiveAligner

aligner = ProgressiveAligner()

In [4]:
# The final stage is generating regular expressions from these preprocessed values

from openclean_pattern.regex.compiler import DefaultRegexCompiler

compiler = DefaultRegexCompiler(per_group='top') # returns only the most dominant expression per group

Now let's put it all together and see it in action

In [5]:
from openclean.profiling.pattern.base import PatternFinder

class CustomPatternFinder(PatternFinder):
    """finds regex patterns as per the defined pipeline"""
    
    def process(self, values):
        return self.find(values)
    
    def find(self, values):
        # get all distinct values
        values = list(set(values))
        
        # tokenize
        tokens = list()
        for t in values:
            tokens.append(tokenizer.tokens(t))
        
        # collect
        groups = collector.collect(tokens)
        
        # align
        aligned = aligner.align(tokens, groups)
        
        # compile
        patterns = list()
        for gr in groups:
            patterns.append(compiler.compile_each(aligned[gr]))
        
        # define a pattern selection strategy from multiple dominant patterns
        if patterns:
            return patterns[0]
        
        return None

### Testing the pattern finder

In [6]:
import pandas as pd

values = pd.DataFrame(['23 Hoyt Street', '31 West Avenue', '50 Bitcoin Blvd'])
values.columns = ['Address']

In [7]:
profiler = CustomPatternFinder()
profiler.find(values['Address'].to_list())

RowPatterns(NUMERIC \S ALPHA \S STREET)

### Adding to openclean-core profilers

In [8]:
from openclean.profiling.dataset import dataset_profile

patterns = dataset_profile(values, [('Address', profiler)])

In [9]:
patterns

[{'column': 'Address', 'stats': RowPatterns(NUMERIC \S ALPHA \S STREET)}]