## Addresses Notebook

The main goals for this notebook is to walk through all the steps in a pattern and mismatch detection process while performing the following tasks on a dataset containing data in an Address format:
- Load / Sample data
- Tokenize into encoded types
    - Basic
    - Advanced
- Collect similar rows together
- Align groups or clusters of similar rows
- Generate Patterns for each group
- Identify mismatches from the full dataset
- PatternFinder Class Abstraction
- Evaluate other address values on a pattern

### Load Data
We're using a dataset with two addresse columns. Let's combine to get a full, more complicated column and drop any row with nulls. For the purposes of this example, we shall detect patterns from all distinct values

In [1]:
import pandas as pd
from openclean.data.load import dataset

In [2]:
# Openclean abstracts over pandas dataframes
address = dataset('../resources/dev/urban.csv', none_is='')
address = address[['Address ', 'Address Continued']].dropna(how='any')
address['Address'] = address['Address ']+ '|' + address['Address Continued']
address.sample(5)

Unnamed: 0,Address,Address Continued,Address.1
8039,5305 RIVER RD NORTH,STE B,5305 RIVER RD NORTH|STE B
21067,5305 RIVER ROAD NORTH,SUITE B,5305 RIVER ROAD NORTH|SUITE B
16566,63722 ELLEN STREET,APT 1,63722 ELLEN STREET|APT 1
20486,2390 EL CAMINO REAL,SUITE 210,2390 EL CAMINO REAL|SUITE 210
6458,247 S LOCUST ST,#247,247 S LOCUST ST|#247


In [3]:
# There are ~2700 values
address.shape

(2712, 3)

In [4]:
# with 969 unique addresses to detect dominant patterns from
address_unique = address['Address'].unique()
len(address_unique)

969

### Tokenize

Splitting the values into basic and advanced types

In [5]:
from openclean_pattern.tokenize.regex import DefaultTokenizer

In [6]:
# The default tokenizer shall split it on all punctuation, keeping '.'s intact if the 
# punctuation flag is set to true.

dt = DefaultTokenizer()
dt.tokenize(address_unique)[8]

('c',
 '/',
 'o',
 ' ',
 'pnc',
 ' ',
 'real',
 ' ',
 'estate',
 ' ',
 'tax',
 ' ',
 'credit',
 ' ',
 'capital',
 '|',
 '121',
 ' ',
 'sw',
 ' ',
 'morrison',
 ' ',
 'street',
 ' ',
 'suite',
 ' ',
 '1300')

#### Basic Types
With no type resolvers attached, the tokenizer shall convert each token to a supported basic datatype

In [7]:
# For use with the proceeding components, we convert the tokens into an internal Token representation
encoded = dt.encode(address_unique)
encoded[8]

(_'ALPHA'_(1,'c'),
 _'PUNC'_(1,'/'),
 _'ALPHA'_(1,'o'),
 _'\\S'_(1,' '),
 _'ALPHA'_(3,'pnc'),
 _'\\S'_(1,' '),
 _'ALPHA'_(4,'real'),
 _'\\S'_(1,' '),
 _'ALPHA'_(6,'estate'),
 _'\\S'_(1,' '),
 _'ALPHA'_(3,'tax'),
 _'\\S'_(1,' '),
 _'ALPHA'_(6,'credit'),
 _'\\S'_(1,' '),
 _'ALPHA'_(7,'capital'),
 _'PUNC'_(1,'|'),
 _'NUMERIC'_(3,'121'),
 _'\\S'_(1,' '),
 _'ALPHA'_(2,'sw'),
 _'\\S'_(1,' '),
 _'ALPHA'_(8,'morrison'),
 _'\\S'_(1,' '),
 _'ALPHA'_(6,'street'),
 _'\\S'_(1,' '),
 _'ALPHA'_(5,'suite'),
 _'\\S'_(1,' '),
 _'NUMERIC'_(4,'1300'))

In [8]:
# Each tuple in the list represents a row with each element inside the tuple, a Token Element. Each token element 
# maintains a bunch of profiling information which is later aggregated into patterns, anomalies, profiles etc.
print(encoded[5][0])
for v, item in vars(encoded[8][0]).items():
    print('{}: {}'.format(v, item))

_'NUMERIC'_(4,'1741')
regex_type: SupportedDataTypes.ALPHA
size: 1
value: c
rowidx: 8


#### Advanced Types
Here we attach an Address type resolver to a Tokenizer, which enables us to identify more complex tokens such as 'Street', 'Ave', 'Apt' etc

In [9]:
from openclean_pattern.datatypes.resolver import AddressDesignatorResolver, DefaultTypeResolver
from openclean_pattern.tokenize.regex import RegexTokenizer

In [10]:
# The default type resolver identifies basic types, and adding an address resolver shall empower it use a 
# a repository of master data to identify specialized tokens by building a prefix tree

tr = DefaultTypeResolver(interceptors=AddressDesignatorResolver())
rt = RegexTokenizer(type_resolver=tr)

In [11]:
# We see now there exist internal representations for _STREET_ and a _SUD_ (secondary address designator) tokens
address_encoded = rt.encode(address_unique)
address_encoded[8]

(_'ALPHA'_(1,'c'),
 _'PUNC'_(1,'/'),
 _'ALPHA'_(1,'o'),
 _'\\S'_(1,' '),
 _'ALPHA'_(3,'pnc'),
 _'\\S'_(1,' '),
 _'ALPHA'_(4,'real'),
 _'\\S'_(1,' '),
 _'STREET'_(6,'estate'),
 _'\\S'_(1,' '),
 _'ALPHA'_(3,'tax'),
 _'\\S'_(1,' '),
 _'ALPHA'_(6,'credit'),
 _'\\S'_(1,' '),
 _'ALPHA'_(7,'capital'),
 _'PUNC'_(1,'|'),
 _'NUMERIC'_(3,'121'),
 _'\\S'_(1,' '),
 _'ALPHA'_(2,'sw'),
 _'\\S'_(1,' '),
 _'ALPHA'_(8,'morrison'),
 _'\\S'_(1,' '),
 _'STREET'_(6,'street'),
 _'\\S'_(1,' '),
 _'SUD'_(5,'suite'),
 _'\\S'_(1,' '),
 _'NUMERIC'_(4,'1300'))

### Collect
We aim to collect similar looking rows with each other using each token's regex_type and it's position as variables to calculate proximity

In [12]:
from openclean_pattern.align.cluster import Cluster

In [13]:
# Cluster uses a DBSCAN clusterer to calculate the distance between encoded rows and group them into clusters.
# Here we use the tree-edit-distance to compute proximity
clusters = Cluster(dist='TED', min_samples=10).collect(address_encoded)

In [14]:
# We discover 11 different clusters with atleast 10 samples
for cluster in clusters:
    if cluster != -1:
        print(cluster)
        print(len(clusters[cluster]))
        print(address.iloc[list(clusters[0])]['Address'].sample(5))
        print()

0
422
1708                       1413 HAWTHORNE AVENUE|SPACE 23
6095    MOTSCHENBACHER & BLATTNER LLP|117 SW TAYLOR ST...
7932                          4201 NE 125TH PLACE|APT 173
2495                               1843 SW 16TH AVE|APT 3
3991                             843 ALDER CREEK DRIVE|#B
Name: Address, dtype: object

1
15
92      C/O PNC REAL ESTATE TAX CREDIT CAPITAL|121 SW ...
2735                              1306 SE 36TH AVE|APT 11
9492                   967 NE ORENCO STATION LOOP|APT 550
4702                            3122 N WILLIAMS AVE|APT B
1011                            5305 RIVER RD NORTH|STE B
Name: Address, dtype: object

2
11
7216    300 CARLSBAD VILLAGE DR|SUITE 108A-211
4131                  3519 NE 15TH AVE|STE 424
8045                   1837 SE ANKENY ST|APT B
2351               10000 NE 7TH AVE|SUITE 330I
607                    520 NW DAVIS ST|STE 215
Name: Address, dtype: object

3
59
8903           4207 SE WOODSTOCK BLVD|#419
200            1926 W BURNSIDE 

### Align
Next, for the identified groups, we add Gap characters i.e. align them such that each row in the group has the same length

In [15]:
from openclean_pattern.align.pad import Padder

In [16]:
# The padder appends n gap Tokens to each row where n is the difference between the 
# longest row in the group and the current row
pd = Padder()
tokens = pd.align(address_encoded, clusters)
tokens[8]

(_'ALPHA'_(1,'c'),
 _'PUNC'_(1,'/'),
 _'ALPHA'_(1,'o'),
 _'\\S'_(1,' '),
 _'ALPHA'_(3,'pnc'),
 _'\\S'_(1,' '),
 _'ALPHA'_(4,'real'),
 _'\\S'_(1,' '),
 _'STREET'_(6,'estate'),
 _'\\S'_(1,' '),
 _'ALPHA'_(3,'tax'),
 _'\\S'_(1,' '),
 _'ALPHA'_(6,'credit'),
 _'\\S'_(1,' '),
 _'ALPHA'_(7,'capital'),
 _'PUNC'_(1,'|'),
 _'NUMERIC'_(3,'121'),
 _'\\S'_(1,' '),
 _'ALPHA'_(2,'sw'),
 _'\\S'_(1,' '),
 _'ALPHA'_(8,'morrison'),
 _'\\S'_(1,' '),
 _'STREET'_(6,'street'),
 _'\\S'_(1,' '),
 _'SUD'_(5,'suite'),
 _'\\S'_(1,' '),
 _'NUMERIC'_(4,'1300'),
 _'G'_(0,''))

### Generate Patterns
We analyse each cluster of aligned rows and generate an Openclean Pattern object. Pattern generation can either be row-wise, i.e. all rows with the same tokens are aggregated into one pattern, or column-wise, i.e. tokens at each position in rows are pooled to retrieve the most common token type per position.

In [17]:
from openclean_pattern.regex.compiler import DefaultRegexCompiler

In [18]:
# the column method pools the majority token type from each token position across the rows. We discover
# that 174 values of the 422 in the cluster 0 make a pattern, 32 of 59 make a pattern in cluster 3 and
#  67 / 67 in cluster 6 follow the generated pattern

rc = DefaultRegexCompiler(method='col', per_group='all')
patterns = rc.compile(address_encoded, clusters)

for cluster, pattern_group in patterns.items():
    if cluster != -1:
        print('{}'.format(pattern_group.top(1, pattern=True).freq))
        print('{} : {}'.format(cluster, pattern_group))
        print()

174
0 : RowPatterns(NUMERIC(1-5) \S() ALPHA(1-11) \S() ALPHA(1-13) \S() STREET(2-9) PUNC(|) SUD(2-9) \S() NUMERIC(1-5))

7
1 : RowPatterns(ALPHA(1-14) \S() ALPHA(1-9) \S() ALPHA(1-14) \S() ALPHA(2-9) PUNC(|) NUMERIC(1-4) \S() sw(2-2) \S() ALPHA(5-10) \S() STREET(2-6) \S() sXXXX(3-5) \S() NUMERIC(1-4))

1
2 : RowPatterns(NUMERIC(3-4) \S() ALPHA(1-5) \S() ALPHA(3-7) \S() STREET(3-7) PUNC(|) SUD(3-4) \S() PUNC(#) NUMERIC(1-3))

32
3 : RowPatterns(NUMERIC(2-5) \S() ALPHA(1-11) \S() ALPHA(2-11) \S() STREET(2-6) PUNC(|#) PUNC(#) NUMERIC(1-4))

4
4 : RowPatterns(NUMERIC(2-5) \S() ALPHA(1-10) \S() ALPHA(4-9) \S() STREET(2-6) PUNC(|) ALPHA(1-7) \S() ALPHA(1-5) \S() NUMERIC(1-5))

16
5 : RowPatterns(NUMERIC(2-5) \S() ALPHA(2-2) \S() ALPHANUM(3-5) \S() avX(2-3) PUNC(|) NUMERIC(1-5))

67
6 : RowPatterns(NUMERIC(1-5) \S() ALPHA(1-13) \S() STREET(2-9) PUNC(|) SUD(3-5) \S() NUMERIC(1-4))

11
7 : RowPatterns(NUMERIC(3-5) \S() ALPHA(1-9) \S() ALPHA(3-10) \S() STREET(2-7) \S() STREET(2-7) PUNC(|) SUD(3-

### Identify Mismatches
We identify mismatches in the column. i.e. values in each group/cluster that didn't match the selected patterns

In [19]:
# Let's select the top pattern in each cluster and filter out values that didn't match them. 
# Using the following patterns we discover 595 values that didn't match any pattern.

print('Selected Patterns')
selected_patterns = list()
for pattern_group in patterns.values():
    top = pattern_group.top(pattern=True)
    selected_patterns.append(top)
    print(top)
    print()
print('------------------')

    
mismatches = rc.mismatches(tokens, selected_patterns)
print('Mismatches: {}'.format(len(address_unique[mismatches])))
print('Sample:')
print(address_unique[mismatches][:5])

Selected Patterns
[DIGIT(1-5), SPACE_REP(1-1), ALPHA(1-11), SPACE_REP(1-1), ALPHA(1-13), SPACE_REP(1-1), STREET(2-9), PUNCTUATION(1-1), SUD(2-9), SPACE_REP(1-1), DIGIT(1-5)]

[DIGIT(1-5), SPACE_REP(1-1), ALPHA(1-14), SPACE_REP(1-1), ALPHA(1-12), SPACE_REP(1-1), ALPHA(1-14), SPACE_REP(1-1), ALPHA(1-11), SPACE_REP(1-1), ALPHA(1-12), SPACE_REP(1-1), ALPHA(1-8), SPACE_REP(1-1), ALPHA(1-12), SPACE_REP(1-1), STREET(2-6), SPACE_REP(1-1), SPACE_REP(1-1), SPACE_REP(1-1), DIGIT(3-4), SPACE_REP(1-1), DIGIT(1-4), SPACE_REP(1-1), SUD(5-5), SPACE_REP(1-1), DIGIT(4-4), DIGIT(3-3)]

[ALPHA(1-14), SPACE_REP(1-1), ALPHA(1-9), SPACE_REP(1-1), ALPHA(1-14), SPACE_REP(1-1), ALPHA(2-9), PUNCTUATION(1-1), DIGIT(1-4), SPACE_REP(1-1), ALPHA(2-2), SPACE_REP(1-1), ALPHA(5-10), SPACE_REP(1-1), STREET(2-6), SPACE_REP(1-1), SUD(3-5), SPACE_REP(1-1), DIGIT(1-4)]

[DIGIT(3-4), SPACE_REP(1-1), ALPHA(1-5), SPACE_REP(1-1), ALPHA(3-7), SPACE_REP(1-1), STREET(3-7), PUNCTUATION(1-1), SUD(3-4), SPACE_REP(1-1), PUNCTUATION(1-

### PattenFinder Abstraction
A PatternFinder pipeline can be built with all the above components to easily detect patterns in the dataset.

In [20]:
from openclean_pattern.opencleanpatternfinder import OpencleanPatternFinder

In [21]:
# The sequence of operations remains the same, i.e.: 
# sampling -> type resolution + tokenization -> collection -> alignment -> compilation

pf = OpencleanPatternFinder(
    distinct=True,
    frac=1,
    tokenizer=RegexTokenizer(
        type_resolver=DefaultTypeResolver(
            interceptors=AddressDesignatorResolver()
        )
    ),
    collector=Cluster(dist='TED', min_samples=10),
    aligner=Padder(),
    compiler=DefaultRegexCompiler(method='col', per_group='all')
)
patterns = pf.find(address['Address'])

In [22]:
# We see the same patterns as without PatternFinder process
for cluster, pattern in patterns.items():
    if cluster != -1:
        print('{}'.format(pattern.top(1, pattern=True).freq))
        print('{} : {}'.format(cluster, pattern))
        print()

174
0 : RowPatterns(NUMERIC(1-5) \S() ALPHA(1-11) \S() ALPHA(1-13) \S() STREET(2-9) PUNC(|) SUD(2-9) \S() NUMERIC(1-5))

68
1 : RowPatterns(NUMERIC(1-5) \S() ALPHA(1-13) \S() STREET(2-9) PUNC(|) SUD(3-5) \S() NUMERIC(1-4))

4
2 : RowPatterns(NUMERIC(2-5) \S() ALPHA(1-10) \S() ALPHA(4-9) \S() STREET(2-6) PUNC(|) ALPHA(1-7) \S() ALPHA(1-5) \S() NUMERIC(1-5))

7
3 : RowPatterns(ALPHA(1-14) \S() ALPHA(1-9) \S() ALPHA(1-14) \S() ALPHA(2-9) PUNC(|) NUMERIC(1-4) \S() sw(2-2) \S() ALPHA(5-10) \S() STREET(2-6) \S() sXXXX(3-5) \S() NUMERIC(1-4))

32
4 : RowPatterns(NUMERIC(2-5) \S() ALPHA(1-11) \S() ALPHA(2-11) \S() STREET(2-6) PUNC(|#) PUNC(#) NUMERIC(1-4))

28
5 : RowPatterns(NUMERIC(3-5) \S() ALPHA(1-9) \S() ALPHA(1-10) \S() STREET(2-6) PUNC(|) NUMERIC(1-5))

10
6 : RowPatterns(NUMERIC(3-5) \S() ALPHA(1-2) \S() ALPHA(4-13) \S() STREET(2-3) PUNC(|) ALPHANUM(2-4))

11
7 : RowPatterns(NUMERIC(3-5) \S() ALPHA(1-9) \S() ALPHA(3-10) \S() STREET(2-7) \S() STREET(2-7) PUNC(|) SUD(3-5) \S() NUMERIC(1-

### Evaluate other data on a pattern
To be able to evaluate patterns on other columns, we'll use a patternfinder object to help perform the same set and sequence of operations on the new column

In [23]:
# get the top Pattern object from the 0th cluster. If pattern=False, a string would be returned instead
# of a Pattern object
pat = patterns[0].top(pattern=True)
pat

[DIGIT(1-5), SPACE_REP(1-1), ALPHA(1-11), SPACE_REP(1-1), ALPHA(1-13), SPACE_REP(1-1), STREET(2-9), PUNCTUATION(1-1), SUD(2-9), SPACE_REP(1-1), DIGIT(1-5)]

In [24]:
# returns True if the value follows the pattern, else False. 
test = '23 Nelson Jansen Ave|Apt 2' # DIGIT SPACE ALPHA SPACE ALPHA SPACE STREET PUNC SUD SPACE DIGIT
pat.compare(test, pf) #test has an extra pair of SPACE ALPHA than the pattern

True

In [25]:
# we could also use the PatternFinder object to quickly compare a list of values with a pattern. 
# We see rows 2 and 3 match the pattern
test = [
    '832 SW VISTA AVENUE|APT 4',
    '23 Nelson Jansen Ave|Apt 2',
    '3 M J Ave|Fl 2',
    '521 Avalon block |House 1'
]
pf.compare(pat, test)

[False, True, True, False]

------------------------------------------------------------------------------------------------------------