## Addresses Notebook

The main goals for this notebook is to walk through all the steps in a pattern and mismatch detection process while performing the following tasks on a dataset containing data in an Address format:
- Load / Sample data
- Tokenize into encoded types
    - Basic
    - Advanced
- Collect similar rows together
- Align groups or clusters of similar rows
- Generate Patterns for each group
- Identify mismatches from the full dataset
- PatternFinder Class Abstraction
- Evaluate other address values on a pattern

### Load Data
We're using a dataset with two addresse columns. Let's combine to get a full, more complicated column and drop any row with nulls. For the purposes of this example, we shall detect patterns from all distinct values

In [1]:
import pandas as pd
from openclean.data.load import dataset

In [2]:
# Openclean abstracts over pandas dataframes
address = dataset('data/urban.csv', none_is='')
address = address[['Address ', 'Address Continued']].dropna(how='any')
address['Address'] = address['Address ']+ '|' + address['Address Continued']
address.sample(5)

Unnamed: 0,Address,Address Continued,Address.1
18794,ROBERT E KABACY,520 SW YAMHILL ST STE 600,ROBERT E KABACY|520 SW YAMHILL ST STE 600
11429,3181 NE 23RD ST,APT 1103,3181 NE 23RD ST|APT 1103
944,2241 GREENSPRINGS DRIVE,UNIT 66,2241 GREENSPRINGS DRIVE|UNIT 66
18624,841 O'HARE PARKWAY,STE 100,841 O'HARE PARKWAY|STE 100
30860,14708 SW BEARD RD,APT 225,14708 SW BEARD RD|APT 225


In [3]:
# There are ~2700 values
address.shape

(2712, 3)

In [4]:
# with 969 unique addresses to detect dominant patterns from
address_unique = address['Address'].unique()
len(address_unique)

969

### Tokenize

Splitting the values into basic and advanced types

In [5]:
from openclean_pattern.tokenize.regex import DefaultTokenizer

In [6]:
# The default tokenizer shall split it on all punctuation, keeping '.'s intact if the 
# punctuation flag is set to true.

dt = DefaultTokenizer()
dt.encode(address_unique)[8]

['c',
 '/',
 'o',
 ' ',
 'pnc',
 ' ',
 'real',
 ' ',
 'estate',
 ' ',
 'tax',
 ' ',
 'credit',
 ' ',
 'capital',
 '|',
 '121',
 ' ',
 'sw',
 ' ',
 'morrison',
 ' ',
 'street',
 ' ',
 'suite',
 ' ',
 '1300']

#### Basic Types
With no type resolvers attached, the tokenizer shall convert each token to a supported basic datatype

In [7]:
# For use with the proceeding components, we convert the tokens into an internal Token representation
encoded = dt.encode(address_unique)

for i in range(len(encoded[8])):
    print('{} - {}'.format(encoded[8][i],encoded[8][i].regex_type))

c - ALPHA
/ - PUNC
o - ALPHA
  - \S
pnc - ALPHA
  - \S
real - ALPHA
  - \S
estate - ALPHA
  - \S
tax - ALPHA
  - \S
credit - ALPHA
  - \S
capital - ALPHA
| - PUNC
121 - NUMERIC
  - \S
sw - ALPHA
  - \S
morrison - ALPHA
  - \S
street - ALPHA
  - \S
suite - ALPHA
  - \S
1300 - NUMERIC


In [8]:
# Each tuple in the list represents a row with each element inside the tuple, a Token Element. Each token element 
# maintains a bunch of profiling information which is later aggregated into patterns, anomalies, profiles etc.
print(encoded[5][0])
for v, item in vars(encoded[8][0]).items():
    print('{}: {}'.format(v, item))

1741
token_type: ALPHA
rowidx: 8


#### Advanced Types
Here we attach an Address type resolver to a Tokenizer, which enables us to identify more complex tokens such as 'Street', 'Ave', 'Apt' etc

In [9]:
from openclean_pattern.datatypes.resolver import AddressDesignatorResolver, DefaultTypeResolver
from openclean_pattern.tokenize.regex import RegexTokenizer

In [10]:
# The default type resolver identifies basic types, and adding an address resolver shall empower it use a 
# a repository of master data to identify specialized tokens by building a prefix tree

tr = DefaultTypeResolver(interceptors=AddressDesignatorResolver())
rt = RegexTokenizer(type_resolver=tr)

In [11]:
# We see now there exist internal representations for _STREET_ and a _SUD_ (secondary address designator) tokens
address_encoded = rt.encode(address_unique)

for i in range(len(address_encoded[8])):
    print('{} - {}'.format(address_encoded[8][i],address_encoded[8][i].regex_type))

c - ALPHA
/ - PUNC
o - ALPHA
  - \S
pnc - ALPHA
  - \S
real - ALPHA
  - \S
estate - STREET
  - \S
tax - ALPHA
  - \S
credit - ALPHA
  - \S
capital - ALPHA
| - PUNC
121 - NUMERIC
  - \S
sw - ALPHA
  - \S
morrison - ALPHA
  - \S
street - STREET
  - \S
suite - SUD
  - \S
1300 - NUMERIC


### Collect
We aim to collect similar looking rows with each other using each token's regex_type and it's position as variables to calculate proximity

In [12]:
from openclean_pattern.collect.cluster import Cluster

In [13]:
# Cluster uses a DBSCAN clusterer to calculate the distance between encoded rows and group them into clusters.
# Here we use the tree-edit-distance to compute proximity
clusters = Cluster(dist='TED', min_samples=10).collect(address_encoded)

In [14]:
# We discover 11 different clusters with atleast 10 samples
for cluster in clusters:
    if cluster != -1:
        print(cluster)
        print(len(clusters[cluster]))
        print(address.iloc[list(clusters[0])]['Address'].sample(5))
        print()

0
422
4240               1416 SW 174TH AVE|APT 203
2260    300 EXECUTIVE CENTER DRIVE|SUITE 201
737           1277 TREAT BOULEVARD|SUITE 400
1408                 235 FRONT ST SE|STE 400
846         485 MASSACHUSETTS AVENUE|SUITE 3
Name: Address, dtype: object

1
15
10256    101 CRAWFORDS CORNER ROAD|STE 4-204R
77                       1302 NE 3RD ST|STE 1
9048            2050 GOODPASTURE LOOP|APT 141
5503        1600 PIONEER TOWER|888 SW 5TH AVE
9145                1300 22ND STREET|UNIT 503
Name: Address, dtype: object

2
11
9187    199 SW SHEVLIN HIXON DRIVE|SUITE A
200            1926 W BURNSIDE ST|UNIT 317
4328       11529 SW ZURICH STREET|UNIT 205
3802      3340 NE M L KING JR BLVD|APT 407
1408               235 FRONT ST SE|STE 400
Name: Address, dtype: object

3
59
6454    4137 SE CESAR E CHAVEZ BLVD|APT 16
2571            550 CALIFORNIA AVE|STE 200
4413               5305 RIVER RD N|SUITE B
9721                  109 SE ALDER ST|#710
1674             7885 SW VLAHOS DR|APT 106
Name

### Align
Next, for the identified groups, we add Gap characters i.e. align them such that each row in the group has the same length

In [15]:
from openclean_pattern.align.pad import Padder

In [16]:
# The padder appends n gap Tokens to each row where n is the difference between the 
# longest row in the group and the current row
pd = Padder()
tokens = pd.align(address_encoded, clusters)
tokens[8]

('c',
 '/',
 'o',
 ' ',
 'pnc',
 ' ',
 'real',
 ' ',
 'estate',
 ' ',
 'tax',
 ' ',
 'credit',
 ' ',
 'capital',
 '|',
 '121',
 ' ',
 'sw',
 ' ',
 'morrison',
 ' ',
 'street',
 ' ',
 'suite',
 ' ',
 '1300',
 '')

### Generate Patterns
We analyse each cluster of aligned rows and generate an Openclean Pattern object. Pattern generation can either be row-wise, i.e. all rows with the same tokens are aggregated into one pattern, or column-wise, i.e. tokens at each position in rows are pooled to retrieve the most common token type per position.

In [17]:
from openclean_pattern.regex.compiler import DefaultRegexCompiler

In [18]:
# the column method pools the majority token type from each token position across the rows. We discover
# that 174 values of the 422 in the cluster 0 make a pattern, 32 of 59 make a pattern in cluster 3 and
#  67 / 67 in cluster 6 follow the generated pattern

rc = DefaultRegexCompiler(method='col', per_group='all')
patterns = rc.compile(address_encoded, clusters)

for cluster, pattern_group in patterns.items():
    if cluster != -1:
        print('{}'.format(pattern_group.top(1, pattern=True).freq))
        print('{} : {}'.format(cluster, pattern_group))
        print()

174
0 : RowPatterns(NUMERIC(1-5) \S() ALPHA(1-11) \S() ALPHA(1-13) \S() STREET(2-9) PUNC(|) SUD(2-9) \S() NUMERIC(1-5))

7
1 : RowPatterns(ALPHA(1-14) \S() ALPHA(1-9) \S() ALPHA(1-14) \S() ALPHA(2-9) PUNC(|) NUMERIC(1-4) \S() sw(2-2) \S() ALPHA(5-10) \S() STREET(2-6) \S() sXXXX(3-5) \S() NUMERIC(1-4))

1
2 : RowPatterns(NUMERIC(3-4) \S() ALPHA(1-5) \S() ALPHA(3-7) \S() STREET(3-7) PUNC(|) SUD(3-4) \S() PUNC(#) NUMERIC(1-3))

32
3 : RowPatterns(NUMERIC(2-5) \S() ALPHA(1-11) \S() ALPHA(2-11) \S() STREET(2-6) PUNC(#|) PUNC(#) NUMERIC(1-4))

4
4 : RowPatterns(NUMERIC(2-5) \S() ALPHA(1-10) \S() ALPHA(4-9) \S() STREET(2-6) PUNC(|) ALPHA(1-7) \S() ALPHA(1-5) \S() NUMERIC(1-5))

16
5 : RowPatterns(NUMERIC(2-5) \S() ALPHA(2-2) \S() ALPHANUM(3-5) \S() avX(2-3) PUNC(|) NUMERIC(1-5))

67
6 : RowPatterns(NUMERIC(1-5) \S() ALPHA(1-13) \S() STREET(2-9) PUNC(|) SUD(3-5) \S() NUMERIC(1-4))

11
7 : RowPatterns(NUMERIC(3-5) \S() ALPHA(1-9) \S() ALPHA(3-10) \S() STREET(2-7) \S() STREET(2-7) PUNC(|) SUD(3-

### Identify Mismatches
We identify mismatches in the column. i.e. values in each group/cluster that didn't match the selected patterns

In [19]:
# Let's select the top pattern in each cluster and filter out values that didn't match them. 
# Using the following patterns we discover 595 values that didn't match any pattern.

print('Selected Patterns')
selected_patterns = list()
for pattern_group in patterns.values():
    top = pattern_group.top(pattern=True)
    selected_patterns.append(top)
    print(top)
    print()
print('------------------')

    
mismatches = rc.mismatches(tokens, selected_patterns)
print('Mismatches: {}'.format(len(address_unique[mismatches])))
print('Sample:')
print(address_unique[mismatches][:5])

Selected Patterns
[NUMERIC(1-5), \S(1-1), ALPHA(1-11), \S(1-1), ALPHA(1-13), \S(1-1), STREET(2-9), PUNC(1-1), SUD(2-9), \S(1-1), NUMERIC(1-5)]

[NUMERIC(1-5), \S(1-1), ALPHA(1-14), \S(1-1), ALPHA(1-12), \S(1-1), ALPHA(1-14), \S(1-1), ALPHA(1-11), \S(1-1), ALPHA(1-12), \S(1-1), ALPHA(1-8), \S(1-1), ALPHA(1-12), \S(1-1), STREET(2-6), \S(1-1), \S(1-1), \S(1-1), NUMERIC(3-4), \S(1-1), NUMERIC(1-4), \S(1-1), SUD(5-5), \S(1-1), NUMERIC(4-4), NUMERIC(3-3)]

[ALPHA(1-14), \S(1-1), ALPHA(1-9), \S(1-1), ALPHA(1-14), \S(1-1), ALPHA(2-9), PUNC(1-1), NUMERIC(1-4), \S(1-1), ALPHA(2-2), \S(1-1), ALPHA(5-10), \S(1-1), STREET(2-6), \S(1-1), SUD(3-5), \S(1-1), NUMERIC(1-4)]

[NUMERIC(3-4), \S(1-1), ALPHA(1-5), \S(1-1), ALPHA(3-7), \S(1-1), STREET(3-7), PUNC(1-1), SUD(3-4), \S(1-1), PUNC(1-1), NUMERIC(1-3)]

[NUMERIC(2-5), \S(1-1), ALPHA(1-11), \S(1-1), ALPHA(2-11), \S(1-1), STREET(2-6), PUNC(1-1), PUNC(1-1), NUMERIC(1-4)]

[NUMERIC(2-5), \S(1-1), ALPHA(1-10), \S(1-1), ALPHA(4-9), \S(1-1), STREET(2-6), P

### PattenFinder Abstraction
A PatternFinder pipeline can be built with all the above components to easily detect patterns in the dataset.

In [20]:
from openclean_pattern.opencleanpatternfinder import OpencleanPatternFinder

In [21]:
# The sequence of operations remains the same, i.e.: 
# sampling -> type resolution + tokenization -> collection -> alignment -> compilation

pf = OpencleanPatternFinder(
    distinct=True,
    frac=1,
    tokenizer=RegexTokenizer(
        type_resolver=DefaultTypeResolver(
            interceptors=AddressDesignatorResolver()
        )
    ),
    collector=Cluster(dist='TED', min_samples=10),
    aligner=Padder(),
    compiler=DefaultRegexCompiler(method='col', per_group='all')
)
patterns = pf.find(address['Address'])

In [22]:
# We see the same patterns as without PatternFinder process
for cluster, pattern in patterns.items():
    if cluster != -1:
        print('{}'.format(pattern.top(1, pattern=True).freq))
        print('{} : {}'.format(cluster, pattern))
        print()

32
0 : RowPatterns(NUMERIC(2-5) \S() ALPHA(1-11) \S() ALPHA(2-11) \S() STREET(2-6) PUNC(#|) PUNC(#) NUMERIC(1-4))

16
1 : RowPatterns(NUMERIC(2-5) \S() ALPHA(2-2) \S() ALPHANUM(3-5) \S() avX(2-3) PUNC(|) NUMERIC(1-5))

174
2 : RowPatterns(NUMERIC(1-5) \S() ALPHA(1-11) \S() ALPHA(1-13) \S() STREET(2-9) PUNC(|) SUD(2-9) \S() NUMERIC(1-5))

10
3 : RowPatterns(NUMERIC(2-4) \S() STREET(4-7) \S() STREET(2-6) PUNC(|) SUD(3-5) \S() NUMERIC(1-3))

4
4 : RowPatterns(NUMERIC(2-5) \S() ALPHA(1-10) \S() ALPHA(4-9) \S() STREET(2-6) PUNC(|) ALPHA(1-7) \S() ALPHA(1-5) \S() NUMERIC(1-5))

11
5 : RowPatterns(NUMERIC(3-5) \S() ALPHA(1-9) \S() ALPHA(3-10) \S() STREET(2-7) \S() STREET(2-7) PUNC(|) SUD(3-5) \S() NUMERIC(1-3))

7
6 : RowPatterns(ALPHA(1-14) \S() ALPHA(1-9) \S() ALPHA(1-14) \S() ALPHA(2-9) PUNC(|) NUMERIC(1-4) \S() sw(2-2) \S() ALPHA(5-10) \S() STREET(2-6) \S() sXXXX(3-5) \S() NUMERIC(1-4))

28
7 : RowPatterns(NUMERIC(3-5) \S() ALPHA(1-9) \S() ALPHA(1-10) \S() STREET(2-6) PUNC(|) NUMERIC(1-5)

### Evaluate other data on a pattern
To be able to evaluate patterns on other columns, we'll use a patternfinder object to help perform the same set and sequence of operations on the new column

In [26]:
# As an example, get the top Pattern object from the 2nd cluster. If pattern=False, a string would be returned instead
# of a Pattern object. The pattern looks like:
# [NUMERIC(1-5), \S(1-1), ALPHA(1-11), \S(1-1), ALPHA(1-13), \S(1-1), STREET(2-9), PUNC(1-1), SUD(2-9), \S(1-1), NUMERIC(1-5)]

# Note: while re-running the notebook, this pattern might be part of a different cluster in the previous cell.
# this is because the indeterminate nature of dbscan clustering at the collect stage

pat = patterns[2].top(pattern=True)
pat

[NUMERIC(1-5), \S(1-1), ALPHA(1-11), \S(1-1), ALPHA(1-13), \S(1-1), STREET(2-9), PUNC(1-1), SUD(2-9), \S(1-1), NUMERIC(1-5)]

In [27]:
# returns True if the value follows the pattern, else False. 
test = '23 Nelson Jansen Ave|Apt 2' # DIGIT SPACE ALPHA SPACE ALPHA SPACE STREET PUNC SUD SPACE DIGIT
pat.compare(test, pf.tokenizer) #test has an extra pair of SPACE ALPHA than the pattern

True

In [28]:
# we could also use the PatternFinder object to quickly compare a list of values with a pattern. 
# We see rows 2 and 3 match the pattern
test = [
    '832 SW VISTA AVENUE|APT 4', # vista is a street identifier (https://pe.usps.com/text/pub28/28apc_002.htm)
    '23 Nelson Jansen Ave|Apt 2',
    '3 M J Ave|Fl 2',
    '521 Avalon block |House 1' # extra space before |
]
pf.compare(pat, test)

[False, True, True, False]

------------------------------------------------------------------------------------------------------------