## Addresses Notebook

The main goals for this notebook is to perform the following tasks on a dataset with Addresses:
- Load
- Tokenize into encoded types
    - Basic
    - Advanced
- Collect
- Align
- Generate Patterns
- Identify Anomalies
- Evaluate other address columns on a pattern
- Abstraction

### Load Data
We're using a dataset with two addresse columns. Let's combine to get a full, more complicated column and drop any row with nulls. For the purposes of this example, we shall detect patterns from all distinct values

In [1]:
import pandas as pd

In [2]:
address = pd.read_csv('../resources/dev/urban.csv', usecols=['Address ', 'Address Continued']).dropna(how='any')
address['Address'] = address['Address '].fillna('')+ '|' + address['Address Continued'].fillna('')
address.sample(5)

Unnamed: 0,Address,Address Continued,Address.1
13945,14200 SOUTHEAST MCLOUGHLIN BOULEVARD,SUITE K,14200 SOUTHEAST MCLOUGHLIN BOULEVARD|SUITE K
4672,15642 NE GLISAN STREET,#1,15642 NE GLISAN STREET|#1
11379,5305 RIVER RD NORTH,STE B,5305 RIVER RD NORTH|STE B
1711,742 SW VISTA AVE,APT 42,742 SW VISTA AVE|APT 42
8864,2025 NE 42ND AVE,APT 205,2025 NE 42ND AVE|APT 205


In [3]:
# There are ~2700 values
address.shape

(2711, 3)

In [4]:
# with 968 unique addresses to detect dominant patterns from
address_unique = address['Address'].unique()
len(address_unique)

968

### Tokenize

Splitting the values into basic and advanced types

In [5]:
from openclean_pattern.tokenize.regex import DefaultTokenizer

In [6]:
# The default tokenizer shall split it on all punctuation, keeping '.'s intact if the 
# punctuation flag is set to true.

dt = DefaultTokenizer()
dt.tokenize(address_unique)[8]

('c',
 '/',
 'o',
 ' ',
 'pnc',
 ' ',
 'real',
 ' ',
 'estate',
 ' ',
 'tax',
 ' ',
 'credit',
 ' ',
 'capital',
 '|',
 '121',
 ' ',
 'sw',
 ' ',
 'morrison',
 ' ',
 'street',
 ' ',
 'suite',
 ' ',
 '1300')

#### Basic Types
With no type resolvers attached, the tokenizer shall convert each token to a supported basic datatype

In [7]:
# For use with the proceeding components, we convert the tokens into an internal Token representation
encoded = dt.encode(address_unique)
encoded[8]

(_'ALPHA'_(1,'c'),
 _'PUNC'_(1,'/'),
 _'ALPHA'_(1,'o'),
 _'\\S'_(1,' '),
 _'ALPHA'_(3,'pnc'),
 _'\\S'_(1,' '),
 _'ALPHA'_(4,'real'),
 _'\\S'_(1,' '),
 _'ALPHA'_(6,'estate'),
 _'\\S'_(1,' '),
 _'ALPHA'_(3,'tax'),
 _'\\S'_(1,' '),
 _'ALPHA'_(6,'credit'),
 _'\\S'_(1,' '),
 _'ALPHA'_(7,'capital'),
 _'PUNC'_(1,'|'),
 _'NUMERIC'_(3,'121'),
 _'\\S'_(1,' '),
 _'ALPHA'_(2,'sw'),
 _'\\S'_(1,' '),
 _'ALPHA'_(8,'morrison'),
 _'\\S'_(1,' '),
 _'ALPHA'_(6,'street'),
 _'\\S'_(1,' '),
 _'ALPHA'_(5,'suite'),
 _'\\S'_(1,' '),
 _'NUMERIC'_(4,'1300'))

In [8]:
# Each tuple in the list represents a row with each element inside the tuple, a Token Element. Each token element 
# maintains a bunch of profiling information which is later aggregated into patterns, anomalies, profiles etc.
print(encoded[5][0])
for v, item in vars(encoded[8][0]).items():
    print('{}: {}'.format(v, item))

_'NUMERIC'_(4,'1741')
regex_type: SupportedDataTypes.ALPHA
size: 1
value: c
rowidx: 8


#### Advanced Types
Here we attach an Address type resolver to a Tokenizer, which enables us to identify more complex tokens such as 'Street', 'Ave', 'Apt' etc

In [9]:
from openclean_pattern.datatypes.resolver import AddressDesignatorResolver, DefaultTypeResolver
from openclean_pattern.tokenize.regex import RegexTokenizer

In [10]:
# The default type resolver identifies basic types, and adding an address resolver shall empower it use a 
# a repository of master data to identify specialized tokens by building a prefix tree

tr = DefaultTypeResolver(interceptors=AddressDesignatorResolver())
rt = RegexTokenizer(type_resolver=tr)

In [11]:
# We see now there exist internal representations for _STREET_ and a _SUD_ (secondary address designator) tokens
address_encoded = rt.encode(address_unique)
address_encoded[8]

(_'ALPHA'_(1,'c'),
 _'PUNC'_(1,'/'),
 _'ALPHA'_(1,'o'),
 _'\\S'_(1,' '),
 _'ALPHA'_(3,'pnc'),
 _'\\S'_(1,' '),
 _'ALPHA'_(4,'real'),
 _'\\S'_(1,' '),
 _'STREET'_(6,'estate'),
 _'\\S'_(1,' '),
 _'ALPHA'_(3,'tax'),
 _'\\S'_(1,' '),
 _'ALPHA'_(6,'credit'),
 _'\\S'_(1,' '),
 _'ALPHA'_(7,'capital'),
 _'PUNC'_(1,'|'),
 _'NUMERIC'_(3,'121'),
 _'\\S'_(1,' '),
 _'ALPHA'_(2,'sw'),
 _'\\S'_(1,' '),
 _'ALPHA'_(8,'morrison'),
 _'\\S'_(1,' '),
 _'STREET'_(6,'street'),
 _'\\S'_(1,' '),
 _'SUD'_(5,'suite'),
 _'\\S'_(1,' '),
 _'NUMERIC'_(4,'1300'))

### Collect
We aim to collect similar looking rows with each other using each token's regex_type and it's position as variables to calculate proximity

In [12]:
from openclean_pattern.align.cluster import Cluster

In [13]:
# Cluster uses a DBSCAN clusterer to calculate the distance between encoded rows and group them into clusters.
# Here we use the tree-edit-distance to compute proximity
clusters = Cluster(dist='TED', min_samples=50).collect(address_encoded)

In [14]:
# We discover 3 different patterns with atleast 50 values
for cluster in clusters:
    if cluster != -1:
        print(cluster)
        print(len(clusters[cluster]))
        print(address.iloc[list(clusters[0])]['Address'].sample(5))
        print()

0
403
2571               550 CALIFORNIA AVE|STE 200
3354        1580 NE 32ND AVENUE|APARTMENT 415
5835    533 NORTHEAST HOLLADAY STREET|APT 304
9667           243 S 2ND ST|C/O SHANNON SOUZA
1673                7885 SW VLAHOS DR|APT 106
Name: Address, dtype: object

2
59
8305               545 MERIDIAN AVE|STE D #26787
5686                   5305 RIVER RD NORTH|STE B
8658               707 SW WASHINGTON ST|STE 1500
8014    16123 LOWER HARBOR ROAD|GENERAL DELIVERY
4669                   15642 NE GLISAN STREET|#1
Name: Address, dtype: object

1
67
8204          5125 SW SCHOLLS FERRY RD|34
4887         18423 NW CHEMEKETA LN|UNIT B
8983    38100 SANDY HEIGHTS ST|APT # L135
1407              235 FRONT ST SE|STE 400
9049        2050 GOODPASTURE LOOP|APT 141
Name: Address, dtype: object



### Align
Next, for the identified groups, we add Gap characters i.e. align the such that each group has the same length

In [15]:
from openclean_pattern.align.pad import Padder

In [16]:
# The padder appends n gap Tokens to each row where n is the difference between the longest row in the column and
# the current row
pd = Padder()
tokens = pd.align(address_encoded, clusters)
tokens[8]

(_'ALPHA'_(1,'c'),
 _'PUNC'_(1,'/'),
 _'ALPHA'_(1,'o'),
 _'\\S'_(1,' '),
 _'ALPHA'_(3,'pnc'),
 _'\\S'_(1,' '),
 _'ALPHA'_(4,'real'),
 _'\\S'_(1,' '),
 _'STREET'_(6,'estate'),
 _'\\S'_(1,' '),
 _'ALPHA'_(3,'tax'),
 _'\\S'_(1,' '),
 _'ALPHA'_(6,'credit'),
 _'\\S'_(1,' '),
 _'ALPHA'_(7,'capital'),
 _'PUNC'_(1,'|'),
 _'NUMERIC'_(3,'121'),
 _'\\S'_(1,' '),
 _'ALPHA'_(2,'sw'),
 _'\\S'_(1,' '),
 _'ALPHA'_(8,'morrison'),
 _'\\S'_(1,' '),
 _'STREET'_(6,'street'),
 _'\\S'_(1,' '),
 _'SUD'_(5,'suite'),
 _'\\S'_(1,' '),
 _'NUMERIC'_(4,'1300'),
 _'G'_(0,''))

### Generate Patterns
We analyse each cluster of aligned rows and generate an Openclean Pattern object. Pattern generation can either be row-wise, i.e. all rows with the same tokens are aggregated into one pattern, or column-wise, i.e. tokens at each position in rows are pooled to retrieve the most common token type per position.

In [17]:
from openclean_pattern.regex.compiler import DefaultRegexCompiler

In [18]:
# the column method pools the majority token type from each token position across the rows. We discover
# that 174 values follow cluster 0, 32 follow cluster 2 and 67 follow cluster 1

rc = DefaultRegexCompiler(method='col', per_group='all')
patterns = rc.compile(address_encoded, clusters)

for cluster, pattern in patterns.items():
    if cluster != -1:
        print('{}'.format(pattern.top(1, pattern=True).freq))
        print('{} : {}'.format(cluster, pattern))
        print()

174
0 : RowPatterns(NUMERIC(1-5) \S() ALPHA(1-11) \S() ALPHA(1-13) \S() STREET(2-9) PUNC(|) SUD(2-9) \S() NUMERIC(1-5))

32
2 : RowPatterns(NUMERIC(2-5) \S() ALPHA(1-11) \S() ALPHA(2-11) \S() STREET(2-6) PUNC(|#) PUNC(#) NUMERIC(1-4))

67
1 : RowPatterns(NUMERIC(1-5) \S() ALPHA(1-13) \S() STREET(2-9) PUNC(|) SUD(3-5) \S() NUMERIC(1-4))



### Identify Anomalies
We identify anomalies in the column. i.e. values in each group/cluster that didn't match the dominant(top) pattern

In [19]:
anoms = rc.anomalies(encoded, clusters)
for an in anoms:
    if an != -1:
        print(an, len(anoms[an]))
        print(patterns[an])
        if len(anoms[an]) > 0:
            print()
            print(address.iloc[anoms[an]]['Address'].sample(5))
        print()

0 171
RowPatterns(NUMERIC(1-5) \S() ALPHA(1-11) \S() ALPHA(1-13) \S() STREET(2-9) PUNC(|) SUD(2-9) \S() NUMERIC(1-5))

5686                            5305 RIVER RD NORTH|STE B
7339                           13000 NW CORNELL RD|APT 15
10469    ATTN OFFICE FACILITIES JONES|3601 SW MURRAY BLVD
8297                            5305 RIVER RD NORTH|STE B
8269                          528 WEST 10TH AVENUE|APT #1
Name: Address, dtype: object

2 23
RowPatterns(NUMERIC(2-5) \S() ALPHA(1-11) \S() ALPHA(2-11) \S() STREET(2-6) PUNC(|#) PUNC(#) NUMERIC(1-4))

1269          1051C NE 6TH STREET|SUITE 1C
9816             5305 RIVER RD NORTH|STE B
7995    101 SOUTHWEST MADISON STREET|#8352
9650                 2936 WILLAMETTE ST|10
1280          1051C NE 6TH STREET|SUITE 1C
Name: Address, dtype: object

1 0
RowPatterns(NUMERIC(1-5) \S() ALPHA(1-13) \S() STREET(2-9) PUNC(|) SUD(3-5) \S() NUMERIC(1-4))



### PattenFinder Abstraction
A PatternFinder object can be built with all the above components to easily detect patterns in the dataset.

In [20]:
from openclean_pattern.opencleanpatternfinder import OpencleanPatternFinder

In [21]:
# The sequence of operations remains the same, i.e.: 
# sampling -> type resolution + tokenization -> collection -> alignment -> compilation

pf = OpencleanPatternFinder(
    distinct=True,
    frac=1,
    tokenizer=RegexTokenizer(
        type_resolver=DefaultTypeResolver(
            interceptors=AddressDesignatorResolver()
        )
    ),
    collector=Cluster(dist='TED', min_samples=50),
    aligner=Padder(),
    compiler=DefaultRegexCompiler(method='col', per_group='all')
)
patterns = pf.find(address['Address'])

In [22]:
# We see the same patterns as with the longer process
for cluster, pattern in patterns.items():
    if cluster != -1:
        print('{}'.format(pattern.top(1, pattern=True).freq))
        print('{} : {}'.format(cluster, pattern))
        print()

174
0 : RowPatterns(NUMERIC(1-5) \S() ALPHA(1-11) \S() ALPHA(1-13) \S() STREET(2-9) PUNC(|) SUD(2-9) \S() NUMERIC(1-5))

32
2 : RowPatterns(NUMERIC(2-5) \S() ALPHA(1-11) \S() ALPHA(2-11) \S() STREET(2-6) PUNC(|#) PUNC(#) NUMERIC(1-4))

68
1 : RowPatterns(NUMERIC(1-5) \S() ALPHA(1-13) \S() STREET(2-9) PUNC(|) SUD(3-5) \S() NUMERIC(1-4))



### Evaluate other data on a pattern
To be able to evaluate patterns on other columns, we'll use a patternfinder object to help perform the same set and sequence of operations on the new column

In [23]:
# get the top Pattern object from the 0th cluster. If pattern=False, a string would be returned instead
# of a Pattern object
pat = patterns[0].top(pattern=True)
pat

[DIGIT(1-5), SPACE_REP(1-1), ALPHA(1-11), SPACE_REP(1-1), ALPHA(1-13), SPACE_REP(1-1), STREET(2-9), PUNCTUATION(1-1), SUD(2-9), SPACE_REP(1-1), DIGIT(1-5)]

In [24]:
# returns True if match, else False
test = '23 Nelson Jansen Ave|Apt 2'
pat.compare(test, pf)

True

In [25]:
# we could also use the PatternFinder object to quickly compare a list of values with a pattern
test = [
    '832 SW VISTA AVENUE|APT 4',
    '23 Nelson Jansen Ave|Apt 2',
    '3 M J Ave|Fl 2',
    '521 Avalon block |House 1'
]
pf.compare(pat, test)

[False, True, True, False]

------------------------------------------------------------------------------------------------------------