## Businesses  - Type Resolution Tutorial

In this notebook, we analyze various columns in a business registrations dataset and look at various ways to detect inconsistencies in the data using the column pattern.

Goals:
- Pattern generation with type resolvers
- Basic anomaly detection
- Non-basic/advanced anomaly detection

In [1]:
import pandas as pd

In [2]:
# load data
# The dataset has registration information for many businesses in the US including names, 
# entity information, dates, ownership information, and address

df = pd.read_csv('../resources/dev/urban.csv', usecols=['Business Name','Entity Type','Registry Date','Address ', 'City', 'Zip Code']).drop_duplicates().reset_index(drop=True)
df.sample(5)

Unnamed: 0,Business Name,Entity Type,Registry Date,Address,City,Zip Code
9495,CHIEF AUTO SPORTS LLC,DOMESTIC LIMITED LIABILITY COMPANY,07/27/2020,940 WILLAMETTE ST STE 400,EUGENE,97401
10682,UAS AERIAL SOLUTIONS LLC,DOMESTIC LIMITED LIABILITY COMPANY,07/30/2020,5823 NW VILLAGE GREEN PL,CORVALLIS,97330
9993,FLYING CLIPPER,ASSUMED BUSINESS NAME,07/29/2020,497 OAKWAY RD,EUGENE,97401
6700,GLOW BEAUTY LOUNGE,ASSUMED BUSINESS NAME,07/20/2020,1820 COMMERCIAL ST SE,SALEM,97302
4752,ROGUE ARTIST SUPPORT SERVICES LLC,DOMESTIC LIMITED LIABILITY COMPANY,07/14/2020,4850 S PACIFIC HWY,PHOENIX,97535


In [3]:
df.shape

(11501, 6)

### Pattern Generation with Type Resolution
We use a patternfinder object to identify patterns in the data, using the internal business entity type resolver to help distinguish between business registration suffix tokens.

The tokenizer accepts a type resolver object to allow prefix matching as part of the tokenization. Internally, each type resolver is built on a master vocabulary with prefix searches optimized using a prefix tree. The type resolver base class is easily implementable and makes pattern detection with custom nonbasic types extremely flexible and powerful.

The DefaultTypeResolver allows for the Basic types (Alpha, Alphanum, Digit, Punctuations, Spaces, Gaps) to be detected, whereas more advanced inbuilt implementations allow for detection of Business Entities (through their suffixes), Addresses (using USPS primary and secondary unit designator terms e.g. Street, Ave, Apt etc), Dates (Month and Weekday), and GeoSpatial Entities (using nyu datamart-geo as the base master data). Ofcourse, as mentioned earlier, the type resolver class can be extended to cater for one's needs.

In [4]:
# using the patternfinder to quickly understand column patterns
from openclean_pattern.opencleanpatternfinder import OpencleanPatternFinder as PatternFinder
from openclean_pattern.datatypes.resolver import BusinessEntityResolver, DefaultTypeResolver
from openclean_pattern.tokenize.regex import RegexTokenizer
from openclean_pattern.align.cluster import Cluster
from openclean_pattern.align.pad import Padder
from openclean_pattern.regex.compiler import DefaultRegexCompiler

# create a new DefaultTypeResolver object (identifies basic types)
# intercepted by a BusinessEntityResolver (identifies company suffixes)
# plug these into a new RegexTokenizer that'll tokenize the remaining values not identified by the type resolvers
# on all delimiters except dots(.) because they're abbreviation characters

rt = RegexTokenizer(
        abbreviations=True,
        type_resolver=DefaultTypeResolver(
            interceptors=BusinessEntityResolver()
        )
    )
column = 'Business Name'

In [5]:
# create a new PatternFinder object with the Cluster collector

pf = PatternFinder(distinct=True,
                    tokenizer=rt,
                    collector=Cluster(dist="TED", min_samples=100),
                    aligner=Padder(),
                    compiler=DefaultRegexCompiler(method='col', per_group='all')
                  )

In [6]:
# find patterns
grouped_patterns = pf.find(df[column])

In [7]:
# We see 13 patterns generated from row clusters with atleast 100 samples. Most 
# dominant patterns contain a _BUSINESS_ suffix except for patterns ranked 2, 3, 5 and 7.

for i, gp in grouped_patterns.items():
    if i != -1:
        print(len(gp.top(pattern=True).idx))
        print(i, gp.top())
        print()

1174
0 ALPHA(1-14) \S() ALPHA(1-14) \S() BUSINESS(2-25)

641
1 ALPHA(1-13) \S() ALPHA(1-14)

904
2 ALPHA(1-12) \S() ALPHA(1-13) \S() ALPHA(1-14) \S() BUSINESS(2-25)

370
3 ALPHA(2-21) \S() BUSINESS(2-25)

111
4 ALPHA(1-11) \S() ALPHA(1-13) \S() ALPHA(1-14) \S() ALPHA(1-12) PUNC(,) \S() BUSINESS(2-12)

394
5 ALPHA(1-11) \S() ALPHA(1-14) \S() ALPHA(1-14) PUNC(,!) \S() BUSINESS(2-9)

673
6 ALPHA(1-14) \S() ALPHA(1-13) \S() ALPHA(1-14)

469
7 ALPHA(1-14) \S() ALPHA(1-15) PUNC(,!) \S() BUSINESS(2-4)

235
8 ALPHA(1-12) \S() ALPHA(1-13) \S() ALPHA(1-12) \S() ALPHA(1-14)

168
9 ALPHA(1-21)

132
10 ALPHA(3-18) PUNC(,) \S() BUSINESS(3-4)

256
11 ALPHA(1-11) \S() ALPHA(1-14) \S() ALPHA(1-12) \S() ALPHA(1-14) \S() BUSINESS(2-25)

69
12 ALPHA(1-11) \S() ALPHA(1-12) \S() ALPHA(1-12) \S() ALPHA(1-14) \S() ALPHA(1-13) \S() BUSINESS(3-25)



### Basic Outliers

Outliers are always a nuisance in data janitorial tasks. With Openclean_pattern, we empower an unconventional anomaly detection technique, i.e. using row patterns to detect mismatched entries. In this example, we discover two dominant patterns in the zipcode column: a digit, and a combination of alphanumeric characters. The latter only appears in the data in less than .1% values and hence can be regarded as outliers. Looking at these reveal that some canadian records sneaked into the dataset.

In [8]:
from openclean_pattern.align.group import Group

# create a new PatternFinder object with the group by length collector

pf = PatternFinder(distinct=True,
                    tokenizer=rt,
                    collector=Group(),
                    aligner=Padder()
                  )
column = 'Zip Code'
patterns = pf.find(df[column])

In [9]:
# looking at the patterns discovered in the zipcode column, we see there are 9 distinct values 
# that deviate from the rest of the column whereas the remainder 1148 distinct values follow a single
# digit pattern

for i, gp in patterns.items():
    if i != -1:
        print(len(gp.top(pattern=True).idx))
        print(i, gp.top())
        print()

1148
1 DIGIT

9
3 ALPHANUM SPACE_REP ALPHANUM



In [10]:
# these are canadian records

import numpy as np

pred = np.logical_not(pf.compare(patterns[1].top(pattern=True), df[column]))
df.loc[pred]

Unnamed: 0,Business Name,Entity Type,Registry Date,Address,City,Zip Code
4538,"MKII SERVICE, INC.",FOREIGN BUSINESS CORPORATION,07/14/2020,69 YONGE ST SUITE 600,TORONTO,M5E 1K3
4551,"MK PAYMENT SOLUTIONS, INC.",FOREIGN BUSINESS CORPORATION,07/14/2020,69 YONGE ST SUITE 600,TORONTO,M5E 1K3
4562,"MKII MARKETING, INC.",FOREIGN BUSINESS CORPORATION,07/14/2020,69 YONGE ST SUITE 600,TORONTO,M5E 1K3
5732,TASMAN AIR SERVICES LLC,DOMESTIC LIMITED LIABILITY COMPANY,07/16/2020,#5 4340 KING ST,DELTA,V4K 0A5
5949,SLANG WORLDWIDE INC.,FOREIGN BUSINESS CORPORATION,07/17/2020,50 CARROLL STREET,TORONTO,M4M 3G3
5992,"TPD RESOURCES, INC.",FOREIGN BUSINESS CORPORATION,07/17/2020,980 HOWE STREET,VANCOUVER,V6Z 0C8
6007,QUARTECH CORRECTIONS LLC,FOREIGN LIMITED LIABILITY COMPANY,07/17/2020,2889 12TH AVENUE E,VANCOUVER,V5M 4T5
7808,DESIGN TRANSPORT USA INC.,DOMESTIC BUSINESS CORPORATION,07/22/2020,8705 170 STREET,SURREY,V4N 5K8
7868,NOBLE FOODS NUTRITION USA INC.,FOREIGN BUSINESS CORPORATION,07/22/2020,250 AV AVRO,POINTE-CLAIRE,H9R 6B1
7879,NOBLE FOODS NUTRITION USA HOLDINGS INC.,FOREIGN BUSINESS CORPORATION,07/22/2020,250 AV AVRO,POINTE-CLAIRE,H9R 6B1


### Non-Basic Outliers

This seems tricky but non-basic mismatches are automatically pushed out as noise with the heuristical hyperparameter optimization for the DBSCAN clusterer. Let's revisit non Basic Type resolution for Geospatial vocabulary from datamart_geo.

In [11]:
from openclean_pattern.datatypes.resolver import GeoSpatialResolver

In [12]:
# create a new DefaultTypeResolver object (identifies basic types)
# intercepted by an Address and Geospatial Resolver (identify addresses and countries)

dtr = DefaultTypeResolver(interceptors=[GeoSpatialResolver()])

# create a new RegexTokenizer that'll tokenize the remaining values not identified by the type resolvers
# on all delimiters

rt = RegexTokenizer(type_resolver=dtr)

# create a new PatternFinder object, use the tokenizer and play with the number of samples in the Clusterer collector

column = 'City'
city = df[column].drop_duplicates()

In [13]:
# we type encode the city column, 
# notice how the tokens are now attached to an administrative level if found in the datamart_geo vocabulary

enc = rt.encode(city)
enc[:5]

[(_'ALPHA'_(9,'wenatchee'),),
 (_'ALPHA'_(6,'keizer'),),
 (_'ADMIN_1'_(8,'new york'),),
 (_'ADMIN_2'_(5,'salem'),),
 (_'ADMIN_1'_(8,'portland'),)]

In [14]:
# we cluster with min_samples = 10, and align similar tokens

gr = Cluster(min_samples=10).collect(enc)
al = Padder().align(enc, gr)

In [15]:
# and compile the patterns

pn = DefaultRegexCompiler(method='row', per_group='all').compile(al, gr)

In [16]:
# Upon inspection, we see cluster -1 (noise) has numerous patterns created from at max 8 patterns. Remember 
# we set min_samples in the clusterer to 10? We can potentially tweak that hyperparameter to decrease/increase the 
# noise threshold. We also realize how important it can be to have the typeresolver produce correct mappings. e.g.
# many of the values categorized as noise are rows that partially matched the master vocabulary, and hence
# this depicts how this is a two way street. We should also consider the respective type resolver with the 
# values mis-categorized as noise.

for i, pnn in pn.items():
    print(i)
    for pat in pnn.values():
        print(pat, pat.freq, list(pat.idx)[:5])
    print()

0
[ALPHA(3-13)] 400 [0, 1, 5, 8, 9]

1
[ADMIN_LEVEL_1(4-12)] 13 [2, 163, 4, 584, 719]

2
[ADMIN_LEVEL_2(4-13)] 60 [129, 3, 259, 771, 7]

3
[ALPHA(2-8), SPACE_REP(1-1), ADMIN_LEVEL_4(5-8)] 16 [160, 450, 762, 644, 389]

4
[ADMIN_LEVEL_4(4-12)] 37 [128, 261, 646, 775, 10]

5
[ADMIN_LEVEL_3(4-13)] 50 [512, 258, 643, 394, 522]

-1
[ADMIN_LEVEL_3(4-9), SPACE_REP(1-1), ALPHA(4-7), GAP(0-0), GAP(0-0)] 8 [33, 16, 720, 84, 309]
[ADMIN_LEVEL_1(3-9), SPACE_REP(1-1), ALPHA(4-7), GAP(0-0), GAP(0-0)] 9 [256, 41, 745, 746, 652]
[ADMIN_LEVEL_0(7-7), GAP(0-0), GAP(0-0), GAP(0-0), GAP(0-0)] 1 [51]
[ADMIN_LEVEL_2(6-9), SPACE_REP(1-1), ALPHA(5-7), GAP(0-0), GAP(0-0)] 9 [364, 716, 687, 315, 52]
[ADMIN_LEVEL_1(5-11), SPACE_REP(1-1), ADMIN_LEVEL_2(4-4), GAP(0-0), GAP(0-0)] 4 [64, 271, 702, 215]
[ADMIN_LEVEL_4(4-4), SPACE_REP(1-1), ADMIN_LEVEL_4(5-5), GAP(0-0), GAP(0-0)] 1 [93]
[ADMIN_LEVEL_1(5-6), SPACE_REP(1-1), ADMIN_LEVEL_4(5-9), GAP(0-0), GAP(0-0)] 2 [248, 94]
[ADMIN_LEVEL_2(4-8), SPACE_REP(1-1), ADMIN_LE

In [17]:
# Looking at the noisy data, one of the things that stands out is ADMIN_0, which is usually reserved for 
# country names. False alarm!

print(city.iloc[51]) #51 is the index of admin_0 above
df[df['City']==city.iloc[51]].head()

LEBANON


Unnamed: 0,Business Name,Entity Type,Registry Date,Address,City,Zip Code
133,VESSEL MUSIC GROUP LLC,DOMESTIC LIMITED LIABILITY COMPANY,07/01/2020,755 E ASH ST,LEBANON,97355
134,VESSEL MUSIC GROUP LLC,DOMESTIC LIMITED LIABILITY COMPANY,07/01/2020,1200 E GRANT ST STE E,LEBANON,97355
599,SONS OF STONE CARPENTRY LLC,DOMESTIC LIMITED LIABILITY COMPANY,07/02/2020,475 W ASH ST,LEBANON,97355
780,BEAUTY MARKED BY MICHELLE MARKS,ASSUMED BUSINESS NAME,07/02/2020,971 E GRANT ST,LEBANON,97355
781,BEAUTY MARKED BY MICHELLE MARKS,ASSUMED BUSINESS NAME,07/02/2020,455 S MAIN ST,LEBANON,97355


In [18]:
# Other values that standout to me are those that were not even partially type resolved inside the noise cluster
# The type resolvers strip all punctuation to perform partial string matching so perhaps more preprocessing
# is a suggestion here

city.iloc[[209, 224, 448, 736, 735]]

997             LEE'S SUMMIT
1086       ELK GROVE VILLAGE
4020     MOUNT HOOD PARKDALE
10056          COEUR D ALENE
10055          COEUR D'ALENE
Name: City, dtype: object

In [19]:
# Also it might be worth looking at the smaller single patterns inside cluster 6 and 9
# Again these values are legitimate values so maybe the easiest fix to align with the rest of the 
# group could be removing the dashes

print(city.iloc[list(pn[6].top(2, pattern=True).idx)])
print()
print(city.iloc[list(pn[9].top(2, pattern=True).idx)])

7868    POINTE-CLAIRE
Name: City, dtype: object

3911    MILTON-FREEWATER
Name: City, dtype: object


______________________________________________________________________________________________________________