# Discover Statistical Outliers in the City Name Column

This notebook demonstrates the use of anomaly detection operators that are implemented by the [scikit-learn machine learning library](https://scikit-learn.org/stable/). There are five different anomaly detection operators that are included in **openclean**. Here we use a simple ensemble approach that applies all five operators to a sample of the *DOB Job Application Filing* dataset and counts for each value the number of operators that classified the value as an outlier.


In [1]:
# Use the 'DOB Job Application Filings - Download' notebook to download the
# 'DOB Job Application Filings' dataset for this example.

datafile = './ic3t-wcy2.tsv.gz'


In [2]:
# Use a random sample of 10,000 records for this example.

from openclean.pipeline import stream

df = stream(datafile).select('City ').update('City ', str.upper).sample(10000, seed=42).to_df()

In [3]:
# Print (a subset of) the distinct city names in the sample.

df['City '].value_counts()

NEW YORK      3713
BROOKLYN      1577
QUEENS         542
NY             459
BRONX          433
              ... 
ENGELWOOD        1
NEW CANAAN       1
EATONTOWN        1
FLORAL           1
CALVERTON        1
Name: City , Length: 508, dtype: int64

In [4]:
# Use a counter to maintain count of how many anomaly detection operators
# classified each value as an outlier.

from collections import Counter

ensemble = Counter()

In [5]:
# Apply fife different anomaly detection operators to the values in the city column.
# Here we use a default value embedding that ignores the frequency of each value (since
# in this NYC Open Dataset city names like NEW YORK and any of the five boroughs are
# more frequent that other names).

from openclean.embedding.feature.default import UniqueSetEmbedding
from openclean.profiling.anomalies.sklearn import (
    dbscan,
    isolation_forest,
    local_outlier_factor,
    one_class_svm,
    robust_covariance
)

for f in [dbscan, isolation_forest, local_outlier_factor, one_class_svm, robust_covariance]:
    ensemble.update(f(df, 'City ', features=UniqueSetEmbedding()))


In [6]:
# Output values that have been classified as outliers by at least three out of the
# five operators.

prev = 0
for value, count in ensemble.most_common():
    if count < 3:
        break
    if count < prev:
        print()
    if count != prev:
        print('{}\t{}'.format(count, value))
    else:
        print('\t{}'.format(value))
    prev = count

5	6132

4	B'KLYN
	L.I. CITY
	L.I.C
	NB.Y.
	NEW  YORK
	N.Y.C.
	N.Y.
	BRONX,
	SI,NY
	L.I.C.
	S.I.
	BRO0KLYN
	N Y
	N.Y
	QUEENS,
	521 FIFTH AVENU

3	LONG IS. CITY
	NEW YORK, NY
	HENAU(SWITZERLA
	FLUSHING MEADOW
	NEW YOR
	HOLLIS HILLS
	WILLIAMSBURG,
	S. OZONE PARK
	FLORAL  PARK
	NEW YORK,
	E. JERSEY CITY
	SO. OZONE PARK
	NEW HYPDE PK.
	RICHMOND-HILL
	BROOKLYN,
	ST HELIER, BVI
	FLUSHING,QUEENS
	MIAMI
	NW YORK
	NEEWARK,
	LONG ISL. CITY
	MT.VERNON
	LI CITY
	PHILADELPHIA
