# Standardization of Street Names

Find groups of different street names that might be alternative representations of the same street. This is an example for the key collision clustering supported by **openclean**. Uses the **NYC Parking Violations Issued - Fiscal Year 2014** dataset.

In [1]:
# Download the full 'DOB Job Application Fiings' dataset.
# Note that this is a file of ~ GB!

import gzip
import os

from openclean.data.source.socrata import Socrata

datafile = './jt7v-77mi.tsv.gz'

# Download file only if it does not exist already.
if not os.path.isfile(datafile):
    with gzip.open(datafile, 'wb') as f:
        ds = Socrata().dataset('jt7v-77mi')
        print('Downloading ...\n')
        print(ds.name + '\n')
        print(ds.description)
        ds.write(f)

        
# As an alternative, you can also use the smaller dataset sample that is
# included in the repository.
#
# datafile = './data/jt7v-77mi.tsv.gz'

In [2]:
# Use streaming function to avoid having to load the full dataset
# into memory.

from openclean.pipeline import stream

df = stream(datafile)

In [3]:
# Get distinct set of street names. By computing the distinct set of
# street names first we avoid computing keys for each distinct street
# name multiple times.

streets = df.select('Street').distinct()

print('{} distinct streets (for {} total values)'.format(len(streets), sum(streets.values())))

115567 distinct streets (for 9100278 total values)


In [4]:
# Cluster street names using key collision (with the default key generator).
# Remove clusters that contain less than seven distinct values (for display
# purposes). Use multiple threads (4) to generate value keys in parallel.

from openclean.cluster.key import key_collision

# Minimum cluster size. Use seven as defaultfor the full dataset (to limit
# the number of clusters that are printed in the next cell).
minsize = 7

# Use minimum cluster size of 2 when using the dataset sample
# minsize = 2

clusters = key_collision(values=streets, minsize=minsize, threads=4)

print('{} clusters of size {} or greater'.format(len(clusters), minsize))

13 clusters of size 7 or greater


In [5]:
# For each cluster print cluster values, their frequency counts,
# and the suggested common value for the cluster.

def print_cluster(cnumber, cluster):
    print('Cluster {} (of size {})\n'.format(cnumber, len(cluster)))
    for val, count in cluster.items():
        print('{} ({})'.format(val, count))
    print('\nSuggested value: {}\n\n'.format(cluster.suggestion()))
    
# Sort clusters by decreasing number of distinct values.
clusters.sort(key=lambda c: len(c), reverse=True)

for i in range(len(clusters)):
    print_cluster(i + 1, clusters[i])


Cluster 1 (of size 8)

2ND AVE (4075)
2nd Ave (67751)
2ND  AVE (5)
2ND AVE. (1)
AVE 2ND (1)
2ND      AVE (1)
2ND    AVE (2)
2ND       AVE (1)

Suggested value: 2nd Ave


Cluster 2 (of size 8)

ST NICHOLAS AVE (2451)
ST. NICHOLAS AVE (125)
St Nicholas Ave (23462)
ST, NICHOLAS AVE (1)
ST NICHOLAS  AVE (9)
ST NICHOLAS   AVE (1)
ST  NICHOLAS AVE (4)
ST. NICHOLAS  AVE (1)

Suggested value: St Nicholas Ave


Cluster 3 (of size 8)

LAWRENCE ST (165)
ST LAWRENCE (34)
LAWRENCE  ST (1)
Lawrence St (2368)
ST. LAWRENCE (2)
ST LAWRENCE ST (1)
LAWRENCE ST. (1)
ST. LAWRENCE ST (1)

Suggested value: Lawrence St


Cluster 4 (of size 8)

ST NICHOLAS (847)
ST NICHOLAS ST (31)
NICHOLAS ST (27)
ST. NICHOLAS (27)
ST  NICHOLAS (2)
ST NICHOLAS  ST (1)
Nicholas St (79)
ST. NICHOLAS ST (1)

Suggested value: ST NICHOLAS


Cluster 5 (of size 7)

W 125 ST (3365)
W 125    ST (1)
W. 125 ST. (1)
W .125 ST (5)
W  125 ST (2)
W 125  ST (1)
W. 125 ST (3)

Suggested value: W 125 ST


Cluster 6 (of size 7)

FERRY LOT 2 (74