# NYC Parking Violations - Data Profiling and Data Cleaning Demo

This notebook contains examples to demonstrate data profiling and data cleaning funcitonality in [openclean](https://github.com/VIDA-NYU/openclean).

*Dataset*. The dataset that is used for all the examples in the notebook is the **NYC Parking Violations Issued - Fiscal Year 2014** dataset that is available for via the [Socrata Open Data API](https://dev.socrata.com/). This dataset contains over 9 million rows and the data file is about 380 MB in size. A smaller version of the dataset is included in the repository that can be used to run this notebook without the need to download the full dataset. Be aware, however, that all examples are designed for the full dataset. When using the smaller example dataset the actual results of the individual steps may differ.

In [1]:
# Download the full 'NYC Parking Violations Issued - Fiscal Year 2014' dataset.
# Note that the downloaded full dataset file is about 380 MB in size! Use the
# alternative data file with 10,000 rows that is included in the repository if
# you do not want to download the full data file.

import gzip
import humanfriendly
import os

from openclean.data.source.socrata import Socrata

dataset = Socrata().dataset('jt7v-77mi')
datafile = './jt7v-77mi.tsv.gz'

# Download file only if it does not exist already.
if not os.path.isfile(datafile):
    with gzip.open(datafile, 'wb') as f:
        print('Downloading ...\n')
        dataset.write(f)


# As an alternative, you can also use the smaller dataset sample that is
# included in the repository.
#
#datafile = './data/jt7v-77mi.tsv.gz'

fsize = humanfriendly.format_size(os.stat(datafile).st_size)
print("Using '{}' in file {} of size {}".format(dataset.name, datafile, fsize))

Using 'Parking Violations Issued - Fiscal Year 2014' in file ./jt7v-77mi.tsv.gz of size 379.19 MB


In [2]:
# Due to the size of the full dataset file, we make use of openclean's
# stream operator to avoid having to load the dataset into main-memory.

from openclean.pipeline import stream

ds = stream(datafile)

In [3]:
# Print the first ten rows of the dataset to get a first
# idea of the content.

ds.head()

Unnamed: 0,Summons Number,Plate ID,Registration State,Plate Type,Issue Date,Violation Code,Vehicle Body Type,Vehicle Make,Issuing Agency,Street Code1,...,Vehicle Color,Unregistered Vehicle?,Vehicle Year,Meter Number,Feet From Curb,Violation Post Code,Violation Description,No Standing or Stopping Violation,Hydrant Violation,Double Parking Violation
0,1361929741,FCJ5493,NY,PAS,12/18/1970,20,SUBN,GMC,S,35030,...,BLACK,0,2013,-,0,,,,,
1,1366962000,63540MC,NY,COM,02/02/1971,46,DELV,FRUEH,P,58830,...,BRN,0,2013,-,0,,,,,
2,1342296187,GCY4187,NY,SRF,09/18/1971,21,VAN,FORD,S,11790,...,BLUE,0,2002,-,0,,,,,
3,1342296199,95V6675,TX,PAS,09/18/1971,21,,GMC,S,11790,...,SILVR,0,2008,-,0,,,,,
4,1342296217,FYM5117,NY,SRF,09/18/1971,21,SUBN,NISSA,S,28190,...,WHITE,0,2012,-,0,,,,,
5,1356906515,GFM1421,NY,PAS,09/18/1971,40,SDN,MAZDA,X,13610,...,BLK,0,2010,-,7,,,,,
6,1337077380,18972BB,NY,999,10/10/1971,14,BUS,INTER,P,8440,...,YELLO,0,2001,-,0,,,,,
7,1364523796,WNJ4730,VA,PAS,04/05/1973,14,SDN,TOYOT,P,50830,...,BLK,0,0,-,0,,,,,
8,1359914924,68091JZ,NY,COM,07/22/1973,46,DELV,TOYOT,P,10610,...,WH,0,2010,-,0,,,,,
9,1355498326,EWV4127,NY,PAS,08/12/1973,21,SUBN,ACURA,X,42630,...,GREY,0,2005,-,0,,,,,


## Data Profiling

We start by computing basic data profiling statistics to get a better understanding for the dataset.

In [4]:
# Profile a sample of 1000 rows using the default data profiler.

from openclean.profiling.column import DefaultColumnProfiler

profiles = ds.sample(n=1000, random_state=42).profile(default_profiler=DefaultColumnProfiler)

In [5]:
# Print overview of profiling results.

profiles.stats()

Unnamed: 0,total,empty,distinct,uniqueness,entropy
Summons Number,1000,0,1000,1.0,9.965784
Plate ID,1000,0,997,0.997,9.959784
Registration State,1000,0,36,0.036,1.559809
Plate Type,1000,0,18,0.018,1.312779
Issue Date,1000,0,306,0.306,8.045415
Violation Code,1000,0,52,0.052,4.366667
Vehicle Body Type,1000,7,27,0.02719,2.806224
Vehicle Make,1000,9,59,0.059536,4.660736
Issuing Agency,1000,0,5,0.005,1.065302
Street Code1,1000,0,531,0.531,8.125434


In [6]:
# Print dataset columns that contain non-empty values together with the most
# frequent data type from the profiling results.

print('Schema\n------')
for col in ds.columns:
    p = profiles.column(col)
    if p['emptyValueCount'] == 1000:
        continue
    print("  '{}' ({})".format(col, p['datatypes']['distinct'].most_common(1)[0][0]))
    
# Print number of rows.
    
print('\n{:,} rows.'.format(ds.count()))

Schema
------
  'Summons Number' (int)
  'Plate ID' (str)
  'Registration State' (str)
  'Plate Type' (str)
  'Issue Date' (date)
  'Violation Code' (int)
  'Vehicle Body Type' (str)
  'Vehicle Make' (str)
  'Issuing Agency' (str)
  'Street Code1' (int)
  'Street Code2' (int)
  'Street Code3' (int)
  'Vehicle Expiration Date' (int)
  'Violation Location' (int)
  'Violation Precinct' (int)
  'Issuer Precinct' (int)
  'Issuer Code' (int)
  'Issuer Command' (int)
  'Issuer Squad' (str)
  'Violation Time' (str)
  'Time First Observed' (str)
  'Violation County' (str)
  'Violation In Front Of Or Opposite' (str)
  'Number' (int)
  'Street' (str)
  'Intersecting Street' (str)
  'Date First Observed' (int)
  'Law Section' (int)
  'Sub Division' (str)
  'Violation Legal Code' (str)
  'Days Parking In Effect    ' (str)
  'From Hours In Effect' (str)
  'To Hours In Effect' (str)
  'Vehicle Color' (str)
  'Unregistered Vehicle?' (int)
  'Vehicle Year' (int)
  'Meter Number' (str)
  'Feet From Curb

### Outliers and Anomalies in Dataset Columns

In the following, we use distinct value sets from two dataset columns to give examples for identifying potential erroneous values (outliers).

#### Registration State

This example looks at the distinct values in the column *Registration State*. It shows that instead of the expected 50 U.S. states, the column contains 69 different values. We demonstrate how to use **openclean**'s reference data repository to help identify the 19 values that do not represent valid U.S. states (also see note at the end of this section).

In [7]:
# Get set of distinct values for column 'Registration State'. Print the
# values in decreasing order of frequency.

states = ds.distinct('Registration State')
for rank, val in enumerate(states.most_common()):
    st, freq = val
    print('{:<3} {}  {:>10}'.format('{}.'.format(rank + 1), st, '{:,}'.format(freq)))

1.  NY   7,029,804
2.  NJ     878,677
3.  PA     225,760
4.  CT     136,973
5.  FL     111,887
6.  MA      78,650
7.  VA      60,951
8.  MD      50,407
9.  IN      49,126
10. NC      47,117
11. 99      38,080
12. IL      31,763
13. GA      30,837
14. AZ      24,245
15. TX      24,092
16. OH      21,995
17. CA      20,100
18. OK      19,688
19. SC      19,529
20. ME      19,459
21. TN      18,396
22. MI      16,365
23. DE      14,643
24. RI      13,296
25. MN      12,901
26. NH       9,930
27. VT       7,215
28. IA       7,166
29. WA       5,967
30. ID       5,863
31. AL       5,828
32. QB       5,336
33. WI       5,311
34. DP       5,264
35. ON       5,183
36. DC       3,728
37. CO       3,663
38. OR       3,484
39. MS       3,428
40. KY       3,222
41. NM       2,936
42. MO       2,876
43. AR       2,716
44. LA       2,500
45. NV       2,131
46. WV       1,688
47. NE       1,626
48. GV       1,317
49. KS       1,226
50. AK         961
51. UT         942
52. SD         691
53. MT      

The output of the previous cell shows that simply relying on the value frequency is not a good option in identifying outlier values (e.g., *99* is more frequent than many of the actual U.S. states). A more reliable solution is to use a curated list of U.S. states to identify invalid values.

**opencelan** provides easy access to reference datasets that can be used for data cleaning. One of these datasets is the [list of U.S. States and Territories from Wikipedia](https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States).


In [8]:
# List identifier and names for available reference datasets.

from openclean.data.refdata import RefStore

refdata = RefStore()
for entry in refdata.repository().find():
    print('{:<35}:  {}'.format(entry.identifier, entry.name))

encyclopaedia_britannica:us_cities :  Cities in the U.S.
restcountries.eu                   :  REST Countries
usps:street_abbrev                 :  C1 Street Suffix Abbreviations
usps:secondary_unit_designators    :  C2 Secondary Unit Designators
wikipedia:us_states                :  States and territories of the U.S.


In [9]:
# Download the States and Territories dataset.

refdata.load('wikipedia:us_states', auto_download=True).df().head()

Unnamed: 0,name,postal_abbreviation,capital_city,largest_city,ratification_date,population,total_area_mi_2,total_area_km_2,land_area_mi_2,land_area_km_2,water_area_mi_2,water_area_km_2,number_of_reps
0,Alabama,AL,Montgomery,Birmingham,1819-12-14,4903185,52420,135767,50645,131171,1775,4597,7
1,Alaska,AK,Juneau,Anchorage,1959-01-03,731545,665384,1723337,570641,1477953,94743,245384,1
2,Arizona,AZ,Phoenix,Phoenix,1912-02-14,7278717,113990,295234,113594,294207,396,1026,9
3,Arkansas,AR,Little Rock,Little Rock,1836-06-15,3017804,53179,137732,52035,134771,1143,2961,4
4,California,CA,Sacramento,Los Angeles,1850-09-09,39512223,163695,423967,155779,403466,7916,20501,53


In [10]:
# Get set of distinct state name abbreviations (i.e., postal abbreviations).

states_ref = refdata.load('wikipedia:us_states', auto_download=True).distinct('postal_abbreviation')

In [11]:
# Print information for entries in the 'Registration State' column that
# do not occur in the reference list of U.S. states.

for rank, val in enumerate(states.most_common()):
    st, freq = val
    if st not in states_ref:
        print('{:<3} {}  {:>10}'.format('{}.'.format(rank + 1), st, '{:,}'.format(freq)))

11. 99      38,080
32. QB       5,336
34. DP       5,264
35. ON       5,183
36. DC       3,728
48. GV       1,317
56. NS         373
57. BC         329
59. AB         243
60. PR         211
61. NB         151
62. MX         108
63. PE          99
64. SK          25
65. MB          22
66. YT          14
67. FO           9
68. NT           6
69. NF           1


**Note**: The [NYC Plate Types & State Codes](http://www.nyc.gov/html/dof/html/pdf/faq/stars_codes.pdf) document lists 17 *OTHER CODES* in addition to the 50 state codes. These can be used to explain all but two of the *outliers* (not included are *99* and *PR*).

We currently do not have this list in our [reference data reository](https://github.com/VIDA-NYU/openclean-reference-data). **Contributions are Welcome!**

#### Vehicle Expiration Date

In this section we look at the values in the *Vehicle Expiration Date*. This example makes use of functionality that has been integrated into **openclean** from [scikit-learn](https://scikit-learn.org) for outlier detection.

In [12]:
# Print the ten most frequent values for the 'Vehicle Expiration Date' column.

expiration_dates = ds.distinct('Vehicle Expiration Date')

for rank, val in enumerate(expiration_dates.most_common(10)):
    st, freq = val
    print('{:<3} {:>8}  {:>10}'.format('{}.'.format(rank + 1), st, '{:,}'.format(freq)))

print('\nTotal number of distinct values is {}'.format(len(expiration_dates)))


1.         0   1,036,939
2.  88888888   1,034,518
3.  88880088     275,925
4.  20140088     163,398
5.  20130088     155,346
6.  20140930     127,904
7.  20140430      92,368
8.  20141231      91,262
9.  20141130      90,542
10. 20140228      87,149

Total number of distinct values is 4415


#### Detect Outliers using scikit-learn

Looking at the most frequent values already provides an idea of the possible outliers in this column.

**openclean** also provides functionality to detect outliers in a dataset column (or any list of values). Next we give a brief example for using the [DBSCAN clustering algorithm](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html) from scikit-learn for anomaly detection. When using DBSCAN, values that are not included in any cluster are considered outliers. 

In [13]:
# Using the default settings yields two outliers.

from openclean.profiling.anomalies.sklearn import DBSCANOutliers

DBSCANOutliers().find(expiration_dates)

['88888888', '0']

In [14]:
# If we change the eps parameter (maximum distance between two samples for one to be considered
# as in the neighborhood of the other) we can find even more potential outliers (including one that
# we had not seen before).

DBSCANOutliers(eps=0.05).find(expiration_dates)

['88880088', '20140088', '88888888', '0', '19750423', '20130088']

In [15]:
# Take a look at the record(s) that have an expiration date of '19750423'.

from openclean.function.eval.base import Col

ds\
    .filter(Col('Vehicle Expiration Date') == '19750423')\
    .select(['Plate ID', 'Plate Type', 'Registration State', 'Street', 'Vehicle Make', 'Violation Code'])\
    .to_df()

Unnamed: 0,Plate ID,Plate Type,Registration State,Street,Vehicle Make,Violation Code
631299,GFR1342,PAS,NY,FLUSHING MEADOW CORO,DODGE,20


## Street Names

In this section we take a look at the *Street* column that contains the names of streets where a parking violation occurred. Street address columns can be very noisy due to different abbreviations, different representations for street numbers, etc. (e.g., 5 Ave vs. Fifth Avenue vs. 5th Av).

### Collision Key Clustering

**openclean** provides functionality for grouping values based on similarity. This functionality is adopted from [OpenRefine Clustering](https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth). The main idea is to identify clusters of values that are different but might be alternative representations of the same thing.

The main idea of *key collision* methods is to create an alternative representation for each value (i.e., a  key), and then group values based on their keys. The default key generator on **openclean** is the [fingerprint](https://github.com/VIDA-NYU/openclean-core/blob/master/openclean/function/value/key/fingerprint.py) that was adopted from [OpenRefine](https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth). The main steps in creating a fingerprint key value are:

- remove leading and trailing whitespace,
- convert string to lower case,
- Normalize string by removing punctuation and control characters and replacing non-diacritic characters (if the default normalizer is used),
- Tokenize string by splitting on whitespace characters,
- Sort the tokens and remove duplicates,
- Concatenate remaining (sorted) tokens using a single space character as the delimiter.


In [16]:
# Cluster street names using 'Key Collision' clustering with the
# default fingerprint key generator.

streets = ds.select('Street')

from openclean.cluster.key import KeyCollision
from openclean.function.value.key.fingerprint import Fingerprint

clusters = streets.cluster(clusterer=KeyCollision(func=Fingerprint(), threads=3))

In [17]:
# Define simple helper method to print the k largest clusters.

def print_k_clusters(clusters, k=10):
    clusters = sorted(clusters, key=lambda x: len(x), reverse=True)
    val_count = sum([len(c) for c in clusters])
    print('Total number of clusters is {} with {} values'.format(len(clusters), val_count))
    for i in range(min(k, len(clusters))):
        print('\nCluster {}'.format(i + 1))
        for key, cnt in clusters[i].items():
            if key == '':
                key = "''"
            print('  {} (x {})'.format(key, cnt))

In [18]:
print_k_clusters(clusters)

Total number of clusters is 8478 with 18836 values

Cluster 1
  2ND AVE (x 4075)
  2nd Ave (x 67751)
  2ND  AVE (x 5)
  2ND AVE. (x 1)
  AVE 2ND (x 1)
  2ND      AVE (x 1)
  2ND    AVE (x 2)
  2ND       AVE (x 1)

Cluster 2
  ST NICHOLAS AVE (x 2451)
  ST. NICHOLAS AVE (x 125)
  St Nicholas Ave (x 23462)
  ST, NICHOLAS AVE (x 1)
  ST NICHOLAS  AVE (x 9)
  ST NICHOLAS   AVE (x 1)
  ST  NICHOLAS AVE (x 4)
  ST. NICHOLAS  AVE (x 1)

Cluster 3
  LAWRENCE ST (x 165)
  ST LAWRENCE (x 34)
  LAWRENCE  ST (x 1)
  Lawrence St (x 2368)
  ST. LAWRENCE (x 2)
  ST LAWRENCE ST (x 1)
  LAWRENCE ST. (x 1)
  ST. LAWRENCE ST (x 1)

Cluster 4
  ST NICHOLAS (x 847)
  ST NICHOLAS ST (x 31)
  NICHOLAS ST (x 27)
  ST. NICHOLAS (x 27)
  ST  NICHOLAS (x 2)
  ST NICHOLAS  ST (x 1)
  Nicholas St (x 79)
  ST. NICHOLAS ST (x 1)

Cluster 5
  W 125 ST (x 3365)
  W 125    ST (x 1)
  W. 125 ST. (x 1)
  W .125 ST (x 5)
  W  125 ST (x 2)
  W 125  ST (x 1)
  W. 125 ST (x 3)

Cluster 6
  FERRY LOT 2 (x 743)
  FERRY LOT #2 

In [19]:
# Convert all street names to upper case before clustering.

streets = ds.select('Street').update('Street', str.upper)
clusters = streets.cluster(clusterer=KeyCollision(func=Fingerprint(), threads=3))
print_k_clusters(clusters, k=5)

Total number of clusters is 4119 with 9164 values

Cluster 1
  W 125 ST (x 3365)
  W 125    ST (x 1)
  W. 125 ST. (x 1)
  W .125 ST (x 5)
  W  125 ST (x 2)
  W 125  ST (x 1)
  W. 125 ST (x 3)

Cluster 2
  FERRY LOT 2 (x 743)
  FERRY LOT #2 (x 140)
  FERRY  LOT #2 (x 1)
  FERRY LOT  2 (x 3)
  FERRY LOT # 2 (x 121)
  FERRY LOT  # 2 (x 2)
  FERRY LOT  #2 (x 1)

Cluster 3
  2ND AVE (x 71826)
  2ND  AVE (x 5)
  2ND AVE. (x 1)
  AVE 2ND (x 1)
  2ND      AVE (x 1)
  2ND    AVE (x 2)
  2ND       AVE (x 1)

Cluster 4
  ST NICHOLAS AVE (x 25913)
  ST. NICHOLAS AVE (x 125)
  ST, NICHOLAS AVE (x 1)
  ST NICHOLAS  AVE (x 9)
  ST NICHOLAS   AVE (x 1)
  ST  NICHOLAS AVE (x 4)
  ST. NICHOLAS  AVE (x 1)

Cluster 5
  LGA TERMINAL B (x 26)
  LGA, TERMINAL B (x 1)
  LGA/ TERMINAL B (x 1)
  TERMINAL B LGA (x 20)
  TERMINAL B - LGA (x 2)
  TERMINAL B -LGA (x 1)
  LGA TERMINAL B, (x 1)


### Specialized Key Generators

The default fingerprint key generator already does show some promissing results. The method, however, misses many cases that we found are common in U.S. street address columns. A few examples are:

- Different abbreviations for street types, e.g., 35 St vs. 35 Str vs. 35 Street
- Missing whitespace between street number and street type, e.g, 35St vs. 35 St
- Text representations of street numbers, e.g., Fifth Ave vs. 5th Ave vs. 5 Ave

To address these issues, the [geospatial extension package for openclean](https://github.com/VIDA-NYU/openclean-geo) contains a specialized key generator and value standardizer that are demonstrated in the following.

In [20]:
# Use a key generator that was specifically designed for street names.

from openclean.cluster.key import KeyCollision
from openclean_geo.address.usstreet import USStreetNameKey

# In this example we take a different approach: we first extract the list of
# distinct street names from the data file. We the apply transformations and
# clustering directly on the list of names using three parallel threads.

street_names = ds.update('Street', str.upper).distinct('Street')
clusters = KeyCollision(func=USStreetNameKey(), threads=3).clusters(street_names)
print_k_clusters(clusters, k=5)

Total number of clusters is 10386 with 33342 values

Cluster 1
  W 43 STREET (x 200)
  W 43 ST (x 1666)
  W 43RD ST (x 19864)
  WEST 43 STREET (x 425)
  W 43RD STREET (x 52)
  WEST 43RD ST (x 147)
  WEST 43 ST (x 366)
  WEST 43RD STREET (x 210)
  W 43ST (x 11)
  W 43 RD ST (x 8)
  W 43TH ST (x 10)
  WEST 43RD  STREET (x 1)
  WEST  43 ST (x 1)
  W.43 RD ST (x 1)
  W.43 STREET (x 3)
  W.43RD ST (x 1)
  W.43 ST (x 9)
  WEST 43ST (x 10)
  W43 ST (x 9)
  W. 43 STREET (x 3)
  W43ST (x 1)
  W. 43RD ST (x 1)
  WEST 43TH ST (x 1)
  W .43RD ST (x 2)
  W 43RD  ST (x 1)
  W. 43 ST (x 1)
  W 43 RD STREET (x 1)
  W 43  STREET (x 1)
  W  43 ST (x 1)
  W43RD ST (x 1)
  WEST 43  STREET (x 1)
  W.43 TH  ST (x 1)
  W.43 TH ST (x 1)
  WEST 43TH STREET (x 1)
  WEST 43  ST (x 1)
  W .43 ST (x 1)
  WEST  43ST (x 1)
  WEST 43 RD ST (x 1)

Cluster 2
  W 125 ST (x 3365)
  W 125    ST (x 1)
  W 125 STREET (x 451)
  WEST 125 ST (x 522)
  WEST 125TH ST (x 81)
  W 125TH ST (x 11611)
  WEST 125 STREET (x 354)
  W 12

In [21]:
# Use street name standardization operator to modify street names
# before clustering them using the default fingerprint operator.

from openclean_geo.address.usstreet import StandardizeUSStreetName

street_names_std = StandardizeUSStreetName(characters='upper').apply(street_names, threads=3)
clusters = KeyCollision(func=Fingerprint(), threads=3).clusters(street_names_std)
print_k_clusters(clusters, k=5)

Total number of clusters is 2354 with 5075 values

Cluster 1
  LGA TERMINAL B (x 26)
  LGA , TERMINAL B (x 1)
  LGA / TERMINAL B (x 1)
  TERMINAL B LGA (x 20)
  TERMINAL B - LGA (x 6)
  LGA TERMINAL B , (x 1)

Cluster 2
  B WAY (x 211)
  B - WAY (x 11)
  B / WAY (x 2)
  B . WAY (x 42)
  B . WAY . (x 1)
  B ; WAY (x 1)

Cluster 3
  LGA , CTB (x 1)
  LGA / CTB (x 1)
  LGA CTB (x 10)
  CTB LGA (x 3)
  LGA - CTB (x 1)
  CTB - LGA (x 1)

Cluster 4
  EAST L GRANT HWY (x 48)
  EAST . L GRANT HWY (x 18)
  EAST . L . GRANT HWY (x 25)
  EAST L . GRANT HWY (x 1)
  EAST / L / GRANT HWY (x 1)
  EAST - L GRANT HWY (x 1)

Cluster 5
  JOHN ST (x 4395)
  ST JOHN (x 10)
  ST JOHN ST (x 8)
  ST . JOHN ST (x 1)
  ST . JOHN (x 1)
  JOHN ST . (x 1)


In [22]:
# Use option to remove special characters (keep only alpha-numeric tokens)
# when standardizing street names and option to remove repeated tokens.

f = StandardizeUSStreetName(characters='upper', alphanum=True, repeated=False)
street_names_std = f.apply(street_names, threads=3)
clusters = KeyCollision(func=Fingerprint(), threads=3).clusters(street_names_std)
print_k_clusters(clusters, k=5)

Total number of clusters is 761 with 1541 values

Cluster 1
  SOUTH E C O 14 ST (x 1)
  SOUTH E C O E 14 ST (x 1)
  SOUTH O C O E 14 ST (x 1)

Cluster 2
  20 FT FROM C O S W (x 1)
  20 FT FROM S W C O (x 1)
  20 FT FROM S W C O C (x 1)

Cluster 3
  NORTH W C O NORTH 4 ST (x 1)
  NORTH W C O W 4 ST (x 1)
  NORTH W C O 4 ST (x 1)

Cluster 4
  NORTH E C O E 71 (x 5)
  NORTH O C O E 71 (x 1)
  NORTH E C O 71 (x 1)

Cluster 5
  ANN ST (x 1171)
  ST ANN ST (x 1)
  ST ANN (x 2)


## Vehicle Color - Using Functional Dependencies to assist with data standardization


In the last part we make use of the fact that 'Plate ID' and 'Registration State' uniquely identify a vehicle. For each vehicle we make the assumption that the color doesn't change within the one year of data that we are looking at here. Thus, the functional dependency 'Plate ID', 'Registration State' -> 'Vehicle Color'. should hold. Violations of that dependency point to different representations of the same color value.

**Note** that this example uses a random sample of 100,000 rows from the full dataset. Conflict repair on large datasets is still expensive and slow.

In [23]:
from openclean.function.eval.base import Col
from openclean.function.eval.logic import And
from openclean.function.eval.null import IsNotEmpty

df = ds\
    .select(['Plate ID', 'Registration State', 'Vehicle Color'])\
    .rename('Registration State', 'State')\
    .where(And(IsNotEmpty('Vehicle Color'), Col('State') != '99', Col('Plate ID') != '999'))\
    .update('Vehicle Color', str.upper)\
    .sample(n=100000, random_state=42)\
    .to_df()

In [24]:
# The Plate ID and Registration State should identify a vehicle uniquely. We use
# this key to find conflicts in the 'Vehicle Color' column.

from openclean.operator.map.violations import fd_violations

# Find violations of the FD ['Plate ID', 'Registration State'] -> ['Vehicle Color']

groups = fd_violations(df, lhs=['Plate ID', 'State'], rhs='Vehicle Color')

# Print number of conflicting groups.

print('{} vehicles with conflicting colors'.format(len(groups)))

2626 vehicles with conflicting colors


In [25]:
# Show examples for vehicles that occur in the dataset with
# different colors.

for key in list(groups.keys())[:3]:
    print(groups.get(key))
    print('\n')

        Plate ID State Vehicle Color
5058377  20223PC    NY         BLACK
5203178  20223PC    NY         WHITE
7877971  20223PC    NY         BLACK


        Plate ID State Vehicle Color
5850925  15993JW    NY         WHITE
1088940  15993JW    NY            WH


        Plate ID State Vehicle Color
6031626  17614MD    NY         WHITE
3442567  17614MD    NY            WH
4772395  17614MD    NY            WH




In [26]:
# Find the most frequent combinations of values that occur in these conflicts.

from collections import Counter

freq_conflicts = Counter()
for key in groups.keys():
    conflicts = tuple(sorted(groups.conflicts(key, 'Vehicle Color').keys()))
    freq_conflicts[conflicts] += 1

In [27]:
# Print the 25 most frequent conflict sets.

freq_conflicts.most_common(25)

[(('WH', 'WHITE'), 893),
 (('BK', 'BLACK'), 163),
 (('RD', 'RED'), 101),
 (('BL', 'BLUE'), 93),
 (('GREY', 'GY'), 87),
 (('BR', 'BROWN'), 83),
 (('GY', 'SILVE'), 77),
 (('BRN', 'BROWN'), 65),
 (('WHITE', 'WHT'), 63),
 (('GREY', 'SILVE'), 48),
 (('WH', 'WHT'), 34),
 (('GR', 'GREEN'), 33),
 (('OTHER', 'WHITE'), 23),
 (('WH', 'WHITE', 'WHT'), 22),
 (('ORANG', 'WH'), 19),
 (('WHITE', 'WT'), 19),
 (('BN', 'BROWN'), 19),
 (('GRAY', 'GY'), 19),
 (('BR', 'BRN', 'BROWN'), 18),
 (('BK', 'BLK'), 17),
 (('BLACK', 'GREY'), 16),
 (('BLACK', 'WHITE'), 14),
 (('GREY', 'OTHER'), 11),
 (('BLACK', 'BLUE'), 11),
 (('YELLO', 'YW'), 11)]

### How could this be turned into a conflict resolution function?

In [28]:
# This is just a naive and incomplete example for demonstration purposes.
# Finding a good conflict resolution strategy in this case is still an
# open issue.

from openclean.util.core import scalar_pass_through
from openclean.function.value.base import CallableWrapper, ConstantValue, UnpreparedFunction

class ColorResolve(UnpreparedFunction):
    """Conflict resolution function that defines for each possible group
    of conflicts what the resolution value is.
    """
    def prepare(self, values):
        key = tuple(sorted(set(values)))
        if key == ('WH', 'WHITE'):
            return ConstantValue('WHITE')
        elif key == ('BK', 'BLACK'):
            return ConstantValue('BLACK')
        elif key == ('RD', 'RED'):
            return ConstantValue('RED')
        elif key == ('BL', 'BLUE'):
            return ConstantValue('BLUE')
        elif key == ('GREY', 'GY'):
            return ConstantValue('GREY')
        else:
            return CallableWrapper(scalar_pass_through)

In [29]:
# Repair conflicts using the defined conflict resolution function.

from openclean.operator.collector.repair import conflict_repair

strategy = {'Vehicle Color': ColorResolve()}
df = conflict_repair(conflicts=groups, strategy=strategy, in_order=False)

In [30]:
groups = fd_violations(df, lhs=['Plate ID', 'State'], rhs='Vehicle Color')
print('{} vehicles with conflicting colors'.format(len(groups)))

1289 vehicles with conflicting colors


In [31]:
# Show examples for vehicles that still occur in the dataset with
# different colors.

for key in list(groups.keys())[:3]:
    print(groups.get(key))
    print('\n')

        Plate ID State Vehicle Color
5058377  20223PC    NY         BLACK
5203178  20223PC    NY         WHITE
7877971  20223PC    NY         BLACK


        Plate ID State Vehicle Color
7214805  ETP5761    NY          BLUE
3800302  ETP5761    NY         GREEN


         Plate ID State Vehicle Color
6797054  G130516N    GV          GREY
6050816  G130516N    GV         BLACK


