# Functional Dependency Violations

Example showing how to detect functional dependency violations. Uses the **NYC Parking Violations Issued - Fiscal Year 2014** dataset to identify violations of the functional dependency `Meter Number -> Registration State, Street`.

In [1]:
# Download the full 'DOB Job Application Fiings' dataset.

import gzip

from openclean.data.source.socrata import Socrata

datafile = './jt7v-77mi.tsv.gz'

with gzip.open(datafile, 'wb') as f:
    ds = Socrata().dataset('jt7v-77mi')
    print('Downloading ...\n')
    print(ds.name + '\n')
    print(ds.description)
    ds.write(f)


# As an alternative, you can also use the smaller dataset sample that is
# included in the repository.
#
# datafile = './data/jt7v-77mi.tsv.gz'

Downloading ...

Parking Violations Issued - Fiscal Year 2014

Parking Violations Issuance datasets contain violations issued during the respective fiscal year.  The Issuance datasets are not updated to reflect violation status, the information only represents the violation(s) at the time they are issued. Since appearing on an issuance dataset, a violation may have been paid, dismissed via a hearing, statutorily expired, or had other changes to its status. To see the current status of outstanding parking violations, please look at the Open Parking & Camera Violations dataset.</p>
• Parking Violations Issued Fiscal Year 2020 can be found <a href="https://data.cityofnewyork.us/City-Government/Parking-Violations-Issued-Fiscal-Year-2020/pvqr-7yc4">here</a>
• Parking Violations Issued Fiscal Year 2019 can be found <a href="https://data.cityofnewyork.us/City-Government/Parking-Violations-Issued-Fiscal-Year-2019/faiq-9dfq">here</a>
• Parking Violations Issued Fiscal Year 2018 can be found 

In [2]:
# Verify that the download was successful. Print dataset columns and number of rows.
# This example makes use of the streaming option to avoid loading the full data frame
# into memory.

from openclean.pipeline import stream

df = stream(datafile)


print('Schema\n------')
for col in df.columns:
    print("  '{}'".format(col))
    
print('\n{} rows.'.format(df.count()))

Schema
------
  'Summons Number'
  'Plate ID'
  'Registration State'
  'Plate Type'
  'Issue Date'
  'Violation Code'
  'Vehicle Body Type'
  'Vehicle Make'
  'Issuing Agency'
  'Street Code1'
  'Street Code2'
  'Street Code3'
  'Vehicle Expiration Date'
  'Violation Location'
  'Violation Precinct'
  'Issuer Precinct'
  'Issuer Code'
  'Issuer Command'
  'Issuer Squad'
  'Violation Time'
  'Time First Observed'
  'Violation County'
  'Violation In Front Of Or Opposite'
  'Number'
  'Street'
  'Intersecting Street'
  'Date First Observed'
  'Law Section'
  'Sub Division'
  'Violation Legal Code'
  'Days Parking In Effect    '
  'From Hours In Effect'
  'To Hours In Effect'
  'Vehicle Color'
  'Unregistered Vehicle?'
  'Vehicle Year'
  'Meter Number'
  'Feet From Curb'
  'Violation Post Code'
  'Violation Description'
  'No Standing or Stopping Violation'
  'Hydrant Violation'
  'Double Parking Violation'

9100278 rows.


In [3]:
# Get the first 100 rows. Ignore rows where the meter number is undefined (i.e., either
# an empty string or '-').

from openclean.function.eval.domain import IsNotIn

df = df\
    .select(['Plate ID', 'Registration State', 'Plate Type', 'Meter Number', 'Street'])\
    .where(IsNotIn('Meter Number', set({'-', ''})), limit=100)\
    .to_df()

df.head()

Unnamed: 0,Plate ID,Registration State,Plate Type,Meter Number,Street
661,FXY1858,NY,PAS,407-3018,QUEENS BLVD
780,89988JX,NY,COM,3 -,FRESH POND TRD
901,FGX2747,NY,PAS,504-3043,
2287,23161JR,NY,COM,144-3942,WEST 42 STREET
2346,47153MC,NY,COM,144-3987,W 40TH ST


In [4]:
# Find violations of the functional dependency Meter Number -> Registration State, Street.

from openclean.operator.map.violations import fd_violations

groups = fd_violations(df, lhs='Meter Number', rhs=['Registration State', 'Street'])

In [5]:
# List meter numbers that have violations and the number of
# violating values.

for key in groups:
    print('{} {}'.format(key, groups.get(key).shape[0]))

144-3942 2
144-3937 5
143-3785 5
144-6376 5
144-3958 3
144-3955 8
144-6088 4
144-6377 3
143-5983 3
140-5816 2
105-8347 2


In [6]:
# Show street names that cause violations of the functional dependency.

from openclean.operator.collector.count import distinct

print('Meter Number | Street (Count)')
print('=============|===============')
for key in groups:
    conflicts = distinct(groups.get(key), 'Street').most_common()
    street, count = conflicts[0]
    print('{:<12} | {} x {}'.format(key, count, street))
    for street, count in conflicts[1:]:
        print('             | {} x {}'.format(count, street))
    print('-------------|---------------')

Meter Number | Street (Count)
144-3942     | 1 x WEST 42 STREET
             | 1 x WEST 42 ST
-------------|---------------
144-3937     | 3 x WEST 42 STREET
             | 1 x WEST 42 ST
             | 1 x W 42ND ST
-------------|---------------
143-3785     | 3 x WEST 43RD ST
             | 1 x WEST 43 ST
             | 1 x W 43RD ST
-------------|---------------
144-6376     | 3 x 8TH AVENUE
             | 2 x 8TH AVE
-------------|---------------
144-3958     | 2 x WEST 41ST STREET
             | 1 x 7TH AVENUE
-------------|---------------
144-3955     | 4 x W 41 ST
             | 3 x W 41ST STREET
             | 1 x TIMES SQUARE
-------------|---------------
144-6088     | 3 x WEST 36 STREET
             | 1 x W 36TH ST
-------------|---------------
144-6377     | 1 x 35 ST
             | 1 x W 35 ST
             | 1 x 8TH AVENUE
-------------|---------------
143-5983     | 2 x W 43RD STREET
             | 1 x WEST 43RD STREET
-------------|---------------
140-5816     | 1 x 8TH 