# Задание по Data Quality

## Agenda:
- We are working in the IT department of the NY city hall.
- We are receiving csv-files with the parking violations from the Police.
- **We need to know if the data is correct and if not - make the corrections.**
- We are working with the following data - https://www.kaggle.com/new-york-city/nyc-parking-tickets

**TASK 1**

Explore Data Quality dimensions of the data. Name it correctly according to DAMA-DMBOK (Please review DAMA BOOK page 458-459)
Hint: DAMA-DMBOK can be found in additional materials.
Also additionally please see: https://www.cdc.gov/ncbddd/hearingloss/documents/dataqualityworksheet.pdf

In 2013, DAMA UK produced a white paper describing six core dimensions of data quality:
- **Completeness**: The proportion of data stored against the potential for 100%.
- **Uniqueness**: No entity instance (thing) will be recorded more than once based upon how that thing is
identified.
- **Timeliness**: The degree to which data represent reality from the required point in time.
- **Validity**: Data is valid if it conforms to the syntax (format, type, range) of its definition.
- **Accuracy**: The degree to which data correctly describes the ‘real world’ object or event being
described.
- **Consistency**: The absence of difference, when comparing two or more representations of a thing
against a definition.

**Dimension 1** (Compliteness)
1. The numbers of houses exist for every row.
2. "Registration state" exist for every row as two letters (example “CA”).

**Dimension 2** (Timeliness)
1. We do not need data from any year but 2017.

**Dimension 3**  (Consistency)
1. House where violation was captured is present;
2. "Violation description" should exist.

**Dimension 4** (Uniqness)
1. No double rows, no duplicates.

**Dimension 5** (Validity)
1. Data types are as described by business:
Numbers: Summons Number, Violation Code, Street Code1, Street Code2, Street Code3, Vehicle Expiration Date, Violation Precinct, Issuer Precinct, Issuer Code, Date First Observed, Law Section, Vehicle Year, Feet From Curb;
Date & Time: Issue Date;
Plain text: all the rest columns.
Additionally: Check for outliers (deviations);

**Dimension 6** (Accuracy)
1. The numbers of houses are real.

## TASK 1.

We will use Parking_Violations_Issued_-_Fiscal_Year_2017.csv because we are interested in data collected during 2017 year only.

In [1]:
# Import modules
import pandas as pd
import os
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Read data from file and convert columns in appropriate format according to the business description
df = pd.read_csv('/kaggle/input/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2017.csv', 
                 low_memory=False, dtype='object',
                 converters={'Summons Number': int, 'Violation Code': int, 
                        'Street Code1': int, 'Street Code2': int, 
                         'Street Code3': int, 'Vehicle Expiration Date': int, 
                         'Violation Precinct': int, 'Issuer Precinct': int, 
                         'Issuer Code': int, 'Date First Observed': int, 
                         'Law Section': int, 'Vehicle Year': int, 
                         'Feet From Curb': int, 'Issue Date': str}, 
                         parse_dates=['Issue Date'])

### Exploratory Data Analysis

In [3]:
# Show first 5 rows as columns
df.head(5).T

Unnamed: 0,0,1,2,3,4
Summons Number,5092469481,5092451658,4006265037,8478629828,7868300310
Plate ID,GZH7067,GZH7067,FZX9232,66623ME,37033JV
Registration State,NY,NY,NY,NY,NY
Plate Type,PAS,PAS,PAS,COM,COM
Issue Date,2016-07-10 00:00:00,2016-07-08 00:00:00,2016-08-23 00:00:00,2017-06-14 00:00:00,2016-11-21 00:00:00
Violation Code,7,7,5,47,69
Vehicle Body Type,SUBN,SUBN,SUBN,REFG,DELV
Vehicle Make,TOYOT,TOYOT,FORD,MITSU,INTER
Issuing Agency,V,V,V,T,T
Street Code1,0,0,0,10610,10510


In [4]:
# Check the shape of imported data
print(f"\nNumber of rows - {df.shape[0]}")
print(f"\nNumber of columns - {df.shape[1]}")


Number of rows - 10803028

Number of columns - 43


In [5]:
# Columns data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10803028 entries, 0 to 10803027
Data columns (total 43 columns):
 #   Column                             Dtype         
---  ------                             -----         
 0   Summons Number                     int64         
 1   Plate ID                           object        
 2   Registration State                 object        
 3   Plate Type                         object        
 4   Issue Date                         datetime64[ns]
 5   Violation Code                     int64         
 6   Vehicle Body Type                  object        
 7   Vehicle Make                       object        
 8   Issuing Agency                     object        
 9   Street Code1                       int64         
 10  Street Code2                       int64         
 11  Street Code3                       int64         
 12  Vehicle Expiration Date            int64         
 13  Violation Location                 object        
 14  

In [6]:
# Count null values in every columns
df.isnull().sum()

Summons Number                              0
Plate ID                                  728
Registration State                          0
Plate Type                                  0
Issue Date                                  0
Violation Code                              0
Vehicle Body Type                       42711
Vehicle Make                            73050
Issuing Agency                              0
Street Code1                                0
Street Code2                                0
Street Code3                                0
Vehicle Expiration Date                     0
Violation Location                    2072400
Violation Precinct                          0
Issuer Precinct                             0
Issuer Code                                 0
Issuer Command                        2062645
Issuer Squad                          2063541
Violation Time                             63
Time First Observed                   9962281
Violation County                  

###  Dimension 1 (Compliteness)

The proportion of data stored against the potential for 100%

Tasks:
1. The numbers of houses exist for every row.
2. "Registration state" exist for every row as two letters (example “CA”).



#### 1. Check that the numbers of houses exist for every row

In [7]:
# Check that the numbers of houses exist for every row
perc_houses_with_num = df['House Number'].isnull().sum() / df.shape[0] * 100
print(f"Percentage of houses with house's number is {100 - perc_houses_with_num:.2f}%")

Percentage of houses with house's number is 78.82%


#### 2. Check that "Registration state" exist for every row as two letters (example “CA”).

In [8]:
# Look at uniuque values in the 'Registration State' column 
df['Registration State'].unique()

array(['NY', 'NJ', 'MA', 'VA', 'AZ', 'FL', 'AL', 'SC', 'MN', 'MD', 'PA',
       'IN', 'CA', 'WY', 'OR', 'CT', 'TX', 'ON', 'DE', '99', 'NC', 'ME',
       'IA', 'GA', 'TN', 'RI', 'IL', 'MI', 'VT', 'OH', 'NE', 'SD', 'NH',
       'UT', 'WI', 'KY', 'NM', 'QB', 'WA', 'OK', 'CO', 'MO', 'ID', 'AR',
       'KS', 'MS', 'MT', 'WV', 'LA', 'HI', 'DC', 'DP', 'AB', 'GV', 'NV',
       'NS', 'AK', 'ND', 'MB', 'BC', 'NB', 'PR', 'NT', 'FO', 'PE', 'SK',
       'MX'], dtype=object)

We can see the incorrect value '99' in the 'Registration State' column and a lot of 2 letter abbreviations that don't exist for USA states.

In [9]:
# Check percentage of rows with 2 letters in the "Registration state" column
perc_rows_with_2_letters = df['Registration State'].str.match(r'^([A-Z]{2})').sum() / df.shape[0] * 100
print(f"\nPercantage of rows with 2 letters in 'Registration State' is {perc_rows_with_2_letters:.2f}%")


Percantage of rows with 2 letters in 'Registration State' is 99.66%


### Dimension 2 (Timeliness)
The degree to which data represent reality from the required point in time

Task:
1. We do not need data from any year but 2017

We need data for the 2017 year only so we will find all rows that satisfy this period.

In [10]:
# Filter data by year
df = df[(df['Issue Date'] >= "2017-01-01") & (df['Issue Date'] <= "2017-12-31")]

# Print how many rows was filtered
print(f"{(1 - df.shape[0] / df.shape[0]) * 100: .2f}% of rows was filtered by year.")

 0.00% of rows was filtered by year.


### Dimension 3  (Consistency)

The absence of difference, when comparing two or more representations of a thing
against a definition.

Tasks:
1. House where violation was captured is present;
2. "Violation description" should exist.



#### 1. House where violation was captured is present


In [11]:
print(f"Null numbers in 'Violation Location' - {df['Violation Location'].isna().sum()}")

Null numbers in 'Violation Location' - 925596


In [12]:
df['Violation Location'].value_counts()

0019    274445
0014    203553
0001    174702
0018    169131
114     147444
         ...  
802          1
183          1
269          1
667          1
234          1
Name: Violation Location, Length: 170, dtype: int64

'Violation Location' looks like "integer" datatype, so let's replace null value with '0'.

In [13]:
# Replace 'Violation Location' nulls with '0'
df['Violation Location'] = df['Violation Location'].fillna('0')
print(f"Null numbers in 'Violation Location' after imputing - {df['Violation Location'].isna().sum()}")

Null numbers in 'Violation Location' after imputing - 0


#### 2. "Violation description" should exist

In [14]:
print(f"Null numbers in 'Violation Description' - {df['Violation Description'].isna().sum()}")

Null numbers in 'Violation Description' - 502169


'Vialation code' exist for every row in data, take this info and create a dict with the violation code as key and the most common description as value.

In [15]:
# Create dict with the violation code and the most common violation description for this code
df_description = df[['Violation Code', 'Violation Description']]
df_description = df_description.dropna()
val_counts_dict = df_description.groupby('Violation Code')['Violation Description'].agg(lambda x:x.value_counts().index[0]).to_dict()

In [16]:
# Fill nulls with 'Violation Code'
df['Violation Description'] = df['Violation Description'].fillna(df['Violation Code'])
# Replace filled Violation Code with most common description
df['Violation Description'] = df['Violation Description'].replace(val_counts_dict)
df[['Violation Code','Violation Description']].head(10)

Unnamed: 0,Violation Code,Violation Description
3,47,47-Double PKG-Midtown
5,7,FAILURE TO STOP AT RED LIGHT
10,78,78-Nighttime PKG on Res Street
14,40,40-Fire Hydrant
17,64,"64-No STD Ex Con/DPL, D/S Dec"
18,20,20A-No Parking (Non-COM)
19,36,PHTO SCHOOL ZN SPEED VIOLATION
20,38,38-Failure to Display Muni Rec
22,14,14-No Standing
23,75,75-No Match-Plate/Reg. Sticker


In [17]:
print(f"Null numbers in 'Violation Description' after imputing - {df['Violation Description'].isna().sum()}")

Null numbers in 'Violation Description' after imputing - 0


### Dimension 4 (Uniqness)

No entity instance (thing) will be recorded more than once based upon how that thing is identified.

Task:

1. No double rows, no duplicates

In [18]:
# Count full duplicates across dataframe
cnt_full_duplicates = df.duplicated().sum()
if cnt_full_duplicates == 0:
    print("We didn't have full duplicates across data.")
else:
    print(f"Number of duplicated records - {cnt_full_duplicates}")

We didn't have full duplicates across data.


We didn't have full duplicates across our table.

In [19]:
# Check duplicates across unique identificators of records
df[['Plate ID', 'Registration State', 'Issue Date', 'Violation Code', 'Street Name', 'Violation Description']].duplicated().sum()

55661

In [20]:
# Look closer to duplicates
df[df[['Plate ID', 'Registration State', 'Issue Date', 'Violation Code', 'Street Name', 'Violation Description']].duplicated(keep=False)].sort_values(['Plate ID']).head(6).T

Unnamed: 0,2578103,3099896,1256465,5429961,6282528,3756591
Summons Number,8533017212,8532663667,8100540688,8100540676,1419473128,1419450487
Plate ID,0094VA,0094VA,009USH,009USH,0112055,0112055
Registration State,NC,NC,AK,AK,NJ,NJ
Plate Type,PAS,PAS,PAS,PAS,PAS,PAS
Issue Date,2017-04-18 00:00:00,2017-04-18 00:00:00,2017-02-24 00:00:00,2017-02-24 00:00:00,2017-02-01 00:00:00,2017-02-01 00:00:00
Violation Code,14,14,14,14,46,46
Vehicle Body Type,SUBN,SUBN,4DSD,4DSD,SDN,SDN
Vehicle Make,ACURA,ACURA,TOYOT,TOYOT,CHEVR,CHEVR
Issuing Agency,T,T,T,T,P,P
Street Code1,38430,38430,49690,49690,59430,59430


In [21]:
df_dropped = df.drop_duplicates(['Plate ID', 'Registration State', 'Issue Date', 'Violation Code', 'Street Name', 'Violation Description'])
print(f"{(1 - df_dropped.shape[0] / df.shape[0]) * 100: .2f}% of rows was dropped.")

 1.02% of rows was dropped.


### Dimension 5 (Validity)

Data types are as described by business: 
1. Numbers: Summons Number, Violation Code, Street Code1, Street Code2, Street Code3, Vehicle Expiration Date, Violation Precinct, Issuer Precinct, Issuer Code, Date First Observed, Law Section, Vehicle Year, Feet From Curb; 
2. Date & Time: Issue Date; 
3. Plain text: all the rest columns. 

**We check the validity of our data types when use pd.read_csv() with right dtypes for every column according as described by business.**

Additionally: Check for outliers (deviations);

In [22]:
int64_columns = []
datetime_columns = []
object_columns = []

for column in df.columns:
    if df[column].dtype == 'int64':
        int64_columns.append(column)
    elif df[column].dtype == '<M8[ns]' or df[column].dtype == 'datetime64':
        datetime_columns.append(column)
    elif df[column].dtype == 'O':
        object_columns.append(column)

# Throw error if columns numbers of different types doesn't match
if len(int64_columns) == 13 and len(datetime_columns) == 1 and len(object_columns) == 29:
    print('Column types match our business description.')
else:
    print('Some mistakes made in datatypes when comparing to our description.')

Column types match our business description.


In [23]:
# Check statistics for numbers columns
df.describe()

Unnamed: 0,Summons Number,Violation Code,Street Code1,Street Code2,Street Code3,Vehicle Expiration Date,Violation Precinct,Issuer Precinct,Issuer Code,Date First Observed,Law Section,Vehicle Year,Feet From Curb
count,5431918.0,5431918.0,5431918.0,5431918.0,5431918.0,5431918.0,5431918.0,5431918.0,5431918.0,5431918.0,5431918.0,5431918.0,5431918.0
mean,7116378000.0,35.05772,24103.36,20399.79,20423.98,25656330.0,45.86077,47.68946,337834.7,359529.2,535.6195,1574.531,0.1289889
std,2294136000.0,19.33282,22644.64,21949.19,21988.29,26502460.0,40.33149,61.58776,209967.2,2668814.0,282.74,828.2559,0.8626523
min,1002885000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,40.0,0.0,0.0
25%,5096371000.0,20.0,5780.0,0.0,0.0,20170210.0,10.0,6.0,346210.0,0.0,408.0,1999.0,0.0
50%,8483102000.0,36.0,18210.0,14010.0,14030.0,20171230.0,34.0,32.0,358645.0,0.0,408.0,2009.0,0.0
75%,8521005000.0,40.0,35530.0,33970.0,34030.0,20181030.0,79.0,79.0,363523.0,0.0,408.0,2014.0,0.0
max,8585600000.0,99.0,98020.0,98310.0,98280.0,88888890.0,918.0,992.0,999992.0,20200520.0,6408.0,2069.0,16.0


#### Check outliers:

**Unusual values in the columns:**

"Feet From Curb" - usual value is 0, but max is 16

"Vehicle Year" - min value is 0 and max value is 2069

"Date First Observed" - strange format, looks like data 2022061

"Vehicle Expiration Date" - strange max value 88888888

In [24]:
# Check "Feet From Curb" values
df['Feet From Curb'].value_counts()

0     5286820
5       31167
6       19145
4       17172
3       16845
2       14816
7       14314
1       11965
8       11822
9        4910
10       2510
11        243
12        117
13         42
15         23
14          6
16          1
Name: Feet From Curb, dtype: int64

We can see that everithing is ok with this values.

Now let's check what stored in 'Vehicle year' column.

In [25]:
# Should replace this values with median
print(f"Year bigger than 2017 - {df[(df['Vehicle Year'] > 2017)]['Vehicle Year'].count()} counts")
print(f"Year equal to 0 - {df[(df['Vehicle Year'] == 0)]['Vehicle Year'].count()} counts ")
print(f"Year bigger than 0 and less than 1970 {df[(df['Vehicle Year'] != 0) & (df['Vehicle Year'] < 1970)]['Vehicle Year'].count()} counts")

Year bigger than 2017 - 2563 counts
Year equal to 0 - 1177265 counts 
Year bigger than 0 and less than 1970 0 counts


In [26]:
# Wrong format for column 'Date First Observed', should be datetime
df["Date First Observed"].value_counts()

0           5335096
20170125        960
20170118        957
20170201        947
20170111        923
             ...   
20160227          1
20140220          1
20140202          1
20170231          1
20171228          1
Name: Date First Observed, Length: 247, dtype: int64

A lot of '0' values. 
Definitely this value used to replce Null values.

In [27]:
# Wrong format for column 'Vehicle Expiration Date', should be datetime
df["Vehicle Expiration Date"].value_counts()

0           1078077
88880088     576057
20170088     168191
88888888     167587
20170930      81742
             ...   
20110121          1
20150514          1
20370630          1
20220121          1
20120112          1
Name: Vehicle Expiration Date, Length: 3041, dtype: int64

A lot of strange values:
'0', '888888', '88880088', etc. It is possible to replace it with one.

### Dimension 6 (Accuracy)

The degree to which data correctly describes the ‘real world’ object or event being
described.

1. The numbers of houses are real.


To check if numbers of house is real, we will use regex:
https://regex101.com/r/N1gAqn/1.

Valid number:
- a number that starts with a non-zero digit
- optionally, a separator followed by another number starting with a non-zero digit ; the separator can be either - or / and can be surrounded by space characters
- optionally, a single lowercase or uppercase letter that can be preceded by a space character


In [28]:
# Check percentage of rows with correct house number in "House Number" column
perc_rows_with_2_letters = df['House Number'].str \
                                             .match(r'^[1-9]\d*(?: ?(?:[a-z]|[/-] ?\d+[a-z]?))?$') \
                                             .sum() / df.shape[0] * 100
print(f"\nPercantage of rows with correct 'House Number' is {perc_rows_with_2_letters:.2f}%")


Percantage of rows with correct 'House Number' is 69.05%


## TASK #2
Using data described above (2017 year only) filter it according to business rules. As a result, you should receive check results for every business rule with DataFrame containing deviations.

Business rules that out data MUST follow:
1. 'Vehicle Make' column should not be 'TOYOT'.
2. No fully nulls columns.
3. No nulls or 'BLANKPLATE' in the Plate ID column.
4. "Registration state": should be only from 50 USA states, named as in ANSI standard INCITS 38:2009 (https://en.wikipedia.org/wiki/List_of_U.S._state_and_territory_abbreviations).
5. "Plate type" only PAS, COM.
6. Issue date - convert to dates. Find min/max.
7. No unregistered vehicle in the "Unregistered Vehicle" column (1 means unregistered vehicle, 0 means registered vehicle).

In [29]:
# 1. 'Vehicle Make' column should not be 'TOYOT'.
df = df[df['Vehicle Make'] != 'TOYOT']

assert 'TOYOT' not in df['Vehicle Make'].values

In [30]:
# 2. No fully nulls columns.
df = df.dropna(how='all') 

assert df.isna().all(axis=1).sum() == 0

In [31]:
# 3. No nulls or 'BLANKPLATE' in the Plate ID column.
df = df[(df['Plate ID'] != 'BLANKPLATE') & (df['Plate ID'].notnull())]

assert 'BLANKPLATE' not in df['Plate ID'].values
assert df['Plate ID'].isnull().sum() == 0

In [32]:
# 4. "Registration state": should be only from 50 USA states, named as in ANSI standard INCITS 38:2009
states = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA", 
          "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", 
          "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", 
          "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", 
          "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]

df = df[df['Registration State'].isin(states)]

assert len(df['Registration State'].unique()) == len(states)
assert set(df['Registration State'].values) == set(states)

In [33]:
# 5. "Plate type" only PAS, COM.
df = df[df['Plate Type'].isin(['PAS', 'COM'])]

assert set(df['Plate Type'].values) == set(['PAS', 'COM'])

In [34]:
# 6. Issue date - convert to dates. Find min/max.
pd.to_datetime(df['Issue Date'])
print(f"Min date in DataFrame - {df['Issue Date'].min()}")
print(f"Max date in DataFrame - {df['Issue Date'].max()}")

assert df['Issue Date'].dtype == 'datetime64[ns]'

Min date in DataFrame - 2017-01-01 00:00:00
Max date in DataFrame - 2017-12-31 00:00:00


In [35]:
# 7. No unregistered vehicle in the "Unregistered Vehicle" column (1 means unregistered vehicle, 0 means registered vehicle).
df[df['Unregistered Vehicle?'] == 0]

assert 1 not in df['Unregistered Vehicle?'].values

In [36]:
print(f'DataFrame has {df.shape[0]} rows after filtering.')

DataFrame has 4455822 rows after filtering.
