# Data Science with Python and Dask

**Author:** David-Alexandre Guenette <br />
**Date:** 2021-01-20 <br />

## Problem Definition

**Problem:** What patterns can we find in the data that are correlated with increases or decreases in the number of parking tickets issued by the New York City parking authority?

**Hypothesis:** 

 - We might find that older vehicles are more likely to receibe tickets.
 - We might find that a particular color attracts more attention from the parking authority than other colors.

## Data Gathering

#### Importing CSV files using Dask defaults

In [10]:
import dask.dataframe as dd
from dask.diagnostics import ProgressBar

fy14 = dd.read_csv('./data/nyu_parking_data/parking_violations_2014.csv')
fy15 = dd.read_csv('./data/nyu_parking_data/parking_violations_2015.csv')
fy16 = dd.read_csv('./data/nyu_parking_data/parking_violations_2016.csv')
fy17 = dd.read_csv('./data/nyu_parking_data/parking_violations_2017.csv')

#### Finding the common columns between the four DataFrames

In [55]:
from functools import reduce


def set_columns(df):
    """
    Purpose : Make a set of columns from DataFrame.
    """
    
    scolumns = set(df.columns)
    return scolumns


def find_common_columns(ls):
    """
    Purpose : Takes a list of columns sets object to make a common list of columns name
    """
    
    cc = list(reduce(lambda a, i: a.intersection(i), ls))
    return cc



columns = [
    set_columns(fy14),
    set_columns(fy15),
    set_columns(fy16),
    set_columns(fy17)
]

common_columns = find_common_columns(columns)


#### Building a generic schema

In [63]:
import numpy as np
import pandas as pd

# First we need to build a dictionary that maps column names to datatypes. 
dtypes = {
 'Date First Observed': np.str,
 'Days Parking In Effect    ': np.str,
 'Double Parking Violation': np.str,
 'Feet From Curb': np.float32,
 'From Hours In Effect': np.str,
 'House Number': np.str,
 'Hydrant Violation': np.str,
 'Intersecting Street': np.str,
 'Issue Date': np.str,
 'Issuer Code': np.float32,
 'Issuer Command': np.str,
 'Issuer Precinct': np.float32,
 'Issuer Squad': np.str,
 'Issuing Agency': np.str,
 'Law Section': np.float32,
 'Meter Number': np.str,
 'No Standing or Stopping Violation': np.str,
 'Plate ID': np.str,
 'Plate Type': np.str,
 'Registration State': np.str,
 'Street Code1': np.uint32,
 'Street Code2': np.uint32,
 'Street Code3': np.uint32,
 'Street Name': np.str,
 'Sub Division': np.str,
 'Summons Number': np.uint32,
 'Time First Observed': np.str,
 'To Hours In Effect': np.str,
 'Unregistered Vehicle?': np.str,
 'Vehicle Body Type': np.str,
 'Vehicle Color': np.str,
 'Vehicle Expiration Date': np.str,
 'Vehicle Make': np.str,
 'Vehicle Year': np.float32,
 'Violation Code': np.uint16,
 'Violation County': np.str,
 'Violation Description': np.str,
 'Violation In Front Of Or Opposite': np.str,
 'Violation Legal Code': np.str,
 'Violation Location': np.str,
 'Violation Post Code': np.str,
 'Violation Precinct': np.float32,
 'Violation Time': np.str
}

#### Applying the schema to all four DataFrames

In [62]:
data = dd.read_csv('./data/nyu_parking_data/*.csv', dtype=dtypes, usecols=common_columns)

## Data Cleaning