# 4.3 Loading data into Dask DataFrames

<img src="./images/Pandas_Dask_DataFrames.png" width="1000"/>

## *Subjects covered*

* Create DataFrames from delimited text files 
* Defining data schemas for DataFrames

## *Content*

- [Reading data from text files](#Reading-data-from-text-files)
    - [Using Dask datatypes](#Using-Dask-datatypes)
    - [Creating schemas for Dask DataFrames](#Creating-schemas-for-Dask-DataFrames)

## Reading data from text files

**Possible problem definition**

*What patterns can we find in the data that are correlated with increases or decreases in the
number of parking tickets issued by the New York City parking authority?*

==> Need to gather, clean, and explore the relevant data with Dask DataFrames.

**Popular row and column delimiters**

Common row delimiters

* `\n`, `\r\n`


Common column delimiters

* `,`, `;`, `\t`, `|`, ` `

**Load NYC parking ticket data**

First import all needed modules.

In [1]:
import dask.dataframe as dd
from dask.diagnostics import ProgressBar

Now load all files into separate Dask DataFrames.

In [2]:
# Listing 4.1
import dask.dataframe as dd
from dask.diagnostics import ProgressBar

fy14 = dd.read_csv('nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2014__August_2013___June_2014_.csv')
fy15 = dd.read_csv('nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2015.csv')
fy16 = dd.read_csv('nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv')
fy17 = dd.read_csv('nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2017.csv')
fy17

Unnamed: 0_level_0,Summons Number,Plate ID,Registration State,Plate Type,Issue Date,Violation Code,Vehicle Body Type,Vehicle Make,Issuing Agency,Street Code1,Street Code2,Street Code3,Vehicle Expiration Date,Violation Location,Violation Precinct,Issuer Precinct,Issuer Code,Issuer Command,Issuer Squad,Violation Time,Time First Observed,Violation County,Violation In Front Of Or Opposite,House Number,Street Name,Intersecting Street,Date First Observed,Law Section,Sub Division,Violation Legal Code,Days Parking In Effect,From Hours In Effect,To Hours In Effect,Vehicle Color,Unregistered Vehicle?,Vehicle Year,Meter Number,Feet From Curb,Violation Post Code,Violation Description,No Standing or Stopping Violation,Hydrant Violation,Double Parking Violation
npartitions=33,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1
,int64,object,object,object,object,int64,object,object,object,int64,int64,int64,int64,float64,int64,int64,int64,object,object,object,object,object,object,object,object,object,int64,int64,object,object,object,object,object,object,float64,int64,object,int64,object,object,float64,float64,float64
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


Show all column names in data from first file, second file, etc.

In [3]:
fy14.columns

Index(['Summons Number', 'Plate ID', 'Registration State', 'Plate Type',
       'Issue Date', 'Violation Code', 'Vehicle Body Type', 'Vehicle Make',
       'Issuing Agency', 'Street Code1', 'Street Code2', 'Street Code3',
       'Vehicle Expiration Date', 'Violation Location', 'Violation Precinct',
       'Issuer Precinct', 'Issuer Code', 'Issuer Command', 'Issuer Squad',
       'Violation Time', 'Time First Observed', 'Violation County',
       'Violation In Front Of Or Opposite', 'House Number', 'Street Name',
       'Intersecting Street', 'Date First Observed', 'Law Section',
       'Sub Division', 'Violation Legal Code', 'Days Parking In Effect    ',
       'From Hours In Effect', 'To Hours In Effect', 'Vehicle Color',
       'Unregistered Vehicle?', 'Vehicle Year', 'Meter Number',
       'Feet From Curb', 'Violation Post Code', 'Violation Description',
       'No Standing or Stopping Violation', 'Hydrant Violation',
       'Double Parking Violation', 'Latitude', 'Longitude', 'Comm

In [4]:
fy15.columns

Index(['Summons Number', 'Plate ID', 'Registration State', 'Plate Type',
       'Issue Date', 'Violation Code', 'Vehicle Body Type', 'Vehicle Make',
       'Issuing Agency', 'Street Code1', 'Street Code2', 'Street Code3',
       'Vehicle Expiration Date', 'Violation Location', 'Violation Precinct',
       'Issuer Precinct', 'Issuer Code', 'Issuer Command', 'Issuer Squad',
       'Violation Time', 'Time First Observed', 'Violation County',
       'Violation In Front Of Or Opposite', 'House Number', 'Street Name',
       'Intersecting Street', 'Date First Observed', 'Law Section',
       'Sub Division', 'Violation Legal Code', 'Days Parking In Effect    ',
       'From Hours In Effect', 'To Hours In Effect', 'Vehicle Color',
       'Unregistered Vehicle?', 'Vehicle Year', 'Meter Number',
       'Feet From Curb', 'Violation Post Code', 'Violation Description',
       'No Standing or Stopping Violation', 'Hydrant Violation',
       'Double Parking Violation', 'Latitude', 'Longitude', 'Comm

In [5]:
fy16.columns

Index(['Summons Number', 'Plate ID', 'Registration State', 'Plate Type',
       'Issue Date', 'Violation Code', 'Vehicle Body Type', 'Vehicle Make',
       'Issuing Agency', 'Street Code1', 'Street Code2', 'Street Code3',
       'Vehicle Expiration Date', 'Violation Location', 'Violation Precinct',
       'Issuer Precinct', 'Issuer Code', 'Issuer Command', 'Issuer Squad',
       'Violation Time', 'Time First Observed', 'Violation County',
       'Violation In Front Of Or Opposite', 'House Number', 'Street Name',
       'Intersecting Street', 'Date First Observed', 'Law Section',
       'Sub Division', 'Violation Legal Code', 'Days Parking In Effect    ',
       'From Hours In Effect', 'To Hours In Effect', 'Vehicle Color',
       'Unregistered Vehicle?', 'Vehicle Year', 'Meter Number',
       'Feet From Curb', 'Violation Post Code', 'Violation Description',
       'No Standing or Stopping Violation', 'Hydrant Violation',
       'Double Parking Violation', 'Latitude', 'Longitude', 'Comm

In [6]:
# Listing 4.2
fy17.columns

Index(['Summons Number', 'Plate ID', 'Registration State', 'Plate Type',
       'Issue Date', 'Violation Code', 'Vehicle Body Type', 'Vehicle Make',
       'Issuing Agency', 'Street Code1', 'Street Code2', 'Street Code3',
       'Vehicle Expiration Date', 'Violation Location', 'Violation Precinct',
       'Issuer Precinct', 'Issuer Code', 'Issuer Command', 'Issuer Squad',
       'Violation Time', 'Time First Observed', 'Violation County',
       'Violation In Front Of Or Opposite', 'House Number', 'Street Name',
       'Intersecting Street', 'Date First Observed', 'Law Section',
       'Sub Division', 'Violation Legal Code', 'Days Parking In Effect    ',
       'From Hours In Effect', 'To Hours In Effect', 'Vehicle Color',
       'Unregistered Vehicle?', 'Vehicle Year', 'Meter Number',
       'Feet From Curb', 'Violation Post Code', 'Violation Description',
       'No Standing or Stopping Violation', 'Hydrant Violation',
       'Double Parking Violation'],
      dtype='object')

* Quick check whether number of columns is equal.
* **NOTE**: We have no guarantee that the columns are the same across all files, even though the number of columns is equal. 

In [7]:
print('number of columns in data for 14:', len(fy14.columns))
print('number of columns in data for 15:', len(fy15.columns))
print('number of columns in data for 16:', len(fy16.columns))
print('number of columns in data for 17:', len(fy17.columns))

number of columns in data for 14: 51
number of columns in data for 15: 51
number of columns in data for 16: 51
number of columns in data for 17: 43


* If datasets are concatenated simply as is, this would result in a DataFrame with lots of missing values.
* To avoid this, find the columns that all four of the DataFrames have in common

In [8]:
from functools import reduce

columns = [set(fy14.columns),
    set(fy15.columns),
    set(fy16.columns),
    set(fy17.columns)]
columns

[{'BBL',
  'BIN',
  'Census Tract',
  'Community Board',
  'Community Council ',
  'Date First Observed',
  'Days Parking In Effect    ',
  'Double Parking Violation',
  'Feet From Curb',
  'From Hours In Effect',
  'House Number',
  'Hydrant Violation',
  'Intersecting Street',
  'Issue Date',
  'Issuer Code',
  'Issuer Command',
  'Issuer Precinct',
  'Issuer Squad',
  'Issuing Agency',
  'Latitude',
  'Law Section',
  'Longitude',
  'Meter Number',
  'NTA',
  'No Standing or Stopping Violation',
  'Plate ID',
  'Plate Type',
  'Registration State',
  'Street Code1',
  'Street Code2',
  'Street Code3',
  'Street Name',
  'Sub Division',
  'Summons Number',
  'Time First Observed',
  'To Hours In Effect',
  'Unregistered Vehicle?',
  'Vehicle Body Type',
  'Vehicle Color',
  'Vehicle Expiration Date',
  'Vehicle Make',
  'Vehicle Year',
  'Violation Code',
  'Violation County',
  'Violation Description',
  'Violation In Front Of Or Opposite',
  'Violation Legal Code',
  'Violation Loc

A short refresher on `lambda`, `filter`, `map` and `reduce`

https://www.python-course.eu/python3_lambda.php

In [9]:
common_columns = list(reduce(lambda a, i: a.intersection(i), columns))
print(len(common_columns))
common_columns

43


['Unregistered Vehicle?',
 'Violation Location',
 'Sub Division',
 'Street Code3',
 'Violation In Front Of Or Opposite',
 'Vehicle Color',
 'Registration State',
 'Issue Date',
 'Feet From Curb',
 'No Standing or Stopping Violation',
 'Issuer Code',
 'Hydrant Violation',
 'Days Parking In Effect    ',
 'Issuing Agency',
 'Intersecting Street',
 'Street Code1',
 'Vehicle Body Type',
 'From Hours In Effect',
 'To Hours In Effect',
 'Issuer Command',
 'Street Name',
 'Violation Precinct',
 'Plate Type',
 'Vehicle Year',
 'Time First Observed',
 'Law Section',
 'Street Code2',
 'House Number',
 'Meter Number',
 'Violation Legal Code',
 'Issuer Precinct',
 'Summons Number',
 'Violation County',
 'Issuer Squad',
 'Violation Code',
 'Plate ID',
 'Double Parking Violation',
 'Vehicle Expiration Date',
 'Date First Observed',
 'Violation Description',
 'Vehicle Make',
 'Violation Post Code',
 'Violation Time']

In [10]:
common_columns_2 = list(set.intersection(*columns))
print(len(common_columns_2))
common_columns_2

43


['Unregistered Vehicle?',
 'Violation Location',
 'Sub Division',
 'Street Code3',
 'Violation In Front Of Or Opposite',
 'Vehicle Color',
 'Registration State',
 'Issue Date',
 'Feet From Curb',
 'No Standing or Stopping Violation',
 'Issuer Code',
 'Hydrant Violation',
 'Days Parking In Effect    ',
 'Issuing Agency',
 'Intersecting Street',
 'Street Code1',
 'Vehicle Body Type',
 'From Hours In Effect',
 'To Hours In Effect',
 'Issuer Command',
 'Street Name',
 'Violation Precinct',
 'Plate Type',
 'Vehicle Year',
 'Time First Observed',
 'Law Section',
 'Street Code2',
 'House Number',
 'Meter Number',
 'Violation Legal Code',
 'Issuer Precinct',
 'Summons Number',
 'Violation County',
 'Issuer Squad',
 'Violation Code',
 'Plate ID',
 'Double Parking Violation',
 'Vehicle Expiration Date',
 'Date First Observed',
 'Violation Description',
 'Vehicle Make',
 'Violation Post Code',
 'Violation Time']

* A quick check whether `common_columns` and `common_columns_2` have the same columns.
* Use when order is different among the two.

In [11]:
set(common_columns).symmetric_difference(common_columns_2)

set()

* A quick check whether `common_columns` and `common_columns_2` are identical.
* Use when order is supposed to be identical.

In [12]:
common_columns == common_columns_2

True

* Visualise the first 10 rows of `fy17` for the common columns across the four files.
* Keep in mind that when you get rows back from Dask, they’re being loaded into your computer’s RAM.
* If you try to return too many rows of data, you will receive an out-of-memory error

In [None]:
fy17[common_columns].head(10)

Do the same for `fy14`.

In [13]:
# Note: This is supposed to produce an error - scroll down and inspect the ValueError
fy14[common_columns].head()

  args2 = [_execute_task(a, cache) for a in args]


ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.

+-----------------------+---------+----------+
| Column                | Found   | Expected |
+-----------------------+---------+----------+
| Issuer Squad          | object  | int64    |
| Unregistered Vehicle? | float64 | int64    |
| Violation Description | object  | float64  |
| Violation Legal Code  | object  | float64  |
| Violation Post Code   | object  | float64  |
+-----------------------+---------+----------+

The following columns also raised exceptions on conversion:

- Issuer Squad
  ValueError('cannot convert float NaN to integer')
- Violation Description
  ValueError("could not convert string to float: 'BUS LANE VIOLATION'")
- Violation Legal Code
  ValueError("could not convert string to float: 'T'")
- Violation Post Code
  ValueError("could not convert string to float: 'H -'")

Usually this is due to dask's dtype inference failing, and
*may* be fixed by specifying dtypes manually by adding:

dtype={'Issuer Squad': 'object',
       'Unregistered Vehicle?': 'float64',
       'Violation Description': 'object',
       'Violation Legal Code': 'object',
       'Violation Post Code': 'object'}

to the call to `read_csv`/`read_table`.

* Five columns - `Issuer Squad`, `Unregistered Vehicle?`, `Violation Description`, `Violation Legal Code`, and `Violation Post Code` — failed to be read correctly.
* Their datatypes were not what Dask expected
* All values contained in a column must conform to the same datatype, since Dask DataFrames explicit typing
* This may happen because of random sampling to infer datatypes to avoid scanning the entire DataFrame


* This usually works well, but the process can break down
    * when a large number of values are missing in a column
    * the vast majority of data can be classified as one datatype (such as an integer), but a small number of edge cases break that assumption (such as a random string or two)
* Dask will throw an exception (like above) once it begins to work on a computation

**Remedy** 
* Manually define a schema for our data instead of relying on type inference
* Need to know about Dask datatypes

### Using Dask datatypes

* Column datatypes play an important role in Dask DataFrames
* They control
    * what kind of operations can be performed on a column
    * how overloaded operators (+, -, and so on) behave
    * how memory is allocated to store and access the column’s values

* Use smaller datatypes where appropriate (e.g. `int8` instead of `int16`)
* ==> Can hold more data in RAM and the CPU’s cache at one time
* ==> Leading to faster, more efficient computations

* **Risk**: if a value exceeds the maximum size allowed by the particular datatype, you will experience overflow errors
* ==> Think carefully about the range and domain of your data

* Dask will default to `object` type when its type inference comes across a column that
    * has a mix of numbers and strings
    * when type inference cannot determine an appropriate datatype to use
    * exception to this rule: a column with a high percentage of missing data

<img src="./images/Dask_datatypes.png" width="800"/>

* Dask DataFrame suggests that the column named `Violation description` should be a `float64` datatype
* Would assume that this column contains text, hence Dask should expect it to be datatype `object`, yet it 

Reason for this behaviour

* It turns out that a large majority of records in this DataFrame have missing violation descriptions (blanks in the raw data)
* Dask treats blank records as null values when parsing files
* Dask By default fills in missing values with NumPy’s NaN (not a number) object called `np.nan`.
* `type(np.nan)` returns `float`
* Since Dask’s type inference randomly selected a bunch of `np.nan` objects when trying to infer the type of the `Violation Description` column, it assumed that the column must contain floating-point numbers

<img src="./images/Dask_mismatched_datatype.png" width="500"/>

### Creating schemas for Dask DataFrames

* A datas' *schema* is the knowledge about
    * column’s datatype
    * whether data can contains missing values
    * valid range of values ahead of time

* Sometimes one might not know what the schema is ahead of time
* Need to figure out on your own
    * guess-and-check
    * manually sample data

**Building a generic schema**

In [14]:
import numpy as np
import pandas as pd

dtype_tuples = [(x, np.str) for x in common_columns]
dtypes = dict(dtype_tuples)
dtypes

{'Unregistered Vehicle?': str,
 'Violation Location': str,
 'Sub Division': str,
 'Street Code3': str,
 'Violation In Front Of Or Opposite': str,
 'Vehicle Color': str,
 'Registration State': str,
 'Issue Date': str,
 'Feet From Curb': str,
 'No Standing or Stopping Violation': str,
 'Issuer Code': str,
 'Hydrant Violation': str,
 'Days Parking In Effect    ': str,
 'Issuing Agency': str,
 'Intersecting Street': str,
 'Street Code1': str,
 'Vehicle Body Type': str,
 'From Hours In Effect': str,
 'To Hours In Effect': str,
 'Issuer Command': str,
 'Street Name': str,
 'Violation Precinct': str,
 'Plate Type': str,
 'Vehicle Year': str,
 'Time First Observed': str,
 'Law Section': str,
 'Street Code2': str,
 'House Number': str,
 'Meter Number': str,
 'Violation Legal Code': str,
 'Issuer Precinct': str,
 'Summons Number': str,
 'Violation County': str,
 'Issuer Squad': str,
 'Violation Code': str,
 'Plate ID': str,
 'Double Parking Violation': str,
 'Vehicle Expiration Date': str,
 'Dat

**Create a DataFrame with a generic schema**

In [15]:
# Listing 4.7
fy14 = dd.read_csv('nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2014__August_2013___June_2014_.csv', dtype=dtypes)

with ProgressBar():
    display(fy14[common_columns].head())

[########################################] | 100% Completed |  2.5s


Unnamed: 0,Unregistered Vehicle?,Violation Location,Sub Division,Street Code3,Violation In Front Of Or Opposite,Vehicle Color,Registration State,Issue Date,Feet From Curb,No Standing or Stopping Violation,...,Issuer Squad,Violation Code,Plate ID,Double Parking Violation,Vehicle Expiration Date,Date First Observed,Violation Description,Vehicle Make,Violation Post Code,Violation Time
0,0,33,F1,21190,F,GY,NY,08/04/2013,0,,...,0,46,GBB9093,,20140831,0,,AUDI,,0752A
1,0,33,C,40404,O,WH,NY,08/04/2013,0,,...,0,46,62416MB,,20140430,0,,FORD,,1240P
2,0,33,F7,13610,O,,NY,08/05/2013,0,,...,0,46,78755JZ,,20140228,0,,CHEVR,,1243P
3,0,33,F1,12010,O,WH,NY,08/05/2013,0,,...,0,46,63009MA,,20141031,0,,FORD,,0232P
4,0,33,E1,31190,F,BR,NY,08/08/2013,0,,...,0,41,91648MC,,0,0,,GMC,,1239P


What happens under the hood

* Dask disables type inference for columns
* Dask uses explicitly specified types instead

Executing `fy14[common_columns].head()` doesn't trigger an error any longer when using the explicit schema

Now have a look at each column and pick a more appropriate datatype (if possible) to maximize efficiency.

**Inspecting the `vehicle year` column**

In [16]:
with ProgressBar():
    print(fy14['Vehicle Year'].unique().head(10))

[########################################] | 100% Completed |  1min  4.9s
0    2013
1    2012
2       0
3    2010
4    2011
5    2001
6    2005
7    1998
8    1995
9    2003
Name: Vehicle Year, dtype: object


**Findings from only 10 first unique values**
* 10 unique values in the whole column
* Seems like all integers could comfortabley fit into `uint16` datatype
* `uint16` is the most appropriate because years can’t be negative values
* `uint8` would be too small with numbers only up to 255

**Any strings in it?**

* If we had seen any letters or special characters, we would not need to proceed any further with analyzing this column
* The string datatype we had already selected would be the only datatype suitable for the column

**How many unique values should we check to make a better datatype guess?**

* 10 unique values might not be a sufficiently large enough sample size to determine that there aren’t any edge cases that need to be considered
* Could use `.compute()` instead of `.head()`, but this could return a very large Dask series and take time to compute
* Try up to acquire a range of 10 to 50 unique samples to make a safer educated guess on the column datatype

**But what about potential missing values in the same column?**

* If conclusion is that `uint16` is appropriate datatype, need to check for presence of missing values
* `np.nan` cannot be coerced to an integer `uint16`, since `np.nan` is considered a `float` datatype
* If column contains missing data, it needs to be defined as `float32` datatype, not `uint16`
* `uint16` is unable to store `np.nan`

In [17]:
with ProgressBar():
    print(fy14['Vehicle Year'].isnull().values.any().compute())

[########################################] | 100% Completed |  1min  5.7s
True


* At least one row contains missing data, hence the column datatype needs to be set to `float32`
* Repeat this process for the other 42 columns


* The dictionary `dtypes` below contains the correct datatypes for each column, i.e. the final schema for each of the four files

In [None]:
dtypes = {
 'Date First Observed': np.str,
 'Days Parking In Effect    ': np.str,
 'Double Parking Violation': np.str,
 'Feet From Curb': np.float32,
 'From Hours In Effect': np.str,
 'House Number': np.str,
 'Hydrant Violation': np.str,
 'Intersecting Street': np.str,
 'Issue Date': np.str,
 'Issuer Code': np.float32,
 'Issuer Command': np.str,
 'Issuer Precinct': np.float32,
 'Issuer Squad': np.str,
 'Issuing Agency': np.str,
 'Law Section': np.float32,
 'Meter Number': np.str,
 'No Standing or Stopping Violation': np.str,
 'Plate ID': np.str,
 'Plate Type': np.str,
 'Registration State': np.str,
 'Street Code1': np.uint32,
 'Street Code2': np.uint32,
 'Street Code3': np.uint32,
 'Street Name': np.str,
 'Sub Division': np.str,
 'Summons Number': np.uint32,
 'Time First Observed': np.str,
 'To Hours In Effect': np.str,
 'Unregistered Vehicle?': np.str,
 'Vehicle Body Type': np.str,
 'Vehicle Color': np.str,
 'Vehicle Expiration Date': np.str,
 'Vehicle Make': np.str,
 'Vehicle Year': np.float32,
 'Violation Code': np.uint16,
 'Violation County': np.str,
 'Violation Description': np.str,
 'Violation In Front Of Or Opposite': np.str,
 'Violation Legal Code': np.str,
 'Violation Location': np.str,
 'Violation Post Code': np.str,
 'Violation Precinct': np.float32,
 'Violation Time': np.str
}

**Final schema for the NYC parking ticket data**

* Use final schema to reload all four of the DataFrames
* Then union all four years of data together into a final DataFrame

In [18]:
data = dd.read_csv('nyc-parking-tickets/*.csv', dtype=dtypes, usecols=common_columns)

* `usecols` argument comes from Pandas (you won't find it in Dask DataFrame documentation)
* Pass along any Pandas arguments through the `*args` and `**kwargs` interfaces
* Control the underlying Pandas DataFrames that make up each partition
* This interface is also how you can control things like which column delimiter should be used, whether the data has a header or not, and so on