# Dataset

The data consists of parking violations issued in New York City.

The data is separated across multiple datasets, each for one fiscal year. The datasets can be found here [NYC OpenData](https://data.cityofnewyork.us/browse?Data-Collection_Data-Collection=DOF+Parking+Violations+Issued&q=&sortBy=alpha&utf8=%E2%9C%93).

We chose this data primarily because it allows us to answer interesting questions about parking violations in New York City. Additionally, the data is well-structured, well-documented and easy to download as CSV files.

## Structure of the dataset

The data is contained in one table with 43 columns. The most interesting columns for us are:

- **Plate ID**: Registered Plate ID of the car issued for a parking violation
- **Issue Date**: Date of the parking violation
- **Violation Time**: Time of the parking violation
- **Violation Code**: Code for the type of the parking violation

The `Violation Code` serves as a foreign key, which we later use to join additional information about the violation.

## Big Data

The raw data from 2013 to the present sums up to about 25GB and 130 million rows, categorizing it as a Big Data problem. We need dedicated tools to process this volume of data efficiently.

## Download the data

### Setup temporary directory for files

In [None]:
!mkdir /data/hdfscluster/rawdata/
!cp ./download.sh /data/hdfscluster/rawdata/

### Download Violation Code Mapping
Download the Violations Code Mapping from Official City of New York Data Site as XLSX:

In [None]:
!wget -O Codes-Mapping.xlsx "https://data.cityofnewyork.us/api/views/pvqr-7yc4/files/7875fa68-3a29-4825-9dfb-63ef30576f9e?download=true&filename=ParkingViolationCodes_January2020.xlsx"

### Upload Violations Code Mapping to HDFS

In [None]:
!hdfs dfs -mkdir /parkingviolations-codes

In [None]:
!hdfs dfs -put ./Codes-Mapping.xlsx /parkingviolations-codes

### Download raw parking violation data

In [None]:
os.chdir("/data/hdfscluster/rawdata/")

In [None]:
# Use bash script to download all the CSV files.
!chmod +x download.sh
!./download.sh

### Upload raw parking violation data to HDFS

In [None]:
!hdfs dfs -mkdir /parkingviolations/rawdata/

In [None]:
!hdfs dfs -put /data/hdfscluster/rawdata/*.csv /parkingviolations/rawdata/

### Final cleanup

In [None]:
!rm -rf /data/hdfscluster/rawdata/

In [None]:
os.chdir(original_directory)