# Parking Violations in NYC, a Detailed Analysis and a Prediction Model  



Eren Tumkaya – tumkaya19@itu.edu.tr - 090190328


  Every large city has its own problems and although most people prefer using public transportation, NYC has always been considered as one of a kind in terms of its traffic. That comes at a cost, both literally and figuratively.  A city with millions of cars can’t be thought of without some parking violations taking place and for every violation there comes a ticket issued by the authorities.  
  
  On November 14, 2023 a dataset containing all the parking violations issued between July 1, 2022 to June 30, 2023 was published  on NYC Open Data Platform (1). It is provided by the Department of Finance (DOF). As can be understood by looking at the dates,  it has the data for the NYC fiscal year 2024. 


### About the Dataset
 This relatively large dataset consists of 43 features(columns).  It has more than 20 million rows of individual violations. It may also be worth mentioning that it is in a rectangular and structured format. Each violation is accompanied by a “Summons Number” which can be entered into “NYC Serv” site (2) to gain information on the ticket.  Since not all the features will be of my use, I will hereby give brief information on the ones I will need in order to do the analysis I intend. 
 
 The “Issue Date” and “Violation Time” features showcase values of the day of the violation issuing and the hour and minute information, respectively. I was able to verify this by checking the exact tickets by entering their summon codes to the NYC Serve site. There, I was able to see the tickets in their original format. “Violation Code” column is the type of the violation. There is a list to see the descriptions of each code, under the dataset information in “NYC Open Data” site.  “Vehicle Color”, “Vehicle Body Type”, “Vehicle Year” and “Vehicle Make” columns are directly related to the specific cars as stated by their names. I included them as well. “Unregistered Vehicle” column is a binary column indicating whether the car was registered or not. In US, it is quite important to register a car after buying it. Each state has its own conditions and New York, the state which NYC is located in, requires a renewal every 2 years. Detailed description can be found on the “Department of Motor Vehicles, New York State” site (3). “Violation County” is another column which contains values of 5 counties of NYC. New York County (Manhattan), Kings County (Brooklyn), Bronx County (The Bronx), Richmond County (Staten Island), and Queens County (Queens). The last column I will be using is “Registration State”, just like its label, it is an indicator of the registered state for the car which the violation is issued for. 
 
 There is one more dataset that I want to merge. I will be trying to find the hourly historical weather data for NYC. The columns I intend to add are Temperature(C), Rain(mm) and Snow features. For my purposes it is crucial for me to find the dataset with the hourly data at largest intervals. I wil also be mentioning why I need this specific data in the next parts of my proposal. 
 
 


### Obtaining the Data
 Open Data NYC provides API access to most of its datasets via SODA API. Upon reading its documentation (4) I’ve found ways to filter the data for the columns I wanted. I checked the “API Column Name’s of the features” I wanted to obtain and also defined a limit of data amount with “limit” expression. It has an SQL-like language, and it was relatively easy to use. I will request the data in JSON format, there is an option to get it in CSV as well though. 
An example of the API call can be seen below.

*** “summons_number” was only for my own purposes such as finding the regarding tickets as a violation. "unregistered vehicle” column might not be used in the analysis because of the lacking data.


In [1]:
import requests
import pandas as pd

In [4]:
url= "https://data.cityofnewyork.us/resource/869v-vr48.json?$select=summons_number,issue_date,violation_time,violation_code,vehicle_color,vehicle_body_type,vehicle_make,vehicle_year,unregistered_vehicle,violation_county,registration_state&$limit=600"
response=requests.get(url)
data=response.json()
df=pd.DataFrame(data)
df.head()

Unnamed: 0,summons_number,issue_date,violation_time,violation_code,vehicle_color,vehicle_body_type,vehicle_make,vehicle_year,unregistered_vehicle,violation_county,registration_state
0,1484697303,2022-06-10T00:00:00.000,1037A,67,BLK,SDN,TOYOT,2004,0,NY,NY
1,1484697315,2022-06-13T00:00:00.000,1045A,51,GRAY,SUBN,JEEP,2017,0,NY,NY
2,1484697625,2022-06-19T00:00:00.000,1116A,63,GRAY,SDN,JEEP,0,0,NY,NJ
3,1484697674,2022-06-19T00:00:00.000,1052A,63,,SUBN,LEXUS,0,0,NY,NY
4,1484697686,2022-06-19T00:00:00.000,1107A,63,BLUE,SDN,HYUND,0,0,NY,NJ


### Initial Operations
 
 As common in most data analysis tasks, I will start with a required cleaning part where I will be doing duplicate cleaning, sharing comments on null values and also delete or fill them depending on the nature of their features. I might also consider handling the null values when their required field is in my analysis order. I will do some filtering once again depending on the analysis idea I have, which I will be sharing in the next part, moreover, I will group the data in the way it is essential for each task in advance. 
 
 Before I implement my prediction model, I will be doing some other operations like outlier cleaning and different handling of null values as well.  


### Feature Engineering
 Initial look at the data indicated that I will be dealing with feature engineering quite a lot during this analysis. Some fields such as “Time” and “Date” require some splitting and a need for creating new columns may also arise from them. Apart from that, “Vehicle Make” column will require me to correct some of the brand names in order for me to group well.  
 
 I will also be trying to add new features thanks to my idea of merging the dataset with hourly historical weather NYC dataset. I will try to gather information on hourly “Temperature”, “Rain” and “Snow” and also “Visibility”. I might perform turning these numerical data’s to categorical ones in order to achieve a clearer result. Hopefully, this process will give me some new columns.  
 
 I will also explore each feature by the possibility of best data type to use with them and type cast whenever necessary. Moreover, capping, binning and other operations will be performed when there is an opportunity to benefit the analysis and the visualization.  Scaling the data won’t be overlooked when it is required to do so. I might do it for both the analysis and the prediction model at the later stage of the project. 


### Questions to Ask 

My initial questions to ask the dataset are these; 

1)	Which part of the city is in need of more parking spaces? 

2)	Do weather conditions have an effect on parking violations? If so, to what extent? 

As I will also be talking about in exploratory data analysis part, first question will be tried to be answered by visualizing the data by counts of parking violations, on a map of the city. One thing to be careful here, not all the parking violations are result of a need for more parking spaces. Separating these will be possible by carefully inspecting the “violation code” feature and only including the type which can be associated with my needs for this analysis. 

After my merge with the hourly weather data, I will go on and statistically analyze if there is a relation with rain, snow, temperature features with parking violations. I might also try to bin these column values and illustrating my findings. Utilizing various plots I mentioned in the next section will be of help.


### Exploratory Data Analysis 
 
 I believe that this is an important subject which requires a detailed analysis from many perspectives. I will share my initial analysis ideas here, but I would like to state that I will go further when I find something interesting while I am doing the analysis. I will both try to go over these ideas and also keep the analysis not just limited to here. 
 
 I will start by the question whether I could find a relation with the hours the violation took place and the violation counts. I will graphically visualize the number of violations with respect to time of the day and try to draw conclusions. 
Another analysis idea which gives me some sort of curiosity is the violation numbers by each county. In my opinion it will be a good visual representation of the data. 

My data includes fields such as vehicle color, make, vehicle body type, vehicle year. I will perform grouping and check the relation with them and the number of violations. I will illustrate my findings of these as well. 

The difference in probabilities of parking violations between the vehicles registered in NY and other states is a question I am really curious about. For so many, NYC is a great tourist destination to visit and thus, it is not their local environment. In my opinion, it is always easier to make mistakes when you are in somewhere you are not used to. Moreover, NYC is a city where small mistakes can happen quite often. 

I will be trying different methods to perform this analysis. Grouping the data for each task will be the first part. Scatter plots, box plots, line graphs, histplots will be common tools to visualize the data. To statistically back some statements, I will do regression analysis for the numerical values and as for the categorical data, I am planning to get help of chi square contingency tests.


### Prediction Model for Hourly Parking Violation Counts

After the detailed analysis, I will be training a machine learning model in order to predict hourly parking violations. The way I will be doing it is like this: 
I will start by checking if my data is clear enough in order to be trained. Following that evaluation, I will do some outlier analysis. Since the first model I am planning to implement is a regression-based algorithm, I know how badly it can be affected by the outliers. I will find ways to minimize them. 

As I am trying to predict hourly data, my next step will be to group all the data by each hour. I will round the remaining minutes of the “violation time” column to nearest hour. I will have data point count equaling the number of hours a year has. 
At this point, hopefully I will have already merged the hourly historical weather data with my table and will have three additional columns namely, temperature, snow and rain data. I will have known their correlation with the violation counts by this time due to the fact that these are stages after the exploratory analysis. I will include them in my analysis if I get content with my findings. I may also think about binning snow and rain data to “None”, “Non to Little”, “Moderate” and “Heavy”. Doing one hot encoding will help me with featuring them in my model. 

Another feature I want to add is the day of the week. Obtaining this will be the result of splitting the “issue date” column. Again, my initial analysis will tell me a lot about this valuable information. Before fitting, I will be doing one hot encoding to this categorical data as well. I will also consider adding the “month” as a new feature, it won’t be hard to encode them either. 

I will split the data into test, train sets and also do k-fold validation. 

Eventually, I will have a set where most of my independent variables are categorical and my target variable is a numerical one. I will start with less complex models such as a linear regression and if they are not capable of providing a good fit (in terms of evaluation metrics I have), I will move on to more complex ones.  I won’t be needing scaling for a model like linear regression, it might give valuable information on coefficients but won’t change the predicted value, but in the case, I change my algorithm, I will be scaling when necessary. 

As some features are not likely to have a linear relationship with my target, I will try different models to capture this relation. My first approach is to do a linear regression, but I acknowledge the fact that my data needs some adjustment in order for me to do so. For instance, I have the “hour” feature which is probably not going to have the relation I want. Well, I will try turning it to categorical features as each hour and do one hot encoding. That is one way I can think of in order to protect the linearity assumption for the independent variables. 

I will use typical evaluation metrics in the case I use this model(regression). “Mean Squared Error (MSE)” and “R-squared” will give me some insights. 

In the case that I still fail to obtain a good fit I will go on to try Neural Network algorithms which can be quite good at handling complex relations. 

*** It is important to mention that my methods may change depending on the nature of the data and my new findings. 


### Hardware and Software 

I will be using my own device(Dell Latitude 7490) for most of the project. My processor is Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz   2.11 GHz. I have 16 gb’s of RAM. System :“64-bit operating system, x64-based processor”. It is currently running on Windows 11 Pro. If required, I will also try to use resources of National Center for High Performance Computing, ITU. 

Jupyter notebooks will be used and all the code will be written in Python.  


### References
1) [NYC Open Data Site, My Dataset](https://data.cityofnewyork.us/City-Government/Parking-Violations-Issued-Fiscal-Year-2023/869v-vr48)
2) [NYC Service Site, Checking for Tickets by the Summon Code](https://nycserv.nyc.gov/NYCServWeb/NYCSERVMain)
3) [Info Regarding Registration](https://dmv.ny.gov/registration/how-register-vehicle)
4) [API Documents, Socrata](https://dev.socrata.com/foundry/data.cityofnewyork.us/869v-vr48)
