# Chicago Parking Ticket Data Analysis

We will use a notebook to perform some basic analysis of the Chicago Parking Ticket data sample.

This is a 1 million row randomized sample from [Daniel Hutmacher's Chicago Parking Tickets database](https://sqlsunday.com/2022/12/05/new-demo-database/).  I have added a new label column, `PaymentIsOutstanding`, which represents whether the ticket recipient still owes the City of Chicago any money.

Our goal in the subsequent analysis is to gain a better feel for this dataset and some of the things we will need to do when we want to run an experiment to train a model.

## Retrieve Data

In [None]:
from azureml.core import Workspace, Datastore, Dataset
import numpy as np
import pandas as pd

ws=Workspace.from_config()

In [None]:
dataset=Dataset.get_by_name(ws, name='ChicagoParkingTickets')
df=dataset.to_pandas_dataframe()

## Perform Exploratory Analysis

`len()` tells us how many rows there are in a dataframe.

In [None]:
len(df)

Review the top few rows in the dataframe.

In [None]:
df.head()

We have a mix of datetime, string, and numeric data, as well as one label:  `PaymentIsOutstanding`.

What is the set of unique values for one of these columns?  I'll choose `Police_District` as an example.

In [None]:
df['Police_District'].unique()

It looks like police district is really an integer, even though it's saved as a decimal.  Also, `NaN` appears, which indicates some police districts were either missing or had data which didn't fit the datatype.  We'll want to get back to this.

Before we go too much further, what is the cardinality (distinct count) of every column?

In [None]:
df.apply(pd.Series.nunique)

We can see that issued date is almost entirely unique.  For analysis, we might want to split this either into smaller components (e.g., year, month, day columns) or split out the date and time.  We may also wish to bin the times in some fashion, such as into 6-hour blocks.

We can also pick out a little bit more from this as well:

* Every community has a few measures: hardship index, per-capita income, percent unemployed, percent without diploma, percent households below poverty.  For the most part, these columns are unique.
* Given that there are 22 police districts, 13 sectors, 98 neighborhoods, 77 communities, and 59 ZIP codes, there's going to be some overlap between these.
* There are 64 license plate states, which may seem weird when you consider there are 50 states in the US.  This dataset also includes out-of-country visitors, especially from Canada.  It might make sense to reshape this data:  in-state, out-of-state, out-of-country.

In [None]:
pd.options.display.max_rows=64
df['License_Plate_State'].value_counts()

`Plate_Type` is intended to indicate what kind of vehicle this is.

In [None]:
pd.options.display.max_rows=53
df['Plate_Type'].value_counts()

We can see that after the first three plate types, there are very few examples of any other.  This indicates that we probably want to have Passenger (PAS), Truck (TRK), Temporary Tags (TMP), and Other.

Let's take a look at some of the rows which are marked as having outstanding payments.

In [None]:
df[df['PaymentIsOutstanding'] == 1].head()

And here are a few rows where payment is complete.  This could be because the person made payment or because a judge threw out the ticket.

In [None]:
df[df['PaymentIsOutstanding'] == 0].head()

## Correlation Analysis

Something we'll want to do is perform a basic correlation analysis.  We'll focus on the numeric features and see if any of these correlate to whether payment is outstanding.

In [None]:
pd.options.display.max_rows=20
df.corr(numeric_only=True)

As an initial look, there aren't many numeric features which correlate well linearly with `PaymentIsOutstanding`.

We can also see that some tight correlation between hardship index, per-capita income (in the negative direction), percent unemployed, percent without diploma, and percent households below poverty.  To avoid the risk of multicollinearity, we probably want to drop either hardship index or the other columns mentioned.

One last thing we'll do is include year and month features from the issuance date, and see if those are correlated at all with payment.

In [None]:
df['Year'], df['Month']=df['Issued_date'].dt.year, df['Issued_date'].dt.month
df.corr(numeric_only=True)

It looks like the year has a small effect (older tickets are more likely to be paid up), but month of year doesn't have any real benefit to us.

## Final Cleanup

Before we wrap things up, we'll want to see what other types of data cleanup we might want to do.  For example, which columns have `NaN`?

In [None]:
pd.options.display.max_rows=30
df.isna().any()

It looks like the following columns do:

* Tract
* Police District
* Plate Type
* License Plate State

## Plan of Action

* Find and replace missing values for police district
* Create features for year, time block during the day, license plate origin (based on license plate state), vehicle type (based on plate type)
* Remove Hardship index and Census tract
* Encode categorical features so we can perform a classification analysis.