# Chicago crimes with tensorflow

In this project I will use the chicago crime dataset to skill up in Tensorflow. The idea is either to predict if a crime will end up in an arrest.

Let's start by importing the relevant packages

In [1]:
import os
import pandas
import tensorflow.keras as keras

## Importing data
First, let's import a subset of the data in a .csv file via pandas and make some exploratory analysis. This way I can  get a first understanding of the kind of pre-processing I need for the data.

In [2]:
file_name = 'Crimes_-_2020.csv'
file_source_path = os.path.join(os.getcwd(), file_name)
chicago_crime_dataset = pandas.read_csv(file_source_path)

First of all, let's inspect a few entries of the dataset and the data types

In [3]:
chicago_crime_dataset.head(5)

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,12226684,JD437696,11/21/2020 10:24:00 AM,048XX W IRVING PARK RD,1330,CRIMINAL TRESPASS,TO LAND,SMALL RETAIL STORE,False,False,...,45.0,15,26,,,2020,11/28/2020 03:46:29 PM,,,
1,12227040,JD437515,11/21/2020 05:25:00 AM,060XX S KOMENSKY AVE,033A,ROBBERY,ATTEMPT ARMED - HANDGUN,RESIDENCE - GARAGE,False,False,...,23.0,65,3,,,2020,11/28/2020 03:46:29 PM,,,
2,12227432,JD438595,11/21/2020 09:30:00 PM,001XX E 117TH PL,1562,SEX OFFENSE,AGGRAVATED CRIMINAL SEXUAL ABUSE,RESIDENCE,False,False,...,9.0,53,17,,,2020,11/28/2020 03:46:29 PM,,,
3,12228918,JD439369,11/21/2020 05:00:00 PM,007XX E 81ST ST,0930,MOTOR VEHICLE THEFT,THEFT / RECOVERY - AUTOMOBILE,STREET,False,False,...,6.0,44,7,,,2020,11/28/2020 03:46:29 PM,,,
4,12231193,JD443060,10/16/2020 12:00:00 PM,060XX W MIAMI AVE,1154,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT $300 AND UNDER,RESIDENCE,False,False,...,39.0,10,11,,,2020,11/28/2020 03:46:29 PM,,,


In [4]:
chicago_crime_dataset.dtypes

ID                        int64
Case Number              object
Date                     object
Block                    object
IUCR                     object
Primary Type             object
Description              object
Location Description     object
Arrest                     bool
Domestic                   bool
Beat                      int64
District                  int64
Ward                    float64
Community Area            int64
FBI Code                 object
X Coordinate            float64
Y Coordinate            float64
Year                      int64
Updated On               object
Latitude                float64
Longitude               float64
Location                 object
dtype: object

This shows immediately some anomalies that do not make sense according to the type of data I expect. In particular:
1. Case number and Block are misnamed. Case number is an alphanumeric identifier, while Block is the address.
2. Date and Updated on are date objects and not strings. Moreover, the format is non-standard.
3. ID, Beat, District, Wards and Community areas are identification codes, not numbers.
4. Year is redundant. It's also not particularly interesting because this dataset covers 2020 only.
5. Location is redundant, at least for the specific purpoeses of this project. Furthermore, I checked that entries with missing values of XY coordinates do not have entries of latitude-longitude coordinates. The analysis is not shown here, but it can be performed with chicago_crime_dataset.loc[chicago_crime_dataset['Longitude'].isna(), ['X Coordinate', 'Y Coordinate', 'Latitude', 'Longitude']].head(10).sum()

For the moment I will correct point 2 to 5 only because I will need to stick with the original column namings when I will play around with tensorflow later on.




In [5]:
# recast dates in correct iso formats
date_format = '%m/%d/%Y %I:%M:%S %p'
chicago_crime_dataset['Date'] = pandas.to_datetime(chicago_crime_dataset['Date'], format = date_format).head()
chicago_crime_dataset['Updated On'] = pandas.to_datetime(chicago_crime_dataset['Updated On'], format = date_format).head()

# recast faux integers as string identifiers
int_to_str_conversions = {'ID' : 'str',
                          'Beat' : 'str',
                          'District' : 'str',
                          'Ward' : 'str',
                          'Community Area' : 'str'}
chicago_crime_dataset = chicago_crime_dataset.astype(int_to_str_conversions)

# drop redundant columns
chicago_crime_dataset.pop('Year')
chicago_crime_dataset.pop('Location')
chicago_crime_dataset.pop('Longitude')
chicago_crime_dataset.pop('Latitude')


# show new data structure
chicago_crime_dataset.dtypes

ID                              object
Case Number                     object
Date                    datetime64[ns]
Block                           object
IUCR                            object
Primary Type                    object
Description                     object
Location Description            object
Arrest                            bool
Domestic                          bool
Beat                            object
District                        object
Ward                            object
Community Area                  object
FBI Code                        object
X Coordinate                   float64
Y Coordinate                   float64
Updated On              datetime64[ns]
Latitude                       float64
Longitude                      float64
dtype: object

I might need more adjustmens later on, but let's be happy with it for now.  Let's describe the dataframe content now

In [6]:
# continuous variables
chicago_crime_dataset.describe()

Unnamed: 0,X Coordinate,Y Coordinate,Latitude,Longitude
count,164094.0,164094.0,164094.0,164094.0
mean,1164935.0,1885071.0,41.840223,-87.670294
std,16219.82,31750.52,0.087321,0.059017
min,1092647.0,1813897.0,41.64459,-87.934567
25%,1152945.0,1857909.0,41.765422,-87.713787
50%,1166492.0,1890436.0,41.854951,-87.664536
75%,1176660.0,1908192.0,41.90377,-87.627485
max,1205112.0,1951527.0,42.022586,-87.524618


In [7]:
# categorical variables
chicago_crime_dataset.astype('object').describe()


Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Updated On,Latitude,Longitude
count,187645,187645,5,187645,187645,187645,187645,186649,187645,187645,187645,187645,187645.0,187645,187645,164094.0,164094.0,5,164094.0,164094.0
unique,187645,187618,5,26314,308,33,414,163,2,2,274,23,51.0,77,26,44415.0,58848.0,1,91658.0,91653.0
top,12003879,JD286197,2020-11-21 17:00:00,001XX N STATE ST,486,BATTERY,SIMPLE,STREET,False,False,1834,11,28.0,25,6,1176352.0,1900927.0,2020-11-28 15:46:29,41.8835,-87.627877
freq,1,4,1,450,18643,37846,21111,45393,157760,152262,1710,13337,9609.0,11338,37077,267.0,203.0,5,201.0,201.0


The continuous variables about position seems to have the same number of entries. control that there is an actual correspondence

In [22]:
chicago_crime_dataset.loc[chicago_crime_dataset['Longitude'].isna(), ['X Coordinate', 'Y Coordinate', 'Latitude', 'Longitude']].head(10).sum()

X Coordinate    0.0
Y Coordinate    0.0
Latitude        0.0
Longitude       0.0
dtype: float64