Big City Crime Analysis

by: Cameron Sudderth, Ayana Rodgers, Moni Buddha

https://github.com/aya-rodgers97/Crime-Project-Tools-1


2024 is a Presidential Election year and crime is always a major topic with which Americans vote, both on a Federal and Local level​

Every major US city publishes data related to its crimes, so we wanted to look at the major metro areas in the US to see what similarities exist

API Connections​:

NYC Crime Data​
City of LA Crime Data​
City of Chicago Crime Data​

Categorize the Crime Descriptions into commonly  themed categories​

Identify the most significant crimes in each city​

Identify the most affected gender in each city.​

To find patterns in each season, day of the week and hour of the day in each city.

Literature

Los Angeles Neighborhood Analysis by Chaitany Krishna Kasaraneni
Towards Data Science


In [1]:
import requests
import pandas as pd

NYC API Connection

In [2]:
import pandas as pd
from sodapy import Socrata

# Number of Records limit
nyc_limit = 500
la_limit = 500
chi_limit = 500

# Connect to NYC API
NYCAppToken = '3u5hcZ6WwKere5Mb5nm5S9mT2'
nyc_client = Socrata("data.cityofnewyork.us",
                 NYCAppToken,
                 username="Cameron.Suddreth@du.edu",
                 password="COMP4447groupproject")

# First 2000 results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
nyc_results = nyc_client.get("5uac-w243", limit=nyc_limit)

# Convert to pandas DataFrame
nyc_df = pd.DataFrame.from_records(nyc_results)


LA API Connection

In [3]:

# Example authenticated client (needed for non-public datasets):
LA_AppToken = 'mEU8HkgWCvfkWLHKGxfiUFecc'
la_client = Socrata("data.lacity.org",
                 LA_AppToken,
                 username="Cameron.Suddreth@du.edu",
                 password="COMP4447groupproject")


la_results = la_client.get("2nrs-mtv8", limit=la_limit)

# Convert to pandas DataFrame
la_df = pd.DataFrame.from_records(la_results)

Chicago API Connection

In [4]:
CHI_AppToken = '6rxQVr5BfXAbUUccKTodxYVdj'
chi_client = Socrata("data.cityofchicago.org",
                    CHI_AppToken,
                    username="Cameron.Suddreth@du.edu",
                    password="COMP4447groupproject")
chi_results = chi_client.get("9hwr-2zxp", limit=chi_limit)
chi_df = pd.DataFrame.from_records(chi_results)

Above, we have established three different API connections. We have established an API connection with each of the three largest cities in the United States: New York, NY; Los Angeles, CA; and Chicago, IL. We are also able to change the number of records that we are pulling in for analysis as each of the databases combined would result in close to one million records!

With each connection, we have created a separate dataframe which allowed us to easily pull in all the records from each city's database. As we now have each of the cities with their own dataframe, we will begin to merge the dataframes together to allow us to look at the data amongst the cities together.

The below cell compares the number of columns within each dataframe. This was the initial step for us to begin merging the dataset together.

In [5]:
nyc_columns = nyc_df.columns
la_columns = la_df.columns
chi_columns = chi_df.columns

print(f'Number of NYC Columns: {len(nyc_columns)}')
print(f'Number of LA Columns: {len(la_columns)}')
print(f'Number of CHI Columns: {len(chi_columns)}')

Number of NYC Columns: 40
Number of LA Columns: 26
Number of CHI Columns: 22


We reviewed the column names within the API documentation and determined what information we wanted from each city and then reduced the size of each dataframe to contain only those columns which we wanted to analyze. We then renamed each column so to allow for easier analysis and merging.

Before we create the new column headers, we must also resolve any NAN values else we will receive an error. We have imported the Numpy module to fill in the Weapon user for each NYC and Chicago; the time of the crime in Chicago; and the victim sex for Chicago as this data was unavailable in each respective city's database.

We have also created a new column in each city's original dataframe to help identify which city each record belongs to after we have merged the data.

In [6]:
import numpy as np

nyc_df['Weapon'] = np.nan
chi_df['Weapon'] = np.nan
chi_df['Time'] = np.nan
chi_df['Victim Sex'] = np.nan
nyc_df['City'] = 'NYC'
la_df['City'] = "LA"
chi_df['City'] = 'CHI'

la_df = la_df[['dr_no', 'date_rptd', 'date_occ', 'crm_cd', 'crm_cd_desc', 
'weapon_desc', 'vict_sex', 'lat', 'lon', 'City']]
nyc_df = nyc_df[['cmplnt_num', 'cmplnt_fr_dt', 'cmplnt_fr_tm','ky_cd', 'ofns_desc', 
'Weapon', 'vic_sex', 'latitude','longitude', 'City']]
chi_df = chi_df[['id', 'date','Time', 'iucr','description','Weapon', 
'Victim Sex','latitude','longitude', 'City']]
generic_columns = ['Case Number', 'Date', 'Time', 'Crime Code', 'Crime Description','Weapon', 
'Victim Sex', 'Latitude', 'Longitude', 'City']

# Rename the columns
la_df.columns = generic_columns
nyc_df.columns = generic_columns
chi_df.columns = generic_columns

By renaming the columns of the dataframe above, it simplified the merging process, so we did not have to specify what column in each dataframe to merge based on.

In [7]:
combined_df = pd.concat([la_df, nyc_df, chi_df],ignore_index=True)
print(len(combined_df))


1500


In [8]:
combined_df['Crime Description'].sample(25)

628                                         HARRASSMENT 2
486                             OTHER MISCELLANEOUS CRIME
1155                                            OVER $500
1220                                               SIMPLE
1495                              DOMESTIC BATTERY SIMPLE
215                                     THEFT OF IDENTITY
359                                     THEFT OF IDENTITY
792                                        FELONY ASSAULT
592                                         HARRASSMENT 2
1216                                           TO VEHICLE
293                                       ORAL COPULATION
739                                         PETIT LARCENY
743                          ASSAULT 3 & RELATED OFFENSES
1111                                            OVER $500
278                                     THEFT OF IDENTITY
220                             OTHER MISCELLANEOUS CRIME
196                                     THEFT OF IDENTITY
1212          

In [9]:
combined_df['str_length'] = combined_df['Crime Description'].str.len()
number_of_crimes = combined_df['Crime Description'].value_counts()
list_of_crimes = combined_df['Crime Description'].unique().tolist()
list_of_crimes_lower = [crime.lower() for crime in list_of_crimes]

print(len(list_of_crimes))

152


In [10]:
# !pip install transformers
# !pip3 install torch torchvision

Simplify the Descriptions

In [11]:
# !pip install spacy

In [12]:
combined_df[['Crime Description', 'City']].sample(20)

Unnamed: 0,Crime Description,City
230,THEFT PLAIN - PETTY ($950 & UNDER),LA
431,BURGLARY,LA
1400,ANIMAL ABUSE / NEGLECT,CHI
484,DOCUMENT FORGERY / STOLEN FELONY,LA
552,VEHICLE AND TRAFFIC LAWS,NYC
634,HARRASSMENT 2,NYC
367,THEFT OF IDENTITY,LA
478,THEFT OF IDENTITY,LA
456,THEFT OF IDENTITY,LA
429,"VANDALISM - FELONY ($400 & OVER, ALL CHURCH VA...",LA


In [13]:
import sys
import re
try:
    # Try to import the SpaCy model to check if it's installed
    import en_core_web_sm
except ImportError:
    # If the model is not installed, download it using the subprocess module
    print("Downloading the 'en_core_web_sm' model...")
    !python3.7 -m spacy download en_core_web_sm

import spacy
nlp = spacy.load("en_core_web_sm")

def preprocess_text(text):
    # Remove digits and punctuation using a regular expression
    text = re.sub(r'[\d]', '', text)  # Remove digits
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation except underscores and spaces
    return text.lower()

def lemmatize(text):
    # Apply spaCy nlp pipeline to pre-processed text
    doc = nlp(text)
    # Return lemmatized text, only including tokens that are alphabetic
    return " ".join([token.lemma_ for token in doc if token.is_alpha])

# Assuming 'combined_df' is your DataFrame and 'Lemmatized Text' is the column to process
# Step 1: Pre-process the text to remove digits and punctuation
combined_df['Preprocessed Text'] = combined_df['Crime Description'].apply(preprocess_text)

# Step 2: Apply lemmatization
combined_df['Lemmatized Text'] = combined_df['Preprocessed Text'].apply(lemmatize)

# Display the processed DataFrame
print(combined_df[['Lemmatized Text']].head())

               Lemmatized Text
0                vehicle steal
1        burglary from vehicle
2                   bike steal
3  shopliftinggrand theft over
4            theft of identity


In [19]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
import pandas as pd  # Assuming you have a DataFrame 'combined_df'

def run_lda(n_topics, max_df, min_df, combined_df, top_topicwords):

    custom_stop_words = ["ord"]
    stop_words = ENGLISH_STOP_WORDS.union(custom_stop_words)
    cv = CountVectorizer(max_df=max_df, min_df=min_df, stop_words=stop_words)
    dtm = cv.fit_transform(combined_df['Lemmatized Text'])
    lda = LatentDirichletAllocation(n_components=n_topics, random_state=42)
    lda.fit(dtm)

    top_words_per_topic = []
    for index, topic in enumerate(lda.components_):
        top_words = [cv.get_feature_names()[i] for i in topic.argsort()[-top_topicwords:]]
        top_words_per_topic.append(top_words)
        print(f'Top {top_topicwords} words for topic {index}: {top_words}')
    
    # Transform the DTM to get the topic results
    topic_results = lda.transform(dtm)

    # Return the perplexity and top words for each topic
    return lda.perplexity(dtm), top_words_per_topic, topic_results

# Initialize variables to keep track of the best model
lowest_perplexity = float('inf')
best_model_config = None
best_top_words = None
best_topic_results = None

# Parameters to iterate over
n_topics_options = [5, 7, 10]
max_df_options = [0.7, 0.75, 0.8, 0.85]
min_df_options = [2, 5]

for n_topics in n_topics_options:
    for max_df in max_df_options:
        for min_df in min_df_options:
            perplexity, top_words, topic_results = run_lda(n_topics, max_df, min_df, combined_df, best_top_words)
            if perplexity < lowest_perplexity:
                lowest_perplexity = perplexity
                best_model_config = (n_topics, max_df, min_df)
                best_top_words = top_words
                best_topic_results = topic_results

# Use the best model to append the dominant topic's top word to each document in the DataFrame
dominant_topics = best_topic_results.argmax(axis=1)
combined_df['Dominant Topic Word'] = [best_top_words[topic][0] for topic in dominant_topics]  # Using the first top word as representative

print(f'Best Model Configuration: {best_model_config} with Lowest Perplexity: {lowest_perplexity}')

InvalidParameterError: The 'stop_words' parameter of CountVectorizer must be a str among {'english'}, an instance of 'list' or None. Got frozenset({'somehow', 'fill', 'there', 'even', 'someone', 'itself', 'sometime', 'inc', 'system', 'amount', 'over', 'nothing', 'out', 'something', 'thence', 'along', 'latter', 'fifteen', 'thereupon', 'least', 'much', 'ord', 'bottom', 'else', 'if', 'fire', 'last', 'wherever', 'by', 'every', 'already', 'etc', 'together', 'eg', 'sometimes', 'eight', 'others', 'please', 'beside', 'anyhow', 'cannot', 'everywhere', 'where', 'could', 'hers', 'for', 'herein', 'move', 'own', 'be', 'they', 'formerly', 'another', 'three', 'via', 'hence', 'than', 'she', 'across', 'elsewhere', 'couldnt', 'thus', 'seems', 'mine', 'otherwise', 'former', 'nine', 'hereafter', 'has', 'the', 'at', 'thick', 'un', 'us', 'whenever', 'here', 'why', 'any', 'four', 'into', 'with', 'you', 'hereby', 'your', 'whereby', 'while', 'either', 'further', 'bill', 'or', 'front', 'towards', 'then', 'six', 'next', 'seemed', 're', 'moreover', 'within', 'do', 'perhaps', 'whereas', 'forty', 'down', 'as', 'eleven', 'which', 'interest', 'in', 'were', 'beforehand', 'very', 'co', 'ever', 'and', 'once', 'almost', 'ten', 'found', 'more', 'anywhere', 'rather', 'part', 'often', 'but', 'except', 'whereupon', 'up', 'thereby', 'always', 'afterwards', 'had', 'show', 'again', 'can', 'have', 'therefore', 'keep', 'ie', 'two', 'therein', 'that', 'less', 'when', 'ltd', 'nobody', 'an', 'first', 'same', 'hasnt', 'to', 'so', 'i', 'never', 'become', 'indeed', 'since', 'off', 'several', 'whither', 'are', 'describe', 'upon', 'due', 'well', 'too', 'nevertheless', 'seem', 'noone', 'only', 'below', 'might', 'find', 'thin', 'also', 'herself', 'been', 'none', 'their', 'whom', 'most', 'side', 'cry', 'get', 'alone', 'whose', 'meanwhile', 'whatever', 'him', 'whether', 'through', 'anyone', 'however', 'among', 'back', 'many', 'above', 'mill', 'yours', 'ourselves', 'top', 'full', 'of', 'though', 'empty', 'whence', 'throughout', 'yourselves', 'is', 'thereafter', 'serious', 'whole', 'myself', 'besides', 'under', 'what', 'sixty', 'should', 'per', 'amoungst', 'call', 'those', 'ours', 'without', 'am', 'no', 'take', 'twenty', 'before', 'such', 'sincere', 'everything', 'he', 'becoming', 'wherein', 'his', 'it', 'themselves', 'who', 'detail', 'namely', 'hereupon', 'everyone', 'seeming', 'now', 'not', 'until', 'because', 'third', 'go', 'me', 'anyway', 'on', 'cant', 'latterly', 'done', 'must', 'whereafter', 'a', 'whoever', 'amongst', 'neither', 'five', 'may', 'fifty', 'few', 'from', 'anything', 'con', 'nowhere', 'our', 'enough', 'behind', 'thru', 'see', 'name', 'give', 'them', 'one', 'during', 'becomes', 'nor', 'onto', 'this', 'would', 'its', 'how', 'my', 'these', 'other', 'being', 'was', 'made', 'beyond', 'somewhere', 'each', 'became', 'all', 'yet', 'some', 'around', 'both', 'mostly', 'about', 'himself', 'we', 'still', 'after', 'put', 'hundred', 'twelve', 'de', 'her', 'will', 'between', 'although', 'toward', 'yourself', 'against'}) instead.

In [16]:
combined_df.sample(5)

Unnamed: 0,Case Number,Date,Time,Crime Code,Crime Description,Weapon,Victim Sex,Latitude,Longitude,City,str_length,Preprocessed Text,Lemmatized Text
1238,12938222,2022-12-31T17:27:00.000,,1320,TO VEHICLE,,,41.843082239,-87.631807329,CHI,10,to vehicle,to vehicle
1275,12945997,2022-12-31T16:30:00.000,,1120,FORGERY,,,42.000843647,-87.807064135,CHI,7,forgery,forgery
1438,12937863,2022-12-31T11:15:00.000,,1345,TO CITY OF CHICAGO PROPERTY,,,41.754325322,-87.62710508,CHI,27,to city of chicago property,to city of chicago property
1366,12939476,2022-12-31T13:40:00.000,,520,AGGRAVATED - KNIFE / CUTTING INSTRUMENT,,,41.835655908,-87.647194113,CHI,39,aggravated knife cutting instrument,aggravate knife cut instrument
248,231220613,2023-10-02T00:00:00.000,2020-07-17T00:00:00.000,354,THEFT OF IDENTITY,,F,33.997,-118.2871,LA,17,theft of identity,theft of identity


In [17]:
# new_descriptions = ['0': 'Traffic', '6':'Financial']
combined_df['New Description'] = np.nan
for index, row in combined_df.iterrows():
    if row['topic'] == 0:
        combined_df.loc[index, 'New Description'] = 'Vehicular'
    elif row['topic'] == 6:
        combined_df.loc[index, 'New Description'] = 'Financial'
    elif row['topic'] == 8:
        combined_df.loc[index, 'New Description'] = 'Larceny'
    else:
        combined_df.loc[index, 'New Description'] = 'Unassigned'
›combined_df[['Crime Description', 'topic', 'New Description']]
     


SyntaxError: invalid character '›' (U+203A) (2092999737.py, line 12)

In [18]:
import matplotlib.pyplot as plt
def plot_top_words_for_all_topics(lda_model, feature_names, num_top_words):
    """
    Plots the top words for all topics in the LDA model.

    Parameters:
    - lda_model: The fitted LDA model.
    - feature_names: The names of the features (words) from the CountVectorizer.
    - num_top_words: The number of top words to include in each plot.
    """
    # Number of topics
    num_topics = lda_model.components_.shape[0]

    # Create a figure to contain subplots for each topic.
    fig, axes = plt.subplots(num_topics, 1, figsize=(10, 6 * num_topics), sharex=True)
    axes = axes.flatten()

    for topic_idx, topic_word_weights in enumerate(lda_model.components_):
        # Get the indices of the top words for this topic.
        top_word_indices = topic_word_weights.argsort()[-num_top_words:][::-1]

        # Get the top words and their weights.
        top_words = [feature_names[i] for i in top_word_indices]
        top_words_weights = topic_word_weights[top_word_indices]

        # Plot for the current topic.
        ax = axes[topic_idx]
        ax.barh(top_words, top_words_weights, color='lightblue')
        ax.set_title(f'Topic {topic_idx + 1}', fontsize=14)
        ax.invert_yaxis()  # Invert y-axis to have the highest weight on top.

    plt.subplots_adjust(top=0.95, bottom=0.05, hspace=0.3)
    plt.show()

# Extract feature names from the CountVectorizer
feature_names = cv.get_feature_names_out()

# Choose the number of top words to display for each topic
num_top_words = 10

# Plot the top words for all topics
plot_top_words_for_all_topics(lda, feature_names, num_top_words)



NameError: name 'cv' is not defined

In [None]:
combined_df.sample(10)

Graphing the crimes and their cities

In [None]:
import folium
from folium import plugins

In [None]:
df = combined_df 
 
# Convert Latitude and Longitude to float
df['Latitude'] = df['Latitude'].astype(float)
df['Longitude'] = df['Longitude'].astype(float)
 
# Create a map centered around the mean of latitude and longitude
map1 = folium.Map(location=[df['Latitude'].mean(), df['Longitude'].mean()], zoom_start=10)
 
# Add markers for each crime location
for index, row in df.iterrows():
    # Check for NaN values
    if not pd.isnull(row['Latitude']) and not pd.isnull(row['Longitude']):
        folium.Marker(location=[row['Latitude'], row['Longitude']], popup=row['Case Number']).add_to(m)

# Save the map
filename = 'crime_map.html'  # Specify a full file path
#map1.save(filename)
map1

In [None]:
la_lon = -118.2426
la_lat = 34.0549
#Create a second map with the same LA coordinates
la_map2 = folium.Map(location = [la_lat, la_lon], zoom_start = 9)

#Instantiate a mark cluster object for the incidents in the dataframe
incidents2 = plugins.MarkerCluster().add_to(la_map2)

df = df.dropna(subset=['Longitude','Latitude'])

#print(type(lat))
#Loop through the dataframe and add each data point to the mark cluster
for lat, lng, in zip(df.Latitude, df.Longitude):
    
    folium.Marker(
        location=[lat, lng],
        icon=None,
    ).add_to(incidents2)

#Display map
la_map2