# LA Crime
Author: Dhanush Vasa

#### Table of Contents:
1. Introduction
2. Data Collection 
3. Data Cleaning and Exploratory Analysis
4. Modeling
5. Interpretation of Results
6. Conclusion

### 1. Introduction 
The aim of this tutorial is to guide through the data science lifecycle, providing an introduction to various key concepts in data science. The stages of the data science lifecycle:
1. Data Collection 
2. Data Cleaning
3. Exploratory Analysis and Visualization
4. Modeling
5. Results Interpretation

Crime in Los Angeles is a complicated and dynamic issue, influenced by the city's size, diversity, and socioeconomic circumstances. As one of the major metropolitan areas in the United States, Los Angeles sees a wide range of criminal activity and violence. Understanding crime patterns and trends in Los Angeles is crucial for keeping the public safe and ensuring that law enforcement agencies allocate resources efficiently.

Many factors influence criminal activity in Los Angeles, including local demographics, economic situations, and the physical environment. Some locations of the city are repeating hotspots for various sorts of crime, which are frequently associated with population density, accessibility, or the presence of specific establishments. For example, theft may be more prevalent in commercial districts, whereas violent crimes may cluster in economically deprived regions. Temporal trends are particularly important, as certain crimes tend to increase at specific seasons of year, days of the week, or even hours of the day.

By using statistics to examine crime in Los Angeles, policymakers and law enforcement may uncover critical patterns and build focused preventive and intervention initiatives. For example, assessing geographic trends might assist police patrols be more effectively assigned to high-crime areas, whilst investigating temporal trends can influence resource deployment during peak hours. Furthermore, knowing the basic causes of crime, whether they are social, economic, or environmental, can help drive community-based programs to reduce criminal behavior. A comprehensive approach to studying crime in Los Angeles is critical for developing safer communities and building confidence between residents and law enforcement.

Important Note:-
- In certain sections of the code, you may encounter warning messages. These can be safely ignored while focusing on the intended output of the code.

### 2. Data Collection

To begin any analysis, we must collect data that is relevant to the topic we want to answer. The quality of your machine learning model is directly proportional to the quality of the data it processes. A solid dataset guarantees that your model finds relevant patterns and draws intelligent conclusions. As a result, selecting the appropriate dataset is an important stage in the data science lifecycle.

In this tutorial, we will use the Crime_Data from 2020 to Present from LA dataset from OpenML, which provides a complete account of reported criminal episodes in a specific area beginning in 2020. This dataset contains critical properties such as unique report numbers, dates and times of reporting and occurrence, criminal descriptions with related codes, and specific geographic information such as region names, premises descriptions, and exact latitude/longitude coordinates. It also provides demographic information regarding victims, weapons used, and the status of each crime report.

For public safety agencies, analysts, and researchers, this dataset is significant because it makes it easier to identify patterns in crime, analyze hotspots, and assess the efficacy of law enforcement. By utilizing this data, we may investigate a range of use cases, including creating prediction models, comprehending the social elements that influence crime, and empowering decision-makers. Spatial data, for instance, might be used to identify high-crime areas and guide resource allocation, and knowledge of victim demographic trends could assist guide community safety efforts. Because of its extensive scope, this dataset is essential for methodically researching and tackling crime.

##### Importing Python Libraries

As shown below, we must import the necessary Python libraries before we can begin this course. Throughout the tutorial, these libraries will be crucial. Because the provided code is optimized for execution in Jupyter Notebook, it is advised to utilize it for this session. Because it makes data visualization and analysis easier, Jupyter Notebook is a popular tool among data scientists. As we move further, we will delve deeper into the functions and goals of each library as they relate to their respective applications.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.sparse import csr_matrix
import geopandas as gpd
import folium
from shapely.geometry import Point
from folium.plugins import HeatMap
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.layers import Input
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score

# Filter out the warnings
import warnings
warnings.filterwarnings("ignore")

##### Download and Import The Data

We head to the [LA Crime dataset](https://openml.org/search?type=data&status=active&order=asc&id=45954&sort=runs) and lets download any the dataset file. We extract the data from the dataset file and keep in mind to keep the dataset file in the same folder as the program file is designed in that manner. So that you can can replicate it if required.

In [None]:
# Read the uploaded file to determine its format
file_path = 'dataset_'
# Read the first few bytes of the file to inspect its structure
with open(file_path, 'rb') as file:
    file_head = file.read(512)  # Read the first 512 bytes

file_head.decode(errors='replace')

Initializes an exploratory analysis by inspecting the first 512 bytes of a dataset file in binary mode to decode its structure, revealing metadata about criminal incidents, including report numbers, dates, times, crime descriptions, locations, and victim/suspect details.

In [None]:
# Display the first few lines of the file to identify delimiters or formatting issues
with open(file_path, 'r', encoding='utf-8', errors='replace') as file:
    for i in range(10):  # Display the first 10 lines
        print(file.readline())
This code snippet initializes an exploratory analysis by inspecting the first 512 bytes of a dataset file in binary mode to decode its structure, revealing metadata about criminal incidents, including report numbers, dates, times, crime descriptions, locations, and victim/suspect details.

Reads and prints the first 10 lines of `Crime_Data_from_2020_to_Present.csv` in UTF-8 encoding to inspect its structure, delimiters, and metadata, revealing attributes like `DR_NO`, report dates, times, area details, district numbers, and offense classification (`Part 1` or `Part 2`) for further analysis.

In [None]:
# Attempt to locate the line where the actual dataset starts
with open(file_path, 'r', encoding='utf-8', errors='replace') as file:
    lines = file.readlines()

# Display lines to find the starting point of the dataset
for i, line in enumerate(lines[:50]):  # Check the first 50 lines
    print(f"Line {i + 1}: {line.strip()}")


This process identifies where the actual dataset starts in a file containing descriptive metadata by reading the first 50 lines, revealing attribute descriptions in the `@ATTRIBUTE` format (typical of ARFF files), ensuring accurate data parsing for further analysis.

In [None]:
# Locate the starting point of the actual data
data_start = None
for i, line in enumerate(lines):
    if "@DATA" in line.upper():  # ARFF files typically use '@DATA' to mark the start of data
        data_start = i + 1  # Data starts after this line
        break

# Display a few lines of the actual data if found
if data_start:
    print(f"Data starts at line {data_start + 1}.")
    for line in lines[data_start:data_start + 10]:
        print(line.strip())
else:
    print("No '@DATA' section found; the structure might differ.")


This process locates the starting point of the dataset in an ARFF file by identifying the `@DATA` marker, confirming that data begins at line 55, and displaying initial rows of comma-separated crime records, ensuring accurate parsing for further analysis.

In [None]:
# Re-import necessary libraries
import pandas as pd

# Re-attempt to process the dataset
try:
    # Reload and inspect the first few lines of the file
    with open(file_path, 'r', encoding='utf-8', errors='replace') as file:
        for i in range(10):  # Display the first 10 lines
            print(file.readline().strip())

    # Load the dataset by skipping metadata and identifying the start of the actual data
    df = pd.read_csv(file_path, skiprows=55, delimiter=',', on_bad_lines='skip', header = None)
    print("Dataset loaded successfully!")
except Exception as e:
    print(f"Error occurred while processing the dataset: {e}")


This process reloads a dataset by skipping 55 metadata lines and using `pandas` to parse it as CSV, handling issues like bad lines with `on_bad_lines='skip'` and avoiding metadata headers with `header=None`, ensuring a clean DataFrame for analysis.

In [None]:
df.head()

In [None]:
df.info()

### 3. Data Cleaning and Exploratory Analysis 

Data cleaning is the essential process of preparing a dataset for analysis or machine learning by ensuring it is consistent, complete, and accurate. This process involves several key tasks, such as removing unnecessary or irrelevant data, filling in missing values, and standardizing metrics or measurements to create uniformity. Additionally, new features can be derived from existing data to make the dataset more useful and meaningful. By addressing errors and inconsistencies, data cleaning ensures the dataset is reliable and forms a solid foundation for further analysis or model training.

Often combined with data cleaning, exploratory analysis involves examining the dataset to uncover patterns, trends, and relationships that provide valuable insights. This step includes creating visualizations, such as graphs or plots, to identify correlations or significant variables, and spotting potential issues like outliers that may need cleaning. Insights gained during this process may guide the creation of new features or adjustments to existing ones, refining the dataset for better performance in a machine learning model. By integrating these two steps, we not only ensure the dataset is clean but also well-understood, which is critical for building effective models or conducting insightful analysis.

##### Key Steps in Data Cleaning:
- Remove unnecessary or irrelevant data.
- Fill in missing values to address gaps.
- Standardize metrics or measurements.
- Create new features from existing data to enhance usability.
- Ensure data accuracy and reliability.

##### Key Steps in Exploratory Analysis:
- Visualize data through graphs and plots to uncover patterns and relationships.
- Identify significant features or correlations.
- Detect issues like outliers or irregularities for further cleaning.
- Refine research questions based on insights from the data.
- Create or adjust features to align with identified trends or insights.
- Combine exploratory analysis with cleaning for a comprehensive understanding of the dataset.

In [None]:
df.columns = [
    "DR_NO", "Date_Rptd", "Date_Occ", "Time_Occ", "Area", "Area_Name",
    "Rpt_Dist_No", "Part_1_2", "Crm_Cd", "Crm_Cd_Desc", "Mocodes", "Vict_Age",
    "Vict_Sex", "Vict_Descent", "Premis_Cd", "Premis_Desc", "Weapon_Used_Cd",
    "Weapon_Desc", "Status", "Status_Desc", "Crm_Cd_1", "Crm_Cd_2", "Crm_Cd_3",
    "Crm_Cd_4", "Location", "Cross_Street", "Lat", "Lon"
]
df.info()

Renames the columns of the DataFrame `df` to a specified list of column names and displays a summary of the DataFrame structure using `df.info()`.

In [None]:
df['Crm_Cd_Desc'].unique()

Retrieves and displays all the unique values in the `Crm_Cd_Desc` column of the DataFrame `df`, which represent the unique descriptions of crime categories in the dataset.

In [None]:
df.drop(columns = ['Weapon_Used_Cd', 'Weapon_Desc', 'Crm_Cd_1', 'Crm_Cd_2', 'Crm_Cd_3', 'Crm_Cd_4', 'Cross_Street'], inplace = True, axis = 1)

df['Vict_Descent'] = df['Vict_Descent'].fillna('None')
df['Vict_Sex'] = df['Vict_Sex'].fillna('None')
df['Mocodes'] = df['Mocodes'].fillna('none')
df['Premis_Desc'] = df['Premis_Desc'].fillna('None')

df['Date_Rptd'] = pd.to_datetime(df['Date_Rptd'].str[:11])
df['Date_Occ'] = pd.to_datetime(df['Date_Occ'].str[:11])

df.head()

The DataFrame by dropping unnecessary columns, filling missing values with `'None'`, converting date columns to `datetime`, and previewing the cleaned data.

In [None]:
df.isnull().sum()

Checks for missing values in the DataFrame `df` by using `df.isnull().sum()`. It outputs the total count of missing values for each column. The result shows that all columns have 0 missing values, indicating the dataset has been successfully cleaned of any null or missing data.

In [None]:
df_cleaned = df.dropna()

df_cleaned['Vict_Age'] = pd.to_numeric(df_cleaned['Vict_Age'], errors='coerce').astype('Int64')
df_cleaned['Lat'] = pd.to_numeric(df_cleaned['Lat'], errors='coerce')
df_cleaned['Lon'] = pd.to_numeric(df_cleaned['Lon'], errors='coerce')

df_cleaned['Vict_Sex'] = df_cleaned['Vict_Sex'].astype('category')
df_cleaned['Vict_Descent'] = df_cleaned['Vict_Descent'].astype('category')

print(df_cleaned.isnull().sum())
print(df_cleaned.info())

Cleans the dataset by removing null values, converting numeric columns (`Vict_Age`, `Lat`, `Lon`) to integers, and optimizing `Vict_Sex` and `Vict_Descent` as categorical data types.

In [None]:
df_cleaned = df_cleaned.dropna(subset=['Lat', 'Lon', 'Vict_Age'])
# Verify the cleaned DataFrame
print(df_cleaned.isnull().sum())
print(df_cleaned.info())

Removes rows with missing values in the `Lat`, `Lon`, and `Vict_Age` columns from the `df_cleaned` DataFrame, then verifies the cleaned dataset by printing the count of missing values and displaying the DataFrame's structure and summary using `df.info()`.

## EDA

#### Step 1: Exploratory Data Analysis

In [None]:
eda_results = {
    "Crime Type Frequency": df['Crm_Cd_Desc'].value_counts().head(10),
    "Area Crime Count": df['Area_Name'].value_counts(),
    "Victim Age Statistics": df['Vict_Age'].describe(),
    "Crimes by Time of Day": df['Time_Occ'].value_counts(bins=4).sort_index(),
    "Top Premises for Crimes": df['Premis_Desc'].value_counts().head(10)
}

# Prepare for time-series analysis
df['Year_Month'] = df['Date_Occ'].dt.to_period('M')
crimes_by_month = df.groupby('Year_Month').size()
crimes_by_month

Performed EDA by summarizing crime frequencies, victim age statistics, and crime timings while preparing the dataset for time-series analysis by grouping crimes by month.

In [None]:
eda_results

The `eda_results` dictionary, which contains key insights from the dataset, such as the top 10 crime types, crime counts by area, victim age statistics, crime distributions by time of day, and the top premises for crimes.

##### Graph 1: Top 10 Crime Type

In [None]:
plt.figure(figsize=(10, 10))
df['Crm_Cd_Desc'].value_counts().head(10).plot(kind='bar', title="Top 10 Crime Types")
plt.xlabel('Crime Type')
plt.ylabel('Number of Incidents')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

The bar chart illustrates the top 10 most frequent crime types, with **"Vehicle - Stolen"** leading significantly, surpassing 100,000 reported incidents. This highlights vehicle theft as a prominent issue in the dataset's coverage area. Following this, crimes like **"Burglary from Vehicle"**, **"Burglary"**, and **"Theft of Identity"** also show high frequencies, emphasizing a pattern of property-related offenses and vulnerabilities in vehicle and property security.

Petty theft-related crimes, including **"Theft Plain - Petty ($950 & Under)"**, **"Theft from Motor Vehicle"**, and **"Shoplifting - Petty Theft"**, are also prevalent, reflecting opportunistic behaviors targeting easily accessible items. Less frequent but still notable offenses, such as **"Robbery"** and **"Vandalism - Misdemeanor ($399 or Under)"**, further underscore the dominance of property crimes in the area, suggesting a need for focused preventive measures.

##### Graph 2: Crime Count by Area

In [None]:
plt.figure(figsize=(15, 10))
df['Area_Name'].value_counts().plot(kind='bar', title="Crime Count by Area")
plt.xlabel('Area Name')
plt.ylabel('Number of Crimes')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

The bar chart visualizes the number of crimes reported in different areas, with the area names on the x-axis and the number of crimes on the y-axis. The findings indicate that **"Central"** and **"Pacific"** areas have the highest crime counts, each with over 30,000 incidents, making them hotspots for criminal activity. These are followed closely by **"77th Street"**, **"Wilshire"**, and **"West Hollywood"**, which also show significantly high crime rates.

Other areas, such as **"Southwest"**, **"Newton"**, and **"West LA"**, report moderately high crime counts, while areas like **"Foothill"** and **"Rampart"** have relatively lower counts compared to the leading areas. This distribution suggests that certain regions experience a disproportionate amount of crime, highlighting the need for targeted law enforcement and community safety initiatives in these high-crime areas.

##### Graph 3: Victim Age Distribution

In [None]:
df['Vict_Age'] = pd.to_numeric(df['Vict_Age'], errors='coerce')
df['Vict_Age'] = df['Vict_Age'].fillna(df['Vict_Age'].median())
df['Lat'] = pd.to_numeric(df['Lat'], errors='coerce')
df['Lon'] = pd.to_numeric(df['Lon'], errors='coerce')
df = df.dropna(subset=['Lat', 'Lon'])  
df = df[(df['Vict_Age'] > 0) & (df['Vict_Age'] <= 100)]
print(df['Vict_Age'].describe())
df.reset_index(drop=True, inplace=True)
 

plt.figure(figsize=(10, 6))
df['Vict_Age'].plot(kind='hist', bins=20, title="Victim Age Distribution", color='blue')
plt.xlabel('Victim Age')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()

The bar chart visualizes the age distribution of crime victims, with preprocessing steps including converting `Vict_Age` to numeric values, filling missing ages with the median, and filtering for realistic values between 0 and 100. This ensures a clean and accurate representation of the data.

The histogram reveals that most crime victims are between 20 and 40 years old, peaking around 30, indicating young adults are the most affected group. Victim frequency declines steadily beyond 40 and drops significantly after 60, suggesting lower victimization rates among older individuals. These findings underscore the need for targeted safety measures for young adults, who are at a higher risk of crime.

##### Graph 4: Crimes by Time of Day

In [None]:
time_bins = [0, 600, 1200, 1800, 2400]
time_labels = ['Midnight to Morning', 'Morning to Noon', 'Noon to Evening', 'Evening to Midnight']
df['Time_Binned'] = pd.cut(df['Time_Occ'], bins=time_bins, labels=time_labels, right=False)

# Plot the cleaned Time of Day distribution
plt.figure(figsize=(10, 6))
df['Time_Binned'].value_counts().sort_index().plot(kind='bar', title="Crimes by Time of Day (Cleaned)")
plt.xlabel('Time of Day')
plt.ylabel('Number of Crimes')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

The bar chart visualizes the distribution of crimes across four time intervals: **Midnight to Morning**, **Morning to Noon**, **Noon to Evening**, and **Evening to Midnight**. The data reveals that most crimes occur **Noon to Evening**, followed by **Evening to Midnight**, indicating a higher crime rate during the latter part of the day.

In contrast, fewer crimes are reported **Morning to Noon**, with the lowest frequency occurring **Midnight to Morning**. This trend suggests that criminal activity peaks in the afternoon and evening hours, tapering off during the early morning, potentially reflecting variations in daily routines and societal activity levels.

##### Graph 5: Crimes by Year and Month

In [None]:
# Remove the last data point (potentially incomplete month/year) from the time series
filtered_crimes_by_month = crimes_by_month.iloc[:-1]

# Plot the filtered Crimes by Year and Month
plt.figure(figsize=(12, 6))
filtered_crimes_by_month.plot(kind='line', title="Crimes by Year and Month (Filtered)")
plt.xlabel('Year-Month')
plt.ylabel('Number of Crimes')
plt.grid()
plt.tight_layout()
plt.show()

The line chart visualizes the monthly trend of reported crimes from 2020 to 2023, excluding the final incomplete data point for precision. The data reveals a clear upward trajectory in crime rates over the years, punctuated by occasional dips and spikes, suggesting periods of varying criminal activity.

These fluctuations hint at potential seasonality or specific factors influencing crime rates, while the overall increase underscores a growing concern. This trend highlights the importance of sustained efforts to address and mitigate criminal activity in the region.

In [None]:
# Create a sparse matrix for area and crime type
area_crime_matrix = pd.crosstab(df['Area_Name'], df['Crm_Cd_Desc'])
sparse_matrix = csr_matrix(area_crime_matrix.values)

# Calculate additional metrics
metrics = {
    "Total Records": len(df),
    "Total Unique Crime Types": df['Crm_Cd_Desc'].nunique(),
    "Total Unique Areas": df['Area_Name'].nunique(),
    "Missing Values": df.isnull().sum().sum(),
    "Density of Sparse Matrix": (sparse_matrix.nnz / np.prod(sparse_matrix.shape)),
}


# Sparse Matrix Dimensions
sparse_matrix_shape = sparse_matrix.shape


metrics_output = {
    "Total Records": metrics["Total Records"],
    "Total Unique Crime Types": metrics["Total Unique Crime Types"],
    "Total Unique Areas": metrics["Total Unique Areas"],
    "Missing Values": metrics["Missing Values"],
    "Density of Sparse Matrix": metrics["Density of Sparse Matrix"],
    "Sparse Matrix Shape": sparse_matrix_shape,
}

This code creates a sparse matrix to analyze the relationship between areas and crime types, calculates metrics, and outputs key dataset statistics:

1. **Sparse Matrix Creation**:
   - A crosstabulation is created using `pd.crosstab` to map `Area_Name` (rows) to `Crm_Cd_Desc` (columns), showing the frequency of each crime type in each area.
   - The resulting matrix is converted into a sparse matrix format using `csr_matrix` for efficient storage.

2. **Metrics Calculation**:
   - **Total Records**: The number of rows in the dataset.
   - **Total Unique Crime Types**: The number of distinct crime types.
   - **Total Unique Areas**: The number of unique areas.
   - **Missing Values**: The total number of missing values in the dataset.
   - **Density of Sparse Matrix**: The ratio of non-zero elements to the total elements in the sparse matrix, indicating how "dense" the matrix is.

3. **Output**:
   - Outputs the calculated metrics and the dimensions of the sparse matrix for further analysis or reporting.

This step summarizes the dataset's structure and provides a compressed representation of the area-crime relationships, useful for efficient data manipulation and machine learning applications.

In [None]:
area_crime_matrix

The `area_crime_matrix` presents a crosstabulation of `Area_Name` (rows) and `Crm_Cd_Desc` (columns), detailing the frequency of each crime type in different areas. Each cell indicates how often a specific crime occurred in a given area, offering a granular view of crime distribution.

Key findings reveal that areas like **Central**, **Wilshire**, and **Pacific** exhibit higher counts across multiple crime types, marking them as crime hotspots. Conversely, certain areas report low or zero occurrences for specific crimes, highlighting regional variations in crime patterns. This matrix serves as a valuable tool for targeted interventions and area-specific crime analysis.

In [None]:
metrics

The `metrics` output summarizes the dataset with 326,863 records, 107 crime types, 21 areas, no missing values, and a sparse matrix density of 74%.

In [None]:
sparse_matrix_shape

The `sparse_matrix_shape` output shows that the sparse matrix has 21 rows (areas) and 107 columns (crime types).

In [None]:
metrics_output

The `metrics_output` summarizes the dataset with 326,863 records, 107 crime types, 21 areas, no missing values, a sparse matrix density of 74.28%, and dimensions of (21, 107).

In [None]:
# Removing extra quotes if any
df['Area_Name'] = df['Area_Name'].str.replace("'", "")
df['Crm_Cd_Desc'] = df['Crm_Cd_Desc'].str.replace("'", "")

# Create a sparse matrix (Area vs. Crime Type)
area_crime_matrix = pd.crosstab(df['Area_Name'], df['Crm_Cd_Desc'])
sparse_matrix = csr_matrix(area_crime_matrix.values)

# Plot the sparse matrix as a heatmap
plt.figure(figsize=(12, 8))
plt.imshow(area_crime_matrix.values, cmap="YlGnBu", aspect="auto")
plt.colorbar(label="Crime Count")
plt.xticks(range(area_crime_matrix.columns.size), area_crime_matrix.columns, rotation=90, fontsize=8)
plt.yticks(range(area_crime_matrix.index.size), area_crime_matrix.index, fontsize=10)
plt.title("Area vs Crime Type (Heatmap)", fontsize=14)
plt.xlabel("Crime Type", fontsize=12)
plt.ylabel("Area", fontsize=12)
plt.tight_layout()
plt.show()

This heatmap visualizes the relationship between areas and crime types, with preprocessing steps including the removal of extra quotes from `Area_Name` and `Crm_Cd_Desc` and the creation of a sparse matrix where rows represent areas, columns represent crime types, and values indicate crime counts.

The heatmap uses the `YlGnBu` color scheme, with darker shades signifying higher crime counts. It highlights areas like **Central** and **Wilshire**, which show higher activity across multiple crime types. Most crimes are sparsely distributed, with a few types dominating specific areas. This visualization effectively identifies patterns and hotspots, aiding targeted analysis and intervention strategies.

In [None]:
# Identify the top 10 crime types
top_10_crime_types = df['Crm_Cd_Desc'].value_counts().head(10).index

# Filter the area-crime matrix for the top 10 crime types
filtered_area_crime_matrix = area_crime_matrix[top_10_crime_types]

# Plot the filtered matrix as a heatmap
plt.figure(figsize=(12, 8))
plt.imshow(filtered_area_crime_matrix.values, cmap="YlGnBu", aspect="auto")
plt.colorbar(label="Crime Count")
plt.xticks(range(filtered_area_crime_matrix.columns.size), filtered_area_crime_matrix.columns, rotation=45, ha="right")
plt.yticks(range(filtered_area_crime_matrix.index.size), filtered_area_crime_matrix.index)
plt.title("Top 10 Crime Types by Area (Heatmap)", fontsize=14)
plt.xlabel("Crime Type", fontsize=12)
plt.ylabel("Area", fontsize=12)
plt.tight_layout()
plt.show()


This heatmap visualizes the distribution of the top 10 most frequent crime types across different areas, focusing on high-frequency crimes. The data was filtered to include only the top 10 crime types, creating a focused representation of key patterns. The x-axis represents these crime types, while the y-axis represents various areas, with darker shades in the `YlGnBu` color scheme indicating higher crime counts.

Key insights reveal that areas like **Central**, **Wilshire**, and **77th Street** exhibit heightened activity across multiple crime types, particularly **Burglary from Vehicle** and **Theft of Identity**. In contrast, crimes such as **Robbery** and **Vandalism** appear more localized to specific areas. This visualization highlights crime hotspots for the most common offenses, offering valuable insights for targeted prevention and intervention strategies.

### 4. Model

#### Analysis based on Hypothesis
##### Relationship Betweeen Crime Type and Area
- Hypothesis: Specific crime types are concentrated in certain areas. For instance, vehicle-related crimes might be more common in high traffic or urban areas.
- Reasoning: The heatmap suggests certain crime types have hotspots in specific areas.

The dataset contains 23 columns with over 500,000 rows, including the following key attributes relevant to the hypothesis:

- Crm_Cd_Desc: Describes the type of crime.
- Area_Name: Provides the name of the area where the crime occurred.
- Lat and Lon: Coordinates for geographical analysis.
- Premis_Desc: Description of the location of the crime.
- Date_Occ and Time_Occ: Provide date and time of occurrence.

To explore the relationship between crime types and areas, we will focus on Crm_Cd_Desc and Area_Name and analyze their distribution. We will also visualize potential hotspots using heatmaps or similar methods.

Let’s start by examining the most frequent crime types per area.

In [None]:
# Grouping data by Area_Name and Crm_Cd_Desc to find the most common crimes in each area
crime_area_group = (
    df.groupby(['Area_Name', 'Crm_Cd_Desc'])
    .size()
    .reset_index(name='Count')
)

# Finding the most frequent crime type per area
most_frequent_crimes_per_area = (
    crime_area_group.loc[crime_area_group.groupby('Area_Name')['Count'].idxmax()]
    .sort_values(by='Count', ascending=False)
)

# Display the results
most_frequent_crimes_per_area

Identifies the most frequent crime type in each area by grouping the dataset by `Area_Name` and `Crm_Cd_Desc`, calculating counts, and filtering for the most common crime per area. The analysis highlights distinct crime patterns across regions.

##### Findings:
1. **"Burglary from Vehicle"** is most common in areas like **Central**, **Hollywood**, and **Pacific**, with **Central** reporting the highest count (8,117 incidents).
2. **"Theft of Identity"** dominates areas such as **Southeast**, **West LA**, and **Devonshire**.
3. **Topanga** reports **"Burglary"** as the most frequent crime, showing regional variation.

#### 1. Overall Crime Type Distribution

In [None]:
# Plot the overall crime type distribution
crime_type_counts = df['Crm_Cd_Desc'].value_counts().head(10)
# Retry plotting the overall crime type distribution
crime_type_counts.plot(kind='bar')
plt.title('Top 10 Crime Types')
plt.ylabel('Count')
plt.xlabel('Crime Type')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

The bar chart visualizes the top 10 most frequent crime types, emphasizing the dominance of property-related offenses, particularly those involving vehicles and theft.

##### Findings:
1. **"Burglary from Vehicle"** leads significantly with over 40,000 incidents, making it the most common crime.
2. Crimes like **"Theft of Identity"**, **"Burglary"**, and **"Theft from Motor Vehicle - Grand ($950.01 and Over)"** also show high prevalence.

3. Less frequent crimes, including **"Robbery"**, **"Vandalism - Misdemeanor ($399 or Under)"**, and **"Brandish Weapon"**, still feature prominently in the dataset.

### 2. Top Crime Types by Area

In [None]:
# Group data by Area_Name and Crm_Cd_Desc to find the most common crimes in each area
crime_area_group = (
    df.groupby(['Area_Name', 'Crm_Cd_Desc'])
    .size()
    .reset_index(name='Count')
)

# Find the most frequent crime type per area
most_frequent_crimes_per_area = (
    crime_area_group.loc[crime_area_group.groupby('Area_Name')['Count'].idxmax()]
    .sort_values(by='Count', ascending=False)
)

# Get the top 10 areas with the highest count of a specific crime type
top_crimes_by_area = most_frequent_crimes_per_area.head(10)
plt.barh(top_crimes_by_area['Area_Name'], top_crimes_by_area['Count'])
plt.xlabel('Count')
plt.ylabel('Area Name')
plt.title('Top Crime Types by Area')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()


# Display the results for analysis
most_frequent_crimes_per_area.head(10)

The table highlights the most frequent crime types in each area, revealing distinct patterns of geographic concentration and dominance of certain offenses.

##### Findings:
1. **"Burglary from Vehicle"** is the leading crime in areas like **Central** (8,117 incidents), **Hollywood**, and **Pacific**.
2. **"Theft of Identity"** is most common in areas such as **77th Street**, **Southeast**, and **Devonshire**.
3. **West LA** and **North Hollywood** also report high occurrences of **"Burglary from Vehicle"**, underscoring its prevalence.

### 3. Temporal Analysis: Analyze crime trends over time

In [None]:
# Remove the last month in the dataset for temporal analysis
df['Date_Occ'] = pd.to_datetime(df['Date_Occ'], errors='coerce')
latest_month = df['Date_Occ'].max().month
latest_year = df['Date_Occ'].max().year

# Filter out the last month and create a copy to avoid warnings
filtered_data = df[
    ~((df['Date_Occ'].dt.month == latest_month) & (df['Date_Occ'].dt.year == latest_year))
].copy()  # Use .copy() here to ensure it's a new DataFrame

# Extract year and month for temporal analysis
filtered_data['Year'] = filtered_data['Date_Occ'].dt.year
filtered_data['Month'] = filtered_data['Date_Occ'].dt.month

# Group data by Year and Month for crime trends
temporal_trends_filtered = (
    filtered_data.groupby(['Year', 'Month'])
    .size()
    .reset_index(name='Crime_Count')
    .sort_values(by=['Year', 'Month'])
)

# Plotting the temporal trends without the last month
plt.figure(figsize=(14, 8))
plt.plot(
    temporal_trends_filtered['Year'].astype(str) + '-' + temporal_trends_filtered['Month'].astype(str),
    temporal_trends_filtered['Crime_Count'],
    marker='o'
)
plt.title('Crime Trends Over Time')
plt.xlabel('Time (Year-Month)')
plt.ylabel('Number of Crimes')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


This analysis examines temporal trends in crime by grouping incidents by year and month, excluding the latest incomplete month to ensure accurate insights.

##### Findings:
1. Crime counts generally increase over the analyzed period, peaking mid-way before showing a decline toward the end.
2. Fluctuations in the trends suggest possible seasonal or external factors influencing criminal activity.

These findings highlight temporal patterns, aiding in better resource allocation and intervention planning.

### 4.  Premises Analysis: Study relationships between crime types and locations.

In [None]:
# Group data by Premis_Desc and Crm_Cd_Desc to find the most common crime types at each location type
premises_crime_group = (
    df.groupby(['Premis_Desc', 'Crm_Cd_Desc'])
    .size()
    .reset_index(name='Count')
    .sort_values(by='Count', ascending=False)
)

# Get the top 10 premises with the most frequent crimes
top_premises_crimes = premises_crime_group.head(10)

# Plot the top premises for crimes
plt.barh(top_premises_crimes['Premis_Desc'], top_premises_crimes['Count'])
plt.xlabel('Number of Crimes')
plt.ylabel('Premises Description')
plt.title('Top Premises for Crimes')
plt.gca().invert_yaxis()  # Invert y-axis for better readability
plt.tight_layout()
plt.show()

This analysis identifies the most common premises where crimes occur, focusing on top locations and their frequency through data grouping and visualization.

### Key Findings:
1. **Single Family Dwelling**: The most frequent crime location, with over 25,000 incidents, emphasizing residential areas as significant crime sites.
2. **Street**: The second most common location, highlighting public spaces as key areas of concern.
3. **Parking Lot**: The third most frequent site, pointing to potential security issues in these areas.

These findings underscore the need for targeted safety measures in both residential and public spaces to address crime effectively.

## ML Analysis
#### GeoSpatial Analysis

In [None]:
df.head()

In [None]:
# Create a geometry column from LAT/LON coordinates
geometry = [Point(lon, lat) for lon, lat in zip(df_cleaned['Lon'], df_cleaned['Lat'])]

# Create a GeoDataFrame
gdf = gpd.GeoDataFrame(df_cleaned, geometry=geometry)

# Set the coordinate reference system (CRS) to WGS84
gdf.set_crs(epsg=4326, inplace=True)

# Display the first few rows of the GeoDataFrame
gdf.head()

Converts the cleaned dataset into a geospatial format for mapping and spatial analysis.

##### Steps:
1. **Create Geometry Column**:
   - Combines latitude (`Lat`) and longitude (`Lon`) coordinates into `Point` objects for each record using the `shapely.geometry.Point` class.

2. **Create GeoDataFrame**:
   - Converts the `df_cleaned` DataFrame into a GeoDataFrame (`gdf`) using `geopandas.GeoDataFrame`, incorporating the geometry column.

3. **Set Coordinate Reference System (CRS)**:
   - Sets the CRS to **WGS84 (EPSG:4326)**, a standard for geographic coordinates, enabling accurate mapping and geospatial analysis.

4. **Preview GeoDataFrame**:
   - Displays the first 5 rows of the GeoDataFrame, which now includes a `geometry` column for spatial representation.

### Purpose:
This prepares the dataset for geospatial analysis, allowing crimes to be visualized on maps and enabling spatial queries to identify trends or hotspots.

In [None]:
# Create a map centered around the mean latitude and longitude of the crime locations
map_center = [df_cleaned['Lat'].mean(), df_cleaned['Lon'].mean()]

# Prepare data for HeatMap (LAT/LON coordinates)
heat_data = [[row['Lat'], row['Lon']] for _, row in df_cleaned.iterrows()]

# Create the HeatMap
heatmap = folium.Map(location=map_center, zoom_start=12)
HeatMap(heat_data).add_to(heatmap)

# Display the heatmap
heatmap

The heatmap visualizes crime density, with red areas indicating hotspots of high activity, primarily in central and urban regions.

##### Insights:
1. High crime concentrations are visible in central areas and densely populated urban zones.
2. Peripheral areas show significantly lower crime density.
3. This visualization highlights where law enforcement and public safety measures should be prioritized.

In [None]:
# Create the figure and axis
fig, ax = plt.subplots(figsize=(20, 20))

# Plot the GeoDataFrame on a Matplotlib axis
gdf.plot(ax=ax, color='red', markersize=1)

# Set axis limits
ax.set_xlim(-118.8, -118)
ax.set_ylim(33.7, 34.35)

# Set labels and title
ax.set_title('Crime Locations')
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')

# Show the plot
plt.show()

This scatter plot visualizes individual crime locations across the region using their latitude and longitude coordinates.

##### Insights:
1. The points form a detailed outline of the mapped area, indicating widespread crime occurrences.
2. Densely packed clusters represent urban areas with higher crime activity.
3. Sparse points highlight regions with lower crime occurrences, likely less populated or rural.

This visualization provides a comprehensive spatial overview of crime distribution, aiding in identifying high and low-crime regions.

## Predictive Modeling (Crime Prevention) using Random Forest Classifier and ANN.

### Random Forest Classifier

In [None]:
# Encode categorical variables
df['Vict_Sex'] = df['Vict_Sex'].astype('category').cat.codes
df['Vict_Descent'] = df['Vict_Descent'].astype('category').cat.codes
df['Crm_Cd'] = df['Crm_Cd'].astype('category').cat.codes

# Create target variable (e.g., crime type or severity)
df['Target'] = df['Part_1_2'].astype('category').cat.codes 

The cell encodes categorical variables (`Vict_Sex`, `Vict_Descent`, `Crm_Cd`, and `Part_1_2`) into numerical codes for machine learning or statistical analysis, creating a target variable `Target` based on `Part_1_2`.

In [None]:
# Select relevant features for modeling
features = ['Lat', 'Lon', 'Vict_Age', 'Vict_Sex', 'Vict_Descent', 'Crm_Cd']
X = df[features]
y = df['Target']

The code selects relevant features (`Lat`, `Lon`, `Vict_Age`, `Vict_Sex`, `Vict_Descent`, `Crm_Cd`) for modeling as `X` and defines the target variable `y` as `Target`.

In [None]:
# Initialize K-Fold cross-validator
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=64)

This code initializes a stratified K-Fold cross-validator with 10 splits, shuffling the data to ensure randomness and preserving the class distribution using a random seed (`random_state=64`).

In [None]:
# Initialize the Random Forest model
rf_model = RandomForestClassifier(n_estimators=50, max_depth=10, max_features=2, random_state=64)

# Perform cross-validation
cv_scores = cross_val_score(rf_model, X, y, cv=kfold, scoring='accuracy')

#Cross-Validation scores
print("10-Fold Cross-Validation Accuracy Scores: \n", cv_scores)
print("\n Mean CV Accuracy: \n", np.mean(cv_scores))
print("\n Standard Deviation of CV Accuracy: \n", np.std(cv_scores))

# Train and test on the whole dataset for a single split as an example
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=64)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

# Evaluate the model on the test set
print("\nRandom Forest Classifier Performance on Test Set:")
print(classification_report(y_test, y_pred_rf))
print("Test Set Accuracy:", accuracy_score(y_test, y_pred_rf))


In [None]:
print(confusion_matrix(y_test, y_pred_rf))

50 trees, a maximum depth of 10, and two features taken into account at each split characterize the Random Forest Classifier. Using the previously defined stratified folds, it conducts a 10-fold cross-validation, determines accuracy scores for each fold, and outputs the mean accuracy along with its standard deviation. It divides the dataset into training (60%) and testing (40%) subsets, trains the model on the training set, and assesses it on the test set as an illustration of training and testing. Metrics from a classification report and the total test accuracy are included in the evaluation, which provide information about the robustness and performance of the model.


With an average 10-fold cross-validation accuracy of **92.38%** and a low standard deviation of **0.14%**, the Random Forest Classifier performs well and consistently across folds. The model's accuracy on the test set is **92.23%**, with both classes' precision and recall being balanced. The model's capacity to manage both classes is demonstrated by the high f1-scores, especially for class 0 (**0.93**) and class 1 (**0.91**). The model is dependable for forecasting the target variable in this dataset, as seen by the weighted average metrics, which validate strong overall performance.


### Sequential Model (Artificial Neural Network) - Dense Layers

In [None]:
# Normalize the features for better performance
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Normalize features using `StandardScaler` to improve model performance and convergence.

In [None]:
# K-Fold Cross-Validation
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=64)

# Store metrics for each fold
fold_accuracies = []
fold_reports = []

Performing K-Fold Cross-Validation with Stratified Splits and Record Metrics.

In [None]:
for fold, (train_idx, test_idx) in enumerate(kfold.split(X_scaled, y)):
    print(f"Training Fold {fold + 1}...")
    
    # Split data into train and test for this fold
    X_train, X_test = X_scaled[train_idx], X_scaled[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]

    # Define the ANN model
    model = Sequential([
        Input(shape=(X_train.shape[1],)),  # Specify the input shape here
        Dense(units=64, activation='relu'),
        BatchNormalization(),
        Dropout(0.2),
        Dense(units=32, activation='relu'),
        BatchNormalization(),
        Dropout(0.3),
        Dense(units=16, activation='relu'),
        Dropout(0.3),
        Dense(units=1, activation='sigmoid')  # Output layer for binary classification
    ])

    # Compile the model
    model.compile(optimizer=Adam(learning_rate=0.001),
                  loss='binary_crossentropy',
                  metrics=['accuracy'])

    # Early stopping
    early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

    # Train the model
    model.fit(X_train, y_train,
              validation_split=0.2,
              epochs=10,
              batch_size=64,
              callbacks=[early_stopping],
              verbose=1)

    # Evaluate the model on the fold test set
    y_pred_prob = model.predict(X_test)
    y_pred = (y_pred_prob > 0.5).astype(int).flatten()

    # Calculate accuracy for this fold
    accuracy = accuracy_score(y_test, y_pred)
    fold_accuracies.append(accuracy)

    # Store classification report
    report = classification_report(y_test, y_pred, output_dict=True)
    fold_reports.append(report)

    print(f"Fold {fold + 1} Accuracy: {accuracy:.2f}")

# Display overall performance
mean_accuracy = np.mean(fold_accuracies)
std_accuracy = np.std(fold_accuracies)

print("\nK-Fold Cross-Validation Results:")
print(f"Mean Accuracy: {mean_accuracy:.2f}")
print(f"Standard Deviation of Accuracy: {std_accuracy:.2f}")

In [None]:
print(confusion_matrix(y_test, y_pred))

In [None]:
model.summary()

In [None]:
print(classification_report(y_test, y_pred))

An artificial neural network's (ANN) performance in binary classification is assessed using the K-Fold Cross-Validation process. The dataset is divided into ten stratified folds using `StratifiedKFold`, guaranteeing uniformity in the class distribution throughout each fold. A sequential ANN model is defined and trained for each fold once the data is separated into training and testing sets. Multiple dense layers with ReLU activations, batch normalization for stability, dropout for regularization, and a sigmoid activation in the output layer for binary classification make up the network. The Adam optimizer and binary cross-entropy loss are used to assemble the model, and early halting is used to avoid overfitting. In addition to the classification report, the accuracy of each fold is computed and recorded, offering information on model performance across folds. 


A mean accuracy of 87% with a standard deviation of 0.03 was obtained from the evaluation of the artificial neural network (ANN) model using K-Fold Cross-Validation. With little change in accuracy, this suggests that the model operated consistently across the folds. With 2,647 false positives and 1,616 false negatives, 9,483 true positives and 18,651 true negatives accurately categorized, the confusion matrix further demonstrates the model's effectiveness. These findings imply that while the model is good at differentiating across classes, it might do a better job of lowering misclassifications, especially erroneous positives and false negatives. Overall, the ANN has a strong capacity for dataset generalization while preserving dependable prediction accuracy.

## 5. Interpretation of Results

Understanding the patterns that drive crime distribution is critical for making sound judgments about resource allocation and public safety measures. Our study aimed to investigate the theory that particular types of crimes are concentrated in specific places, such as vehicle-related crimes being more common in high-traffic urban areas. We tested the efficacy of machine learning models such as Random Forest and Artificial Neural Networks (ANN) to predict crime types based on geographical and contextual variables.


#### Random Forest Results:
The Random Forest model outperformed in this job, with a mean cross-validation accuracy of 92.38% and a minimum standard deviation of 0.14%, indicating consistent performance over folds. On the test set, the model achieved 92.23% accuracy and a weighted F1-score of 0.92. The confusion matrix revealed 76,406 true negatives, 42,650 true positives, 9,445 false positives, and 2,245 false negatives, demonstrating its accuracy in distinguishing various crime categories. However, Type I errors (false positives) remain slightly higher, indicating room for improvement, maybe through feature engineering or more detailed hyperparameter optimization.


#### Artificial Neural Network (ANN) Results:
The ANN model had somewhat lower performance metrics, with a cross-validation mean accuracy of 87% and a standard deviation of 0.03, indicating reasonable reliability. On the test set, the ANN achieved an accuracy of 87.23% with a weighted F1-score of 0.87. The confusion matrix identified 18,651 true negatives, 9,483 true positives, 2,647 erroneous positives, and 1,616 false negatives. When compared to Random Forest, the ANN performed worse with Type II mistakes, underperforming in distinguishing true positive cases. This finding underscores the ANN's reliance on larger datasets and the potential for improvement by adding more features or modifying the architecture.


### Analysis and Key Takeaways:
Both models' findings strongly support the hypothesis, demonstrating patterns in crime distribution that correspond to spatial trends in the data. The Random Forest model revealed to be the most robust and trustworthy alternative for this investigation, surpassing the ANN in the majority of measures. However, all models demonstrated limits in dealing with false positives and false negatives, implying that additional contextual features—such as traffic patterns, population density, or time of day—could improve prediction accuracy.


### Future Improvements:
The findings show that, while our models capture the overall patterns, there is still space for improvement. Improving the dataset by adding new factors such as weather, socioeconomic indices, or proximity to key landmarks may yield more detailed insights. Furthermore, experimenting with advanced ensemble techniques or hybrid systems that combine Random Forest and ANN could help reduce errors. Iterating on these discoveries by refining models and features will provide a clearer view of crime distribution patterns and more precise predictions, bringing them closer to real-world circumstances.


## 6. Conclusion 

My models did not achieve the amount of resilience I had hoped for, but it is a normal part of the data science process. Regardless, the adventure has been extremely beneficial. I was able to find patterns in crime data, efficiently process and clean it, verify my theory regarding crime type concentration, and forecast the likelihood of specific crimes in specific places. While the existing data are insufficient to influence important policy decisions, they do suggest areas for improvement and serve as a solid framework for future research.


This project allowed me to work through the key stages of the data science lifecycle in a practical context:

1. **Data Collection**: Gathering detailed crime records from 2020 onward to analyze spatial and contextual factors.  
2. **Data Processing**: Cleaning and preparing the data to ensure consistency and relevance for my models.  
3. **Exploratory Analysis and Visualization**: Using heatmaps and other tools to uncover trends, such as hotspots for vehicle-related crimes.  
4. **Model Analysis and Testing**: Training and evaluating Random Forest and ANN models, understanding their strengths and limitations.  
5. **Interpretation of Results**: Drawing insights from the models, such as the need for additional features to improve predictions.

This exploration reaffirmed my belief that data science is iterative—results frequently lead to new questions and opportunity to better methodologies. My Random Forest model, for example, performed well with a 92% accuracy, but adding characteristics such as traffic patterns or socioeconomic data could lead to even better results. The ANN model, despite obtaining a lower accuracy of 87%, identified areas where design changes or extra data could improve performance.


This initiative is an important step forward in my career as a data scientist. Every step of the process, from developing hypotheses to evaluating results, has increased my understanding of how data can be used to solve real-world problems. I'm driven to keep refining my technique, adding additional layers of complexity, and eventually contribute to significant solutions that help analyze and manage criminal patterns.


## 7. References

1. Mohammad Nayeem Teli, "MSML602/DATA602/BIOL602 Principles of Data Science - Final Tutorial Instructions," University of Maryland. 
2. The Effect of Storms in the United States: An example tutorial illustrating data analysis and visualization. [Link](https://shahsean.github.io/).  
3. An Evaluation of American Presidential Elections: Demonstrates hypothesis testing and modeling in data science. [Link](https://jcurran0499.github.io/).  
4. Analysis of S&P 500 Companies: Showcasing exploratory analysis and financial modeling. [Link](https://neo-zhao.github.io/).  
5. City Bike Planning: Analyzing bike usage trends to inform city planning. [Link](https://abachhu.github.io/city-bike-planning/).  
6. Scikit-learn Documentation: Random Forest Classifier. [Link](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).  
7. Keras Sequential API Documentation. [Link](https://keras.io/guides/sequential_model/).  
8. Python Data Science Handbook by Jake VanderPlas. [Link](https://jakevdp.github.io/PythonDataScienceHandbook/).  
9. Random Forests by Leo Breiman. [Link](https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf).  
10. ChatGPT by OpenAI: An AI language model developed by OpenAI. [Link](https://chatgpt.com/).  
11. Crime Data from 2020 to Present | Los Angeles - Open Data Portal. [Link](https://data.lacity.org/Public-Safety/Crime-Data-from-2020-to-Present/2nrs-mtv8).  
12. LAPD Releases End of Year Crime Statistics for the City of Los Angeles 2023. [Link](https://mayor.lacity.gov/news/lapd-releases-end-year-crime-statistics-city-los-angeles-2023).   
13. Crime Mapping and COMPSTAT - LAPD Online. [Link](https://www.lapdonline.org/office-of-the-chief-of-police/office-of-special-operations/detective-bureau/crime-mapping-and-compstat/).  
14. Crime and Arrest Statistics - Los Angeles County Sheriff's Department. [Link](https://lasd.org/transparency/crimeandarrest/).  
15. Los Angeles Crime Rates and Statistics - NeighborhoodScout. [Link](https://www.neighborhoodscout.com/ca/los-angeles/crime).  
16. LAPD 2023 Stats Show Homicides and Violent Crime Down, Property Crime and Thefts Up. [Link](https://ktla.com/news/local-news/lapd-2023-stats-show-homicides-and-violent-crime-down-property-crime-and-thefts-up/).  
17. Violent Crime in Los Angeles Decreased in 2023. But Officials Worry the... [Link](https://apnews.com/article/los-angeles-crime-police-5ec43ae9f02acfd01ce2b21cb387f3b9).  