---
format: 
  html:
    toc: true
    page-layout: full
execute:
    warning: false
    echo: true
    eval: true
---

## Assault Data (Data Gathering)

***

Our data is sourced from the Chicago Data Portal, which provides a variety of municipal datasets for public use. For our model, we specifically use assault case data from 2021. While our original goal was to analyze assault cases from 2020 to 2023 to explore patterns during the COVID-19 pandemic, for the sake of simplicity and time constraints, we have chosen 2021 as a representative year. We intend to later predict assault cases for 2022 and assess the accuracy of our model.


In [None]:
#| code-fold: true

!pip install census
!pip install us
!pip install sodapy

import matplotlib.pyplot as plt
import pandas as pd
import geopandas as gpd
from census import Census
from us import states
import os
from sodapy import Socrata
from shapely.geometry import Point
import numpy as np
from scipy.stats import gaussian_kde
from matplotlib.colors import LinearSegmentedColormap
from sklearn.neighbors import KernelDensity
from shapely.geometry import box
from matplotlib.colors import Normalize
from matplotlib.colorbar import ColorbarBase
import seaborn as sns

To analyze crime patterns in Chicago, we began by retrieving data from the city's open data portal using the Socrata API. The Socrata API provides an efficient method for accessing large public datasets. Below is an overview of the steps involved in the data retrieval process:

1.	**Data Retrieval**:
We obtained the dataset containing crime reports (dataset ID: dwme-t96c), specifying a limit of 210,000 records to ensure a comprehensive dataset for analysis. This dataset includes key details such as the date, location, crime type, and other relevant contextual information about each incident.

2.	**Conversion to Pandas DataFrame**:
The data returned from the API is in JSON format, which was then converted into a Pandas DataFrame. This transformation makes the data easier to manipulate and analyze, using Pandas' powerful tools for data cleaning, filtering, and exploration.
 
3.	**Data Structure**:
The dataset contains several columns, including:
	- Date and Location: Timestamp and coordinates of each incident.
	- Crime Descriptions: Categorized details about the crime, including the type of crime and location (e.g., street address, community area).
	- Administrative Details: Information about police districts, case numbers, and arrest status.

4.	**Setup**:
This initial step ensures we have a clean and manageable dataset for analysis. The extracted data serves as the foundation for preprocessing, feature engineering, and model development in the subsequent stages of the project.

In [4]:
#| code-fold: true

client = Socrata("data.cityofchicago.org",
                  "PXGs3LAGSv2IZaGVJVPf1M0Fz",
                  username="jijinc@upenn.edu",
                  password="8m9reD@XfA$Z5W.")

results = client.get("dwme-t96c", limit=210000)

results_df = pd.DataFrame.from_records(results)

After these steps, the dataset (`results_df`) was filtered to extract only rows where the `primary_type` is "ASSAULT", creating a new DataFrame named assault. A new column, geometry, was added by converting the longitude and latitude values into Point objects using the `shapely.geometry` library. This transformation allows for spatial representation of assault incidents.

The filtered dataset was then converted into a GeoDataFrame called `Assault21`, a specialized Pandas object that supports spatial operations. The geometry column specifies the spatial data, allowing for further spatial analysis. The coordinate reference system (CRS) was set to `EPSG:4326`, which is based on latitude and longitude.

The resulting GeoDataFrame, `Assault21`, includes both the original crime data and the newly created geometry column, enabling us to perform spatial analysis and visualization. With this setup, we are able to map the locations of assault incidents, gaining insights into the spatial distribution of these crimes.

To visualize the assault incidents in Chicago, we used a combination of a boundary shapefile and the `Assault21` GeoDataFrame. First, we retrieved census data for Chicago’s tracts using the `c.acs5.state_county_tract` function from the census module. Relevant demographic variables, such as total population, were selected and stored in a Pandas DataFrame called `chidf`. Additionally, a GeoDataFrame representing Chicago’s city boundary was loaded from the GeoJSON file `chicagoBoundary.geojson`, which defines the city's limits.

***


The map was created using matplotlib and geopandas. The `chicagoBoundary` GeoDataFrame was plotted first, using a beige color to represent the city's boundary. The `Assault21` GeoDataFrame was overlaid on top, with assault incident locations marked in orange.  Markers were kept small to prevent map clutter, providing a clearer visualization of incident distributions.

From the map, we observe that assault incidents are concentrated in the northern and central areas of the city, while the southern region shows fewer incidents. This spatial distribution allows us to identify potential hotspots and explore correlations with other socioeconomic or geographic factors in the following stages of the analysis.


In [None]:
#| code-fold: true

import matplotlib.pyplot as plt
import matplotlib as mpl

mpl.rcParams['font.family'] = 'sans-serif'
mpl.rcParams['font.sans-serif'] = 'Futura'
mpl.rcParams['font.size'] = 12

fig, ax = plt.subplots(figsize=(10, 10))

# chicagoBoundary.plot(ax=ax, color='beige')
chicagoBoundary.to_crs(4326).plot(ax=ax, color='black')
# Assault21.to_crs()
Assault21.plot(ax=ax, color='#c2e538', markersize=0.1, label='Assault Incidents')

plt.title("Assault Incidents in Chicago 2021")
plt.xticks([])
plt.yticks([])
plt.gca().set_frame_on(False)

plt.gca().set_facecolor('white')
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['left'].set_color('grey')
plt.gca().spines['bottom'].set_color('grey')
plt.gca().spines['left'].set_linewidth(0.8)
plt.gca().spines['bottom'].set_linewidth(0.8)

plt.show()

![](../images/dots.jpeg){width=45%}