# FBI analysis

In this notebook, we conduct an analysis of **violent offenses** **data** across the United States, focusing on the year **2012**. This year was selected because it provides the **most consistent** dataset across states.

Our goal is to explore trends in violent offenses, normalize the data relative to population sizes, and identify patterns across states. This includes metrics like **total offenses per state** and **offenses per 100,000 inhabitants** to ensure fair comparisons between regions with different population sizes.

### Understanding the NIBRS and Crime Data Explorer (CDE)
The **National Incident-Based Reporting System** (NIBRS) is a U.S. system for collecting and analyzing crime data. The data found on the **Crime Data Explorer** (CDE) is sourced from the **FBI’s Uniform Crime Reporting** (UCR) Program, which collects information from over **18,000 federal, state, local, tribal and territorial law enforcement agencies** across the U.S. (Source: [CDE](https://cde.ucr.cjis.gov/LATEST/webapp/#/pages/explorer/crime/crime-trend)).

#### Key points about this data:
* The report is **voluntary for non-federal agencies** and depends on **state participation** or **direct submissions** to the FBI.
* It reflects **reported crimes** and may not include all crime occurrences.
* Various **confounders factors** (socio-economic and legal factors such as population size, economic conditions, etc.) influence crime reporting and activity.
#### Why NIBRS Data is useful for our analysis
* **Insights**: It provides specific details about incidents (including their exact dates, types, victim injuries and locations).
* **Types of offenses**: Knowing the nature of offenses helps us classify crimes by their violent range.
* **Definition**: of violent crime  The FBI categorizes violent crimes into **four offenses**: *homicide, rape, robbery, and aggravated assault, all involving force or threats* (Source: [CDE](https://cde.ucr.cjis.gov/LATEST/webapp/#/pages/explorer/crime/crime-trend)).
  
For our analysis, we **adjusted these categories** to include offenses that are clearly violent but were not classified as such by the FBI (e.g., Animal Cruelty)

The following analysis focuses on the year **2012**, which was the year when most states reported crimes.

In [1]:
#----------------------------------------------------------------------------
# Generate 2 datasets: offenses group by state and type of offenses in 2012, 
#                      offenses group by type of offenses in 2012
#----------------------------------------------------------------------------
import pandas as pd

data = pd.read_csv("../../data/CLEAN/2012.csv")

# To ensure state prefixes are consistent
data['state_prefix'] = data['state_prefix'].str.strip().str[:2].str.upper()

# Count per state for the same offense
state_offense_count = data.groupby(['state_prefix', 'offense_type']).size().reset_index(name='count')

# Count per offense
offense_count = data['offense_type'].value_counts().reset_index()
offense_count.columns = ['offense_type', 'count']

# Save results
state_offense_count.to_csv("state_offense_count.csv", index=False)
offense_count.to_csv("offense_count.csv", index=False)

# Display the results
print("Counts per state and offense:")
print(state_offense_count)

print("\nCounts per offense:")
print(offense_count)

with pd.ExcelWriter("offense_analysis.xlsx") as writer:
    state_offense_count.to_excel(writer, sheet_name="State_Offense_Count", index=False)
    offense_count.to_excel(writer, sheet_name="Offense_Count", index=False)

print("Results saved in 'offense_analysis.xlsx'")


Counts per state and offense:
    state_prefix                              offense_type  count
0             AL                                     Arson      1
1             AL                          Assault Offenses    825
2             AL              Burglary/Breaking & Entering    391
3             AL                    Counterfeiting/Forgery     55
4             AL  Destruction/Damage/Vandalism of Property    280
..           ...                                       ...    ...
650           WV                     Prostitution Offenses    125
651           WV                                   Robbery    780
652           WV                              Sex Offenses   1246
653           WV                  Stolen Property Offenses    855
654           WV                     Weapon Law Violations    975

[655 rows x 3 columns]

Counts per offense:
                                offense_type    count
0                     Larceny/Theft Offenses  1582037
1                        

In [2]:
# Save in an xlsx for Flourish use
state_offense_count.to_excel('../../data/CLEAN/statecount.xlsx')
offense_count.to_excel('../../data/CLEAN/offensecount.xlsx')

In [3]:
#-----------------------------------------------------------------------
# For better visualization, we will use state's name instead of prefixes
#-----------------------------------------------------------------------
# Dictionary mapping state prefixes to full state names
state_prefix_to_name = {
    "AL": "Alabama",
    "AK": "Alaska",
    "AZ": "Arizona",
    "AR": "Arkansas",
    "CA": "California",
    "CO": "Colorado",
    "CT": "Connecticut",
    "DE": "Delaware",
    "DC": "District of Columbia",
    "FL": "Florida",
    "GA": "Georgia",
    "HI": "Hawaii",
    "ID": "Idaho",
    "IL": "Illinois",
    "IN": "Indiana",
    "IA": "Iowa",
    "KS": "Kansas",
    "KY": "Kentucky",
    "LA": "Louisiana",
    "ME": "Maine",
    "MD": "Maryland",
    "MA": "Massachusetts",
    "MI": "Michigan",
    "MN": "Minnesota",
    "MS": "Mississippi",
    "MO": "Missouri",
    "MT": "Montana",
    "NE": "Nebraska",
    "NV": "Nevada",
    "NH": "New Hampshire",
    "NJ": "New Jersey",
    "NM": "New Mexico",
    "NY": "New York",
    "NC": "North Carolina",
    "ND": "North Dakota",
    "OH": "Ohio",
    "OK": "Oklahoma",
    "OR": "Oregon",
    "PA": "Pennsylvania",
    "RI": "Rhode Island",
    "SC": "South Carolina",
    "SD": "South Dakota",
    "TN": "Tennessee",
    "TX": "Texas",
    "UT": "Utah",
    "VT": "Vermont",
    "VA": "Virginia",
    "WA": "Washington",
    "WV": "West Virginia",
    "WI": "Wisconsin",
    "WY": "Wyoming"
} # generated by ChatGPT

data = pd.read_excel("../../data/CLEAN/statecount.xlsx")

# Map the state prefixes to full names
data['state_name'] = data['state_prefix'].map(state_prefix_to_name)

# Save the updated df
data.to_excel("../../data/CLEAN/updatedstatecount.xlsx", index=False)


### Adjustments for Our Analysis
As mentioned previously, we included additional offenses directly targeting individuals, such as:
- Larceny/Theft Offenses
- Assault Offenses
- Destruction/Damage/Vandalism of Property
- Burglary/Breaking & Entering
- Sex Offenses
- Robbery
- Arson
- Kidnapping/Abduction
- Animal Cruelty
- Homicide Offenses
- Etc.
#### Selected violent offenses from previous analysis

In [4]:
violent_offenses = ["Larceny/Theft Offenses", "Assault Offenses", "Destruction/Damage/Vandalism of Property",
                    "Burglary/Breaking & Entering", "Sex Offenses", "Robbery", "Arson", "Kidnapping/Abduction", "Homicide Offenses"]

#### Filter offense types

In [5]:
filtered_data = data[data['offense_type'].isin(violent_offenses)]
filtered_data.to_excel("../../data/CLEAN/violentstatecount.xlsx", index=False)
filtered_data = filtered_data.drop(columns=["Unnamed: 0"])
filtered_data

Unnamed: 0,state_prefix,offense_type,count,state_name
0,AL,Arson,1,Alabama
1,AL,Assault Offenses,825,Alabama
2,AL,Burglary/Breaking & Entering,391,Alabama
4,AL,Destruction/Damage/Vandalism of Property,280,Alabama
8,AL,Homicide Offenses,2,Alabama
...,...,...,...,...
645,WV,Homicide Offenses,63,West Virginia
646,WV,Kidnapping/Abduction,117,West Virginia
647,WV,Larceny/Theft Offenses,28676,West Virginia
651,WV,Robbery,780,West Virginia


We also wanted to normalize the data to **offenses per 100k capita** for a more accurate comparison across states. This method will reveal that some states with a high count of crimes had lower rates when adjusted for population...

# Normalization per capita

In order to better visualize the violent offenses, the count of each offenses was normalized using data from [Fact Monster](https://www.factmonster.com/us/fifty-states/state-population-rank-2012).

Normalization was performed to provide better insights through the metric **offenses per 100k capita**.

In [6]:
# Data about population from Fact Monster
population = pd.read_csv('../../data/CLEAN/state_population_data.csv')
population.sample(10)

Unnamed: 0,State,Population
27,Oklahoma,3814820
13,Massachusetts,6646144
24,Louisiana,4601893
43,Montana,1005141
48,DC,632323
26,Oregon,3899353
38,Idaho,1595728
15,Indiana,6537334
28,Connecticut,3590347
32,Kansas,2885905


#### Normalization based on population

In [7]:
# Merge violent offense data with population data based on state name
merged_data = pd.merge(
    filtered_data,
    population,
    left_on="state_name",  # Column in violent offense data
    right_on="State",      # Column in population data
    how="left"             # Keep all violent offense data
)

# Calculate normalized offenses per 100,000 people
merged_data["offenses_per_100k"] = (merged_data["count"] / merged_data["Population"]) * 100000

# Drop unnecessary columns
merged_data_cleaned = merged_data.drop(columns=["State", merged_data.columns[0]], errors="ignore")

# Save the cleaned and normalized data to a new CSV
output_file_path = "../../data/CLEAN/normalized_violent_offenses.csv"
merged_data_cleaned.to_csv(output_file_path, index=False)
merged_data_cleaned.to_excel("../../data/CLEAN/normalized_violent_offenses.xlsx" )


In [8]:
# Merge violent offense data with population data based on state name
merged_data = pd.merge(
    data,
    population,
    left_on="state_name",  # Column in violent offense data
    right_on="State",      # Column in population data
    how="left"             # Keep all violent offense data
)

# Calculate normalized offenses per 100,000 people
merged_data["offenses_per_100k"] = (merged_data["count"] / merged_data["Population"]) * 100000

# Drop unnecessary columns
merged_data_cleaned = merged_data.drop(columns=["Unnamed: 0", "State", merged_data.columns[0]], errors="ignore")

# Save the cleaned and normalized data to a new CSV
output_file_path = "../../data/CLEAN/normalized_offenses.csv"
merged_data_cleaned.to_csv(output_file_path, index=False)
merged_data_cleaned.to_excel("../../data/CLEAN/normalized_offenses.xlsx")

## Observing Trends Over Time

In [9]:
import plotly.express as px

just_counts = merged_data.groupby(['state_name'], as_index=False).agg({'count':'sum'})

# Sort data for clarity
df_sorted = just_counts.sort_values('count', ascending=False)
df_sorted['count'] = df_sorted['count'].clip(lower=1e-5)

fig = px.treemap(
    df_sorted, 
    path=['state_name'], 
    values='count', 
    color='count', 
    color_continuous_scale='Bluered', 
    title='Violent Offenses by State in 2012 in the U.S.'
)

fig.update_layout(margin=dict(t=50, l=25, r=25, b=25))
fig.show()



- **Highest Offense Counts:** Michigan, Tennessee and Ohio report the highest violent offense counts.
- **Moderate Counts:** South Carolina, Washington and Arkansas fall in the mid-range with noticeable counts.
- **Lowest Counts:** Maine, Vermon, and Illinois report the lowest counts, based on their small size and blue color.
#### Observations
Without normalization, populous states like Michigan and Ohio dominate the visualization due to their larger populations.

In [10]:
per_capita = merged_data_cleaned.groupby(['state_name'], as_index=False).agg({'offenses_per_100k':'sum'})


# Sort data for clarity
df_sorted = per_capita.sort_values('offenses_per_100k', ascending=False)
df_sorted['offenses_per_100k'] = df_sorted['offenses_per_100k'].clip(lower=1e-5)

fig = px.treemap(
    df_sorted, 
    path=['state_name'], 
    values='offenses_per_100k', 
    color='offenses_per_100k', 
    color_continuous_scale='Bluered', 
    title='Violent Offenses Per 100k Capita by State in 2012 in the U.S.'
)

fig.update_layout(margin=dict(t=50, l=25, r=25, b=25))
fig.show()



- **Highest Offense Rates:** Delaware and Arkansas have the highest offenses per 100k capita.
- **Moderate Rates:** States like Michigan, Kansas, Rhode Island and South Carolina show mid-range offense rates.
- **Lowest Rates:** Alabama, Arizona, Illinois, Texas and Louisiana report the lowest offenses per 100k capita.
#### Observations
Normalization shows significant differences in crime rates independent of state population and highlights states like Delaware and Arkansas that might otherwise go unnoticed from the previous plot.


## Deep Dive: 2012 Crime Analysis

In [None]:
import pandas as pd
import os
from scipy.stats import zscore

# Path to the folder containing the files
folder_path = "../../data/CLEAN/FBI_91_12" # regroup all violent offenses from each state

state_name_mapping = {
    "Vermont": "Vermont",
    "SouthCarolina": "South Carolina",
    "Texas": "Texas",
    "WestVirginia": "West Virginia",
    "Michigan": "Michigan",
    "Kentuchy": "Kentucky",  # Correct spelling error
    "Wisconsin": "Wisconsin",
    "Montana": "Montana",
    "Louisiana": "Louisiana",
    "Connecticut": "Connecticut",
    "Washington": "Washington",
    "Tennessee": "Tennessee",
    "RhodeIsland": "Rhode Island",
    "Massachusetts": "Massachusetts",
    "Arizona": "Arizona",
    "Alabama": "Alabama",
    "Pennsylvania": "Pennsylvania",
    "SouthDakota": "South Dakota",
    "Utah": "Utah",
    "NewHampshire": "New Hampshire",
    "Maine": "Maine",
    "DistrictColumbia": "DC",  # Change to "DC"
    "Nebraska": "Nebraska",
    "NorthDakota": "North Dakota",
    "Delaware": "Delaware",
    "Iowa": "Iowa",
    "Oklahoma": "Oklahoma",
    "Ohio": "Ohio",
    "Virginia": "Virginia",
    "Colorado": "Colorado",
    "Oregon": "Oregon",
    "Arkansas": "Arkansas",
    "Missouri": "Missouri",
    "Illinois": "Illinois",
    "Kansas": "Kansas",
    "Idaho": "Idaho",
} # Generate by ChatGPT to correct name of states

z_score_all_states_data = []
normalized_states_data = []

# Iterate through files and process each state's data
for file in os.listdir(folder_path):
    if file.endswith("_violence.csv"):

        state_data = pd.read_csv(os.path.join(folder_path, file))
        state_abbreviation = file.split('_')[0]  # Extract state name from filename
        if state_abbreviation in state_name_mapping:
            state_name = state_name_mapping[state_abbreviation]
        else:
            print(f"Skipping file: {file} (no matching state abbreviation)")
            continue
        
        state_data['state'] = state_name  # Replace state name with the proper format
        
        # Filter for the year 2012
        state_data = state_data[state_data['year'] == 2012]
        if state_data.empty:
            # debug
            print(f"No data for 2012 in state: {state_name}")
            continue  # Skip if no data for 2012
        
        # Group by week to calculate weekly offense counts
        weekly_data = state_data.groupby('week')['incident_id'].count().reset_index()
        weekly_data['state'] = state_name
        
        # Calculate z-score for weekly offenses
        if len(weekly_data) > 1:  # Ensure enough data points for z-score
            weekly_data['z_score'] = zscore(weekly_data['incident_id'])
            z_score_all_states_data.append(weekly_data)
        else:
            print(f"Not enough data for z-score calculation in state: {state_name}")

        # Normalize offense counts by population
        state_population = population[population['State'] == state_name]['Population'].values
        if len(state_population) > 0:
            total_population = state_population[0]
            weekly_data['normalized_offenses'] = weekly_data['incident_id'] / total_population * 100000  # Per 100k capita as before
            normalized_states_data.append(weekly_data)
        else:
            print(f"No population data found for {state_name}, skipping normalization")


# Combine all states' data
if z_score_all_states_data:
    z_scores_offenses_by_week = pd.concat(z_score_all_states_data)
    print("Z-Score data processed successfully!")
else:
    print("No z-score data to combine.")
if normalized_states_data:
    normalized_offenses_by_week = pd.concat(normalized_states_data)
    print("Normalized data processed successfully!")
else:
    print("No normalized data to combine.")

Not enough data for z-score calculation in state: Pennsylvania
Z-Score data processed successfully!
Normalized data processed successfully!


In [12]:
# For flourish
z_scores_offenses_by_week.to_excel("../../data/CLEAN/z_scores_offenses_by_week.xlsx" )
normalized_offenses_by_week.to_excel("../../data/CLEAN/normalized_offenses_by_week.xlsx" )

### Evolution across the year of 2012 in all states
#### Z-score

In [54]:
# Highlighted weeks with events
highlighted_weeks = {
    'Birthday of Martin Luther King, Jr.': 3,
    "Washington's Birthday/Presidents Day": 8,
    "Memorial Day": 21,
    "Independence Day": 27,
    "Labor Day": 36,
    "Veterans Day": 45,
    "Thanksgiving Day": 47,
    'Inauguration Day': 3,
    'Halloween': 44,
    'Valentine\'s Day': 7,
    'Pride Month Start': 22,
    # Major events of 2012
    'Obama Supports Same-Sex Marriage': 18,
    '2012 Presidential Election': 45,  # Week of November 6
}


# Avoid overlap
staggered_positions = [-40, -60, -80, -100, -120]  # Different vertical offsets
annotation_index = 0  # Counter to cycle through staggered positions

# Create the list of annotations
annotations = []
for event, week in highlighted_weeks.items():
    annotations.append(
        dict(
            x=week,
            y=max(z_scores_offenses_by_week['z_score']) * 1.1,  # Slightly above the max value
            text=event,
            showarrow=True,
            arrowhead=2,
            ax=0,  # No horizontal offset
            ay=staggered_positions[annotation_index % len(staggered_positions)],  # Cycle through vertical offsets
            font=dict(size=10, color="black"),
            bgcolor="yellow",
            bordercolor="black"
        )
    )
    annotation_index += 1  # Increment to use the next staggered position

# Create the line plot
fig = px.line(
    z_scores_offenses_by_week,
    x="week",
    y="z_score",
    color="state",  # One color for each state
    labels={
        "week": "Week of the Year",
        "z_score": "Z-Score of Weekly Offenses",
        "state": "State"
    },
    title="Z-Score Evolution of Violent Offenses Across Weeks in 2012 (By State)",
)

# Add hover to display week and z-score
fig.update_traces(hovertemplate="<b>Week:</b> %{x}<br><b>Z-Score:</b> %{y:.2f}")

# Add annotations and adjust layout
fig.update_layout(
    height=800,  # Increase height
    title=dict(font=dict(size=16)),  
    xaxis_title="Week of the Year", 
    yaxis_title="Z-Score of Weekly Offenses",  
    legend_title="State",  
    hovermode="closest",
    annotations=annotations  # Add the staggered annotations
)

# Display the figure
fig.show()


#### Normalized data

In [49]:
# Avoid overlap
staggered_positions = [-40, -60, -80, -100, -120]  # Different vertical offsets
annotation_index = 0  # Counter to cycle through staggered positions

# Create the list of annotations
annotations = []
for event, week in highlighted_weeks.items():
    annotations.append(
        dict(
            x=week,
            y=max(normalized_offenses_by_week['normalized_offenses']) * 1.1,  # Slightly above the max value
            text=event,
            showarrow=True,
            arrowhead=2,
            ax=0,  # No horizontal offset
            ay=staggered_positions[annotation_index % len(staggered_positions)],  # Cycle through vertical offsets
            font=dict(size=10, color="black"),
            bgcolor="yellow",
            bordercolor="black"
        )
    )
    annotation_index += 1  # Increment to use the next staggered position

# Create the line plot
fig = px.line(
    normalized_offenses_by_week,
    x="week",
    y="normalized_offenses",
    color="state",  # One color for each state
    labels={
        "week": "Week of the Year",
        "normalized_offenses": "Offenses Per 100k Capita",
        "state": "State"
    },
    title="Normalized Evolution of Violent Offenses Across Weeks in 2012 (Per 100k Capita & by State)",
)

# Add hover template to display state name on hover
fig.update_traces(hovertemplate="<b>Week:</b> %{x}<br><b>Offenses:</b> %{y:.2f}")

# Add annotations and adjust layout
fig.update_layout(
    height=800,  # Increase height
    title=dict(font=dict(size=16)),
    xaxis_title="Week of the Year",
    yaxis_title="Offenses Per 100k Capita",
    legend_title="State",
    hovermode="closest",  # Highlight the closest line on hover
    annotations=annotations  # Add the staggered annotations
)

# Display the figure
fig.show()


## Analysis of Violent Offenses Across States in 2012

### Z-Score Evolution

The z-score plot shows how weekly violent offense counts deviate from the mean for each state, scaled by the standard deviation. We wanted to standardize the data to highlight anomalies and state-specific crime trends.

#### Observations:
- States shows significant **variability** in weekly violent offense counts (with z-scores **frequently exceeding ±2**) and highlights weeks with notable crime surges or declines compared to the average.
- Pronounced spikes could suggest **potential localized events** or **external factors** that influences weekly crime patterns in specific states.
- States with stable weekly crime counts show **less variability** visualize by a flatter z-score trends.
- The overlapping lines make it challenging to isolate individual states but overall trends highlight nationwide fluctuations.

### Observations based on highlighted dates

1. **Birthday of Martin Luther King, Jr. (Week 3)**:
   - A subtle rise in violent offense counts in some states may be observed and could potentially be linked to increased activity around public holidays.

2. **Washington's Birthday/Presidents Day (Week 8)**:
   - No consistent spikes across states in 2012 and could indicate this holiday might not significantly influence violent offenses in the U.S.

3. **Obama Supports Same-Sex Marriage (Week 18)**:
   - A notable increase in violent offenses is observed in some states which might correlate with public reactions or protests from this announcement.

4. **Memorial Day (Week 21)**:
   - A slight uptick in z-scores for certain states.

5. **Pride Month Start (Week 22)**:
   - Certain states show higher crime deviations which could be linked to events during Pride Month.

6. **Labor Day (Week 36)**:
   - Similar to Memorial Day, a modest rise in offenses is observed.

7. **Halloween (Week 44)**:
   - Noticeable spikes in violent offenses in several states and could possibly be linked to increased nighttime activity or parties.

8. **Veterans Day (Week 45)**:
   - Minimal fluctuation observed, suggesting this holiday has little impact on violent offenses.

9.  **Thanksgiving Day (Week 47)**:
    - A sharp decline in offenses toward this holiday in most states, potentially linked to to family-focused holidays reducing public activity. Also, a sharp increase the week following Thanksgiving is observed.

#### Limitations:
- Overlapping lines can hide the ability to distinguish individual state trends.
- The z-score only reflects **deviations** and not the **magnitude of crimes** which can obscure the real-world impact of high crime rates.

### Normalized Evolution (Per 100k Capita)

The normalized plot adjusts weekly violent offense counts by state population (per 100,000 inhabitants). We performed this analysis in order to allow **direct comparisons of crime rates across states** and to highlight the proportional impact of violent offenses.

#### Observations:
- States such as **Delaware**, **Arkansas**, **Tennesse** and **South Carolina** exhibit higher normalized crime rates, with peaks often exceeding 80–100 offenses per 100k capita.
- Larger states like **Texas** show steadier trends certainly due to the dilution effect of the large population.
- A **mid-year rise in offenses** is visible for certain states that could possibliy linked to seasonal factors, while declines toward the year’s end may correspond to **holiday-related trends.**
- Normalization reveals disparities in crime rates and allows us to make comparisons between states of different sizes.

#### Possible limitations:
- **Population Dependency**
  - Small population states with low absolute offenses may appear disproportionately significant after normalization.
- **Subtle Trends**
  -  States with stable crime patterns might not show significant variability even if underlying dynamics are important.

### Comparison of both approaches

The **z-score plot** focuses on deviations within states, making it ideal for detecting anomalies and understanding intra-state patterns. In contrast, the **normalized plot** highlights inter-state differences by accounting for population disparities, offering a proportional view of crime rates.

Both visualizations allow us to do complementary analysis:
- The **z-score plot** is effective for identifying unusual events or weeks within states.
- The **normalized plot** can illustrate state-to-state differences and overall trends in crime rate.
