# Exploratory Data Analysis: Geospatial Crime Mapping

- State/district crime mapping


### Geospatial Risk Mapping Using FIR Data

#### Objective

- Geospatial Risk Mapping: Utilize geographic information from FIRs to create detailed maps highlighting high-risk areas, aiding in targeted law enforcement and resource allocation.


#### Analysis of Crime Data (2001-2014)

In this notebook, we will analyze crime data from 2001 to 2014 to identify spatial and temporal trends in criminal activities. The datasets include:

1. `01_District_wise_crimes_committed_IPC_2001_2012.csv`
2. `05_State_UT_wise_crimes_committed_2001_2012.csv`
3. `42_District_wise_crimes_committed_against_women_2001_2012.csv`



## Dataset Overview

| **Dataset**                                      | **Usefulness**                                                                 | **Pros**                                                                                     | **Cons**                                   |
|--------------------------------------------------|--------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|-------------------------------------------|
| `01_District_wise_crimes_committed_IPC_2001_2012.csv` | ⭐⭐⭐⭐⭐ High spatial resolution (district level), IPC crimes, multi-year          | Detailed district-level data for granular analysis                                          | Requires district boundaries (GeoJSON)    |
| `05_State_UT_wise_crimes_committed_2001_2012.csv` | ⭐⭐⭐⭐ Easier for quick mapping (state level), complete IPC/SLL overview          | Simpler for state-level analysis and visualization                                          | Less granular                             |
| `42_District_wise_crimes_committed_against_women_2001_2012.csv` | ⭐⭐⭐ Focused category (e.g., crimes against women)                               | Useful for analyzing specific crime categories                                              | Limited to specific crime types           |


## Analysis

-  `01_District_wise_crimes_committed_IPC_2001_2012.csv`** for detailed district-level heatmaps to analyze granular crime trends.
- `05_State_UT_wise_crimes_committed_2001_2012.csv`** for simpler state-level heatmaps as a fallback or for an overview.
- `42_District_wise_crimes_committed_against_women_2001_2012.csv`** to focus on crimes against women and identify high-risk areas for targeted interventions.

#### **Notebook Structure**

1. Introduction
2. Dataset Description
3. Data Preprocessing
4. Exploratory Analysis
    - Aggregate total IPC crimes per district over all years
    - 
    - 
    - 
5. Key Insights

In [None]:
from pathlib import Path
import pandas as pd

root = Path().resolve().parent
df_2001_2012 = pd.read_csv(root / 'dataset/State-wise data from 2001 is classified according to 40+factors/crime/crime/01_District_wise_crimes_committed_IPC_2001_2012.csv')
df_2013 = pd.read_csv(root / 'dataset/State-wise data from 2001 is classified according to 40+factors/crime/crime/01_District_wise_crimes_committed_IPC_2013.csv')

### Preprocessing

In [None]:
import pandas as pd

# Load Dataset
df_2001_2012 = pd.read_csv('/Users/ananthakrishnab/Desktop/Projects/Community Risk Profiling Using FIR Data/dataset/State-wise data from 2001 is classified according to 40+factors/crime/crime/01_District_wise_crimes_committed_IPC_2001_2012.csv')
df_2013 = pd.read_csv('/Users/ananthakrishnab/Desktop/Projects/Community Risk Profiling Using FIR Data/dataset/State-wise data from 2001 is classified according to 40+factors/crime/crime/01_District_wise_crimes_committed_IPC_2013.csv')

In [None]:
# Combine them into one
df_all_years = pd.concat([df_2001_2012, df_2013], ignore_index=True)

# Standardize column names
df_all_years.columns = df_all_years.columns.str.strip() # Remove leading/trailing whitespace

In [None]:
# View the first few rows of the combined DataFrame
df_all_years.head()

In [None]:
# Check for nulls
df_all_years.isnull().sum()

In [None]:
# Columns
columns = df_all_years.columns.tolist()
print(columns)

In [None]:
# Aggregate total IPC crimes per district over all years
crime_totals = df_all_years.groupby(['STATE/UT', 'DISTRICT'])['TOTAL IPC CRIMES'].sum().reset_index()
crime_totals.rename(columns={'TOTAL IPC CRIMES': 'Total_Crimes'}, inplace=True)

# Result
crime_totals.head()

In [None]:
# Name of all the districts
districts = crime_totals['DISTRICT'].unique()
print("Number of unique districts:", len(districts))

# List of all the districts
districts_list = districts.tolist()
districts_list

In [None]:
# Removing districts that are not valid
invalid_districts = [
    # 🚫 Clearly Invalid Entries
    'A AND N ISLANDS', 'ANDAMAN', 'CAR', 'NICOBAR', 'TOTAL', 'ZZ TOTAL', 'DELHI UT TOTAL',
    'G.R.P.', 'G.R.P.(RLY)', 'GRP', 'GRP(RLY)', 'GRP RAIPUR',
    'RAILWAYS', 'RAILWAYS JAMMU', 'RAILWAYS KASHMIR', 'RAILWAYS KATRA', 'RAILWAYS KMR',
    'W.RLY', 'W.RLY AHMEDABAD', 'W.RLY VADODARA',
    'METRO RAIL', 'IGI AIRPORT', 'I.G.I. AIRPORT',
    'S.T.F.', 'STF',
    'CID', 'CID CRIME', 'CBCID', 'CRIME BRANCH', 'CRIME JAMMU', 'CRIME KASHMIR', 'CRIME SRINAGAR',
    'EOW', 'SPL CELL', 'CAW', 'SPL NARCOTIC', 'C.I.D.', 'R.P.O.',
    'DCP BBSR', 'DCP CTC', 'SRP(CUTTACK)', 'SRP(ROURKELA)',
    'TRAFFIC PS', 'I&P HARYANA',

    # ⚠️ RLY & Urban-specific Names
    'JAMALPUR RLY.', 'MUZAFFARPUR RLY.', 'PATNA RLY.',
    'BHOPAL RLY.', 'INDORE RLY.', 'JABALPUR RLY.',
    'DHANBAD RLY.', 'JAMSHEDPUR RLY.',
    'VIJAYAWADA RLY.', 'SECUNDERABAD RLY.',
    'MUMBAI RLY.', 'NAGPUR RLY.', 'PUNE RLY.',
    'MYSORE RURAL', 'ERNAKULAM COMMR.', 'KOLLAM COMMR.', 'THRISSUR COMMR.',

    # 🧾 Commissariats & Zones
    'BANGALORE COMMR.', 'BANGALORE RURAL',
    'MUMBAI COMMR.', 'NAGPUR COMMR.', 'NAGPUR RURAL',
    'PUNE COMMR.', 'PUNE RURAL',
    'THANE COMMR.', 'THANE RURAL',
    'SURAT COMMR.', 'SURAT RURAL',
    'RAJKOT COMMR.', 'RAJKOT RURAL',
    'VADODARA COMMR.', 'VADODARA RURAL',
    'CP AMRITSAR', 'CP LUDHIANA', 'CP JALANDHAR',

    # 🧐 Ambiguous/Obsolete Names
    'N.C.HILLS', 'K/KUMEY', 'G.R.P', 'GARO HILLS NORTH', 'GARO HILLS SOUTH W.',
    
    # Directional (catch via regex below too)
    'NORTH', 'SOUTH', 'EAST', 'WEST', 'CENTRAL', 'OUTER', 'NORTH-EAST', 'SOUTH-WEST'
]

# Drop rows where DISTRICT is in the invalid list
df_cleaned = df_all_years[~df_all_years['DISTRICT'].isin(invalid_districts)].reset_index(drop=True)

In [None]:
df_cleaned.columns

In [None]:
df_cleaned.dtypes

#### Visualisation of data

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# 🔹 1. Top 10 States by Total IPC Crimes (2001–2014) using df_cleaned
state_crime = df_cleaned.groupby('STATE/UT')['TOTAL IPC CRIMES'].sum().sort_values(ascending=False).head(10)

plt.figure(figsize=(12, 6))
sns.barplot(x=state_crime.values, y=state_crime.index, palette="Reds_r")
plt.title("Top 10 States by Total IPC Crimes (2001–2014)")
plt.xlabel("Total Crimes")
plt.ylabel("State/UT")
plt.tight_layout()
plt.show()

In [None]:
# 🔹 2. Top 10 Districts by Total IPC Crimes (2001–2014)
district_crime = df_cleaned.groupby('DISTRICT')['TOTAL IPC CRIMES'].sum().sort_values(ascending=False).head(10)

plt.figure(figsize=(12, 6))
sns.barplot(x=district_crime.values, y=district_crime.index, palette="Blues_r")
plt.title("Top 10 Districts by Total IPC Crimes (2001–2014)")
plt.xlabel("Total Crimes")
plt.ylabel("District")
plt.tight_layout()
plt.show()

In [None]:
# Yearly trend of total IPC crimes

yearly_trend = df_cleaned.groupby('YEAR')['TOTAL IPC CRIMES'].sum()

plt.figure(figsize=(10, 5))
sns.lineplot(x=yearly_trend.index, y=yearly_trend.values, marker="o", color='darkred')
plt.title("Total IPC Crimes Over Years")
plt.xlabel("Year")
plt.ylabel("Total Crimes")
plt.grid(True)
plt.tight_layout()
plt.show()

In [None]:
# Top 8 Crimes in the dataset
crime_types = df_cleaned.columns[4:-1]  # Exclude non-crime columns
crime_totals = df_cleaned[crime_types].sum().sort_values(ascending=False).head(8)

# Plot as a pie chart
plt.figure(figsize=(8, 8))
crime_totals.plot.pie(autopct='%1.1f%%', startangle=140, cmap="viridis", legend=False)
plt.title("Top 8 Crimes in the Dataset")
plt.ylabel("")  # Remove y-axis label for better aesthetics
plt.tight_layout()
plt.show()

In [None]:
# Heinous Crimes (States)

heinous_cols = ['MURDER', 'RAPE', 'DACOITY', 'KIDNAPPING & ABDUCTION', 'ROBBERY']
df_cleaned['HEINOUS_TOTAL'] = df_cleaned[heinous_cols].sum(axis=1)

heinous_state = df_cleaned.groupby('STATE/UT')['HEINOUS_TOTAL'].sum().sort_values(ascending=False).head(10)

plt.figure(figsize=(10, 6))
sns.barplot(x=heinous_state.values, y=heinous_state.index, palette='magma')
plt.title("Top 10 States with Highest Heinous Crimes")
plt.xlabel("Total Heinous Crimes")
plt.ylabel("State/UT")
plt.tight_layout()
plt.show()

In [None]:
# Petty Crimes 

petty_cols = ['THEFT', 'BURGLARY', 'CHEATING', 'COUNTERFIETING']
df_cleaned['PETTY_TOTAL'] = df_cleaned[petty_cols].sum(axis=1)

petty_state = df_cleaned.groupby('STATE/UT')['PETTY_TOTAL'].sum().sort_values(ascending=False).head(10)

plt.figure(figsize=(10, 6))
sns.barplot(x=petty_state.values, y=petty_state.index, palette='coolwarm')
plt.title("Top 10 States with Highest Petty Crimes")
plt.xlabel("Total Petty Crimes")
plt.ylabel("State/UT")
plt.tight_layout()
plt.show()

In [None]:
# # Drop conflicting columns from district_map before merging
# district_map = district_map.drop(columns=['HEINOUS_TOTAL', 'PETTY_TOTAL'], errors='ignore')

# # Perform the merge
# district_map = district_map.merge(
#     df_cleaned.groupby(['STATE/UT', 'DISTRICT'])[['HEINOUS_TOTAL', 'PETTY_TOTAL']].sum().reset_index(),
#     on=['STATE/UT', 'DISTRICT'],
#     how='left'
# )

In [None]:
import geopandas as gpd

# Load the composite GeoJSON
district_geo = gpd.read_file("/Users/ananthakrishnab/Downloads/india_district.geojson")

# Explore columns
print(district_geo.columns)
district_geo.head()

In [None]:
print(district_geo[['NAME_1', 'NAME_2', 'TYPE_2']].head())


In [None]:
# Uppercase and strip whitespaces
district_geo['DISTRICT'] = district_geo['NAME_2'].str.upper().str.strip()
crime_totals['DISTRICT'] = crime_totals['DISTRICT'].str.upper().str.strip()

In [None]:
print(crime_totals.columns)

In [None]:
district_map = district_geo.merge(crime_totals, on='DISTRICT', how='left')

In [None]:
district_map.head()

In [None]:
district_map.columns

In [None]:
print(district_map[['DISTRICT', 'Total_Crimes']].dropna().head(10))
print("***************************" * 2)
print(district_map['Total_Crimes'].isna().sum(), "districts have missing crime data")

In [None]:
district_map.head()

In [None]:
# Dropping unnecessary columns
district_map = district_map.drop(columns=['NAME_1', 'NAME_2', 'TYPE_2'], errors='ignore')

In [None]:
# Checking the missing values 
district_map.isna().sum()

In [None]:
# Counting the number of rows in district_map
print("Number of rows in district_map:", len(district_map))

In [None]:
# Remvoing rows with missing values in district_map
district_map = district_map[district_map['Total_Crimes'].notna()]
print("Number of rows in district_map after removing missing values:", len(district_map))

In [None]:
import warnings

warnings.filterwarnings("ignore")

# Get the top 5 districts based on total crimes
top_5_districts = district_map.nlargest(5, 'Total_Crimes')

# Display the names of the top 5 districts
print("Top 5 Districts by Total Crimes:")
print(top_5_districts[['DISTRICT', 'Total_Crimes']])

# Plot the map with the top 5 districts highlighted
fig, ax = plt.subplots(1, 1, figsize=(12, 12))
district_map.plot(column='Total_Crimes', cmap='Reds', linewidth=0.8, ax=ax, edgecolor='0.8', legend=True)

# Annotate the top 5 districts on the map
for _, row in top_5_districts.iterrows():
    district_name = row['DISTRICT']
    total_crimes = row['Total_Crimes']
    district_geometry = district_map[district_map['DISTRICT'] == district_name].geometry
    if not district_geometry.empty:
        centroid = district_geometry.centroid.iloc[0]
        ax.text(centroid.x, centroid.y, f"{district_name}\n{total_crimes}", fontsize=8, color='blue')

ax.set_title('Total IPC Crimes by District (Top 5 Highlighted)')
ax.axis('off')
plt.show()

In [None]:
top_5_districts

In [None]:
# import plotly.express as px

# # Make sure your GeoDataFrame has the crime data and district names
# fig = px.choropleth(
#     district_map,
#     geojson=district_map.geometry,
#     locations=district_map.index,
#     color='Total_Crimes',
#     hover_name='DISTRICT',  # or 'DISTRICT' based on your column
#     projection='mercator',
#     color_continuous_scale='Reds'
# )

# fig.update_geos(fitbounds="locations", visible=False)
# fig.update_layout(title_text="Total IPC Crimes by District (Interactive)", margin={"r":0,"t":30,"l":0,"b":0})
# fig.show()

In [None]:
import folium
from folium.features import GeoJsonTooltip

# Create base map
m = folium.Map(location=[22.5, 80], zoom_start=5, tiles="cartodbpositron")

# Add choropleth
folium.Choropleth(
    geo_data=district_map,
    data=district_map,
    columns=[district_map.index, 'Total_Crimes'],
    key_on='feature.id',
    fill_color='Reds',
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name='Total IPC Crimes'
).add_to(m)

# Add tooltip
tooltip = GeoJsonTooltip(fields=['DISTRICT', 'Total_Crimes'], aliases=['District:', 'Total Crimes:'])
folium.GeoJson(district_map, tooltip=tooltip).add_to(m)

# Display map
# Create base map
m = folium.Map(location=[22.5, 80], zoom_start=5, tiles="cartodbpositron")

# Add choropleth
folium.Choropleth(
    geo_data=district_map,
    data=district_map,
    columns=[district_map.index, 'Total_Crimes'],
    key_on='feature.id',
    fill_color='Reds',
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name='Total IPC Crimes'
).add_to(m)

# Add tooltip
tooltip = GeoJsonTooltip(fields=['DISTRICT', 'Total_Crimes'], aliases=['District:', 'Total Crimes:'])
folium.GeoJson(district_map, tooltip=tooltip).add_to(m)

# Save the map
m.save("/Users/ananthakrishnab/Desktop/Projects/Community Risk Profiling Using FIR Data/output/map_images/interactive_crime_map.html")

In [None]:
# Ensure names are standardized
df_cleaned['DISTRICT'] = df_cleaned['DISTRICT'].str.upper().str.strip()
df_cleaned['STATE/UT'] = df_cleaned['STATE/UT'].str.upper().str.strip()
district_map['DISTRICT'] = district_map['DISTRICT'].str.upper().str.strip()
district_map['STATE/UT'] = district_map['STATE/UT'].str.upper().str.strip()

# Aggregate Heinous and Petty totals by district
agg_crime_types = df_cleaned.groupby(['STATE/UT', 'DISTRICT'])[['HEINOUS_TOTAL', 'PETTY_TOTAL']].sum().reset_index()

# Merge into district_map
district_map = district_map.merge(
    agg_crime_types,
    on=['STATE/UT', 'DISTRICT'],
    how='left'
)

# ✅ Confirm merge success
print(district_map[['DISTRICT', 'HEINOUS_TOTAL', 'PETTY_TOTAL']].dropna().head())

In [None]:
import plotly.express as px

fig_heinous = px.choropleth(
    district_map,
    geojson=district_map.geometry.__geo_interface__,  # Important for proper serialization
    locations=district_map.index,
    color='HEINOUS_TOTAL',
    hover_name='DISTRICT',
    projection='mercator',
    color_continuous_scale='OrRd',
    labels={'HEINOUS_TOTAL': 'Heinous Crimes'}
)

fig_heinous.update_geos(fitbounds="locations", visible=False)
fig_heinous.update_layout(
    title_text="🔴 Heinous Crimes by District (2001–2014)",
    margin={"r":0,"t":40,"l":0,"b":0}
)
fig_heinous.show()

In [None]:
import plotly.express as px

fig_petty = px.choropleth(
    district_map,
    geojson=district_map.geometry.__geo_interface__,
    locations=district_map.index,
    color='PETTY_TOTAL',
    hover_name='DISTRICT',
    projection='mercator',
    color_continuous_scale='Blues',
    labels={'PETTY_TOTAL': 'Petty Crimes'}
)

fig_petty.update_geos(fitbounds="locations", visible=False)
fig_petty.update_layout(
    title_text="🔵 Petty Crimes by District (2001–2014)",
    margin={"r":0,"t":40,"l":0,"b":0}
)
fig_petty.show()

In [None]:
# Standardize names again just in case
df_cleaned['STATE/UT'] = df_cleaned['STATE/UT'].str.upper()
district_map['STATE/UT'] = district_map['STATE/UT'].str.upper()

# Filter to only Uttar Pradesh
df_up = df_cleaned[df_cleaned['STATE/UT'] == 'UTTAR PRADESH']
district_map_up = district_map[district_map['STATE/UT'] == 'UTTAR PRADESH'].copy()

In [None]:
agg_crime_types_up = df_up.groupby(['STATE/UT', 'DISTRICT'])[['HEINOUS_TOTAL', 'PETTY_TOTAL']].sum().reset_index()

# Merge into UP map
# Merge into UP map (with suffixes handled)
district_map_up = district_map_up.merge(
    agg_crime_types_up,
    on=['STATE/UT', 'DISTRICT'],
    how='left',
    suffixes=('', '_agg')
)

# Rename the correct columns for clarity
district_map_up['HEINOUS_TOTAL'] = district_map_up['HEINOUS_TOTAL_agg']
district_map_up['PETTY_TOTAL'] = district_map_up['PETTY_TOTAL_agg']

# Drop old if needed
district_map_up = district_map_up.drop(columns=['HEINOUS_TOTAL_agg', 'PETTY_TOTAL_agg'], errors='ignore')

In [None]:
fig_heinous_up = px.choropleth(
    district_map_up,
    geojson=district_map_up.geometry.__geo_interface__,
    locations=district_map_up.index,
    color='HEINOUS_TOTAL',
    hover_name='DISTRICT',
    projection='mercator',
    color_continuous_scale='OrRd',
    labels={'HEINOUS_TOTAL': 'Heinous Crimes'}
)

fig_heinous_up.update_geos(fitbounds="locations", visible=False)
fig_heinous_up.update_layout(
    title_text="🔴 Heinous Crimes by District in Uttar Pradesh (2001–2014)",
    margin={"r": 0, "t": 40, "l": 0, "b": 0}
)
fig_heinous_up.show()

In [None]:
# Filter GeoDataFrame for Uttar Pradesh
up_geo = district_geo[district_geo['NAME_1'].str.upper().str.strip() == 'UTTAR PRADESH']

# Clean crime_totals to match UP districts only
up_crime_totals = crime_totals[crime_totals['STATE/UT'].str.upper().str.strip() == 'UTTAR PRADESH']

# Merge for UP-specific mapping
up_map = up_geo.merge(up_crime_totals, on='DISTRICT', how='left')

In [None]:
# IPC Crimes by District in UP

fig_up = px.choropleth(
    up_map,
    geojson=up_map.geometry,
    locations=up_map.index,
    color='Total_Crimes',
    hover_name='DISTRICT',
    projection='mercator',
    color_continuous_scale='OrRd'
)

fig_up.update_geos(fitbounds="locations", visible=False)
fig_up.update_layout(title_text="IPC Crimes by District in Uttar Pradesh", margin={"r":0,"t":30,"l":0,"b":0})
fig_up.show()

In [None]:
import folium
from folium.features import GeoJsonTooltip

# Reproject to projected CRS (Web Mercator)
up_geo_proj = up_geo.to_crs(epsg=3857)

# Calculate safe centroid
up_center = up_geo_proj.geometry.centroid.to_crs(epsg=4326).unary_union.centroid

# Base map centered on Uttar Pradesh
m_up = folium.Map(location=[up_center.y, up_center.x], zoom_start=6, tiles="cartodbpositron")

# Add choropleth layer
folium.Choropleth(
    geo_data=up_map,
    data=up_map,
    columns=[up_map.index, 'Total_Crimes'],
    key_on='feature.id',
    fill_color='YlOrRd',
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name='Total IPC Crimes (Uttar Pradesh)'
).add_to(m_up)

# Add tooltips
tooltip_up = GeoJsonTooltip(fields=['DISTRICT', 'Total_Crimes'], aliases=['District:', 'Crimes:'])
folium.GeoJson(up_map, tooltip=tooltip_up).add_to(m_up)

# Save or display
m_up.save("/Users/ananthakrishnab/Desktop/Projects/Community Risk Profiling Using FIR Data/output/map_images/up_interactive_map.html")