In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df= pd.read_csv('SHR65_23.csv')
df.head()

In [None]:
df['State'].value_counts()

In [None]:
print(df.info())
print(df.describe())
print(df.shape)

In [None]:
df.isnull().sum()

In [None]:
df_new= df.drop(['Subcircum', 'FileDate'], axis='columns')
df_new.isnull().sum()

In [None]:
df_new.dropna(inplace=True)
df_new.isnull().sum()

In [None]:
df_new.shape

In [None]:
df.columns

In [None]:
df_new['VicCount'].value_counts()

In [None]:
df_new['Homicide'].value_counts()

In [None]:
df_new.info()

In [None]:
df['MSA'].value_counts()

In [None]:
df['Weapon'].value_counts()

In [None]:
sns.countplot(df_new['Weapon'])

In [None]:
df['Situation'].value_counts()

In [None]:
df['VicCount'].value_counts()

In [None]:
df_new['VicSex'].value_counts()

In [None]:
# 1. Start with original df (with selected features)
cols = ['State', 'Circumstance', 'Homicide', 'Weapon', 'Relationship', 'Situation', 'Solved']
df_state = df_new[cols].copy()

# 2. One-hot encode categorical features- this is important because PCA and KMEANS doesn't work with categorical variables
df_encoded = pd.get_dummies(df_state.drop(columns='State'), drop_first=True)

# 3. Group by state and compute means (proportion of each feature per state)
df_grouped = pd.concat([df_state['State'], df_encoded], axis=1).groupby('State').mean()

# 4. Scale the aggregated state-level data
from sklearn.preprocessing import StandardScaler
X_scaled = StandardScaler().fit_transform(df_grouped)

# 5. Apply KMeans to the states
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=4, random_state=42)
df_grouped['Cluster'] = kmeans.fit_predict(X_scaled)

# 6. See which cluster each state falls into
print(df_grouped['Cluster'].sort_values())

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# --- 1. PCA for Visualization ---
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

plt.figure(figsize=(12, 8))
sns.scatterplot(x=X_pca[:, 0], y=X_pca[:, 1], hue=df_grouped['Cluster'], palette='tab10', s=100)

# Label each state on the plot
for i, state in enumerate(df_grouped.index):
    plt.text(X_pca[i, 0] + 0.1, X_pca[i, 1], state, fontsize=8)

plt.title('PCA Projection of State Clusters')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid(True)
plt.legend(title='Cluster', loc='best')
plt.tight_layout()
plt.show()


# --- 1. WORKING WITH CLUSTER SUMMARIES

# --- 2. Cluster Counts ---
print("\n--- Cluster Counts (Number of States per Cluster) ---")
print(df_grouped['Cluster'].value_counts())

# --- 3. Cluster Centers (in scaled feature space) ---
cluster_centers = pd.DataFrame(kmeans.cluster_centers_, columns=df_grouped.columns[:-1])
print("\n--- Cluster Centers (Scaled Feature Space) ---")
print(cluster_centers)

# --- 4. Cluster Feature Summary (Means in original units) ---
#  one-hot encoded data, aggregated by state
df_encoded_with_state = pd.concat([df_state['State'], df_encoded], axis=1)
df_grouped_unscaled = df_encoded_with_state.groupby('State').mean()

# Adding  the cluster assignments back
df_grouped_unscaled['Cluster'] = df_grouped['Cluster']

# Now computing the mean of each feature per cluster
cluster_summary = df_grouped_unscaled.groupby('Cluster').mean()

print("\n--- Feature Means per Cluster (Proportions) ---")
print(cluster_summary)


Based on the PCA projections chart, there's a lot to dissect. To start, we see that the majority of states fall under cluster 1 of (30 states), then cluster 3 ( 16), then cluster 0 (4), and cluster 2 with only 1 stated. From this, we understand that cluster 2 may highlight an unusual pattern ( outlier)- the state was Arizona. The x-axis/ y-axis  values appear to be z scores ( they tell you how much above or below the mean a feature is for each cluster)- That being said, 0 means the feature is at the average, positive value means higher than average, and negative values mean lower than average. Now let's look at each cluster to note any patterns.

1. Cluster 0: 4 states (Idaho, Maine, Motana, West Virginia)

a. Key Patters: It has the highest proprtion of solved homicides (Solved_Yes ~ 90%), It has a single-victim/single-offender cases (66%), Low urban crime types like “multiple unknown offenders,” “stranger relationships,” and felonies, High rates of domestic relationships (wife, husband, family ties)- common in rural, low-population states.

Insights: These are mostly rural states that have solvable, domestic relate homicides- low complexity of crimes.

2. Cluster 1: (30 states)

a. Key Patterns: It has the lowest solved rate of 71%, it has a high rates of unkown offender case, homicides involving strangers, more complex urban homicides, high population states.

Insights: These (high population) states represent diverse, urban-driven homicide patterns where crimes are harder to solve.

3. Cluster 2: (1 states- Arizona)

a. Key Patterns: It has the high z scores in felony-related homicides, arguments over money, and stranager related homicides, it has low solved rates, and high multiple victim case and unknown offenders

Insights: Arizona shows an extreme outlier pattern, especially in felony circumstances and argument-based killings. What could've caused this?

4. Cluster 3: (16 states)

a. Key Patterns: It has an above average solved rates ( 85%), and  High family or known-offender rates (wife, boyfriend, etc.). There was also an interesting elevation of child-realted homicide indicators ( babysitter, family)- more generalized motives(mixture)

Insights: These states form a moderate cluster—not They may reflect balanced investigative systems and a mix of urban/rural crime.

Conclusions

Based on the above analysis, it looks like low population, and higher social ties ( connection) leads to a more solvable domestic crimes. While a Densely populated state has more unknow violence which leads to unsolved crimes, stranger homicides, and felony situations. Medium population has a balanced miz of domestic and unkonws.

Group commentary: Just by looking at the Kepler.gl map and PCA visualization, it's clear that there's a pattern of clustering ( we can see the seperation). Relating back to our area of intrest and purpose. Based on the cluster summary, we can make the connection that higher population ( states near high populated areas tend to have more diverse cases, but high in anonymity ( stranger homicides and unknown offenders). While States near low populated states (have higher social ties and leading to more solvable domestic crimes). We'd like to convey to police officiers/ state defenders are: Outlier state, deploy specialized task forces for gang violence, trafficking, or serial offense. Also defintly puch for more porensic transparency- we already saw that higher unsolved rates mean that the evidence may be week and forensic efforts may need more of investing. For cluster 3, in particular, since there are elevated rates of arson, children killed by babysitters, and stepfamily-related cases- we could advise to police officers to coordinate with cps and fire departments during invesitgations, and definetly have more training to recognize nuanced abuse in families. Also, request pscyhological assessments for defendants in domestic or child-involved homicides, especially where abuse cycles may be present. Additionally , for high solve rates ( domestic focused)- Leverage prior restraining orders or 911 call patterns to preempt escalation.

In [None]:
import plotly.express as px
import pandas as pd

# Data: states and their corresponding cluster numbers
data = {
    'State': [
        'Idaho', 'Maine', 'Montana', 'West Virginia', 'Delaware', 'Georgia', 'Connecticut', 'Alabama',
        'Indiana', 'Illinois', 'Florida', 'Maryland', 'Kansas', 'Kentucky', 'District of Columbia',
        'Arkansas', 'New Mexico', 'Mississippi', 'Missouri', 'Nevada', 'New Jersey', 'Massachusetts',
        'Michigan', 'Louisiana', 'Rhode Island', 'Tennessee', 'Texas', 'Pennsylvania', 'North Carolina',
        'Ohio', 'New York', 'California', 'Virginia', 'South Carolina', 'Arizona', 'Minnesota',
        'Colorado', 'Alaska', 'Hawaii', 'Iowa', 'New Hampshire', 'Nebraska', 'North Dakota', 'Oklahoma',
        'South Dakota', 'Utah', 'Vermont', 'Oregon', 'Washington', 'Wisconsin', 'Wyoming'
    ],
    'Cluster': [
        0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3
    ]
}

# Creating DataFrame
df = pd.DataFrame(data)

# Ensure that the 'Cluster' column is treated as categorical data
df['Cluster'] = df['Cluster'].astype(str)


#lat/lon centroids for each state
state_coords = {
    'Alabama': (32.806671, -86.791130),
    'Alaska': (61.370716, -152.404419),
    'Arizona': (33.729759, -111.431221),
    'Arkansas': (34.969704, -92.373123),
    'California': (36.116203, -119.681564),
    'Colorado': (39.059811, -105.311104),
    'Connecticut': (41.597782, -72.755371),
    'Delaware': (39.318523, -75.507141),
    'District of Columbia': (38.897438, -77.026817),
    'Florida': (27.766279, -81.686783),
    'Georgia': (33.040619, -83.643074),
    'Hawaii': (21.094318, -157.498337),
    'Idaho': (44.240459, -114.478828),
    'Illinois': (40.349457, -88.986137),
    'Indiana': (39.849426, -86.258278),
    'Iowa': (42.011539, -93.210526),
    'Kansas': (38.526600, -96.726486),
    'Kentucky': (37.668140, -84.670067),
    'Louisiana': (31.169546, -91.867805),
    'Maine': (44.693947, -69.381927),
    'Maryland': (39.063946, -76.802101),
    'Massachusetts': (42.230171, -71.530106),
    'Michigan': (43.326618, -84.536095),
    'Minnesota': (45.694454, -93.900192),
    'Mississippi': (32.741646, -89.678696),
    'Missouri': (38.456085, -92.288368),
    'Montana': (46.921925, -110.454353),
    'Nebraska': (41.125370, -98.268082),
    'Nevada': (38.313515, -117.055374),
    'New Hampshire': (43.452492, -71.563896),
    'New Jersey': (40.298904, -74.521011),
    'New Mexico': (34.840515, -106.248482),
    'New York': (42.165726, -74.948051),
    'North Carolina': (35.630066, -79.806419),
    'North Dakota': (47.528912, -99.784012),
    'Ohio': (40.388783, -82.764915),
    'Oklahoma': (35.565342, -96.928917),
    'Oregon': (44.572021, -122.070938),
    'Pennsylvania': (40.590752, -77.209755),
    'Rhode Island': (41.680893, -71.511780),
    'South Carolina': (33.856892, -80.945007),
    'South Dakota': (44.299782, -99.438828),
    'Tennessee': (35.747845, -86.692345),
    'Texas': (31.054487, -97.563461),
    'Utah': (40.150032, -111.862434),
    'Vermont': (44.045876, -72.710686),
    'Virginia': (37.769337, -78.169968),
    'Washington': (47.400902, -121.490494),
    'West Virginia': (38.491226, -80.954570),
    'Wisconsin': (44.268543, -89.616508),
    'Wyoming': (42.755966, -107.302490)
}

# Add lat/lon to your dataframe
df["Latitude"] = df["State"].map(lambda x: state_coords[x][0])
df["Longitude"] = df["State"].map(lambda x: state_coords[x][1])

# Save to CSV for Kepler
df.to_csv("state_clusters_for_kepler.csv", index=False)