# Predictive Cluster Models for Crime Victim Analysis:
## A. C. Coffin 
### 11/10/2023
### Northwestern Missouri State University
### MS Data Analytics Capstone
---
## Introduction:
This section demonstrates the use of two clustering algorythms, K-Means and DBSCAN, on Crime Victim Data gathered from the NYPD and the NCVS Dashboard. The objective is to determine which model is the most successful at predictive clustering using both data sets, and then beginning a process of incremental learning. By taking the model trained on the NCVS data and fitting NYPD data, it is possible to exicute a process known as incremental learning. As crime data is extremily complex utilizing incremental learning will address inconcistencies that are common with crime data. 
---
## Importing packages:

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
import seaborn as sns
import warnings
warnings.filterwarnings("default")

# Pre-Processing:
All of the data within these crime sets is catagorical in nature and as a result was proceed differently. Part of pre-processing was grouping crime types together, as well as removing those which didn't apply to specific models. When addressing the grouping of data within the NYPD dataset, crimes were coordinated based on the general type of crime and severity. The grouping of key codes is explained in the following table, along with thier offense description. The data pulled from the NCVS pertains to mainly Personal Crimes, such as sexual assault, robbery, aggrivated assault and simple assault. In order to make the data found within the NYPD data to be compatible with the incremental learning to occur, these crimes were once again grouped and sorted based on the types found in the NCVS data. Data pertaining to age groups has already been seperated, with an adjustment made to the NCVS data to group minors together as that data in particular breaks the age groups apart indepth. All null values found within the NYPD data for age have been removed, it was included within analysis to visualize the amount of known data in comparison to the unknown within the filtered set.

## Modifications Made to both NCVS and NYPD Data:
Please see the following charts for the modified ML sets. Each of these set was exported from the SQL server independently and labeled ML for this specific section of the project. 

Modifications to Age Groups within the NCVS data is as follows:
|AGE_GM|NCVS Age Groups|New Range|
|:---|:---|:---|
|1 |12-14, 15-17| >18|
|2 |18-20, 21-24| 18-25|
|3 |25-34, 35-49| 25-49|
|4 |50-64| 50-64|
|5 | 65+ | 65+|

Modifications to NYPD data as NCIC Codes and CT_M:
|NCIC|KY_CD|OFNS_DESC|CT_M|CT_M Meaning
|:---|:---|:---|:---|:---|
|1011 |104, 115, 116, 235|Sex Crimes| 2| Violent Crime Except Simple Assault|
|1201 |105, 107, 109-113, 231, 313, 340-343| Robbery/Fraud| 2| Violent Crime Except Simple Assault|
|1301 |101, 103, 106, 114| Homicide/Aggrivated Assault| 2| Violent Crime Except Simple Assault|
|1313 |344, 578, 230, 355| Simple Assault/Related Crimes| 1| Simple Assault|

Data pertaining to Public or Society Crimes, such as driving under the influence, traffic violations, child abandonment, pocession of a deadly weapon or drug pocessions have been removed from this data. This was done as the NCVS data pulled only pertains to Personal Crimes. This data was utilized both through the initial run of the models and again when being fitted to the selected NCVS model. The only difference is that when the NYPD data was fitted to the NCVS model the NCIC model was used. This was to demonstrate the differences between the sets as a whole. 

---
# NCVS Baseline:
As this Analysis relies heavily on the comparison of National Data to Regional Data to explore crime in america as well as demonstrate a possible application of machine learning, creating a baseline is critical. For this section each of the models has been run three times on the pre-processed NCVS data. There are two major data sets which will be used, the first being the NCVS_RegionalML Data, and the second being the NCVS_AgeSegML. 

## Importing NCVS Data:
All data for each of the sets being analysize has been brought in independently to decrease processing load.


In [22]:
NCVS_AgeSeg = pd.read_csv('Data/ML_PreProcess/NCVS_AgeSegML.csv')
NCVS_Region = pd.read_csv('Data/ML_PreProcess/NCVS_RegionSegML.csv')


In [20]:
NCVS_AgeSeg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 960 entries, 0 to 959
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   rpt_dt   960 non-null    int64
 1   age_gm   960 non-null    int64
 2   ncic     960 non-null    int64
 3   vic_num  960 non-null    int64
dtypes: int64(4)
memory usage: 30.1 KB


In [23]:
NCVS_Region.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 432 entries, 0 to 431
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   rpt_dt    432 non-null    int64
 1   region_m  432 non-null    int64
 2   ncic      432 non-null    int64
 3   vic_num   432 non-null    int64
dtypes: int64(4)
memory usage: 13.6 KB


In [28]:
# file over view
NCVS_AgeSeg.head(2)

Unnamed: 0,rpt_dt,age_gm,ncic,vic_num
0,1993,1,1101,47196
1,1994,1,1101,49561


In [25]:
# Dropping rows with NA values in any columns
NCVS_AgeSeg.dropna()

# Creating a scaled df where each value has a mean of 0 and stdev of 1
scaler = StandardScaler()
scaler.fit(NCVS_AgeSeg[['vic_num', 'ncic', 'age_gm']])
SC_NCVSAgeSeg = scaler.transform(NCVS_AgeSeg[['vic_num', 'ncic', 'age_gm']])

#Review scaled rows of the df
print(SC_NCVSAgeSeg[:5])

[[-0.41536431 -1.49281923 -1.23390539]
 [-0.40168172 -1.49281923 -1.23390539]
 [-0.46738132 -1.49281923 -1.23390539]
 [-0.52411347 -1.49281923 -1.23390539]
 [-0.39471025 -1.49281923 -1.23390539]]


In [26]:
# Creating a scaled data array
SC_NCVSAgeSeg = pd.DataFrame(data=SC_NCVSAgeSeg, columns=['vic_num', 'ncic', 'age_gm'])

In [27]:
# file over view for Region
NCVS_Region.head(2)

Unnamed: 0,rpt_dt,region_m,ncic,vic_num
0,1996,2,1101,145675
1,1997,2,1101,210209


In [31]:
# Dropping rows with NA values in any columns
NCVS_Region.dropna()

# Creating a scaled df where each value has a mean of 0 and stdev of 1
scaler = StandardScaler()
scaler.fit(NCVS_Region[['vic_num', 'ncic', 'region_m']])
SC_NCVSRegion = scaler.transform(NCVS_Region[['vic_num', 'ncic', 'region_m']])

#Review scaled rows of the df
print(SC_NCVSRegion[:5])

[[-0.29002227 -1.49281923 -0.4472136 ]
 [-0.02305437 -1.49281923 -0.4472136 ]
 [-0.39446549 -1.49281923 -0.4472136 ]
 [-0.29783679 -1.49281923 -0.4472136 ]
 [-0.62775938 -1.49281923 -0.4472136 ]]


In [32]:
# Creating corresponding df
SC_NCVSRegion = pd. DataFrame(data=SC_NCVSRegion, columns=['vic_num', 'ncic', 'region_m'])

## K-Means Model:
There are a number of clustering models used for data anlysis, one of these models is the kmeans model. The objective of the model is to assign data points to clusters, based on proximity to a centroid. By doing so it explores the association between data points based on distance calculations between groups. The model opperates under the assumption that the cluster is spherical in nature, equally size and has similar dencities. Analysizing crime statistics as well as conceptualizing the actual precence of crime in US society, clustering is often used. The reason is that crime classifications are based on a group of characteristics that are requried for a crime to be considered one class or another. As this data is based on catagorical structure, as opposed to an interval clustering is possible. It is important to note that while in the case of crime, is that while Offenses do carry an ordinal weight within socity, this weight is difficult to build into a model without adding an extra layer or catagory. 

K-Means has limitations, as it performs poorly when clusters are irregular in size, shape and decnity. As the alogrythem itself is based on distance calculations from a centroid, it can be sensitive to the initial placement of a cluster, and even interpret outliers with greater impact than necessary. This is why the analysis begins with K-Means but has an added layer with DBSCAN. 