### 3. High-level analysis

Perform **at least 6** higher-level analyses of your data. You are free to use any techniques we discussed in class, including but not limited to:

* Use Pandas features to answer specific questions about the data
* Perform a cluster analysis to identify groups within your data
* Identify and motivate a machine learning problem in your data (classification or regression). Create a train/test/validation split and evaluate how well an appropriate model performs
* Perform a linear regression to show the relationship between two variables

If applicable to an analysis, you **must** include:

* Appropriate statistical test(s)
* An appropriate visualization.

Please take advantage of the check-ins or office hours if you are unsure whether a visualization or statistical test is necessary for an analysis.

In [1]:
import pandas as pd

df = pd.read_csv("../datasets/processed/arrests2025_cleaned.csv")
df['ARREST_DATE'] = pd.to_datetime(df['ARREST_DATE'])
df.head()

Unnamed: 0,ARREST_KEY,ARREST_DATE,PD_CD,PD_DESC,KY_CD,OFNS_DESC,LAW_CODE,LAW_CAT_CD,ARREST_BORO,ARREST_PRECINCT,...,PERP_RACE,X_COORD_CD,Y_COORD_CD,Latitude,Longitude,New Georeferenced Column,MIN_AGE,MAX_AGE,DAY_OF_WEEK,MONTH
0,298799078,2025-01-02,101,ASSAULT 3,344.0,ASSAULT 3 & RELATED OFFENSES,PL 1200001,M,M,23,...,BLACK,1000213,228833,40.794755,-73.942348,POINT (-73.9423482609703 40.79475532416718),25,44,Thursday,January
1,299008265,2025-01-07,105,STRANGULATION 1ST,106.0,FELONY ASSAULT,PL 1211200,F,Q,113,...,BLACK,1046399,187126,40.680086,-73.775931,POINT (-73.775931 40.680086),45,64,Tuesday,January
2,298969999,2025-01-06,793,WEAPONS POSSESSION 3,118.0,DANGEROUS WEAPONS,PL 2650201,F,M,5,...,WHITE,983907,199958,40.715526,-74.001238,POINT (-74.001238 40.715526),25,44,Monday,January
3,299436365,2025-01-14,157,RAPE 1,104.0,RAPE,PL 130352B,F,Q,112,...,BLACK,1025401,202586,40.722641,-73.851542,POINT (-73.8515418216779 40.7226409964758),45,64,Tuesday,January
4,299562518,2025-01-16,397,"ROBBERY,OPEN AREA UNCLASSIFIED",105.0,ROBBERY,PL 1601504,F,M,26,...,BLACK,996342,236149,40.814853,-73.956314,POINT (-73.956314 40.814853),0,17,Thursday,January


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71028 entries, 0 to 71027
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   ARREST_KEY                71028 non-null  int64         
 1   ARREST_DATE               71028 non-null  datetime64[ns]
 2   PD_CD                     71028 non-null  int64         
 3   PD_DESC                   71028 non-null  object        
 4   KY_CD                     71024 non-null  float64       
 5   OFNS_DESC                 71028 non-null  object        
 6   LAW_CODE                  71028 non-null  object        
 7   LAW_CAT_CD                70668 non-null  object        
 8   ARREST_BORO               71028 non-null  object        
 9   ARREST_PRECINCT           71028 non-null  int64         
 10  JURISDICTION_CODE         71028 non-null  int64         
 11  AGE_GROUP                 71028 non-null  object        
 12  PERP_SEX          

In [3]:
df['DAY_OF_WEEK'].value_counts()

DAY_OF_WEEK
Wednesday    12618
Thursday     11733
Tuesday      10991
Friday       10408
Saturday      8875
Monday        8837
Sunday        7566
Name: count, dtype: int64

In [4]:
df_age = df.groupby("AGE_GROUP", as_index=False)['AGE_GROUP'].value_counts().sort_values(by='count', ascending=False).reset_index(drop=True)
df_age.head()

Unnamed: 0,AGE_GROUP,count
0,25-44,41711
1,45-64,15016
2,18-24,10516
3,<18,2442
4,65+,1343


In [5]:
df_race = df.groupby("PERP_RACE", as_index=False)['PERP_RACE'].value_counts().sort_values(by='count', ascending=False).reset_index(drop=True)
df_race.head()

Unnamed: 0,PERP_RACE,count
0,BLACK,33878
1,WHITE HISPANIC,18314
2,WHITE,7244
3,BLACK HISPANIC,6963
4,ASIAN / PACIFIC ISLANDER,4114


In [6]:
df_age = df.groupby("PERP_SEX", as_index=False)['PERP_SEX'].value_counts().sort_values(by='count', ascending=False).reset_index(drop=True)
df_age.head()

Unnamed: 0,PERP_SEX,count
0,M,58528
1,F,12500


In [7]:
df_borough = df.groupby("ARREST_BORO", as_index=False)['ARREST_BORO'].value_counts().sort_values(by='count', ascending=False).reset_index(drop=True)
df_borough.head()

Unnamed: 0,ARREST_BORO,count
0,K,20024
1,M,17067
2,B,15776
3,Q,15154
4,S,3007


In [8]:
df_day = df.groupby("DAY_OF_WEEK", as_index=False)['DAY_OF_WEEK'].value_counts().sort_values(by='count', ascending=False).reset_index(drop=True)
df_day.head()

Unnamed: 0,DAY_OF_WEEK,count
0,Wednesday,12618
1,Thursday,11733
2,Tuesday,10991
3,Friday,10408
4,Saturday,8875


In [9]:
df_crime = df.groupby("OFNS_DESC", as_index=False)['OFNS_DESC'].value_counts().sort_values(by='count', ascending=False).reset_index(drop=True)
df_crime.head()

Unnamed: 0,OFNS_DESC,count
0,ASSAULT 3 & RELATED OFFENSES,9185
1,PETIT LARCENY,7447
2,DANGEROUS DRUGS,6146
3,FELONY ASSAULT,5285
4,MISCELLANEOUS PENAL LAW,4730


In [10]:
df_crime_detailed = df.groupby("PD_DESC", as_index=False)['PD_DESC'].value_counts().sort_values(by='count', ascending=False).reset_index(drop=True)
df_crime_detailed.head()

Unnamed: 0,PD_DESC,count
0,"LARCENY,PETIT FROM OPEN AREAS,",7447
1,ASSAULT 3,6737
2,"THEFT OF SERVICES, UNCLASSIFIE",4534
3,"TRAFFIC,UNCLASSIFIED MISDEMEAN",4435
4,"ASSAULT 2,1,UNCLASSIFIED",3601
