EDA & Visualization
MSDS Summer 2024
Final Group Project

# Police arrests in Montgomery County from 2021-2023

This dataset contains information about individuals who have been arrested by police officers in Montgomery County. The data includes demographic information about the individuals arrested, such as their race, gender, age, and ethnicity.

The data is available at https://www.kaggle.com/datasets/shayanshahid997/police-arrest-from-2021-2023/data

In [111]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import colors as clrs
import datetime

# EDA

In [112]:
arrests = pd.read_csv("Police_Arrests_20240702.csv")

## A first look at the data

In [113]:
print(arrests.shape)
arrests.head()

(23510, 10)


Unnamed: 0,ID Reference Number,Subject's race,Subject's gender,Subject's age,Ethnicity,District of occurrence,Adjacent to School,Assigned Division,Assigned Bureau,Event Date/Time
0,1,Asian,Male,43,NON-HISPANIC,6.0,0,TOD,FSB,1/1/2021 1:07
1,2,Black/African American,Male,23,NON-HISPANIC,1.0,0,RCPD,,1/1/2021 0:52
2,3,Black/African American,Male,18,NON-HISPANIC,1.0,0,RCPD,,1/1/2021 0:52
3,4,Black/African American,Male,21,NON-HISPANIC,1.0,0,RCPD,,1/1/2021 0:52
4,5,White,Female,38,HISPANIC,4.0,0,TOD,FSB,1/1/2021 3:00


In [114]:
arrests.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23510 entries, 0 to 23509
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   ID Reference Number     23510 non-null  int64  
 1   Subject's race          23510 non-null  object 
 2   Subject's gender        23510 non-null  object 
 3   Subject's age           23510 non-null  int64  
 4   Ethnicity               23510 non-null  object 
 5   District of occurrence  23408 non-null  float64
 6   Adjacent to School      23510 non-null  int64  
 7   Assigned Division       23381 non-null  object 
 8   Assigned Bureau         18851 non-null  object 
 9   Event Date/Time         23510 non-null  object 
dtypes: float64(1), int64(3), object(6)
memory usage: 1.8+ MB


In [115]:
arrests.describe()

Unnamed: 0,ID Reference Number,Subject's age,District of occurrence,Adjacent to School
count,23510.0,23510.0,23408.0,23510.0
mean,11755.5,31.813143,3.989192,0.085155
std,6786.89675,12.946909,1.748587,0.279119
min,1.0,1.0,1.0,0.0
25%,5878.25,22.0,3.0,0.0
50%,11755.5,30.0,4.0,0.0
75%,17632.75,39.0,5.0,0.0
max,23510.0,99.0,8.0,1.0


## Variables

### Types

<b>The information provided about each variable and its information is the following:</b>

<div class="alert alert-block alert-info">

This dataset contains information about individuals who have been arrested by police officers in Montgomery County. The data includes demographic information about the individuals arrested, such as their race, gender, age, and ethnicity.

The `ID Reference Number` column likely serves as a unique identifier for each arrest record.

The `Subject's race`, `Subject's gender`, `Subject's age`, and `Ethnicity` columns provide demographic information about the individuals who were arrested.

The `District of occurrence` column likely indicates the district in which the arrest took place.

The `Adjacent to School` column is particularly interesting. It appears to indicate whether the arrest took place within 500 feet of a school, with a 1 indicating that it did and a 0 indicating that it did not. This could be useful for analyzing patterns in where arrests occur.

The `Assigned Division` and `Assigned Bureau` columns likely indicate the division and bureau of the police department that were assigned to the arrest.

Finally, the `Event Date/Time` column provides the date and time of the arrest. This could be useful for analyzing patterns in when arrests occur.

</div>

<div class="alert alert-block alert-info">

| Column Name            | Description                                      | Field Name         | Data Type          |
| ---------------------- | ------------------------------------------------ | ------------------ | ------------------ | 
| ID Reference Number    | Row number/ID Reference Number                   | id                 | Text               |
| Subject's race         | Subject's race                                   | race               | Text               |
| Subject's gender       | Subject's gender                                 | gender             | Text               |
| Subject's age          | Subject's age                                    | age                | Text               |
| Ethnicity	Subject's    | ethnicity                                        | ethnicity          | Text               |
| District of occurrence | District of occurrence                           | district           | Text               |
| Adjacent to School     | Arrest occurred within 500 ft. of a school (1/0) | adjacent_to_school | Text               |
| Assigned Division      | District/division of officer's assignment        | division           | Text               |
| Assigned Bureau        | Bureau of officer's assignment                   | bureau             | Text               |
| Event Date/Time        | Event Date/Time                                  | event_date_time    | Floating Timestamp |

</div>

Let's look at the dtype of each variable after importing.

In [116]:
arrests.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23510 entries, 0 to 23509
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   ID Reference Number     23510 non-null  int64  
 1   Subject's race          23510 non-null  object 
 2   Subject's gender        23510 non-null  object 
 3   Subject's age           23510 non-null  int64  
 4   Ethnicity               23510 non-null  object 
 5   District of occurrence  23408 non-null  float64
 6   Adjacent to School      23510 non-null  int64  
 7   Assigned Division       23381 non-null  object 
 8   Assigned Bureau         18851 non-null  object 
 9   Event Date/Time         23510 non-null  object 
dtypes: float64(1), int64(3), object(6)
memory usage: 1.8+ MB


Before going ahead with redefining our dtypes, let's rename the columns to the names on the <b>Field</b> column above, for simplicity's sake.

In [117]:
renames = {}
for colname in arrests:
    words = colname.lower().split(' ')
    if words[0] == 'id' or words[0] == "district" or words[0] == "ethnicity":
        renames[colname] = words[0]
    elif words[0] == "subject's" or words[0] == "assigned":
        renames[colname] = words[1]
    elif words[0] == 'event':
        words = [words[0]] + words[1].split("/")
        renames[colname] = "_".join(words)
    else:
        renames[colname] = "_".join(words)
print(renames)

{'ID Reference Number ': 'id', "Subject's race": 'race', "Subject's gender": 'gender', "Subject's age": 'age', 'Ethnicity': 'ethnicity', 'District of occurrence': 'district', 'Adjacent to School': 'adjacent_to_school', 'Assigned Division': 'division', 'Assigned Bureau': 'bureau', 'Event Date/Time': 'event_date_time'}


In [118]:
arrests.rename(columns=renames, inplace=True)

Now, let's look at al the <b>unique values</b> we have on our variables.

In [119]:
for col in arrests:
    uniques = arrests[col].unique()
    print(f"\nThe column {col} has {len(uniques)} unique values.\n")
    print(uniques)


The column id has 23510 unique values.

[    1     2     3 ... 23508 23509 23510]

The column race has 6 unique values.

['Asian' 'Black/African American' 'White' 'Unknown'
 'American Indian/ALSK Natv' 'Hawaiian/Pacific Islander']

The column gender has 2 unique values.

['Male' 'Female']

The column age has 78 unique values.

[43 23 18 21 38 31 37 40 34 49 27 32 28 25 30 29 35 26 24 39 41 20 19 16
 33 60 62 55 45 36 52 56 53 17 22 48 15 51 57 47 44 42 50 64 46 54 58 13
 14 59 66 79 80 63 61 69 71 12 72 65 67 73 10 68 83 76 99 75 11 70 77 78
  1 74 84 85 86 82]

The column ethnicity has 2 unique values.

['NON-HISPANIC' 'HISPANIC']

The column district has 8 unique values.

[ 6.  1.  4.  3.  5.  2.  8. nan]

The column adjacent_to_school has 2 unique values.

[0 1]

The column division has 25 unique values.

['TOD' 'RCPD' 'TPPD' 'GPD' '4D' '3D' '5D' '6D' 'IMTD' '2D' '1D' 'SVID'
 'CID' 'MCFM' 'SID' 'PSTA' nan 'MCSO' 'MCD' 'SOD' 'CED' 'PSCC' 'FSB'
 'PERS' 'ISB']

The column bureau has 5

Based on this information, this is the types we'd think optimal to attribute to each of the columns:

| Column Name            | Description                                      | Field Name         | Data Type                   |
| ---------------------- | ------------------------------------------------ | ------------------ | --------------------------- | 
| ID Reference Number    | Row number/ID Reference Number                   | id                 | <b> int </b>                |
| Subject's race         | Subject's race                                   | race               | <b> categorical </b>        |
| Subject's gender       | Subject's gender                                 | gender             | <b> categorical </b>        |
| Subject's age          | Subject's age                                    | age                | <b> int </b>                |
| Ethnicity	Subject's    | ethnicity                                        | ethnicity          | <b> categorical </b>        |
| District of occurrence | District of occurrence <b> represented by an integer value </b>  | district           | <b> categorical </b>        |
| Adjacent to School     | Arrest occurred within 500 ft. of a school (1/0) | adjacent_to_school | <b> bool </b>               |
| Assigned Division      | District/division of officer's assignment        | division           | <b> categorical </b>        |
| Assigned Bureau        | Bureau of officer's assignment                   | bureau             | <b> categorical </b>        |
| Event Date/Time        | Event Date/Time                                  | event_date_time    | <b> datetime </b>           |

Let's look at what dtypes our columns ended up with after importing the data and ajudst the column types that need adjusting.

In [120]:
arrests.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23510 entries, 0 to 23509
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   id                  23510 non-null  int64  
 1   race                23510 non-null  object 
 2   gender              23510 non-null  object 
 3   age                 23510 non-null  int64  
 4   ethnicity           23510 non-null  object 
 5   district            23408 non-null  float64
 6   adjacent_to_school  23510 non-null  int64  
 7   division            23381 non-null  object 
 8   bureau              18851 non-null  object 
 9   event_date_time     23510 non-null  object 
dtypes: float64(1), int64(3), object(6)
memory usage: 1.8+ MB


In [121]:
# Categorical variables
arrests[['race', 'gender', 'ethnicity', 'district', 'division', 'bureau']] = arrests[['race', 'gender', 'ethnicity', 'district', 'division', 'bureau']].astype('category')

# Boolean variable
arrests['adjacent_to_school'] = arrests['adjacent_to_school'].astype(bool)

# Date time variable
arrests["event_date_time"] = pd.to_datetime(arrests['event_date_time'])

arrests.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23510 entries, 0 to 23509
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   id                  23510 non-null  int64         
 1   race                23510 non-null  category      
 2   gender              23510 non-null  category      
 3   age                 23510 non-null  int64         
 4   ethnicity           23510 non-null  category      
 5   district            23408 non-null  category      
 6   adjacent_to_school  23510 non-null  bool          
 7   division            23381 non-null  category      
 8   bureau              18851 non-null  category      
 9   event_date_time     23510 non-null  datetime64[ns]
dtypes: bool(1), category(6), datetime64[ns](1), int64(2)
memory usage: 713.6 KB


### Categorical variable study

#### Districts

In [125]:
arrests['district'].value_counts()

district
3.0    4782
6.0    4748
4.0    4563
5.0    3820
1.0    2513
2.0    2196
8.0     786
Name: count, dtype: int64

In [126]:
arrests['district'].unique()

[6.0, 1.0, 4.0, 3.0, 5.0, 2.0, 8.0, NaN]
Categories (7, float64): [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 8.0]

We don't have any records for district 7, so we will assume district 8 corresponds to Montgomery County district 7.
We are basing off of the political and administrative divisions for the districts.

#### Bureau

In [127]:
arrests['bureau'].unique()

['FSB', NaN, 'PSB', 'MSB', 'ISB']
Categories (4, object): ['FSB', 'ISB', 'MSB', 'PSB']

source: https://www.montgomerycountymd.gov/pol/bureaus/index.html <br>
The Police Department is structured into the Office of the Chief and five major bureaus:
- the Community Resources Bureau
- the Field Services Bureau
- the Investigative Services Bureau
- the Management Services Bureau
- the Patrol Services Bureau.<br>
        The Patrol Services Bureau, the largest and most visible bureau in the Police Department, oversees most of the Department’s uniformed officers on patrol. The Patrol Services Bureau is divided into six police districts.

In [128]:
bureaus = {'FSB': 'Field Services Bureau',
           'PSB': 'Patrol Services Bureau',
           'MSB': 'Management Services Bureau',
           'ISB': 'Investigative Services Bureau'}

#### Division
District/Division of officer's assignment

Forensic Science Evidence Management Division
​Internal Affairs Division

In [129]:
arrests['division'].unique()

divisions = {'TOD': 'Traffic Operations Division',
             'IMTD': 'Information Management and Technology Division',
             'SVID': 'Special Victims Investigations Division',
             'CID': 'Criminal Investigations Division',
             'SID': 'Special Investigations Division',
             'MCD': 'Major Crimes Division',
             'SOD': 'Special Operations Division',
             'CED': 'Community Engagement Division',
             'RCPD': 'Rockville City Police Department',
             'TPPD': 'Takoma Park Police Department',
             'GPD': 'Gaithersburg Police Department',
             'IMTD': 'Information Management and Technology Division',
             'PSTA': 'Public Safety Training Academy',
             'MCSO': 'Montgomery County Sheriff\'s Office',
             'PSCC': 'Public Safety Communications Centre',
             '1D': 'District 1',
             '2D': 'District 2',
             '3D': 'District 3',
             '4D': 'District 5',
             '5D': 'District 5',
             '6D': 'District 6'}

In [130]:
arrests['division'][arrests['bureau'] == "FSB"].value_counts()

division
TOD     1033
CED      361
SOD        7
FSB        4
1D         0
MCFM       0
SVID       0
SID        0
RCPD       0
PSTA       0
PSCC       0
PERS       0
MCSO       0
MCD        0
2D         0
ISB        0
IMTD       0
GPD        0
CID        0
6D         0
5D         0
4D         0
3D         0
TPPD       0
Name: count, dtype: int64

In [131]:
arrests['division'][arrests['bureau'] == "MSB"].value_counts()

division
IMTD    463
PSTA     17
PSCC      1
PERS      1
1D        0
MCFM      0
TOD       0
SVID      0
SOD       0
SID       0
RCPD      0
MCSO      0
MCD       0
2D        0
ISB       0
GPD       0
FSB       0
CID       0
CED       0
6D        0
5D        0
4D        0
3D        0
TPPD      0
Name: count, dtype: int64

In [132]:
arrests['division'][arrests['bureau'] == "ISB"].value_counts()

division
SID     254
SVID    249
MCD      75
CID      74
ISB       3
MCFM      0
TOD       0
SOD       0
RCPD      0
PSTA      0
PSCC      0
PERS      0
MCSO      0
1D        0
2D        0
IMTD      0
GPD       0
FSB       0
CED       0
6D        0
5D        0
4D        0
3D        0
TPPD      0
Name: count, dtype: int64

In [133]:
arrests['division'][arrests['bureau'] == "PSB"].value_counts()

division
3D      4331
4D      3908
5D      3502
2D      1870
6D      1766
1D       926
MCSO       6
PERS       0
TOD        0
SVID       0
SOD        0
SID        0
RCPD       0
PSTA       0
PSCC       0
MCD        0
MCFM       0
ISB        0
IMTD       0
GPD        0
FSB        0
CID        0
CED        0
TPPD       0
Name: count, dtype: int64