This notebook will conduct data exploration on San Diego RIPA data. 

First, drop unnecessary columns. These columns will not be used in the later analysis due to their textual nature. Information derived from these columns has been encoded.

In [1]:
%store -r ripa_all_combo

In [2]:
ripa_all_combo.head()

Unnamed: 0,stop_id,ori,agency,exp_years,date_stop,time_stop,stopduration,stop_in_response_to_cfs,officer_assignment_key,assignment,...,resulttext,race,reason_for_stop,reason_for_stopcode,reason_for_stop_code_text,reason_for_stop_detail,reason_for_stop_explanation,action,consented,contraband
0,2443,CA0371100,SD,10,2018-07-01,00:01:37,30,0,1,"Patrol, traffic enforcement, field operations",...,647(F) PC - DISORD CONDUCT:ALCOHOL (M) 64005,White,Reasonable Suspicion,64005,647(F) PC - DISORD CONDUCT:ALCOHOL (M) 64005,Officer witnessed commission of a crime,"staggering, unable to safely walk",,,
1,2444,CA0371100,SD,18,2018-07-01,00:03:34,10,0,1,"Patrol, traffic enforcement, field operations",...,22349(B) VC - EXC 55MPH SPEED:2 LANE RD (I) 54395,White,Traffic Violation,54106,22350 VC - UNSAFE SPEED:PREVAIL COND (I) 54106,Moving Violation,Speeding,,,
2,2447,CA0371100,SD,1,2018-07-01,00:05:43,15,1,10,Other,...,,Hispanic/Latino/a,Reasonable Suspicion,53072,415(1) PC - FIGHT IN PUBLIC PLACE (M) 53072,Matched suspect description,Both parties involved in argument.,Curbside detention,,
3,2447,CA0371100,SD,1,2018-07-01,00:05:43,15,1,10,Other,...,,Hispanic/Latino/a,Reasonable Suspicion,53072,415(1) PC - FIGHT IN PUBLIC PLACE (M) 53072,Other Reasonable Suspicion of a crime,Both parties engaged in argument.,Curbside detention,,
4,2448,CA0371100,SD,3,2018-07-01,00:19:06,5,0,1,"Patrol, traffic enforcement, field operations",...,,White,Traffic Violation,54106,22350 VC - UNSAFE SPEED:PREVAIL COND (I) 54106,Moving Violation,UNSAFE DRIVING,,,


In [3]:
print(ripa_all_combo.columns)

Index(['stop_id', 'ori', 'agency', 'exp_years', 'date_stop', 'time_stop',
       'stopduration', 'stop_in_response_to_cfs', 'officer_assignment_key',
       'assignment', 'intersection', 'address_block', 'land_mark',
       'address_street', 'highway_exit', 'isschool', 'school_name',
       'address_city', 'beat', 'beat_name', 'pid', 'isstudent',
       'perceived_limited_english', 'perceived_age', 'perceived_gender',
       'gender_nonconforming', 'gend', 'gend_nc', 'perceived_lgbt',
       'resultkey', 'result', 'code', 'resulttext', 'race', 'reason_for_stop',
       'reason_for_stopcode', 'reason_for_stop_code_text',
       'reason_for_stop_detail', 'reason_for_stop_explanation', 'action',
       'consented', 'contraband'],
      dtype='object')


In [4]:
ripa_data = ripa_all_combo.drop(['ori','agency','intersection','highway_exit','land_mark',
                                'address_street','address_block','address_city','school_name',
                                'beat_name','resulttext','reason_for_stop_code_text','reason_for_stop_detail',
                                'reason_for_stop_explanation'], axis=1)
print(ripa_data.columns)

Index(['stop_id', 'exp_years', 'date_stop', 'time_stop', 'stopduration',
       'stop_in_response_to_cfs', 'officer_assignment_key', 'assignment',
       'isschool', 'beat', 'pid', 'isstudent', 'perceived_limited_english',
       'perceived_age', 'perceived_gender', 'gender_nonconforming', 'gend',
       'gend_nc', 'perceived_lgbt', 'resultkey', 'result', 'code', 'race',
       'reason_for_stop', 'reason_for_stopcode', 'action', 'consented',
       'contraband'],
      dtype='object')


Now some data exploration

In [5]:
print(ripa_data.dtypes)

stop_id                        int64
exp_years                      int64
date_stop                     object
time_stop                     object
stopduration                   int64
stop_in_response_to_cfs        int64
officer_assignment_key         int64
assignment                    object
isschool                       int64
beat                           int64
pid                            int64
isstudent                      int64
perceived_limited_english      int64
perceived_age                  int64
perceived_gender              object
gender_nonconforming           int64
gend                           int64
gend_nc                      float64
perceived_lgbt                object
resultkey                      int64
result                        object
code                         float64
race                          object
reason_for_stop               object
reason_for_stopcode           object
action                        object
consented                     object
c

some columns are object datatypes because encoded as Yes/No. Other columns are object datatypes because they are text; these columns will be kept for now to preserve information until unecessary. 

In [6]:
#how many observations and how many attributes
ripa_data.shape

(529600, 28)

In [7]:
#presence of missing numbers 

ripa_data.isnull().sum()

stop_id                           0
exp_years                         0
date_stop                         0
time_stop                         0
stopduration                      0
stop_in_response_to_cfs           0
officer_assignment_key            0
assignment                        0
isschool                          0
beat                              0
pid                               0
isstudent                         0
perceived_limited_english         0
perceived_age                     0
perceived_gender                152
gender_nonconforming              0
gend                              0
gend_nc                      529322
perceived_lgbt                    0
resultkey                         0
result                            0
code                         227492
race                              0
reason_for_stop                   0
reason_for_stopcode           26411
action                            0
consented                    520968
contraband                  

Most columns do not have missing values. For those that do, they deserve additional investigation. 
The perceived_gender column is missing values; however gend is not. Is there a difference between these two columns, it is unclear as of now since the dictionary states both are the officer's perceived gender of the individuals. 
gender_nc was endocded as No Data-No, 5-Yes.
Code is the "specific violation is stop outcome is warning, citation, or custodial arrest(code)". So missing data when outcome not one of the aforementioned results.
reason_for_stopcode is included if "... reason for stop is traffic violation or reasonable suspicion of criminal acitivity".
Consented only has data if the individuals stopped consented to a search, so in most cases the individuals did consent to a search.

Check differences among the different gender columns

In [8]:
ripa_data['perceived_gender'].value_counts()

Male                      386042
Female                    142001
Transgender man/boy          767
Transgender woman/girl       638
Name: perceived_gender, dtype: int64

In [9]:
ripa_data['gend'].value_counts()

1    386042
2    142001
3       767
4       638
0       152
Name: gend, dtype: int64

In [10]:
ripa_data['gender_nonconforming'].value_counts()

0    529322
1       278
Name: gender_nonconforming, dtype: int64

In [11]:
ripa_data['gend_nc'].value_counts()

5.0    278
Name: gend_nc, dtype: int64

so columna 'gend_nc' and 'gender_nonconforming' are the same. Column 'gend' has additional values than colun 'perceived_gender'.

In [12]:
ripa_data[ripa_data['gend']==0][['gend','perceived_gender','gender_nonconforming']]

Unnamed: 0,gend,perceived_gender,gender_nonconforming
270,0,,1
3945,0,,1
4507,0,,1
4685,0,,1
5052,0,,1
...,...,...,...
511932,0,,1
517862,0,,1
520168,0,,1
526386,0,,1


https://oag.ca.gov/sites/all/files/agweb/pdfs/ripa/stop-data-reg-final-text-110717.pdf?

So the fifth element in the 'gend' column is referring to gender noncomforming. Columns 'gender_nonconforming' and 'gend' have different values because according to the data documentation "... an officer may select 'gender nonconforming' in addition to any of the other data values, if applicable." 

So we can keep columns: 'gen','gender_nonconforming'

In [13]:
ripa_data = ripa_data.drop(['perceived_gender'], axis=1)

In [19]:
#save data set

%store ripa_data

Stored 'ripa_data' (DataFrame)


Now to check for any outliers 

In [15]:
#are there outliers in experience of years for the officer 
ripa_data['exp_years'].value_counts()

1     187515
3      45353
2      41552
5      38500
4      33459
10     22624
11     15787
18     15570
9      14235
12     12325
6      11601
20     11164
15      8248
14      8226
8       7177
19      6642
13      6409
7       6139
24      6030
30      5026
17      4415
23      3779
29      3397
16      3121
22      2939
28      2617
21      1795
26      1380
25       951
27       836
32       322
48       231
31       113
33        49
45        36
49        23
50         6
34         3
36         2
37         2
35         1
Name: exp_years, dtype: int64

In [16]:
ripa_data['exp_years'].median() #lets see how data distributed

3.0

The median experience an officer has is low. So this distribution has a long tail

In [17]:
ripa_data['stopduration'].value_counts()

10     127385
15      63811
5       57473
20      52907
30      36609
        ...  
234         1
228         1
224         1
222         1
511         1
Name: stopduration, Length: 371, dtype: int64

In [18]:
ripa_data['stopduration'].median()

15.0

Stop duration has a long tailed distribution! Few stops last for >200 minutes

Now lets look at some summary stats 

In [20]:
ripa_data['exp_years'].describe()

count    529600.000000
mean          6.466161
std           7.187756
min           1.000000
25%           1.000000
50%           3.000000
75%          10.000000
max          50.000000
Name: exp_years, dtype: float64

In [21]:
ripa_data['stopduration'].describe()

count    529600.000000
mean         28.324239
std          50.069074
min           1.000000
25%          10.000000
50%          15.000000
75%          28.000000
max        1440.000000
Name: stopduration, dtype: float64

In [22]:
ripa_data['perceived_age'].describe()

count    529600.000000
mean         37.233801
std          13.417861
min           1.000000
25%          26.000000
50%          35.000000
75%          46.000000
max         120.000000
Name: perceived_age, dtype: float64

Some errors present in perceieved age...120 years old 

In [23]:
ripa_data['resultkey'].value_counts()

3     129497
7     106090
2      97391
1      70342
6      47748
4      27495
10     22759
5      17813
8       9128
9       1291
12        29
11        11
13         6
Name: resultkey, dtype: int64

Most stops have a result key of 3- which according to the documentation https://oag.ca.gov/sites/all/files/agweb/pdfs/ripa/stop-data-reg-final-text-110717.pdf?
is an arrest that has resulted in citation for infraction. Second most popular is that a fieldd interview card completed. Result column specifies the resultkey

In [30]:
ripa_data[ripa_data['resultkey']==7]

Unnamed: 0,stop_id,exp_years,date_stop,time_stop,stopduration,stop_in_response_to_cfs,officer_assignment_key,assignment,isschool,beat,...,perceived_lgbt,resultkey,result,code,race,reason_for_stop,reason_for_stopcode,action,consented,contraband
12,2458,1,2018-07-01,00:37:08,10,0,1,"Patrol, traffic enforcement, field operations",0,124,...,No,7,Field interview card completed,,White,Traffic Violation,54141,,,
20,2469,1,2018-07-01,01:45:00,10,0,1,"Patrol, traffic enforcement, field operations",0,122,...,No,7,Field interview card completed,,White,Traffic Violation,54159,,,
21,2471,5,2018-07-01,01:53:43,20,0,5,Roadblock or DUI sobriety checkpoint,0,834,...,No,7,Field interview card completed,,Black/African American,Reasonable Suspicion,13174,,,
23,2474,3,2018-07-01,02:16:33,15,0,1,"Patrol, traffic enforcement, field operations",0,441,...,No,7,Field interview card completed,,Hispanic/Latino/a,Reasonable Suspicion,24067,Handcuffed or flex cuffed,,
24,2475,4,2018-07-01,02:05:07,15,0,1,"Patrol, traffic enforcement, field operations",0,122,...,No,7,Field interview card completed,,White,Reasonable Suspicion,53072,Handcuffed or flex cuffed,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
529515,478588,4,2021-09-30,16:33:13,200,0,1,"Patrol, traffic enforcement, field operations",0,531,...,No,7,Field interview card completed,,Black/African American,Traffic Violation,54644.0,,,
529549,478620,1,2021-09-30,22:40:19,20,0,1,"Patrol, traffic enforcement, field operations",0,627,...,No,7,Field interview card completed,,White,Reasonable Suspicion,35150.0,,,
529587,478681,1,2021-09-30,23:45:00,15,0,1,"Patrol, traffic enforcement, field operations",0,623,...,No,7,Field interview card completed,,Hispanic/Latino/a,Reasonable Suspicion,52687.0,Handcuffed or flex cuffed,,
529597,478867,1,2021-09-30,15:50:00,20,0,1,"Patrol, traffic enforcement, field operations",0,611,...,No,7,Field interview card completed,,White,Reasonable Suspicion,32022.0,,,


In [39]:
ripa_data['reason_for_stop'].value_counts() #most stops due to reasonable suspicion

Reasonable Suspicion                                                                                                 271553
Traffic Violation                                                                                                    231637
Investigation to determine whether the person was truant                                                               7551
Known to be on Parole / Probation / PRCS / Mandatory Supervision                                                       6792
Consensual Encounter resulting in a search                                                                             6705
Knowledge of outstanding arrest warrant/wanted person                                                                  5325
Determine whether the student violated school policy                                                                     30
Possible conduct warranting discipline under Education Code sections 48900, 48900.2, 48900.3, 48900.4 and 48900.7         7
Name: re

In [40]:

ripa_data['action'].value_counts() #most stops resulted in no action, fewer stops resulted in physical contact

None                                                    325468
Handcuffed or flex cuffed                                56108
Curbside detention                                       55353
Search of person was conducted                           25531
Patrol car detention                                     23489
Search of property was conducted                         10678
Person removed from vehicle by order                      8136
Person photographed                                       4907
Asked for consent to search person                        4768
Asked for consent to search property                      3864
Field sobriety test conducted                             2826
Physical or Vehicle contact                               2704
Property was seized                                       2298
Vehicle impounded                                         2251
Firearm pointed at person                                  497
Person removed from vehicle by physical contact        

In [41]:
ripa_data['result'].value_counts()

Citation for infraction                                                      129497
Field interview card completed                                               106090
No Action                                                                     70342
Custodial Arrest without warrant                                              47748
In-field cite and release                                                     27495
Psychiatric hold                                                              22759
Custodial Arrest pursuant to outstanding warrant                              17810
Noncriminal transport or caretaking transport                                  9128
Contacted parent/legal guardian or other person responsible for the minor      1291
Referral to school administrator                                                 29
Contacted U.S. Department of Homeland Security                                   14
Referral to school counselor or other support staff                         

In [43]:
ripa_data['time_stop'].value_counts() #a lot of stops happen at 4pm- this is a cursory glance since time is not encoded by hour

16:00:00    1498
15:00:00    1280
08:00:00    1262
10:00:00    1236
09:00:00    1227
            ... 
02:24:17       1
19:54:21       1
07:06:24       1
04:29:01       1
01:21:28       1
Name: time_stop, Length: 81330, dtype: int64

In [45]:
ripa_data['date_stop'].describe()

count         529600
unique          1188
top       2020-02-12
freq             799
Name: date_stop, dtype: object