This notebook has codes to output a csv file of many ratio distribution summary statistics for each county in Texas.

Definition of summary statistics:

For speeding only records (only 1 speeding violation or multiple speeding violations), all years(2006-2017 combined)

1. pct_spd_over_lmt: : 

(# of violation == 'speeding over limit' & race == A) / (# of race == A)

2. pct_srch:

(# of search_conducted == 1 & race = A) / (# race == A)

3. pct_contraband_srch: 

(# of search_conducted == 1 & race == A & contraband_found == 1) / (# of search_conducted == 1 & race == A)

4.pct_stopped:

(# of race = A / # of all records)

Output file format:

1. cite_rate_spd_cat.csv
citation rate (for low citation rate type: 'speeding over limit') or % of not getting citation (other 3 types having high citation rates) of 4 speeding types for each county:

['unsafe speed','fail to control speed','speeding-10% or more above posted speed','speeding over limit']

columns:

asian/pacific | islander | black | hispanic	| white	| county | type

definition: 

**'cite_rate_speeding over limit': **

citation rate = (# of race == A & citation == 1/ # of race == A)

**Other 3 types: **

e.g., cite_rate_unsafe speed: 

citation rate = (# of race == A & citation == 0/ # of race == A)

Note that for **Other 3 types: **, cite_rate_viotype actually means % of not getting citations

2. spd_other_pct.csv:

columns:

asian/pacific | islander | black | hispanic | white | county | type

(type: pct_spd_over_lmt, pct_srch, pct_contraband_srch, pct_stopped)


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

  import pandas.util.testing as tm


In [None]:
from google.colab import drive
import os 
drive.mount('/gdrive/')

Mounted at /gdrive/


In [None]:
# get list of counties
filename = '/gdrive/MyDrive/traffic_stop/year_data_speeding_only/traffic_' + str(2016) + '.parquet'
df = pd.read_parquet(filename, engine = 'pyarrow')
county_lst = list(df['county_name'].unique())

In [None]:
df.head()

Unnamed: 0,date,time,county_name,subject_race,subject_sex,violation,citation_issued,contraband_found,contraband_drugs,contraband_weapons,search_conducted,search_vehicle,lat,lng,all_violation,speeding_only,county_type,sunset,sunrise,dawn,dusk,stop_time,light_cat,holiday
0,2016-01-01,00:04:00,Hays County,white,female,speeding over limit,0,0,0,0,0,0,29.955355,-97.87788,[speeding over limit],speeding-1,Metropolitan,2016-01-01 17:42:32,2016-01-01 07:27:28,2016-01-01 07:00:42,2016-01-01 18:09:18,2016-01-01 00:04:00,dark,1
1,2016-01-01,00:04:00,Wilson County,hispanic,female,speeding-10% or more above posted speed,1,0,0,0,0,0,29.164392,-98.177765,[speeding-10% or more above posted speed],speeding-1,Metropolitan,2016-01-01 17:45:32,2016-01-01 07:26:52,2016-01-01 07:00:21,2016-01-01 18:12:03,2016-01-01 00:04:00,dark,1
2,2016-01-01,00:06:00,Tarrant County,white,male,speeding over limit,0,0,0,0,0,0,32.717438,-97.38813,[speeding over limit],speeding-1,Metropolitan,2016-01-01 17:34:05,2016-01-01 07:32:01,2016-01-01 07:04:19,2016-01-01 18:01:47,2016-01-01 00:06:00,dark,1
3,2016-01-01,00:09:00,Gregg County,white,male,speeding over limit,0,0,0,0,0,0,32.502914,-94.710304,[speeding over limit],speeding-1,Metropolitan,2016-01-01 17:23:53,2016-01-01 07:20:47,2016-01-01 06:53:09,2016-01-01 17:51:30,2016-01-01 00:09:00,dark,1
4,2016-01-01,00:09:00,Jackson County,hispanic,female,speeding over limit,0,0,0,0,0,0,28.945862,-96.71141,[speeding over limit],speeding-1,Non core,2016-01-01 17:40:09,2016-01-01 07:20:31,2016-01-01 06:54:03,2016-01-01 18:06:37,2016-01-01 00:09:00,dark,1


### race distribution in a specific speeding type

In [None]:
def race_distribution(speeding_type, res_df):
  print(speeding_type)
  years = list(range(2006, 2018))
  for county in county_lst:
    df_cite = pd.Series([0]*4, index = ['white','hispanic','black','asian/pacific islander'])
    df_all = pd.Series([0]*4, index = ['white','hispanic','black','asian/pacific islander'])
    for year in years:
      file_name = '/gdrive/MyDrive/traffic_stop/year_data_speeding_only/traffic_' + str(year) + '.parquet'
      df = pd.read_parquet(file_name, engine = 'pyarrow')

      # consider only one speeding violation
      df = df.loc[df['county_name'] == county,:] 
      df = df.loc[df['speeding_only'].isin(['speeding-repeated_entries','speeding-1']),:]
      df = df.loc[df['violation'] == speeding_type,:]

      for race in ['white','hispanic','black','asian/pacific islander']:
        df_all[race] += len(df.loc[df['subject_race'] == race,:])
        if speeding_type == 'speeding over limit':
          df_cite[race] += len(df.loc[(df['citation_issued'] == 1) & (df['subject_race'] == race),:])
        else:
          df_cite[race] += len(df.loc[(df['citation_issued'] == 0) & (df['subject_race'] == race),:])
    
    cite_pct = df_cite.divide(df_all)
    cite_pct.sort_index(inplace = True)

    df1 = pd.DataFrame(dict(zip(list(cite_pct.index),list(cite_pct.values))),index=[0])
    df1['county'] = county
    df1['type'] = 'cite_rate_' + speeding_type
    
    res_df = pd.concat([res_df, df1], ignore_index = True)

  return res_df

In [None]:
column_names = ['asian/pacific islander', 'black', 'hispanic', 'white']
res_df = pd.DataFrame(columns = column_names)
res_df

Unnamed: 0,asian/pacific islander,black,hispanic,white


In [None]:
vio_lst = ['unsafe speed','fail to control speed','speeding-10% or more above posted speed','speeding over limit']
for vio in vio_lst:
  res_df = race_distribution(vio, res_df)

unsafe speed
fail to control speed
speeding-10% or more above posted speed
speeding over limit


In [None]:
# I guess is because total number of speeding stops is 0
res_df

Unnamed: 0,asian/pacific islander,black,hispanic,white,county,type
0,0.000000,0.000000,0.073529,0.017167,Hays County,cite_rate_unsafe speed
1,0.000000,0.000000,0.008475,0.010563,Wilson County,cite_rate_unsafe speed
2,,0.166667,0.333333,0.058824,Tarrant County,cite_rate_unsafe speed
3,0.000000,0.011236,0.000000,0.023750,Gregg County,cite_rate_unsafe speed
4,0.000000,0.000000,0.052632,0.051724,Jackson County,cite_rate_unsafe speed
...,...,...,...,...,...,...
1011,0.000000,0.014493,0.000926,0.000862,Hansford County,cite_rate_speeding over limit
1012,0.028571,0.007042,0.020718,0.012461,Lynn County,cite_rate_speeding over limit
1013,,0.000000,0.036810,0.008818,Cochran County,cite_rate_speeding over limit
1014,,0.000000,0.054054,0.023256,Loving County,cite_rate_speeding over limit


In [None]:
file_name = 'cite_rate_spd_cat.csv'
path = "/gdrive/MyDrive/traffic_stop/TX-county/summarystat/"
save_path = path + file_name
os.mkdir(path)
res_df.to_csv(save_path, index = False)

In [None]:
# check implementation
for year in list(range(2006, 2018)):
  file_name = '/gdrive/MyDrive/traffic_stop/year_data_speeding_only/traffic_' + str(year) + '.parquet'
  df = pd.read_parquet(file_name, engine = 'pyarrow')
  print(year)
  df = df.loc[(df['county_name'] == 'Tarrant County') & (df['violation'] == 'unsafe speed') & (df['speeding_only'].isin(['speeding-repeated_entries','speeding-1'])),:]
  print(df[['subject_race','citation_issued']])

2006
        subject_race  citation_issued
1056662        white                1
2007
       subject_race  citation_issued
960631        white                1
961116        white                1
2008
Empty DataFrame
Columns: [subject_race, citation_issued]
Index: []
2009
Empty DataFrame
Columns: [subject_race, citation_issued]
Index: []
2010
Empty DataFrame
Columns: [subject_race, citation_issued]
Index: []
2011
        subject_race  citation_issued
82819       hispanic                1
945581         black                0
1076848     hispanic                0
1076850        white                1
2012
Empty DataFrame
Columns: [subject_race, citation_issued]
Index: []
2013
       subject_race  citation_issued
825249        white                0
2014
       subject_race  citation_issued
412904        black                1
775054        white                1
2015
       subject_race  citation_issued
193400        white                1
254745        white                1
254880   

### % speeding over limit, % searched, % contraband_found, % race stopped

In [None]:
def race_pct_distribution(pct_names, res_df):
  years = list(range(2006, 2018))
  print(pct_names)
  for county in county_lst:
    df_sub1 = pd.Series([0]*4, index = ['white','hispanic','black','asian/pacific islander'])
    df_all1 = pd.Series([0]*4, index = ['white','hispanic','black','asian/pacific islander'])

    df_sub2 = pd.Series([0]*4, index = ['white','hispanic','black','asian/pacific islander'])
    df_all2 = pd.Series([0]*4, index = ['white','hispanic','black','asian/pacific islander'])

    df_sub3 = pd.Series([0]*4, index = ['white','hispanic','black','asian/pacific islander'])
    df_all3 = pd.Series([0]*4, index = ['white','hispanic','black','asian/pacific islander'])

    df_sub4 = pd.Series([0]*4, index = ['white','hispanic','black','asian/pacific islander'])
    df_all4 = pd.Series([0]*4, index = ['white','hispanic','black','asian/pacific islander'])

    for year in years:
      file_name = '/gdrive/MyDrive/traffic_stop/year_data_speeding_only/traffic_' + str(year) + '.parquet'
      df = pd.read_parquet(file_name, engine = 'pyarrow')
      df = df.loc[df['county_name'] == county,:]

      # loop through percentages
      for pct_name in pct_names:
          # For pct of speeding over limit in all speeding, consider only single speeding violation
        if pct_name == 'pct_spd_over_lmt':
          df1 = df.loc[df['speeding_only'].isin(['speeding-repeated_entries','speeding-1'])]

          for race in ['white','hispanic','black','asian/pacific islander']:
            df_all1[race] += len(df1.loc[df1['subject_race'] == race,:])
            df_sub1[race] += len(df1.loc[((df1['violation'] == 'speeding over limit') & (df1['subject_race'] == race)),:])

        # for % of search_conducted
        if pct_name == 'pct_srch':
          if year != 2017:
            for race in ['white','hispanic','black','asian/pacific islander']:
              df_all2[race] += len(df.loc[df['subject_race'] == race,:])
              df_sub2[race] += len(df.loc[((df['search_conducted'] == 1) & (df['subject_race'] == race)),:])

        # for % of contraband_found
        if pct_name == 'pct_contraband_srch':
          for race in ['white','hispanic','black','asian/pacific islander']:
            df_all3[race] += len(df.loc[((df['search_conducted'] == 1) & (df['subject_race'] == race)),:])
            df_sub3[race] += len(df.loc[((df['search_conducted'] == 1) & (df['subject_race'] == race) & (df['contraband_found'] == 1)),:])

        if pct_name == 'pct_stopped':
          for race in ['white','hispanic','black','asian/pacific islander']:
            df_all4[race] += len(df)
            df_sub4[race] += len(df.loc[df['subject_race'] == race,:])

    sub_pct1 = df_sub1.divide(df_all1)
    sub_pct1.sort_index(inplace = True)
    df1 = pd.DataFrame(dict(zip(list(sub_pct1.index),list(sub_pct1.values))),index=[0])
    df1['county'] = county
    df1['type'] = 'pct_spd_over_lmt'

    sub_pct2 = df_sub2.divide(df_all2)
    sub_pct2.sort_index(inplace = True)
    df2 = pd.DataFrame(dict(zip(list(sub_pct2.index),list(sub_pct2.values))),index=[0])
    df2['county'] = county
    df2['type'] = 'pct_srch'

    sub_pct3 = df_sub3.divide(df_all3)
    sub_pct3.sort_index(inplace = True)
    df3 = pd.DataFrame(dict(zip(list(sub_pct3.index),list(sub_pct3.values))),index=[0])
    df3['county'] = county
    df3['type'] = 'pct_contraband_srch'

    sub_pct4 = df_sub4.divide(df_all4)
    sub_pct4.sort_index(inplace = True)
    df4 = pd.DataFrame(dict(zip(list(sub_pct4.index),list(sub_pct4.values))),index=[0])
    df4['county'] = county
    df4['type'] = 'pct_stopped'

    res_df = pd.concat([res_df, df1, df2, df3, df4], ignore_index = True)

  return res_df

In [None]:
column_names = ['asian/pacific islander', 'black', 'hispanic', 'white']
res_df1 = pd.DataFrame(columns = column_names)
res_df1
pct_names = ['pct_spd_over_lmt','pct_srch','pct_contraband_srch','pct_stopped']
res_df1 = race_pct_distribution(pct_names, res_df1)

['pct_spd_over_lmt', 'pct_srch', 'pct_contraband_srch', 'pct_stopped']


In [None]:
res_df1.head()

Unnamed: 0,asian/pacific islander,black,hispanic,white,county,type
0,0.476044,0.56265,0.582365,0.601788,Hays County,pct_spd_over_lmt
1,0.004008,0.00635,0.005467,0.005281,Hays County,pct_srch
2,0.333333,0.259259,0.127451,0.066421,Hays County,pct_contraband_srch
3,0.019996,0.056217,0.250528,0.673258,Hays County,pct_stopped
4,0.555944,0.638695,0.6236,0.656829,Wilson County,pct_spd_over_lmt


In [None]:
file_name = 'spd_other_pct.csv'
path = "/gdrive/MyDrive/traffic_stop/TX-county/summarystat/"
save_path = path + file_name
res_df1.to_csv(save_path, index = False)

### ignore

In [None]:
cluster_1 = pd.DataFrame([['a', 1], ['b', 2]],
                columns=['letter  ', 'number'])
cluster = pd.concat([cluster_1,cluster_1,cluster_1,cluster_1], ignore_index = True)

In [None]:
file_name = '/gdrive/MyDrive/traffic_stop/year_data_speeding_only/traffic_' + str(2016) + '.parquet'
df = pd.read_parquet(file_name, engine = 'pyarrow')

In [None]:
df['violation'].value_counts()

speeding over limit                                                                                                 552218
speeding-10% or more above posted speed                                                                             226367
fail to control speed                                                                                                 8500
unsafe speed                                                                                                          5901
speeding-school zone                                                                                                  2756
speeding - zoned (inclement weather, signs posted, military zone, beach)                                               688
speeding over limit |speeding-10% or more above posted speed                                                            53
speed under minimum                                                                                                     30
any speedometer 