In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats
import statsmodels
from numpy.random import seed

function for z-tests: https://www.statsmodels.org/stable/generated/statsmodels.stats.weightstats.ztest.html

function for one-way chi-squared test: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html#scipy.stats.chisquare

The DataFrame used in this hypothesis testing stage is a sample from the cleaned "NYPD Complaint Data Historic" dataset, which was downloaded from https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i. 

In [2]:
nyc = pd.read_csv('nyc_crime_sample_HT.csv', index_col=0)

  interactivity=interactivity, compiler=compiler, result=result)


Columns 6 and 10 are of mixed data types, but will not be used. So it isn't a concern.

In [3]:
nyc.columns[6]

'RPT_DT'

In [4]:
nyc.columns[10]

'PD_DESC'

In [5]:
nyc.shape

(646388, 127)

The first null hypothesis is that cannabis crimes are equally likely in having their suspect's race reported by the arresting NYPD officer as non-cannabis crimes.

In the sample, there are 22,253 cannabis arrests out of 646,388 NYC crimes. 

In [6]:
nyc['cannabis_crime'].value_counts()

0    624135
1     22253
Name: cannabis_crime, dtype: int64

In [7]:
observed_cann_pctg_s = round(round(22253/646388, 3)*100, 2)
observed_cann_pctg_s

3.4

So cannabis crimes account for 3.4% of overall crimes in this sample. The cleaned total population data set of New York City crimes between 2006 and 2018 has 220,306 cannabis crimes out of 6,463,881 total crimes. So 3.4% of overall crimes are cannabis crimes in the population as well.

In [8]:
observed_cann_pctg = round(round(220306/6463881, 3)*100, 2)
observed_cann_pctg

3.4

In [9]:
cann = nyc[nyc.cannabis_crime == 1]

In [10]:
non_cann = nyc[nyc.cannabis_crime == 0]

Definition of stats for cannabis crime group:

In [11]:
n_cann = len(cann)
n_cann

22253

In [12]:
cann_race = cann['SUSP_RACE_cleaned'].value_counts()
cann_race

unknown                           18797
BLACK                              1764
WHITE HISPANIC                      965
BLACK HISPANIC                      402
WHITE                               258
ASIAN / PACIFIC ISLANDER             57
AMERICAN INDIAN/ALASKAN NATIVE       10
Name: SUSP_RACE_cleaned, dtype: int64

In [13]:
race_reported_c = cann.SUSP_RACE_cleaned != 'unknown'

In [14]:
cann['race_reported'] = race_reported_c.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [15]:
cann['race_reported'].value_counts()

0    18797
1     3456
Name: race_reported, dtype: int64

As reported in previous notebooks of this project, the population percentage of cannabis crimes that do not have their suspect's race reported is 15.8%. In the sample, it is 15.5%.

In [16]:
round(cann['race_reported'].value_counts(normalize=True), 3)*100

0    84.5
1    15.5
Name: race_reported, dtype: float64

In [17]:
cann_race_reported = np.mean(cann['race_reported'])
cann_race_reported

0.155304902709747

In [18]:
cann_std = np.std(cann['race_reported'])
cann_std

0.3621950992270472

Definition of stats for non-cannabis crime group:

In [19]:
n_non_cann = len(non_cann)
n_non_cann

624135

In [20]:
non_cann['SUSP_RACE_cleaned'].value_counts()

unknown                           386090
BLACK                             119699
WHITE HISPANIC                     54528
WHITE                              36163
BLACK HISPANIC                     16506
ASIAN / PACIFIC ISLANDER           10159
AMERICAN INDIAN/ALASKAN NATIVE       988
Name: SUSP_RACE_cleaned, dtype: int64

In [21]:
race_reported_nc = non_cann.SUSP_RACE_cleaned != 'unknown'

In [22]:
non_cann['race_reported'] = race_reported_nc.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


As reported in previous notebooks of this project, the population percentage of cannabis crimes that do not have their suspect's race reported is 38.1%. In the sample, it is also 38.1%.

In [23]:
non_cann['race_reported'].value_counts()

0    386090
1    238045
Name: race_reported, dtype: int64

In [24]:
non_cann_race_reported = np.mean(non_cann['race_reported'])
non_cann_race_reported

0.38139985740264526

In [25]:
non_cann_std = np.std(non_cann['race_reported'])
non_cann_std

0.48573038424137105

In [26]:
scipy.stats.ttest_ind_from_stats(non_cann_race_reported, non_cann_std, n_non_cann, cann_race_reported, cann_std, n_cann)

Ttest_indResult(statistic=68.75859586807616, pvalue=0.0)

Null hypothesis is rejected. Although the reason is outside the scope of this analysis, one cannot say that NYPD officers are equally likely to record the suspect's race for cannabis crimes as they are for non-cannabis crimes.

Null hypothesis: African-Americans arrested for a crime are equally likely to be arrested for cannabis crimes as they are for non_cannabis crimes.

In [27]:
african_american_c = cann.SUSP_RACE_cleaned == 'BLACK'

In [28]:
cann['african_american'] = african_american_c.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [30]:
cann['african_american'].value_counts()

0    20489
1     1764
Name: african_american, dtype: int64

In [31]:
cann_af_am = np.mean(cann['african_american'])
cann_af_am

0.07927021075810003

In [33]:
cann_af_am_std = np.std(cann['african_american'])
cann_af_am_std

0.27016003487647783

In [34]:
african_american_nc = non_cann.SUSP_RACE_cleaned == 'BLACK'

In [35]:
non_cann['african_american'] = african_american_nc.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [36]:
non_cann['african_american'].value_counts()

0    504436
1    119699
Name: african_american, dtype: int64

In [38]:
non_cann_af_am = np.mean(non_cann['african_american'])
non_cann_af_am

0.19178382881908562

In [39]:
non_cann_af_am_std = np.std(non_cann['african_american'])
non_cann_af_am_std

0.393703939300504

In [40]:
scipy.stats.ttest_ind_from_stats(non_cann_af_am, non_cann_af_am_std, n_non_cann, cann_af_am, cann_af_am_std, n_cann)

Ttest_indResult(statistic=42.277969963971316, pvalue=0.0)

Null hypothesis rejected. African-Americans arrested for a crime are NOT equally likely to be arrested for cannabis crimes as they are for non_cannabis crimes.

Null hypothesis: African-Americans arrested for a crime are equally likely to be arrested for cannabis crimes as they are for non_cannabis crimes. With sample of NYC crimes with suspect race reported

In [41]:
nyc_rr = pd.read_csv('nyc_crime_race_reported_sample_HT.csv', index_col=0)

In [42]:
len(nyc_rr)

241923

In [43]:
cann_rr = nyc_rr[nyc_rr.cannabis_crime == 1]

In [44]:
non_cann_rr = nyc_rr[nyc_rr.cannabis_crime == 0]

In [45]:
n_cann_rr = len(cann_rr)
n_cann_rr

3536

In [46]:
n_non_cann_rr = len(non_cann_rr)
n_non_cann_rr

238387

In [47]:
cann_rr['SUSP_RACE_cleaned'].value_counts()

BLACK                             1846
WHITE HISPANIC                     924
BLACK HISPANIC                     403
WHITE                              263
ASIAN / PACIFIC ISLANDER            90
AMERICAN INDIAN/ALASKAN NATIVE      10
Name: SUSP_RACE_cleaned, dtype: int64

In [48]:
african_american_c_rr = cann_rr.SUSP_RACE_cleaned == 'BLACK'

In [49]:
cann_rr['af_am'] = african_american_c_rr.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [50]:
cann_rr['af_am'].value_counts()

1    1846
0    1690
Name: af_am, dtype: int64

In [51]:
cann_af_am_rr = np.mean(cann_rr['af_am'])
cann_af_am_rr

0.5220588235294118

In [52]:
cann_af_am_rr_std = np.std(cann_rr['af_am'])
cann_af_am_rr_std

0.4995131713023196

In [53]:
non_cann_rr['SUSP_RACE_cleaned'].value_counts()

BLACK                             120314
WHITE HISPANIC                     54299
WHITE                              36339
BLACK HISPANIC                     16273
ASIAN / PACIFIC ISLANDER           10111
AMERICAN INDIAN/ALASKAN NATIVE      1048
Name: SUSP_RACE_cleaned, dtype: int64

In [54]:
african_american_nc_rr = non_cann_rr.SUSP_RACE_cleaned == 'BLACK'

In [55]:
non_cann_rr['af_am'] = african_american_nc_rr.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [56]:
non_cann_rr['af_am'].value_counts()

1    120314
0    118073
Name: af_am, dtype: int64

In [57]:
non_cann_af_am_rr = np.mean(non_cann_rr['af_am'])
non_cann_af_am_rr

0.504700340203115

In [58]:
non_cann_af_am_rr_std = np.std(non_cann_rr['af_am'])
non_cann_af_am_rr_std

0.4999779063137545

In [59]:
scipy.stats.ttest_ind_from_stats(non_cann_af_am_rr, non_cann_af_am_rr_std, n_non_cann_rr, cann_af_am_rr, cann_af_am_rr_std, n_cann_rr)

Ttest_indResult(statistic=-2.049395091323461, pvalue=0.04042457286179611)

Null hypothesis rejected at the 5% significance level. So in this separate sample of NYC crimes where the race was reported, the p-value is not as small but the null hypothesis that African-Americans arrested for a crime are equally likely to be arrested for cannabis crimes as they are for non_cannabis crimes is rejected.

Null hypothesis: whites arrested for a crime are equally likely to be arrested for cannabis crimes as they are for non_cannabis crimes.

In [60]:
whites_c = cann.SUSP_RACE_cleaned == 'WHITE'
cann['white'] = whites_c.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [61]:
cann['white'].value_counts()

0    21995
1      258
Name: white, dtype: int64

In [62]:
cann_white = np.mean(cann['white'])
cann_white

0.01159394238979014

In [63]:
cann_white_std = np.std(cann['white'])
cann_white_std

0.10704916108805024

In [64]:
whites_nc = non_cann.SUSP_RACE_cleaned == 'WHITE'
non_cann['white'] = whites_nc.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [65]:
non_cann['white'].value_counts()

0    587972
1     36163
Name: white, dtype: int64

In [66]:
non_cann_white = np.mean(non_cann['white'])
non_cann_white

0.05794099033061757

In [67]:
non_cann_white_std = np.std(non_cann['white'])
non_cann_white_std

0.23363182995822473

In [68]:
scipy.stats.ttest_ind_from_stats(non_cann_white, non_cann_white_std, n_non_cann, cann_white, cann_white_std, n_cann)

Ttest_indResult(statistic=29.482528355205954, pvalue=6.46178201921188e-191)

Null hypothesis NOT rejected. Whites arrested for a crime are equally likely to be arrested for cannabis crimes as they are for non_cannabis crimes. This is in stark contrast to African-Americans.

Null hypothesis: Hispanic whites arrested for a crime are equally likely to be arrested for cannabis crimes as they are for non_cannabis crimes.

In [69]:
hisp_whites_c = cann.SUSP_RACE_cleaned == 'WHITE HISPANIC'
cann['hisp_whites'] = hisp_whites_c.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [73]:
cann['hisp_whites'].value_counts()

0    21288
1      965
Name: hisp_whites, dtype: int64

In [74]:
cann_hisp_whites = np.mean(cann['hisp_whites'])
cann_hisp_whites

0.04336493955871119

In [75]:
cann_hisp_whites_std = np.std(cann['hisp_whites'])
cann_hisp_whites_std

0.2036772485472899

In [76]:
hisp_whites_nc = non_cann.SUSP_RACE_cleaned == 'WHITE HISPANIC'
non_cann['hisp_whites'] = hisp_whites_nc.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [77]:
non_cann['hisp_whites'].value_counts()

0    569607
1     54528
Name: hisp_whites, dtype: int64

In [78]:
non_cann_hisp_whites = np.mean(non_cann['hisp_whites'])
non_cann_hisp_whites

0.08736571414838136

In [79]:
non_cann_hisp_whites_std = np.std(non_cann['hisp_whites'])
non_cann_hisp_whites_std

0.2823702288464741

In [81]:
scipy.stats.ttest_ind_from_stats(non_cann_hisp_whites, non_cann_hisp_whites_std, n_non_cann, cann_hisp_whites, cann_hisp_whites_std, n_cann)

Ttest_indResult(statistic=23.03266809571235, pvalue=2.4477626728318438e-117)

Null hypothesis rejected at a very low p-value. So hispanic whites arrested for a crime are not equally likely to be arrested for cannabis crimes as they are for non_cannabis crimes.

Null hypothesis: Hispanic blacks arrested for a crime are equally likely to be arrested for cannabis crimes as they are for non_cannabis crimes.

In [82]:
hisp_blacks_c = cann.SUSP_RACE_cleaned == 'BLACK HISPANIC'
cann['hisp_blacks'] = hisp_blacks_c.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [83]:
cann['hisp_blacks'].value_counts()

0    21851
1      402
Name: hisp_blacks, dtype: int64

In [84]:
cann_hisp_blacks = np.mean(cann['hisp_blacks'])
cann_hisp_blacks

0.018064980002696265

In [85]:
cann_hisp_blacks_std = np.std(cann['hisp_blacks'])
cann_hisp_blacks_std

0.13318647266217973

In [86]:
hisp_blacks_nc = non_cann.SUSP_RACE_cleaned == 'BLACK HISPANIC'
non_cann['hisp_blacks'] = hisp_blacks_nc.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [88]:
non_cann['hisp_blacks'].value_counts()

0    607629
1     16506
Name: hisp_blacks, dtype: int64

In [89]:
non_cann_hisp_blacks = np.mean(non_cann['hisp_blacks'])
non_cann_hisp_blacks

0.02644620154293542

In [90]:
non_cann_hisp_blacks_std = np.std(non_cann['hisp_blacks'])
non_cann_hisp_blacks_std

0.1604580941152736

In [91]:
scipy.stats.ttest_ind_from_stats(non_cann_hisp_blacks, non_cann_hisp_blacks_std, n_non_cann, cann_hisp_blacks, cann_hisp_blacks_std, n_cann)

Ttest_indResult(statistic=7.697864029768781, pvalue=1.3855380465208633e-14)

Null hypothesis rejected. So hispanic blacks arrested for a crime are NOT equally likely to be arrested for cannabis crimes as they are for non_cannabis crimes.

Null hypothesis: Asians arrested for a crime are equally likely to be arrested for cannabis crimes as they are for non_cannabis crimes.

In [92]:
asians_c = cann.SUSP_RACE_cleaned == 'ASIAN / PACIFIC ISLANDER'
cann['asians'] = asians_c.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [93]:
cann['asians'].value_counts()

0    22196
1       57
Name: asians, dtype: int64

In [94]:
cann_asians = np.mean(cann['asians'])
cann_asians

0.002561452388442008

In [95]:
cann_asians_std = np.std(cann['asians'])
cann_asians_std

0.050545933071853726

In [96]:
asians_nc = non_cann.SUSP_RACE_cleaned == 'ASIAN / PACIFIC ISLANDER'
non_cann['asians'] = asians_nc.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [97]:
non_cann['asians'].value_counts()

0    613976
1     10159
Name: asians, dtype: int64

In [98]:
non_cann_asians = np.mean(non_cann['asians'])
non_cann_asians

0.016276927267337996

In [99]:
non_cann_asians_std = np.std(non_cann['asians'])
non_cann_asians_std

0.12653848784462388

In [100]:
scipy.stats.ttest_ind_from_stats(non_cann_asians, non_cann_asians_std, n_non_cann, cann_asians, cann_asians_std, n_cann)

Ttest_indResult(statistic=16.123167600010156, pvalue=1.800635931559938e-58)

Null hypothesis rejected, so Asians arrested for a crime are NOT equally likely to be arrested for cannabis crimes as they are for non_cannabis crimes.

Null hypothesis: African-Americans arrested for a crime are equally likely to be charged for cannabis crimes as white people arrested for a crime.

In [101]:
scipy.stats.ttest_ind_from_stats(cann_af_am, cann_af_am_std, n_cann, cann_white, cann_white_std, n_cann)

Ttest_indResult(statistic=34.7409132242668, pvalue=5.964923628783365e-261)

Null hypothesis rejected, so African-Americans arrested for a crime are NOT equally likely to be charged for cannabis crimes as white people arrested for a crime.

Null hypothesis: Hispanic whites arrested for a crime are equally likely to be charged for cannabis crimes as white people arrested for a crime.

In [102]:
scipy.stats.ttest_ind_from_stats(cann_hisp_whites, cann_hisp_whites_std, n_cann, cann_white, cann_white_std, n_cann)

Ttest_indResult(statistic=20.597617201604326, pvalue=7.911075976945696e-94)

Null hypothesis rejected, so Hispanic whites arrested for a crime are NOT equally likely to be charged for cannabis crimes as white people arrested for a crime.

Null hypothesis: Hispanic blacks arrested for a crime are equally likely to be charged for cannabis crimes as white people arrested for a crime.

In [103]:
scipy.stats.ttest_ind_from_stats(cann_hisp_blacks, cann_hisp_blacks_std, n_cann, cann_white, cann_white_std, n_cann)

Ttest_indResult(statistic=5.649245789915209, pvalue=1.621349009434323e-08)

Null hypothesis rejected, so Hispanic blacks arrested for a crime are NOT equally likely to be charged for cannabis crimes as white people arrested for a crime.

Null hypothesis: Asians arrested for a crime are equally likely to be charged for cannabis crimes as white people arrested for a crime.

In [104]:
scipy.stats.ttest_ind_from_stats(cann_asians, cann_asians_std, n_cann, cann_white, cann_white_std, n_cann)

Ttest_indResult(statistic=-11.381887752780784, pvalue=5.663369203012483e-30)

Null hypothesis rejected, so Asians arrested for a crime are NOT equally likely to be charged for cannabis crimes as white people arrested for a crime.

Null hypothesis: African-Americans arrested for a crime are equally likely to be charged for cannabis crimes as white Hispanic people arrested for a crime.

In [105]:
scipy.stats.ttest_ind_from_stats(cann_af_am, cann_af_am_std, n_cann, cann_hisp_whites, cann_hisp_whites_std, n_cann)

Ttest_indResult(statistic=15.830878331203309, pvalue=2.7156074400977062e-56)

Null hypothesis rejected, so African-Americans arrested for a crime are NOT equally likely to be charged for cannabis crimes as white Hispanic people arrested for a crime.

Null hypothesis: African-Americans arrested for a crime are equally likely to be charged for cannabis crimes as black Hispanic people arrested for a crime.

In [106]:
scipy.stats.ttest_ind_from_stats(cann_af_am, cann_af_am_std, n_cann, cann_hisp_blacks, cann_hisp_blacks_std, n_cann)

Ttest_indResult(statistic=30.312317515402764, pvalue=8.574831684177874e-200)

Null hypothesis rejected, so African-Americans arrested for a crime are NOT equally likely to be charged for cannabis crimes as black Hispanic people arrested for a crime.

Null hypothesis: White Hispanics arrested for a crime are equally likely to be charged for cannabis crimes as black Hispanic people arrested for a crime.

In [107]:
scipy.stats.ttest_ind_from_stats(cann_hisp_whites, cann_hisp_whites_std, n_cann, cann_hisp_blacks, cann_hisp_blacks_std, n_cann)

Ttest_indResult(statistic=15.508460789028748, pvalue=4.214779580557901e-54)

Null hypothesis rejected, so White Hispanics arrested for a crime are equally likely to be charged for cannabis crimes as black Hispanic people arrested for a crime.

Null hypothesis: African-Americans arrested for a cannabis crime are equally likely to be charged for misdemeanor cannabis possession as they are for violation cannabis possession.

In [116]:
af_am_cann = cann[cann.african_american == 1]

In [117]:
n_af_am_cann = len(af_am_cann)
n_af_am_cann

1764

In [118]:
af_am_misd_poss = (cann['african_american'] == 1) & (cann['misd_poss'] == 1)
cann['af_am_misd_poss'] = af_am_misd_poss.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [119]:
cann['af_am_misd_poss'].value_counts()

0    20718
1     1535
Name: af_am_misd_poss, dtype: int64

In [120]:
af_am_misd_poss_m = np.mean(cann['af_am_misd_poss'])
af_am_misd_poss_m

0.06897946344313126

In [121]:
af_am_misd_poss_std = np.std(cann['af_am_misd_poss'])
af_am_misd_poss_std

0.2534192121095084

In [124]:
af_am_viol_poss = (cann['african_american'] == 1) & (cann['viol_poss'] == 1)
cann['af_am_viol_poss'] = af_am_viol_poss.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [125]:
cann['af_am_viol_poss'].value_counts()

0    22175
1       78
Name: af_am_viol_poss, dtype: int64

In [126]:
af_am_viol_poss_m = np.mean(cann['af_am_viol_poss'])
af_am_viol_poss_m

0.003505145373657484

In [127]:
af_am_viol_poss_std = np.std(cann['af_am_viol_poss'])
af_am_viol_poss_std

0.059100417338348324

In [128]:
scipy.stats.ttest_ind_from_stats(af_am_misd_poss_m, af_am_misd_poss_std, n_af_am_cann, af_am_viol_poss_m, af_am_viol_poss_std, n_af_am_cann)

Ttest_indResult(statistic=10.56770151473383, pvalue=1.0157825879751098e-25)

The null hypothesis is rejected, so African-Americans arrested for a cannabis crime are NOT equally likely to be charged for misdemeanor cannabis possession as they are for violation cannabis possession.

EDA suggested that more whites are arrested for violation possession charges. The next null hypothesis to be tested is that Whites arrested for a cannabis crime are equally likely to be charged for misdemeanor cannabis possession as they are for violation cannabis possession.

In [130]:
white_cann = cann[cann.white == 1]

In [131]:
n_white_cann = len(white_cann)
n_white_cann

258

In [132]:
white_misd_poss = (cann['white'] == 1) & (cann['misd_poss'] == 1)
cann['white_misd_poss'] = white_misd_poss.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [133]:
cann['white_misd_poss'].value_counts()

0    22024
1      229
Name: white_misd_poss, dtype: int64

In [134]:
white_misd_poss_m = np.mean(cann['white_misd_poss'])
white_misd_poss_m

0.010290747314968767

In [135]:
white_misd_poss_std = np.std(cann['white_misd_poss'])
white_misd_poss_std

0.10092000710795918

In [136]:
white_viol_poss = (cann['white'] == 1) & (cann['viol_poss'] == 1)
cann['white_viol_poss'] = white_viol_poss.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [137]:
cann['white_viol_poss'].value_counts()

0    22241
1       12
Name: white_viol_poss, dtype: int64

In [138]:
white_viol_poss_m = np.mean(cann['white_viol_poss'])
white_viol_poss_m

0.0005392531344088437

In [139]:
white_viol_poss_std = np.std(cann['white_viol_poss'])
white_viol_poss_std

0.023215562462838735

In [140]:
scipy.stats.ttest_ind_from_stats(white_misd_poss_m, white_misd_poss_std, n_white_cann, white_viol_poss_m, white_viol_poss_std, n_white_cann)

Ttest_indResult(statistic=1.5125385767500596, pvalue=0.13101148742325466)

The null hypothesis is not rejected, so Whites arrested for a cannabis crime are equally likely to be charged for misdemeanor cannabis possession as they are for violation cannabis possession.

To follow up on this finding, the next null hypothesis is that African-Americans arrested for a cannabis crime are equally likely to be arrested for violation possession as are Whites arrested for a cannabis crime.

In [141]:
scipy.stats.ttest_ind_from_stats(af_am_viol_poss_m, af_am_viol_poss_std, n_af_am_cann, white_viol_poss_m, white_viol_poss_std, n_white_cann)

Ttest_indResult(statistic=0.7969895485561075, pvalue=0.4255507657116251)

The null hypothesis here is not rejected, so African-Americans arrested for a cannabis crime are equally likely to be arrested for violation possession as are Whites arrested for a cannabis crime.

The next null hypothesis is to look at whether African-Americans arrested for a cannabis crime are equally likely to be arrested for misdemeanor possession as they are for felony possession.

In [142]:
af_am_felony_poss = (cann['african_american'] == 1) & (cann['felony_poss'] == 1)
cann['af_am_felony_poss'] = af_am_felony_poss.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [143]:
cann['af_am_felony_poss'].value_counts()

0    22214
1       39
Name: af_am_felony_poss, dtype: int64

In [144]:
af_am_felony_poss_m = np.mean(cann['af_am_felony_poss'])
af_am_felony_poss_m

0.001752572686828742

In [145]:
af_am_felony_poss_std = np.std(cann['af_am_felony_poss'])
af_am_felony_poss_std

0.04182703881231521

In [146]:
scipy.stats.ttest_ind_from_stats(af_am_misd_poss_m, af_am_misd_poss_std, n_af_am_cann, af_am_felony_poss_m, af_am_felony_poss_std, n_af_am_cann)

Ttest_indResult(statistic=10.993005530879257, pvalue=1.155697744923015e-27)

The null hypothesis is rejected, so African-Americans arrested for a cannabis crime are NOT equally likely to be arrested for misdemeanor possession as they are for felony possession.

The next null hypothesis is to look at whether African-Americans arrested for a cannabis crime are equally likely to be arrested for misdemeanor sales as they are for felony sales.

In [147]:
af_am_misd_sales = (cann['african_american'] == 1) & (cann['misd_sales'] == 1)
cann['af_am_misd_sales'] = af_am_misd_sales.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [148]:
cann['af_am_misd_sales'].value_counts()

0    22148
1      105
Name: af_am_misd_sales, dtype: int64

In [149]:
af_am_misd_sales_m = np.mean(cann['af_am_misd_sales'])
af_am_misd_sales_m

0.004718464926077383

In [150]:
af_am_misd_sales_std = np.std(cann['af_am_misd_sales'])
af_am_misd_sales_std

0.06852883345586251

In [151]:
af_am_felony_sales = (cann['african_american'] == 1) & (cann['felony_sales'] == 1)
cann['af_am_felony_sales'] = af_am_felony_sales.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [152]:
cann['af_am_felony_sales'].value_counts()

0    22246
1        7
Name: af_am_felony_sales, dtype: int64

In [154]:
af_am_felony_sales_m = np.mean(cann['af_am_felony_sales'])
af_am_felony_sales_m

0.00031456432840515884

In [155]:
af_am_felony_sales_std = np.std(cann['af_am_felony_sales'])
af_am_felony_sales_std

0.01773317167594061

In [156]:
scipy.stats.ttest_ind_from_stats(af_am_misd_sales_m, af_am_misd_sales_std, n_af_am_cann, af_am_felony_sales_m, af_am_felony_sales_std, n_af_am_cann)

Ttest_indResult(statistic=2.6129978588465477, pvalue=0.009013304688616644)

The null hypothesis is rejected, so African-Americans arrested for a cannabis crime are NOT equally likely to be arrested for misdemeanor sales as they are for felony sales.

The next subject for hypothesis testing involves geography. The machine learning phase of this project will go into more detail with geographic location as a differential predictive factor for cannabis arrests and their sub-types, but for now I'll start with looking at borough. The first null hypothesis is that cannabis arrests are equally as likely to happen in the Bronx as they are in Manhattan. 

In [157]:
cann['BORO_NM'].value_counts()

BRONX            8880
BROOKLYN         7128
MANHATTAN        4593
QUEENS           1049
STATEN ISLAND     591
unknown            12
Name: BORO_NM, dtype: int64

In [158]:
bronx_cann = cann['BORO_NM'] == 'BRONX'
cann['bronx_cann'] = bronx_cann.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [159]:
cann['bronx_cann'].value_counts()

0    13373
1     8880
Name: bronx_cann, dtype: int64

In [161]:
bronx_cann_m = np.mean(cann['bronx_cann'])
bronx_cann_m

0.3990473194625444

In [162]:
bronx_cann_std = np.std(cann['bronx_cann'])
bronx_cann_std

0.48970251816007104

In [163]:
manhattan_cann = cann['BORO_NM'] == 'MANHATTAN'
cann['manhattan_cann'] = manhattan_cann.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [164]:
cann['manhattan_cann'].value_counts()

0    17660
1     4593
Name: manhattan_cann, dtype: int64

In [165]:
manhattan_cann_m = np.mean(cann['manhattan_cann'])
manhattan_cann_m

0.20639913719498496

In [166]:
manhattan_cann_std = np.std(cann['manhattan_cann'])
manhattan_cann_std

0.40472031498319616

In [167]:
scipy.stats.ttest_ind_from_stats(bronx_cann_m, bronx_cann_std, n_cann, manhattan_cann_m, manhattan_cann_std, n_cann)

Ttest_indResult(statistic=45.23554284349289, pvalue=0.0)

The null hypothesis is rejected, so cannabis arrests are NOT equally as likely to happen in the Bronx as they are in Manhattan.

The next null hypothesis is that cannabis arrests are equally as likely to happen in the Bronx as they are in Brooklyn.

In [168]:
brooklyn_cann = cann['BORO_NM'] == 'BROOKLYN'
cann['brooklyn_cann'] = brooklyn_cann.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [169]:
cann['brooklyn_cann'].value_counts()

0    15125
1     7128
Name: brooklyn_cann, dtype: int64

In [170]:
brooklyn_cann_m = np.mean(cann['brooklyn_cann'])
brooklyn_cann_m

0.3203163618388532

In [171]:
brooklyn_cann_std = np.std(cann['brooklyn_cann'])
brooklyn_cann_std

0.4665981034864472

In [172]:
scipy.stats.ttest_ind_from_stats(bronx_cann_m, bronx_cann_std, n_cann, brooklyn_cann_m, brooklyn_cann_std, n_cann)

Ttest_indResult(statistic=17.363355710916625, pvalue=2.6074354160951398e-67)

The null hypothesis is rejected, so cannabis arrests are NOT equally as likely to happen in the Bronx as they are in Brooklyn.

The next null hypothesis is that cannabis arrests are equally as likely to happen in the Bronx as they are in Queens.

In [173]:
queens_cann = cann['BORO_NM'] == 'QUEENS'
cann['queens_cann'] = queens_cann.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [174]:
cann['queens_cann'].value_counts()

0    21204
1     1049
Name: queens_cann, dtype: int64

In [175]:
queens_cann_m = np.mean(cann['queens_cann'])
queens_cann_m

0.04713971149957309

In [176]:
queens_cann_std = np.std(cann['queens_cann'])
queens_cann_std

0.2119376302106154

In [177]:
scipy.stats.ttest_ind_from_stats(bronx_cann_m, bronx_cann_std, n_cann, queens_cann_m, queens_cann_std, n_cann)

Ttest_indResult(statistic=98.38055959667413, pvalue=0.0)

The null hypothesis is rejected, so cannabis arrests are NOT equally as likely to happen in the Bronx as they are in Queens.

The next null hypothesis is that cannabis arrests are equally as likely to happen in the Bronx as they are in Staten Island.

In [178]:
staten_cann = cann['BORO_NM'] == 'STATEN ISLAND'
cann['staten_cann'] = staten_cann.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [179]:
cann['staten_cann'].value_counts()

0    21662
1      591
Name: staten_cann, dtype: int64

In [180]:
staten_cann_m = np.mean(cann['staten_cann'])
staten_cann_m

0.026558216869635554

In [181]:
staten_cann_std = np.std(cann['staten_cann'])
staten_cann_std

0.16078830177081022

In [182]:
scipy.stats.ttest_ind_from_stats(bronx_cann_m, bronx_cann_std, n_cann, staten_cann_m, staten_cann_std, n_cann)

Ttest_indResult(statistic=107.80616550805878, pvalue=0.0)

The null hypothesis is rejected, so cannabis arrests are NOT equally as likely to happen in the Bronx as they are in Staten Island.

The next null hypothesis is that cannabis arrests are equally as likely to happen in Manhattan as they are in Brooklyn.

In [183]:
scipy.stats.ttest_ind_from_stats(manhattan_cann_m, manhattan_cann_std, n_cann, brooklyn_cann_m, brooklyn_cann_std, n_cann)

Ttest_indResult(statistic=-27.512458102741178, pvalue=3.0284379982446205e-165)

The null hypothesis is rejected, so cannabis arrests are NOT equally as likely to happen in Manhattan as they are in Brooklyn.

The next null hypothesis is that cannabis arrests are equally as likely to happen in Manhattan as they are in Queens.

In [184]:
scipy.stats.ttest_ind_from_stats(manhattan_cann_m, manhattan_cann_std, n_cann, queens_cann_m, queens_cann_std, n_cann)

Ttest_indResult(statistic=52.00216938379398, pvalue=0.0)

The null hypothesis is rejected, so cannabis arrests are NOT equally as likely to happen in Manhattan as they are in Queens.

The next null hypothesis is that cannabis arrests are equally as likely to happen in Manhattan as they are in Staten Island.

In [185]:
scipy.stats.ttest_ind_from_stats(manhattan_cann_m, manhattan_cann_std, n_cann, staten_cann_m, staten_cann_std, n_cann)

Ttest_indResult(statistic=61.60341114114099, pvalue=0.0)

The null hypothesis is rejected, so cannabis arrests are NOT equally as likely to happen in Manhattan as they are in Staten Island.

The next null hypothesis is that cannabis arrests are equally as likely to happen in Brooklyn as they are in Queens.

In [186]:
scipy.stats.ttest_ind_from_stats(brooklyn_cann_m, brooklyn_cann_std, n_cann, queens_cann_m, queens_cann_std, n_cann)

Ttest_indResult(statistic=79.51784045445179, pvalue=0.0)

The null hypothesis is rejected, so cannabis arrests are NOT equally as likely to happen in Brooklyn as they are in Queens.

The next null hypothesis is that cannabis arrests are equally as likely to happen in Brooklyn as they are in Staten Island.

In [187]:
scipy.stats.ttest_ind_from_stats(brooklyn_cann_m, brooklyn_cann_std, n_cann, staten_cann_m, staten_cann_std, n_cann)

Ttest_indResult(statistic=88.7922933811341, pvalue=0.0)

The null hypothesis is rejected, so cannabis arrests are NOT equally as likely to happen in Brooklyn as they are in Staten Island.

The next null hypothesis is that cannabis arrests are equally as likely to happen in Queens as they are in Staten Island.

In [188]:
scipy.stats.ttest_ind_from_stats(queens_cann_m, queens_cann_std, n_cann, staten_cann_m, staten_cann_std, n_cann)

Ttest_indResult(statistic=11.541048207759841, pvalue=9.062260122217387e-31)

The null hypothesis is rejected, so cannabis arrests are NOT equally as likely to happen in Brooklyn as they are in Staten Island.