# Targets: Using Machine Learning Classification Models to Identify Salient Predictors of Cannabis Arrests in New York City, 2006-2018

# Statistical Data Analysis notebook

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats
import statsmodels
from numpy.random import seed

The focus of this sub-report is to run hypothesis tests on a series of null hypotheses about the likelihood of different types of cannabis arrests amongst different demographic groups. The null hypotheses will be evaluated using the t-test.

The DataFrame used in this hypothesis testing stage is a sample from the cleaned "NYPD Complaint Data Historic" dataset, which was downloaded from https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i. The data cleaning for cannabis crimes was conducted separately from all other crimes, so two .csv files are imported and subsequently concatenated.

In [6]:
nyc_cann = pd.read_csv('nyc_cann_HT_sample.csv', index_col=0)

In [3]:
nyc_non_cann = pd.read_csv('nyc_non_cann_HT_sample.csv', index_col=0)

The two DataFrames are then concatenated together in the following cell, and the shape of each individual DataFrame and the concatenated DataFrame are called.

In [8]:
df = pd.concat([nyc_cann, nyc_non_cann], sort=True)

In [9]:
nyc_cann.shape

(22030, 136)

In [10]:
nyc_non_cann.shape

(623849, 124)

In [11]:
df.shape

(645879, 136)

The presence of duplicate cases in the concatenated DataFrame is then checked for.

In [12]:
df = df.drop_duplicates()
df.shape

(645879, 136)

The presence of null values is checked.

In [13]:
df_nulls = list(df.columns[(df.isnull().sum()>0) | (df.isna().sum()>0)])
df_nulls

['Unnamed: 0.1',
 'felony',
 'felony_poss',
 'felony_sales',
 'misd_poss',
 'misd_sales',
 'misdemeanor',
 'possession',
 'sales',
 'viol_poss',
 'viol_sales',
 'violation']

The features with null values are put into a list called 'fill_w_zero', and then any null values for these features are filled with zeroes.

In [14]:
fill_w_zero = list(df.columns[(df.isnull().sum()>0) | (df.isna().sum()>0)])
fill_w_zero

['Unnamed: 0.1',
 'felony',
 'felony_poss',
 'felony_sales',
 'misd_poss',
 'misd_sales',
 'misdemeanor',
 'possession',
 'sales',
 'viol_poss',
 'viol_sales',
 'violation']

In [15]:
df[fill_w_zero] = df[fill_w_zero].fillna(value=0, axis=1)

In [16]:
null_recheck = list(df.columns[(df.isnull().sum()>0) | (df.isna().sum()>0)])
null_recheck

[]

There are no more missing values. 

The 'cannabis_crime' feature is created here by filtering on the 'PD_CD' values for cannabis crimes for the value of '1' (cannabis crime) and '0' (non-cannabis crime).

In [17]:
cannabis_crime = (df.PD_CD == 566.0) | (df.PD_CD == 567.0) | (df.PD_CD == 568.0) | (df.PD_CD == 569.0) | (df.PD_CD == 570.0)

In [18]:
df['cannabis_crime'] = cannabis_crime.astype(int)

In the sample, there are 22,030 cannabis arrests out of 623,849 NYC crimes. 

In [19]:
df['cannabis_crime'].value_counts()

0    623849
1     22030
Name: cannabis_crime, dtype: int64

Cannabis crimes account for 3.5% of overall crimes in this sample. The cleaned total population data set of New York City crimes between 2006 and 2018 has 220,304 cannabis crimes out of 6,463,881 total crimes (or 3.5%).

In [22]:
observed_cann_pctg_s = round(round(22030/623849, 3)*100, 2)
observed_cann_pctg_s

3.5

In [23]:
observed_cann_pctg = round(round(220304/6238491, 3)*100, 2)
observed_cann_pctg

3.5

For ease of use in running hypothesis tests, separate DataFrames for cannabis crimes ('cann') and non-cannabis crimes ('non_cann') are created.

In [26]:
cann = df[df.cannabis_crime == 1]
len(cann)

22030

In [27]:
non_cann = df[df.cannabis_crime == 0]
len(non_cann)

623849

# Reported Race of Suspect: First Null Hypothesis

The first null hypothesis states that the difference seen between the percentage of cannabis crimes where the suspect's race was reported and the percentage of non-cannabis crimes where the suspect's race was reported is due to chance and not some mediating factor.

For the cannabis crime group, the necessary statistics for running hypothesis tests are assigned to objects below. First, the N, or sample size, is assigned to 'n_cann'.

In [28]:
n_cann = len(cann)
n_cann

22030

Second, the value counts for suspect race are assigned to 'cann_race'.

In [29]:
cann_race = cann['SUSP_RACE_cleaned'].value_counts()
cann_race

unknown                           18578
BLACK                              1784
WHITE HISPANIC                      924
BLACK HISPANIC                      361
WHITE                               293
ASIAN / PACIFIC ISLANDER             82
AMERICAN INDIAN/ALASKAN NATIVE        8
Name: SUSP_RACE_cleaned, dtype: int64

A new feature named 'race_reported' is created that assigns a value of '1' for rows where the suspect's race was reported, and '0' for those rows where it was not.

In [30]:
race_reported_c = cann.SUSP_RACE_cleaned != 'unknown'

In [32]:
cann['race_reported'] = race_reported_c.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [33]:
cann['race_reported'].value_counts()

0    18578
1     3452
Name: race_reported, dtype: int64

As reported in previous notebooks of this project, the population percentage of cannabis crimes that do not have their suspect's race reported is 15.8%. In the sample, it is 15.7%.

In [34]:
round(cann['race_reported'].value_counts(normalize=True), 3)*100

0    84.3
1    15.7
Name: race_reported, dtype: float64

The mean of the 'race_reported' feature is assigned to 'cann_race_reported', for use in the t-test function that will be run below.

In [35]:
cann_race_reported = np.mean(cann['race_reported'])
cann_race_reported

0.15669541534271447

The standard deviation of the 'race_reported' feature is assigned to 'cann_std', for use in the t-test function that will be run below.

In [36]:
cann_std = np.std(cann['race_reported'])
cann_std

0.3635133589750448

For the non-cannabis crime group, the necessary statistics for running hypothesis tests are assigned to objects below. First, the N, or sample size is assigned to 'n_cann'.

In [37]:
n_non_cann = len(non_cann)
n_non_cann

623849

Second, the value counts for suspect race are called.

In [20]:
non_cann['SUSP_RACE_cleaned'].value_counts()

unknown                           386090
BLACK                             119699
WHITE HISPANIC                     54528
WHITE                              36163
BLACK HISPANIC                     16506
ASIAN / PACIFIC ISLANDER           10159
AMERICAN INDIAN/ALASKAN NATIVE       988
Name: SUSP_RACE_cleaned, dtype: int64

A new feature named 'race_reported' is created that assigns a value of '1' for rows where the suspect's race was reported, and '0' for those rows where it was not.

In [38]:
race_reported_nc = non_cann.SUSP_RACE_cleaned != 'unknown'

In [39]:
non_cann['race_reported'] = race_reported_nc.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


As reported in previous notebooks of this project, the population percentage of cannabis crimes that do not have their suspect's race reported is 38.1%. In the sample, it is 38.0%.

In [40]:
round(non_cann['race_reported'].value_counts(normalize=True), 3)*100

0    62.0
1    38.0
Name: race_reported, dtype: float64

The mean of the 'race_reported' feature is assigned to 'non_cann_race_reported', for use in the t-test function that will be run below.

In [43]:
non_cann_race_reported = np.mean(non_cann['race_reported'])
non_cann_race_reported

0.38003266816168657

The standard deviation of the 'race_reported' feature is assigned to 'cann_std', for use in the t-test function that will be run below.

In [44]:
non_cann_std = np.std(non_cann['race_reported'])
non_cann_std

0.4853945192236991

The t-test is run using the mean and standard deviation of the 'race_reported' feature, along with the sample size, of the cannabis crime group and the non-cannabis crime group. The t-score is approximately 67.6 and the p-value is 0.0.

In [45]:
scipy.stats.ttest_ind_from_stats(non_cann_race_reported, non_cann_std, n_non_cann, cann_race_reported, cann_std, n_cann)

Ttest_indResult(statistic=67.62616088664134, pvalue=0.0)

The first null hypothesis is rejected. Although the mediating factor is outside the scope of this analysis, one cannot say that the difference seen between the percentage of cannabis crimes where the suspect's race was reported and the percentage of non-cannabis crimes where the suspect's race was reported is due to chance.

# Likelihood of Arrest for Cannabis and Non-Cannabis Crimes for African-Americans:  Second Null Hypothesis

The second null hypothesis states that African-Americans arrested for a crime are equally likely to be arrested for cannabis crimes as they are for non-cannabis crimes.

For the cannabis crime group, the necessary statistics for running the second hypothesis test are assigned to objects below.

First, a flag feature is created in the 'cann' DataFrame which flags those cannabis crimes which were committed by a recorded African-American suspect ('1' for African-American and '0' for not African-American).

In [46]:
african_american_c = cann.SUSP_RACE_cleaned == 'BLACK'

In [47]:
cann['african_american'] = african_american_c.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


There are 1,764 cannabis crimes with a recorded African-American suspect.

In [48]:
cann['african_american'].value_counts()

0    20246
1     1784
Name: african_american, dtype: int64

The mean of the 'african_american' feature is assigned to 'cann_af_am', for use in the t-test function that will be run below.

In [49]:
cann_af_am = np.mean(cann['african_american'])
cann_af_am

0.08098048116205175

The standard deviation of the 'african_american' feature is assigned to 'cann_af_am_std', for use in the t-test function that will be run below.

In [50]:
cann_af_am_std = np.std(cann['african_american'])
cann_af_am_std

0.2728051371085288

A flag feature is created in the 'non_cann' DataFrame which flags those non-cannabis crimes which were committed by a recorded African-American suspect ('1' for African-American and '0' for not African-American).

In [51]:
african_american_nc = non_cann.SUSP_RACE_cleaned == 'BLACK'

In [52]:
non_cann['african_american'] = african_american_nc.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


There are 119,377 non-cannabis crimes with a recorded African-American suspect.

In [53]:
non_cann['african_american'].value_counts()

0    504472
1    119377
Name: african_american, dtype: int64

The mean of the 'african_american' feature is assigned to 'non_cann_af_am', for use in the t-test function that will be run below.

In [54]:
non_cann_af_am = np.mean(non_cann['african_american'])
non_cann_af_am

0.1913556004738326

The standard deviation of the 'african_american' feature is assigned to 'non_cann_af_am_std', for use in the t-test function that will be run below.

In [55]:
non_cann_af_am_std = np.std(non_cann['african_american'])
non_cann_af_am_std

0.39336831931663324

The t-test is run using the mean and standard deviation of the 'african_american' feature, along with the sample size, of the cannabis crime group and the non-cannabis crime group. The t-score is approximately 41.3 and the p-value is 0.0.

In [56]:
scipy.stats.ttest_ind_from_stats(non_cann_af_am, non_cann_af_am_std, n_non_cann, cann_af_am, cann_af_am_std, n_cann)

Ttest_indResult(statistic=41.29731509436323, pvalue=0.0)

The second null hypothesis is rejected. African-Americans arrested for a crime are NOT equally likely to be arrested for cannabis crimes as they are for non-cannabis crimes.

# Likelihood of Arrest for Cannabis and Non-Cannabis Crimes for African-Americans (re-run on crimes with the suspect's race reported): Third Null Hypothesis

The third null hypothesis states that African-Americans arrested for a crime are equally likely to be arrested for cannabis crimes as they are for non_cannabis crimes. This null hypothesis will be tested with the sample of NYC crimes with the suspect's race reported. This null hypothesis is conducted as an exploratory test, and may be spurious because all the crimes where the suspect's race was not recorded were not included.

A DataFrame of crimes where the suspect's race was reported is first subsetted from the 'df', the 10% sample of all cannabis and non-cannabis crimes. It's length is called.

In [57]:
nyc_rr = df[df.SUSP_RACE_cleaned != 'unknown']

In [58]:
len(nyc_rr)

240535

For ease of use in running hypothesis tests on just the crimes with the suspect's race recorded, separate DataFrames for cannabis crimes ('cann_rr') and non-cannabis crimes ('non_cann_rr') are created.

In [59]:
cann_rr = nyc_rr[nyc_rr.cannabis_crime == 1]

In [60]:
non_cann_rr = nyc_rr[nyc_rr.cannabis_crime == 0]

First, the N of the group of cannabis crimes whose suspect's race was recorded is assigned to 'n_cann_rr’.

In [61]:
n_cann_rr = len(cann_rr)
n_cann_rr

3452

Second, the N of the group of non-cannabis crimes whose suspect's race was recorded is assigned to 'n_non_cann_rr’.

In [62]:
n_non_cann_rr = len(non_cann_rr)
n_non_cann_rr

237083

The value counts for 'SUSP_RACE_cleaned' are called for cannabis crimes where the suspect's race was reported.

In [63]:
cann_rr['SUSP_RACE_cleaned'].value_counts()

BLACK                             1784
WHITE HISPANIC                     924
BLACK HISPANIC                     361
WHITE                              293
ASIAN / PACIFIC ISLANDER            82
AMERICAN INDIAN/ALASKAN NATIVE       8
Name: SUSP_RACE_cleaned, dtype: int64

First, a flag feature is created in the 'cann_rr' DataFrame which flags those cannabis crimes which were committed by a recorded African-American suspect ('1' for African-American and '0' for not African-American).

In [64]:
african_american_c_rr = cann_rr.SUSP_RACE_cleaned == 'BLACK'

In [65]:
cann_rr['af_am'] = african_american_c_rr.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [66]:
cann_rr['af_am'].value_counts()

1    1784
0    1668
Name: af_am, dtype: int64

The mean of the 'af_am' feature is assigned to 'cann_af_am_rr', for use in the t-test function that will be run below.

In [67]:
cann_af_am_rr = np.mean(cann_rr['af_am'])
cann_af_am_rr

0.5168018539976825

The standard deviation of the 'af_am' feature is assigned to 'cann_af_am_rr_std', for use in the t-test function that will be run below.

In [68]:
cann_af_am_rr_std = np.std(cann_rr['af_am'])
cann_af_am_rr_std

0.4997176179626278

The value counts for 'SUSP_RACE_cleaned' are called for non-cannabis crimes where the suspect's race was reported.

In [69]:
non_cann_rr['SUSP_RACE_cleaned'].value_counts()

BLACK                             119377
WHITE HISPANIC                     54579
WHITE                              35661
BLACK HISPANIC                     16256
ASIAN / PACIFIC ISLANDER           10211
AMERICAN INDIAN/ALASKAN NATIVE       999
Name: SUSP_RACE_cleaned, dtype: int64

A flag feature is created in the 'non_cann_rr' DataFrame which flags those non-cannabis crimes which were committed by a recorded African-American suspect ('1' for African-American and '0' for not African-American).

In [70]:
african_american_nc_rr = non_cann_rr.SUSP_RACE_cleaned == 'BLACK'

In [71]:
non_cann_rr['af_am'] = african_american_nc_rr.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [72]:
non_cann_rr['af_am'].value_counts()

1    119377
0    117706
Name: af_am, dtype: int64

The mean of the 'af_am' feature is assigned to 'non_cann_af_am_rr', for use in the t-test function that will be run below.

In [73]:
non_cann_af_am_rr = np.mean(non_cann_rr['af_am'])
non_cann_af_am_rr

0.5035240822834197

The standard deviation of the 'af_am' feature is assigned to 'non_cann_af_am_rr_std', for use in the t-test function that will be run below.

In [74]:
non_cann_af_am_rr_std = np.std(non_cann_rr['af_am'])
non_cann_af_am_rr_std

0.4999875806908565

The t-test is run using the mean and standard deviation of the 'af_am' feature, along with the sample size, of the cannabis crime group and the non-cannabis crime group where the suspect's race was reported. The t-score is approximately -1.54 and the p-value is 0.12.

In [75]:
scipy.stats.ttest_ind_from_stats(non_cann_af_am_rr, non_cann_af_am_rr_std, n_non_cann_rr, cann_af_am_rr, cann_af_am_rr_std, n_cann_rr)

Ttest_indResult(statistic=-1.5490513161568045, pvalue=0.12137070030085367)

The p-value in this t-test is 0.12, well above 0.05. So in this separate sample of NYC crimes where the race was reported, the third null hypothesis that African-Americans arrested for a crime are equally likely to be arrested for cannabis crimes as they are for non-cannabis crimes is NOT rejected.

# Likelihood of Arrest for Cannabis and Non-Cannabis Crimes for Whites:  Fourth Null Hypothesis

The fourth null hypothesis states that Whites arrested for a crime are equally likely to be arrested for cannabis crimes as they are for non-cannabis crimes.

First, a flag feature is created in the 'cann' DataFrame which flags those cannabis crimes which were committed by a recorded White suspect ('1' for White and '0' for not White).

In [76]:
whites_c = cann.SUSP_RACE_cleaned == 'WHITE'
cann['white'] = whites_c.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


There are 293 cannabis crimes with a recorded White suspect.

In [77]:
cann['white'].value_counts()

0    21737
1      293
Name: white, dtype: int64

The mean of the 'white' feature is assigned to 'cann_white', for use in the t-test function that will be run below.

In [78]:
cann_white = np.mean(cann['white'])
cann_white

0.013300045392646391

The standard deviation of the 'white' feature is assigned to 'cann_white_std', for use in the t-test function that will be run below.

In [79]:
cann_white_std = np.std(cann['white'])
cann_white_std

0.11455633629441452

A flag feature is created in the 'non_cann' DataFrame which flags those non-cannabis crimes which were committed by a recorded White suspect ('1' for White and '0' for not White).

In [80]:
whites_nc = non_cann.SUSP_RACE_cleaned == 'WHITE'
non_cann['white'] = whites_nc.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


There are 35,661 non-cannabis crimes with a recorded White suspect.

In [81]:
non_cann['white'].value_counts()

0    588188
1     35661
Name: white, dtype: int64

The mean of the 'white' feature is assigned to 'non_cann_white', for use in the t-test function that will be run below.

In [82]:
non_cann_white = np.mean(non_cann['white'])
non_cann_white

0.05716287114349786

The standard deviation of the 'white' feature is assigned to 'non_cann_white_std', for use in the t-test function that will be run below.

In [83]:
non_cann_white_std = np.std(non_cann['white'])
non_cann_white_std

0.23215356406192092

The t-test is run using the mean and standard deviation of the 'white' feature, along with the sample size, of the cannabis crime group and the non-cannabis crime group. The t-score is approximately 27.9 and the p-value is approximately 1.8e-171, not zero like with African-Americans but still an infinitesimally small p-value.

In [84]:
scipy.stats.ttest_ind_from_stats(non_cann_white, non_cann_white_std, n_non_cann, cann_white, cann_white_std, n_cann)

Ttest_indResult(statistic=27.92345147365235, pvalue=1.7536178040202593e-171)

The fourth null hypothesis is rejected. Whites arrested for a crime are NOT equally likely to be arrested for cannabis crimes as they are for non-cannabis crimes.

# Likelihood of Arrest for Cannabis and Non-Cannabis Crimes for Hispanic Whites:  Fifth Null Hypothesis

The fifth null hypothesis states that Hispanic Whites arrested for a crime are equally likely to be arrested for cannabis crimes as they are for non-cannabis crimes.

First, a flag feature is created in the 'cann' DataFrame which flags those cannabis crimes which were committed by a recorded Hispanic White suspect ('1' for Hispanic White and '0' for not Hispanic White).

In [85]:
hisp_whites_c = cann.SUSP_RACE_cleaned == 'WHITE HISPANIC'
cann['hisp_whites'] = hisp_whites_c.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


There are 924 cannabis crimes with a recorded Hispanic White suspect.

In [86]:
cann['hisp_whites'].value_counts()

0    21106
1      924
Name: hisp_whites, dtype: int64

The mean of the 'hisp_whites' feature is assigned to 'cann_hisp_whites', for use in the t-test function that will be run below.

In [87]:
cann_hisp_whites = np.mean(cann['hisp_whites'])
cann_hisp_whites

0.04194280526554698

The standard deviation of the 'hisp_whites' feature is assigned to 'cann_hisp_whites_std', for use in the t-test function that will be run below.

In [88]:
cann_hisp_whites_std = np.std(cann['hisp_whites'])
cann_hisp_whites_std

0.2004584903464932

A flag feature is created in the 'non_cann' DataFrame which flags those non-cannabis crimes which were committed by a recorded Hispanic White suspect ('1' for Hispanic White and '0' for not Hispanic White).

In [89]:
hisp_whites_nc = non_cann.SUSP_RACE_cleaned == 'WHITE HISPANIC'
non_cann['hisp_whites'] = hisp_whites_nc.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


There are 54,579 non-cannabis crimes with a recorded African-American suspect.

In [90]:
non_cann['hisp_whites'].value_counts()

0    569270
1     54579
Name: hisp_whites, dtype: int64

The mean of the 'hisp_whites' feature is assigned to 'non_cann_hisp_whites', for use in the t-test function that will be run below.

In [91]:
non_cann_hisp_whites = np.mean(non_cann['hisp_whites'])
non_cann_hisp_whites

0.08748751701132806

The standard deviation of the 'hisp_whites' feature is assigned to 'non_cann_hisp_whites_std', for use in the t-test function that will be run below.

In [92]:
non_cann_hisp_whites_std = np.std(non_cann['hisp_whites'])
non_cann_hisp_whites_std

0.28254813993038513

The t-test is run using the mean and standard deviation of the 'hisp_whites' feature, along with the sample size, of the cannabis crime group and the non-cannabis crime group. The t-score is approximately 23.7 and the p-value is 2.8e-124.

In [93]:
scipy.stats.ttest_ind_from_stats(non_cann_hisp_whites, non_cann_hisp_whites_std, n_non_cann, cann_hisp_whites, cann_hisp_whites_std, n_cann)

Ttest_indResult(statistic=23.71520037603833, pvalue=2.8414252698367278e-124)

The fifth null hypothesis is rejected at a very low p-value. Hispanic Whites arrested for a crime are NOT equally likely to be arrested for cannabis crimes as they are for non-cannabis crimes.

# Likelihood of Arrest for Cannabis and Non-Cannabis Crimes for Hispanic Blacks:  Sixth Null Hypothesis

The sixth null hypothesis states that Hispanic Blacks arrested for a crime are equally likely to be arrested for cannabis crimes as they are for non-cannabis crimes.

First, a flag feature is created in the 'cann' DataFrame which flags those cannabis crimes which were committed by a recorded Hispanic Black suspect ('1' for Hispanic Black and '0' for not Hispanic Black).

In [94]:
hisp_blacks_c = cann.SUSP_RACE_cleaned == 'BLACK HISPANIC'
cann['hisp_blacks'] = hisp_blacks_c.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


There are 361 cannabis crimes with a recorded Hispanic Black suspect.

In [95]:
cann['hisp_blacks'].value_counts()

0    21669
1      361
Name: hisp_blacks, dtype: int64

The mean of the 'hisp_blacks' feature is assigned to 'cann_hisp_blacks', for use in the t-test function that will be run below.

In [96]:
cann_hisp_blacks = np.mean(cann['hisp_blacks'])
cann_hisp_blacks

0.016386745347253744

The standard deviation of the 'hisp_blacks' feature is assigned to 'cann_hisp_blacks_std', for use in the t-test function that will be run below.

In [97]:
cann_hisp_blacks_std = np.std(cann['hisp_blacks'])
cann_hisp_blacks_std

0.1269575516626439

A flag feature is created in the 'non_cann' DataFrame which flags those non-cannabis crimes which were committed by a recorded African-American suspect ('1' for African-American and '0' for not African-American).

In [98]:
hisp_blacks_nc = non_cann.SUSP_RACE_cleaned == 'BLACK HISPANIC'
non_cann['hisp_blacks'] = hisp_blacks_nc.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


There are 16,256 non-cannabis crimes with a recorded Hispanic Black suspect.

In [99]:
non_cann['hisp_blacks'].value_counts()

0    607593
1     16256
Name: hisp_blacks, dtype: int64

The mean of the 'hisp_blacks' feature is assigned to 'non_cann_hisp_blacks', for use in the t-test function that will be run below.

In [100]:
non_cann_hisp_blacks = np.mean(non_cann['hisp_blacks'])
non_cann_hisp_blacks

0.026057587653422542

The standard deviation of the 'hisp_blacks' feature is assigned to 'non_cann_hisp_blacks_std', for use in the t-test function that will be run below.

In [101]:
non_cann_hisp_blacks_std = np.std(non_cann['hisp_blacks'])
non_cann_hisp_blacks_std

0.15930659050730908

The t-test is run using the mean and standard deviation of the 'hisp_blacks' feature, along with the sample size, of the cannabis crime group and the non-cannabis crime group. The t-score is approximately 8.9 and the p-value is 5.1e-19.

In [102]:
scipy.stats.ttest_ind_from_stats(non_cann_hisp_blacks, non_cann_hisp_blacks_std, n_non_cann, cann_hisp_blacks, cann_hisp_blacks_std, n_cann)

Ttest_indResult(statistic=8.910896298244786, pvalue=5.074843146331189e-19)

The sixth null hypothesis is rejected. Hispanic Blacks arrested for a crime are NOT equally likely to be arrested for cannabis crimes as they are for non-cannabis crimes.

# Likelihood of Arrest for Cannabis and Non-Cannabis Crimes for Asians:  Seventh Null Hypothesis

The seventh null hypothesis states that Asians arrested for a crime are equally likely to be arrested for cannabis crimes as they are for non-cannabis crimes.

First, a flag feature is created in the 'cann' DataFrame which flags those cannabis crimes which were committed by a recorded Asian suspect ('1' for Asian and '0' for not Asian).

In [103]:
asians_c = cann.SUSP_RACE_cleaned == 'ASIAN / PACIFIC ISLANDER'
cann['asians'] = asians_c.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


There are 82 cannabis crimes with a recorded Asian suspect.

In [104]:
cann['asians'].value_counts()

0    21948
1       82
Name: asians, dtype: int64

The mean of the 'asians' feature is assigned to 'cann_asians', for use in the t-test function that will be run below.

In [105]:
cann_asians = np.mean(cann['asians'])
cann_asians

0.003722197004085338

The standard deviation of the 'asians' feature is assigned to 'cann_asians_std', for use in the t-test function that will be run below.

In [106]:
cann_asians_std = np.std(cann['asians'])
cann_asians_std

0.060896159596039376

A flag feature is created in the 'non_cann' DataFrame which flags those non-cannabis crimes which were committed by a recorded Asian suspect ('1' for Asian and '0' for not Asian).

In [107]:
asians_nc = non_cann.SUSP_RACE_cleaned == 'ASIAN / PACIFIC ISLANDER'
non_cann['asians'] = asians_nc.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


There are 10,211 non-cannabis crimes with a recorded Asian suspect.

In [108]:
non_cann['asians'].value_counts()

0    613638
1     10211
Name: asians, dtype: int64

The mean of the 'asians' feature is assigned to 'non_cann_asians', for use in the t-test function that will be run below.

In [109]:
non_cann_asians = np.mean(non_cann['asians'])
non_cann_asians

0.016367742835205316

The standard deviation of the 'asians' feature is assigned to 'non_cann_asians_std', for use in the t-test function that will be run below.

In [110]:
non_cann_asians_std = np.std(non_cann['asians'])
non_cann_asians_std

0.1268851442439904

The t-test is run using the mean and standard deviation of the 'asians' feature, along with the sample size, of the cannabis crime group and the non-cannabis crime group. The t-score is approximately 14.7 and the p-value is 4.1e-49.

In [111]:
scipy.stats.ttest_ind_from_stats(non_cann_asians, non_cann_asians_std, n_non_cann, cann_asians, cann_asians_std, n_cann)

Ttest_indResult(statistic=14.732441192840762, pvalue=4.064362489290671e-49)

The seventh null hypothesis is rejected. Asians arrested for a crime are NOT equally likely to be arrested for cannabis crimes as they are for non-cannabis crimes.

# Likelihood of Arrest for Cannabis Crimes for African-Americans and Whites:  Eighth Null Hypothesis

The eighth null hypothesis states that African-Americans arrested for a crime are equally likely to be arrested for cannabis crimes as Whites arrested for a crime. The arguments for the t-test function are already defined above.

The t-test is run using the means and standard deviations of the 'african_american' and 'white' features, along with the sample size of the cannabis crime group. The t-score is approximately 34.0 and the p-value is 1.8e-249.

In [112]:
scipy.stats.ttest_ind_from_stats(cann_af_am, cann_af_am_std, n_cann, cann_white, cann_white_std, n_cann)

Ttest_indResult(statistic=33.95101358854242, pvalue=1.9730978972156837e-249)

The eighth null hypothesis is rejected, so African-Americans arrested for a crime are NOT equally likely to be charged for cannabis crimes as white people arrested for a crime.

# Likelihood of Arrest for Cannabis Crimes for Hispanic Whites and Whites:  Ninth Null Hypothesis

The ninth null hypothesis states that Hispanic Whites arrested for a crime are equally likely to be arrested for cannabis crimes as Whites arrested for a crime. The arguments for the t-test function are already defined above. The t-score is approximately 18.4 and the p-value is 2.0e-75.

In [113]:
scipy.stats.ttest_ind_from_stats(cann_hisp_whites, cann_hisp_whites_std, n_cann, cann_white, cann_white_std, n_cann)

Ttest_indResult(statistic=18.413271851747407, pvalue=1.9750398165467757e-75)

The ninth null hypothesis was rejected, so Hispanic whites arrested for a crime are NOT equally likely to be charged for cannabis crimes as white people arrested for a crime.

# Likelihood of Arrest for Cannabis Crimes for Hispanic Blacks and Whites:  10th Null Hypothesis

The 10th null hypothesis states that Hispanic blacks arrested for a crime are equally likely to be charged for cannabis crimes as white people arrested for a crime. The arguments for the t-test function are already defined above. The t-score is approximately 2.7 and the p-value is 0.01.

In [114]:
scipy.stats.ttest_ind_from_stats(cann_hisp_blacks, cann_hisp_blacks_std, n_cann, cann_white, cann_white_std, n_cann)

Ttest_indResult(statistic=2.6791854474292762, pvalue=0.0073828913926165885)

The 10th null hypothesis was rejected, so Hispanic blacks arrested for a crime are NOT equally likely to be charged for cannabis crimes as white people arrested for a crime.

# Likelihood of Arrest for Cannabis Crimes for Asians and Whites:  11th Null Hypothesis

The 11th null hypothesis states that Asians arrested for a crime are equally likely to be charged for cannabis crimes as white people arrested for a crime. The arguments for the t-test function are already defined above. The t-score is approximately -11.0 and the p-value is 6.6e-28.

In [115]:
scipy.stats.ttest_ind_from_stats(cann_asians, cann_asians_std, n_cann, cann_white, cann_white_std, n_cann)

Ttest_indResult(statistic=-10.957559610650065, pvalue=6.641606022332508e-28)

The 11th null hypothesis was rejected, so Asians arrested for a crime are NOT equally likely to be charged for cannabis crimes as Whites arrested for a crime.

# Likelihood of Arrest for Cannabis Crimes for African-Americans and Hispanic Whites:  12th Null Hypothesis

The 12th null hypothesis states that African-Americans arrested for a crime are equally likely to be charged for cannabis crimes as Hispanic Whites arrested for a crime. The arguments for the t-test function are already defined above. The t-score is approximately 17.1 and the p-value is 1.9e-65.

In [116]:
scipy.stats.ttest_ind_from_stats(cann_af_am, cann_af_am_std, n_cann, cann_hisp_whites, cann_hisp_whites_std, n_cann)

Ttest_indResult(statistic=17.115399986624872, pvalue=1.8561500214806594e-65)

The 12th null hypothesis was rejected, so African-Americans arrested for a crime are NOT equally likely to be charged for cannabis crimes as Hispanic Whites arrested for a crime.

# Likelihood of Arrest for Cannabis Crimes for African-Americans and Hispanic Blacks:  13th Null Hypothesis

The 13th null hypothesis states that African-Americans arrested for a crime are equally likely to be charged for cannabis crimes as Hispanic Blacks arrested for a crime. The arguments for the t-test function are already defined above. The t-score is approximately 31.9 and the p-value is 2.9e-220.

In [117]:
scipy.stats.ttest_ind_from_stats(cann_af_am, cann_af_am_std, n_cann, cann_hisp_blacks, cann_hisp_blacks_std, n_cann)

Ttest_indResult(statistic=31.86216902974593, pvalue=2.8657113854483013e-220)

The 13th null hypothesis was rejected, so African-Americans arrested for a crime are NOT equally likely to be charged for cannabis crimes as black Hispanic people arrested for a crime.

# Likelihood of Arrest for Cannabis Crimes for White Hispanics and Black Hispanics:  14th Null Hypothesis

The 14th null hypothesis states that White Hispanics arrested for a crime are equally likely to be charged for cannabis crimes as Black Hispanics arrested for a crime. The arguments for the t-test function are already defined above. The t-score is approximately 16.0 and the p-value is 2.3e-57.

In [118]:
scipy.stats.ttest_ind_from_stats(cann_hisp_whites, cann_hisp_whites_std, n_cann, cann_hisp_blacks, cann_hisp_blacks_std, n_cann)

Ttest_indResult(statistic=15.986003550012267, pvalue=2.320652865432582e-57)

The 14th null hypothesis was rejected, so White Hispanics arrested for a crime are NOT equally likely to be charged for cannabis crimes as Black Hispanics arrested for a crime.

# Likelihood of Misdemeanor versus Violation Possession for African-Americans: 15th Null Hypothesis

The 15th null hypothesis was that African-Americans arrested for a cannabis crime are equally likely to be charged for misdemeanor cannabis possession as they are for violation cannabis possession. This hypothesis test was warranted by the fact that in the Data Story and EDA report, Whites were slightly more likely to be arrested for violation cannabis possession than misdemeanor cannabis possession.

A DataFrame of African-Americans who committed cannabis crimes is first subsetted. Its length is then taken to define the sample size of this group, which will be used in the hypothesis test below.

In [119]:
af_am_cann = cann[cann.african_american == 1]

In [120]:
n_af_am_cann = len(af_am_cann)
n_af_am_cann

1784

First, a flag feature is created in the 'cann' DataFrame which flags those misdemeanor cannabis possession crimes which were committed by a recorded African-American suspect ('1' for African-American misdemeanor cannabis possession and '0' for not African-American misdemeanor cannabis possession).

In [121]:
af_am_misd_poss = (cann['african_american'] == 1) & (cann['misd_poss'] == 1)
cann['af_am_misd_poss'] = af_am_misd_poss.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


There are 1,539 misdemeanor cannabis possession crimes with a recorded African-American suspect.

In [122]:
cann['af_am_misd_poss'].value_counts()

0    20491
1     1539
Name: af_am_misd_poss, dtype: int64

The mean of the 'af_am_misd_poss' feature is assigned to 'af_am_misd_poss_m', for use in the t-test function that will be run below.

In [123]:
af_am_misd_poss_m = np.mean(cann['af_am_misd_poss'])
af_am_misd_poss_m

0.06985928279618701

The standard deviation of the 'af_am_misd_poss' feature is assigned to 'af_am_misd_poss_std', for use in the t-test function that will be run below.

In [124]:
af_am_misd_poss_std = np.std(cann['af_am_misd_poss'])
af_am_misd_poss_std

0.2549097161808717

A flag feature is created in the 'cann' DataFrame which flags those violation cannabis possession crimes which were committed by a recorded African-American suspect ('1' for African-American violation cannabis possession and '0' for not African-American violation cannabis possession).

In [125]:
af_am_viol_poss = (cann['african_american'] == 1) & (cann['viol_poss'] == 1)
cann['af_am_viol_poss'] = af_am_viol_poss.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


There are 84 violation cannabis possession crimes with a recorded African-American suspect.

In [126]:
cann['af_am_viol_poss'].value_counts()

0    21946
1       84
Name: af_am_viol_poss, dtype: int64

The mean of the 'af_am_viol_poss' feature is assigned to 'af_am_viol_poss_m', for use in the t-test function that will be run below.

In [127]:
af_am_viol_poss_m = np.mean(cann['af_am_viol_poss'])
af_am_viol_poss_m

0.0038129822968679073

The standard deviation of the 'af_am_viol_poss' feature is assigned to 'af_am_misd_viol_std', for use in the t-test function that will be run below.

In [128]:
af_am_viol_poss_std = np.std(cann['af_am_viol_poss'])
af_am_viol_poss_std

0.061631513553297955

The t-test is run using the means and standard deviations of the 'af_am_misd_poss' and 'af_am_viol_poss' features, along with the sample size of the African-American cannabis crime group. The t-score is approximately 10.6 and the p-value is 4.9e-26.

In [129]:
scipy.stats.ttest_ind_from_stats(af_am_misd_poss_m, af_am_misd_poss_std, n_af_am_cann, af_am_viol_poss_m, af_am_viol_poss_std, n_af_am_cann)

Ttest_indResult(statistic=10.637094620604225, pvalue=4.8992078156127774e-26)

The 15th null hypothesis is rejected, so African-Americans arrested for a cannabis crime are NOT equally likely to be charged for misdemeanor cannabis possession as they are for violation cannabis possession.

# Likelihood of Misdemeanor versus Violation Possession for Whites: 16th Null Hypothesis

EDA suggested that more whites are arrested for violation possession charges. The 16th null hypothesis to be tested is that Whites arrested for a cannabis crime are equally likely to be charged for misdemeanor cannabis possession as they are for violation cannabis possession.

A DataFrame of Whites who committed cannabis crimes is first subsetted. Its length is then taken to define the sample size of this group, which will be used in the hypothesis test below.

In [130]:
white_cann = cann[cann.white == 1]

In [131]:
n_white_cann = len(white_cann)
n_white_cann

293

First, a flag feature is created in the 'cann' DataFrame which flags those misdemeanor cannabis possession crimes which were committed by a recorded White suspect ('1' for White misdemeanor cannabis possession and '0' for not White misdemeanor cannabis possession).

In [132]:
white_misd_poss = (cann['white'] == 1) & (cann['misd_poss'] == 1)
cann['white_misd_poss'] = white_misd_poss.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


There are 261 misdemeanor cannabis possession crimes with a recorded White suspect.

In [133]:
cann['white_misd_poss'].value_counts()

0    21769
1      261
Name: white_misd_poss, dtype: int64

The mean of the 'white_misd_poss' feature is assigned to 'white_misd_poss_m', for use in the t-test function that will be run below.

In [134]:
white_misd_poss_m = np.mean(cann['white_misd_poss'])
white_misd_poss_m

0.011847480708125283

The standard deviation of the 'white_misd_poss' feature is assigned to 'white_misd_poss_std', for use in the t-test function that will be run below.

In [135]:
white_misd_poss_std = np.std(cann['white_misd_poss'])
white_misd_poss_std

0.10819943580719762

A flag feature is created in the 'cann' DataFrame which flags those violation cannabis possession crimes which were committed by a recorded White suspect ('1' for White violation cannabis possession and '0' for not White violation cannabis possession).

In [136]:
white_viol_poss = (cann['white'] == 1) & (cann['viol_poss'] == 1)
cann['white_viol_poss'] = white_viol_poss.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


There are 12 violation cannabis possession crimes with a recorded White suspect.

In [137]:
cann['white_viol_poss'].value_counts()

0    22018
1       12
Name: white_viol_poss, dtype: int64

The mean of the 'white_viol_poss' feature is assigned to 'white_viol_poss_m', for use in the t-test function that will be run below.

In [138]:
white_viol_poss_m = np.mean(cann['white_viol_poss'])
white_viol_poss_m

0.0005447117566954153

The standard deviation of the 'white_viol_poss' feature is assigned to 'white_viol_poss_std', for use in the t-test function that will be run below.

In [139]:
white_viol_poss_std = np.std(cann['white_viol_poss'])
white_viol_poss_std

0.023332703353820677

The t-test is run using the means and standard deviations of the 'white_misd_poss' and 'white_viol_poss' features, along with the sample size of the White cannabis crime group. The t-score is approximately 1.5 and the p-value is 0.1.

In [140]:
scipy.stats.ttest_ind_from_stats(white_misd_poss_m, white_misd_poss_std, n_white_cann, white_viol_poss_m, white_viol_poss_std, n_white_cann)

Ttest_indResult(statistic=1.5125385767500596, pvalue=0.13101148742325466)

The 16th null hypothesis is not rejected, so Whites arrested for a cannabis crime are equally likely to be charged for misdemeanor cannabis possession as they are for violation cannabis possession.

# Likelihood of Violation Cannabis Possession Charges Among African-Americans and Whites: 17th Null Hypothesis

To follow up on the 16th null hypothesis not being rejected, the 17th null hypothesis is that African-Americans arrested for a cannabis crime are equally likely to be arrested for violation possession as are Whites arrested for a cannabis crime.

The t-test is run using the means and standard deviations of the 'af_am_viol_poss' and 'white_viol_poss' features, along with the sample sizes of the African American and White cannabis crime groups. The t-score is approximately 0.9 and the p-value is 0.4.

In [140]:
scipy.stats.ttest_ind_from_stats(af_am_viol_poss_m, af_am_viol_poss_std, n_af_am_cann, white_viol_poss_m, white_viol_poss_std, n_white_cann)

Ttest_indResult(statistic=0.8970641100980942, pvalue=0.3697887733367471)

The 17th null hypothesis here is not rejected, so African-Americans arrested for a cannabis crime are equally likely to be arrested for violation possession as are Whites arrested for a cannabis crime. This suggests that violation possession charges are not charged differently among African-American and White suspects. It bears re-mentioning that violation possession charges are the least severe cannabis crime charges in NYC.

# Likelihood of Misdemeanor versus Felony Possession for African-Americans: 18th Null Hypothesis

The 18th null hypothesis is to look at whether African-Americans arrested for a cannabis crime are equally likely to be arrested for misdemeanor possession as they are for felony possession.

First, a flag feature is created in the 'cann' DataFrame which flags those felony cannabis possession crimes which were committed by a recorded African-American suspect ('1' for African-American felony cannabis possession and '0' for not African-American felony cannabis possession).

In [141]:
af_am_felony_poss = (cann['african_american'] == 1) & (cann['felony_poss'] == 1)
cann['af_am_felony_poss'] = af_am_felony_poss.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


There are 42 felony cannabis possession crimes with a recorded African-American suspect.

In [142]:
cann['af_am_felony_poss'].value_counts()

0    21988
1       42
Name: af_am_felony_poss, dtype: int64

The mean of the 'af_am_felony_poss' feature is assigned to 'af_am_felony_poss_m', for use in the t-test function that will be run below.

In [143]:
af_am_felony_poss_m = np.mean(cann['af_am_felony_poss'])
af_am_felony_poss_m

0.0019064911484339537

The standard deviation of the 'af_am_felony_poss' feature is assigned to 'af_am_felony_poss_std', for use in the t-test function that will be run below.

In [144]:
af_am_felony_poss_std = np.std(cann['af_am_felony_poss'])
af_am_felony_poss_std

0.04362174274297742

The t-test is run using the means and standard deviations of the 'af_am_misd_poss' and 'af_am_felony_poss' features, along with the sample size of the African-American cannabis crime group. The t-score is approximately 11.1 and the p-value is 3.7e-28.

In [145]:
scipy.stats.ttest_ind_from_stats(af_am_misd_poss_m, af_am_misd_poss_std, n_af_am_cann, af_am_felony_poss_m, af_am_felony_poss_std, n_af_am_cann)

Ttest_indResult(statistic=11.098152479679825, pvalue=3.6838426185513833e-28)

The 18th null hypothesis is rejected, so African-Americans arrested for a cannabis crime are NOT equally likely to be arrested for misdemeanor possession as they are for felony possession.

# Likelihood of Misdemeanor versus Felony Sales for African-Americans: 19th Null Hypothesis

The 19th null hypothesis is to look at whether African-Americans arrested for a cannabis crime are equally likely to be arrested for misdemeanor sales as they are for felony sales.

First, a flag feature is created in the 'cann' DataFrame which flags those misdemeanor cannabis sales crimes which were committed by a recorded African-American suspect ('1' for African-American misdemeanor cannabis sales and '0' for not African-American misdemeanor cannabis sales).

In [146]:
af_am_misd_sales = (cann['african_american'] == 1) & (cann['misd_sales'] == 1)
cann['af_am_misd_sales'] = af_am_misd_sales.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


There are 108 misdemeanor cannabis sales crimes with a recorded African-American suspect.

In [147]:
cann['af_am_misd_sales'].value_counts()

0    21922
1      108
Name: af_am_misd_sales, dtype: int64

The mean of the 'af_am_misd_sales' feature is assigned to 'af_am_misd_sales_m', for use in the t-test function that will be run below.

In [148]:
af_am_misd_sales_m = np.mean(cann['af_am_misd_sales'])
af_am_misd_sales_m

0.004902405810258738

The standard deviation of the 'af_am_misd_sales' feature is assigned to 'af_am_misd_sales_std', for use in the t-test function that will be run below.

In [149]:
af_am_misd_sales_std = np.std(cann['af_am_misd_sales'])
af_am_misd_sales_std

0.0698453450670224

A flag feature is created in the 'cann' DataFrame which flags those felony cannabis sales crimes which were committed by a recorded African-American suspect ('1' for African-American felony cannabis sales and '0' for not African-American felony cannabis sales).

In [150]:
af_am_felony_sales = (cann['african_american'] == 1) & (cann['felony_sales'] == 1)
cann['af_am_felony_sales'] = af_am_felony_sales.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


There are 11 felony cannabis sales crimes with a recorded African-American suspect.

In [151]:
cann['af_am_felony_sales'].value_counts()

0    22019
1       11
Name: af_am_felony_sales, dtype: int64

The mean of the 'af_am_felony_sales' feature is assigned to 'af_am_felony_sales_m', for use in the t-test function that will be run below.

In [152]:
af_am_felony_sales_m = np.mean(cann['af_am_felony_sales'])
af_am_felony_sales_m

0.0004993191103041308

The standard deviation of the 'af_am_felony_sales' feature is assigned to 'af_am_felony_sales_std', for use in the t-test function that will be run below.

In [153]:
af_am_felony_sales_std = np.std(cann['af_am_felony_sales'])
af_am_felony_sales_std

0.0223398699801565

The t-test is run using the means and standard deviations of the 'af_am_misd_sales' and 'af_am_felony_sales' features, along with the sample size of the African-American cannabis crime group. The t-score is approximately 2.5 and the p-value is 0.01.

In [155]:
scipy.stats.ttest_ind_from_stats(af_am_misd_sales_m, af_am_misd_sales_std, n_af_am_cann, af_am_felony_sales_m, af_am_felony_sales_std, n_af_am_cann)

Ttest_indResult(statistic=2.5361024521197204, pvalue=0.011251721921620172)

The 19th null hypothesis is rejected, so African-Americans arrested for a cannabis crime are NOT equally likely to be arrested for misdemeanor sales as they are for felony sales.

# Likelihood of Cannabis Arrests in the Bronx and Manhattan: 20th Null Hypothesis

The next subject for hypothesis testing involves geography. The machine learning phase of this project will go into more depth with geographic location as a differential predictive factor for cannabis arrests and their sub-types, but the next series of hypothesis tests start with geographical predictors by looking at borough. Therefore, the 20th null hypothesis is that cannabis arrests are equally as likely to happen in the Bronx as they are in Manhattan. 

The borough feature's value counts for the cannabis group is first called.

In [156]:
cann['BORO_NM'].value_counts()

BRONX            8663
BROOKLYN         7164
MANHATTAN        4674
QUEENS            939
STATEN ISLAND     575
unknown            15
Name: BORO_NM, dtype: int64

First, a flag feature is created in the 'cann' DataFrame which flags those cannabis crimes which were committed in the Bronx ('1' for Bronx cannabis crimes and '0' for cannabis crimes outside of the Bronx).

In [157]:
bronx_cann = cann['BORO_NM'] == 'BRONX'
cann['bronx_cann'] = bronx_cann.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In the sample, there are 8,663 cannabis crimes that occurred in the Bronx.

In [159]:
cann['bronx_cann'].value_counts()

0    13367
1     8663
Name: bronx_cann, dtype: int64

The mean of the 'bronx_cann' feature is assigned to 'bronx_cann_m', for use in the t-test function that will be run below.

In [160]:
bronx_cann_m = np.mean(cann['bronx_cann'])
bronx_cann_m

0.3932364956876986

The standard deviation of the 'bronx_cann' feature is assigned to 'bronx_cann_std', for use in the t-test function that will be run below.

In [161]:
bronx_cann_std = np.std(cann['bronx_cann'])
bronx_cann_std

0.4884685805115633

First, a flag feature is created in the 'cann' DataFrame which flags those cannabis crimes which were committed in Manhattan ('1' for Manhattan cannabis crimes and '0' for cannabis crimes outside of Manhattan).

In [162]:
manhattan_cann = cann['BORO_NM'] == 'MANHATTAN'
cann['manhattan_cann'] = manhattan_cann.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In the sample, there are 4,674 cannabis crimes that occurred in Manhattan.

In [165]:
cann['manhattan_cann'].value_counts()

0    17356
1     4674
Name: manhattan_cann, dtype: int64

The mean of the 'manhattan_cann' feature is assigned to 'manhattan_cann_m', for use in the t-test function that will be run below.

In [166]:
manhattan_cann_m = np.mean(cann['manhattan_cann'])
manhattan_cann_m

0.21216522923286427

The standard deviation of the 'manhattan_cann' feature is assigned to 'manhattan_cann_std', for use in the t-test function that will be run below.

In [167]:
manhattan_cann_std = np.std(cann['manhattan_cann'])
manhattan_cann_std

0.40884122191564554

The t-test is run using the means and standard deviations of the 'bronx_cann' and 'manhattan_cann' features, along with the sample size of the cannabis crime group. The t-score is approximately 42.2 and the p-value is 0.

In [169]:
scipy.stats.ttest_ind_from_stats(bronx_cann_m, bronx_cann_std, n_cann, manhattan_cann_m, manhattan_cann_std, n_cann)

Ttest_indResult(statistic=42.19160816065755, pvalue=0.0)

The 20th null hypothesis is rejected, so cannabis arrests are NOT equally as likely to happen in the Bronx as they are in Manhattan.

# Likelihood of Cannabis Arrests in the Bronx and Brooklyn: 21st Null Hypothesis

The 21st null hypothesis is that cannabis arrests are equally as likely to happen in the Bronx as they are in Brooklyn.

First, a flag feature is created in the 'cann' DataFrame which flags those cannabis crimes which were committed in Brooklyn ('1' for Brooklyn cannabis crimes and '0' for cannabis crimes outside of Brooklyn).

In [170]:
brooklyn_cann = cann['BORO_NM'] == 'BROOKLYN'
cann['brooklyn_cann'] = brooklyn_cann.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In the sample, there are 7,164 cannabis crimes that occurred in Brooklyn.

In [172]:
cann['brooklyn_cann'].value_counts()

0    14866
1     7164
Name: brooklyn_cann, dtype: int64

The mean of the 'brooklyn_cann' feature is assigned to 'brooklyn_cann_m', for use in the t-test function that will be run below.

In [173]:
brooklyn_cann_m = np.mean(cann['brooklyn_cann'])
brooklyn_cann_m

0.32519291874716294

The standard deviation of the 'brooklyn_cann' feature is assigned to 'brooklyn_cann_std', for use in the t-test function that will be run below.

In [174]:
brooklyn_cann_std = np.std(cann['brooklyn_cann'])
brooklyn_cann_std

0.4684468852964487

The t-test is run using the means and standard deviations of the 'bronx_cann' and 'brooklyn_cann' features, along with the sample size of the cannabis crime group. The t-score is approximately 14.9 and the p-value is 3.1e-50.

In [176]:
scipy.stats.ttest_ind_from_stats(bronx_cann_m, bronx_cann_std, n_cann, brooklyn_cann_m, brooklyn_cann_std, n_cann)

Ttest_indResult(statistic=14.922471795513756, pvalue=3.123682217096033e-50)

The 21st null hypothesis is rejected, so cannabis arrests are NOT equally as likely to happen in the Bronx as they are in Brooklyn.

# Likelihood of Cannabis Arrests in the Bronx and Queens: 22nd Null Hypothesis

The 22nd null hypothesis is that cannabis arrests are equally as likely to happen in the Bronx as they are in Queens.

First, a flag feature is created in the 'cann' DataFrame which flags those cannabis crimes which were committed in Queens ('1' for Queens cannabis crimes and '0' for cannabis crimes outside of Queens).

In [177]:
queens_cann = cann['BORO_NM'] == 'QUEENS'
cann['queens_cann'] = queens_cann.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In the sample, there are 939 cannabis crimes that occurred in Queens.

In [179]:
cann['queens_cann'].value_counts()

0    21091
1      939
Name: queens_cann, dtype: int64

The mean of the 'queens_cann' feature is assigned to 'queens_cann_m', for use in the t-test function that will be run below.

In [180]:
queens_cann_m = np.mean(cann['queens_cann'])
queens_cann_m

0.04262369496141625

The standard deviation of the 'queens_cann' feature is assigned to 'queens_cann_std', for use in the t-test function that will be run below.

In [181]:
queens_cann_std = np.std(cann['queens_cann'])
queens_cann_std

0.20200721667614016

The t-test is run using the means and standard deviations of the 'bronx_cann' and 'queens_cann' features, along with the sample size of the cannabis crime group. The t-score is approximately 98.5 and the p-value is 0.

In [182]:
scipy.stats.ttest_ind_from_stats(bronx_cann_m, bronx_cann_std, n_cann, queens_cann_m, queens_cann_std, n_cann)

Ttest_indResult(statistic=98.44988952598659, pvalue=0.0)

The 22nd null hypothesis is rejected, so cannabis arrests are NOT equally as likely to happen in the Bronx as they are in Queens.

# Likelihood of Cannabis Arrests in the Bronx and Staten Island: 23rd Null Hypothesis

The 23rd null hypothesis is that cannabis arrests are equally as likely to happen in the Bronx as they are in Staten Island.

First, a flag feature is created in the 'cann' DataFrame which flags those cannabis crimes which were committed in Staten Island ('1' for Staten Island cannabis crimes and '0' for cannabis crimes outside of Staten Island).

In [184]:
staten_cann = cann['BORO_NM'] == 'STATEN ISLAND'
cann['staten_cann'] = staten_cann.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In the sample, there are 575 cannabis crimes that occurred in Staten Island.

In [185]:
cann['staten_cann'].value_counts()

0    21455
1      575
Name: staten_cann, dtype: int64

The mean of the 'staten_cann' feature is assigned to 'staten_cann_m', for use in the t-test function that will be run below.

In [186]:
staten_cann_m = np.mean(cann['staten_cann'])
staten_cann_m

0.026100771674988654

The standard deviation of the 'staten_cann' feature is assigned to 'staten_cann_std', for use in the t-test function that will be run below.

In [187]:
staten_cann_std = np.std(cann['staten_cann'])
staten_cann_std

0.15943500679886

The t-test is run using the means and standard deviations of the 'bronx_cann' and 'staten_cann' features, along with the sample size of the cannabis crime group. The t-score is approximately 106.1 and the p-value is 0.

In [189]:
scipy.stats.ttest_ind_from_stats(bronx_cann_m, bronx_cann_std, n_cann, staten_cann_m, staten_cann_std, n_cann)

Ttest_indResult(statistic=106.05095800776438, pvalue=0.0)

The 23rd null hypothesis is rejected, so cannabis arrests are NOT equally as likely to happen in the Bronx as they are in Staten Island.

# Likelihood of Cannabis Arrests in Manhattan and Brooklyn: 24th Null Hypothesis

The 24th null hypothesis is that cannabis arrests are equally as likely to happen in Manhattan as they are in Brooklyn.

The t-test is run using the means and standard deviations of the 'manhattan_cann' and 'brooklyn_cann' features, along with the sample size of the cannabis crime group. The t-score is approximately -27 and the p-value is 4.8e-159.

In [191]:
scipy.stats.ttest_ind_from_stats(manhattan_cann_m, manhattan_cann_std, n_cann, brooklyn_cann_m, brooklyn_cann_std, n_cann)

Ttest_indResult(statistic=-26.98141688192628, pvalue=4.821290618883564e-159)

The 24th null hypothesis is rejected, so cannabis arrests are NOT equally as likely to happen in Manhattan as they are in Brooklyn.

# Likelihood of Cannabis Arrests in Manhattan and Queens: 25th Null Hypothesis

The 25th null hypothesis is that cannabis arrests are equally as likely to happen in Manhattan as they are in Queens.

The t-test is run using the means and standard deviations of the 'manhattan_cann' and 'queens_cann' features, along with the sample size of the cannabis crime group. The t-score is approximately 55.2 and the p-value is 0.

In [193]:
scipy.stats.ttest_ind_from_stats(manhattan_cann_m, manhattan_cann_std, n_cann, queens_cann_m, queens_cann_std, n_cann)

Ttest_indResult(statistic=55.18175117654442, pvalue=0.0)

The 25th null hypothesis is rejected, so cannabis arrests are NOT equally as likely to happen in Manhattan as they are in Queens.

# Likelihood of Cannabis Arrests in Manhattan and Staten Island: 26th Null Hypothesis

The 26th null hypothesis is that cannabis arrests are equally as likely to happen in Manhattan as they are in Staten Island.

The t-test is run using the means and standard deviations of the 'manhattan_cann' and 'staten_cann' features, along with the sample size of the cannabis crime group. The t-score is approximately 62.9 and the p-value is 0.

In [195]:
scipy.stats.ttest_ind_from_stats(manhattan_cann_m, manhattan_cann_std, n_cann, staten_cann_m, staten_cann_std, n_cann)

Ttest_indResult(statistic=62.93258968480057, pvalue=0.0)

The 26th null hypothesis is rejected, so cannabis arrests are NOT equally as likely to happen in Manhattan as they are in Staten Island.

# Likelihood of Cannabis Arrests in Brooklyn and Queens: 27th Null Hypothesis

The 27th null hypothesis is that cannabis arrests are equally as likely to happen in Brooklyn as they are in Queens.

The t-test is run using the means and standard deviations of the 'brooklyn_cann' and 'queens_cann' features, along with the sample size of the cannabis crime group. The t-score is approximately 82.2 and the p-value is 0.

In [197]:
scipy.stats.ttest_ind_from_stats(brooklyn_cann_m, brooklyn_cann_std, n_cann, queens_cann_m, queens_cann_std, n_cann)

Ttest_indResult(statistic=82.21238337492828, pvalue=0.0)

The 27th null hypothesis is rejected, so cannabis arrests are NOT equally as likely to happen in Brooklyn as they are in Queens.

# Likelihood of Cannabis Arrests in Brooklyn and Staten Island: 28th Null Hypothesis

The 28th null hypothesis is that cannabis arrests are equally as likely to happen in Brooklyn as they are in Staten Island.

The t-test is run using the means and standard deviations of the 'brooklyn_cann' and 'staten_cann' features, along with the sample size of the cannabis crime group. The t-score is approximately 89.7 and the p-value is 0.

In [199]:
scipy.stats.ttest_ind_from_stats(brooklyn_cann_m, brooklyn_cann_std, n_cann, staten_cann_m, staten_cann_std, n_cann)

Ttest_indResult(statistic=89.71221047700361, pvalue=0.0)

The 28th null hypothesis is rejected, so cannabis arrests are NOT equally as likely to happen in Brooklyn as they are in Staten Island.

# Likelihood of Cannabis Arrests in Queens and Staten Island: 29th Null Hypothesis

The next null hypothesis is that cannabis arrests are equally as likely to happen in Queens as they are in Staten Island.

The t-test is run using the means and standard deviations of the 'queens_cann' and 'staten_cann' features, along with the sample size of the cannabis crime group. The t-score is approximately 9.5 and the p-value is 1.7e-21.

In [201]:
scipy.stats.ttest_ind_from_stats(queens_cann_m, queens_cann_std, n_cann, staten_cann_m, staten_cann_std, n_cann)

Ttest_indResult(statistic=9.529682683533519, pvalue=1.654821103194533e-21)

The 29th null hypothesis is rejected, so cannabis arrests are NOT equally as likely to happen in Queens as they are in Staten Island.