# Statistical Inference: Crime Rates in NYC

## Import packages and data

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as stats
import statsmodels.api as sm
df = pd.read_csv('data/final_NYC_crimes.csv')

## Question 1: NYC Census Population vs. NYC Crime Victim Population

**Null Hypothesis**: The sample mean between the population mean of NYC in 2019 by race **is the same** as the population mean of crime victims by race in NYC 2019

**Alternate Hypothesis**: The sample mean between the population mean of NYC in 2019 by race **is different** from the population mean of crime victims by race in NYC 2019

In [2]:
two_way_table = pd.crosstab(index=df["VIC_RACE"], columns=df["CRIME_TYPE"])
two_way_table

CRIME_TYPE,ARSON,ASSAULT,BURGLARY,DRIVING_UNDER_INFLUENCE,FRAUD,GAMBLING,HARRASSMENT,KIDNAPPING,LARCENY,MISC_PENAL_LAW,MURDER,OFFENSES_AGAINST_PUBLIC_ORDER,POSSESSION_CONTROLLED_SUBSSTANCE,POSSESSION_WEAPON,ROBBERY,SEX_CRIMES,SOCIAL_RELATED_CRIMES,THEFT,TRAFFIC_LAWS_VIOLATION,UNCLASSIFIED
VIC_RACE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
AMERICAN INDIAN/ALASKAN NATIVE,0,433,53,2,36,0,559,0,850,57,1,379,3,7,118,29,0,47,49,9
ASIAN / PACIFIC ISLANDER,10,5906,1333,45,537,1,5588,5,11579,772,21,4501,12,175,1712,594,2,373,733,69
BLACK,37,28116,1804,87,1077,0,26391,38,24133,4221,168,18430,52,761,3386,2437,31,327,2028,436
BLACK HISPANIC,9,4820,349,12,170,0,3770,8,4048,502,19,2928,11,81,928,429,4,50,324,64
UNKNOWN,537,4797,6311,4052,6446,288,4825,12,50314,4008,4,25856,13209,6084,2018,642,231,1857,1100,995
WHITE,31,9333,2668,83,1933,0,13455,17,26249,1957,15,9869,42,353,1621,1331,10,220,1065,177
WHITE HISPANIC,51,19697,1455,85,677,0,16412,46,16949,2531,75,12068,38,330,3542,2073,9,179,1441,219


### Chi-square Goodness-of-fit Test

*Following http://hamelg.blogspot.com/2015/11/python-for-data-analysis-part-25-chi.html*

The one-way t-test checks whether a sample mean differs from the an expected (population) mean. The chi-squared goodness-of-fit test is an analog of the one-way t-test for categorical variables: it tests whether the distribution of sample categorical data matches an expected distribution.

**We will use the chi-square goodness-of-fit test to test our Null Hypothesis.**

In [3]:
# Get the count of victim race in the sample
df.VIC_RACE.value_counts()

UNKNOWN                           133586
BLACK                             113960
WHITE HISPANIC                     77877
WHITE                              70429
ASIAN / PACIFIC ISLANDER           33968
BLACK HISPANIC                     18526
AMERICAN INDIAN/ALASKAN NATIVE      2632
Name: VIC_RACE, dtype: int64

In [4]:
# Using population count from outside source
nyc_pop = pd.DataFrame(["white"]*2737163 + ["hispanic"]*2489089 +\
                        ["black"]*1899379 + ["asian"]*1247479 + ["other"]*164563)
           
# Use population counts from df.VIC_RACE.value_counts()
vic_pop = pd.DataFrame(["white"]*70429 + ["hispanic"]*148306 +\
                        ["black"]*113960 + ["asian"]*33968 + ["other"]*2632)

# Create crosstab of sample and population data
nyc_table = pd.crosstab(index=nyc_pop[0], columns="count")
vic_table = pd.crosstab(index=vic_pop[0], columns="count")

print( "NYC Population 2019")
print(nyc_table)
print(" ")
print( "Victim Population")
print(vic_table)

NYC Population 2019
col_0       count
0                
asian     1247479
black     1899379
hispanic  2489089
other      164563
white     2737163
 
Victim Population
col_0      count
0               
asian      33968
black     113960
hispanic  148306
other       2632
white      70429


Chi-squared tests are based on the so-called chi-squared statistic. You calculate the chi-squared statistic with the following formula:

*sum(((observed−expected)** 2) / expected)*

In the formula, observed is the actual observed count for each category and expected is the expected count based on the distribution of the population for the corresponding category. 

**Let's calculate the chi-squared statistic for our data.

In [5]:
# observed is the sample
observed = vic_table

# nyc_ratios has the population ratio for each race
nyc_ratios = nyc_table/len(nyc_pop)  

# expected is the product of the population ratios and the sample size
expected = nyc_ratios * len(vic_pop)   # Get expected counts

chi_squared_stat = (((observed-expected)**2)/expected).sum()

print(chi_squared_stat)

col_0
count    57318.734242
dtype: float64


From this, we can conclude that **the calculated chi-square statistic is 57318.734242.**

*Note: The chi-squared test assumes none of the expected counts are less than 5.

Similar to the t-test where we compared the t-test statistic to a critical value based on the t-distribution to determine whether the result is significant, in the chi-square test we compare the chi-square test statistic to a critical value based on the chi-square distribution. 

The scipy library shorthand for the chi-square distribution is **chi2**. 

Let's use this knowledge to find the critical value for 95% confidence level and check the p-value of our result:

In [6]:
# crit is the critical value (one-sided test ?)
crit = stats.chi2.ppf(q = 0.95, # Find the critical value for 95% confidence*
                      df = 1)   # Df = number of variable categories - 1


# p_value is the probability of obtaining results as extreme as the observed results of a statistical hypothesis test)
p_value = 1 - stats.chi2.cdf(x=chi_squared_stat,  # Find the p-value
                             df=1)

print("Critical value:", crit)
print("P value:", p_value)

Critical value: 3.841458820694124
P value: [0.]


From this, we conclude since our chi-squared statistic exceeds the critical value, we'd reject the null hypothesis that the two distributions are the same.

You can also carry out a chi-squared goodness-of-fit test automatically using the scipy function `scipy.stats.chisquare()`:

**Since our chi-squared statistic exceeds the critical value, we'd reject the null hypothesis that the two distributions are the same.**

## Experimenting

Trying to see if this statitistics is good enough.

In [7]:
# Create a contingency table 
stats.chi2_contingency(observed= observed)

(0.0, 1.0, 0, array([[ 33968.],
        [113960.],
        [148306.],
        [  2632.],
        [ 70429.]]))

In [8]:
# You can carry out a chi-squared goodness-of-fit test automatically using the scipy function
stats.chisquare(f_obs= observed,   # Array of observed counts
                f_exp= expected)   # Array of expected counts expected

Power_divergenceResult(statistic=array([57318.7342417]), pvalue=array([0.]))

In [9]:
two_way_table2 = pd.crosstab(index=df["BOROUGH"], columns=df["CRIME_TYPE"])
two_way_table2

CRIME_TYPE,ARSON,ASSAULT,BURGLARY,DRIVING_UNDER_INFLUENCE,FRAUD,GAMBLING,HARRASSMENT,KIDNAPPING,LARCENY,MISC_PENAL_LAW,MURDER,OFFENSES_AGAINST_PUBLIC_ORDER,POSSESSION_CONTROLLED_SUBSSTANCE,POSSESSION_WEAPON,ROBBERY,SEX_CRIMES,SOCIAL_RELATED_CRIMES,THEFT,TRAFFIC_LAWS_VIOLATION,UNCLASSIFIED
BOROUGH,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
BRONX,210,19904,2455,677,1762,49,16733,30,22686,2124,5,18089,4638,1797,3517,1483,67,572,1640,564
BROOKLYN,198,20421,4494,1388,3367,157,20289,38,36437,5401,2,21503,3149,2804,4016,2153,97,858,2044,554
MANHATTAN,117,15120,3379,688,3292,76,14799,22,45601,2089,2,15736,3807,1246,3089,1967,39,991,1200,390
QUEENS,119,15096,3206,969,2005,6,15148,32,25507,3555,1,14795,1286,1621,2476,1688,68,538,1691,340
STATEN ISLAND,31,2559,439,644,449,1,4030,4,3889,879,1,3908,487,323,227,244,16,93,165,121
UNKNOWN,0,2,0,0,1,0,1,0,2,0,292,0,0,0,0,0,0,1,0,0


In [10]:
two_way_table3 = pd.crosstab(index=df["CRIME_TYPE"], columns=df["VIC_AGE_GROUP"])
two_way_table3

VIC_AGE_GROUP,18-24,25-44,45-64,65+,<18,UNKNOWN
CRIME_TYPE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ARSON,8,72,64,21,0,510
ASSAULT,12126,34716,14743,2039,6081,3397
BURGLARY,626,3986,2459,730,41,6131
DRIVING_UNDER_INFLUENCE,38,151,102,18,4,4053
FRAUD,364,1972,1566,668,36,6270
GAMBLING,0,1,0,0,0,288
HARRASSMENT,7595,32350,19940,3945,3856,3314
KIDNAPPING,19,49,8,1,37,12
LARCENY,10887,43765,23315,6945,2012,47198
MISC_PENAL_LAW,1195,5517,2525,620,374,3817


In [11]:
# try multi-indexing for contingency table

result_chi2 = stats.chi2_contingency(observed = two_way_table3)


chi2, p, dof, expected = stats.chi2_contingency(observed = two_way_table2)

print('chi-square statistic :', result_chi2[0])
print('p-value :', result_chi2[1])
print('degrees of freedom :', result_chi2[2])
print('expected counts : \n', result_chi2[3])

table2 = sm.stats.Table(two_way_table2)
table2.standardized_resids

chi-square statistic : 147466.3500345064
p-value : 0.0
degrees of freedom : 95
expected counts : 
 [[6.50426407e+01 2.39775887e+02 1.25824042e+02 2.85684113e+01
  2.86836608e+01 1.87105358e+02]
 [7.04406980e+03 2.59675510e+04 1.36266506e+04 3.09393778e+03
  3.10641922e+03 2.02633716e+04]
 [1.34643084e+03 4.96353847e+03 2.60465088e+03 5.91387276e+02
  5.93773027e+02 3.87321950e+03]
 [4.20705436e+02 1.55090596e+03 8.13848547e+02 1.84784717e+02
  1.85530168e+02 1.21022517e+03]
 [1.04800557e+03 3.86341118e+03 2.02735153e+03 4.60311173e+02
  4.62168141e+02 3.01475240e+03]
 [2.78478861e+01 1.02659602e+02 5.38713308e+01 1.22315124e+01
  1.22808563e+01 8.01088124e+01]
 [6.84152220e+03 2.52208711e+04 1.32348252e+04 3.00497364e+03
  3.01709618e+03 1.96807117e+04]
 [1.21412929e+01 4.47581656e+01 2.34871546e+01 5.33277011e+00
  5.35428336e+00 3.49263334e+01]
 [1.29239245e+04 4.76432911e+04 2.50011440e+04 5.67652217e+03
  5.69942216e+03 3.71776960e+04]
 [1.35365780e+03 4.99018024e+03 2.61863133e+03

CRIME_TYPE,ARSON,ASSAULT,BURGLARY,DRIVING_UNDER_INFLUENCE,FRAUD,GAMBLING,HARRASSMENT,KIDNAPPING,LARCENY,MISC_PENAL_LAW,MURDER,OFFENSES_AGAINST_PUBLIC_ORDER,POSSESSION_CONTROLLED_SUBSSTANCE,POSSESSION_WEAPON,ROBBERY,SEX_CRIMES,SOCIAL_RELATED_CRIMES,THEFT,TRAFFIC_LAWS_VIOLATION,UNCLASSIFIED
BOROUGH,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
BRONX,5.74062,37.643913,-12.71686,-10.343861,-14.669059,-2.066914,11.327549,0.479119,-53.177478,-19.879215,-8.540636,17.843973,36.135457,2.390381,12.571202,-4.805552,0.553986,-4.308552,4.752629,7.182319
BROOKLYN,0.359715,-4.905412,9.225532,4.553138,5.302933,9.612857,-0.706256,0.336603,-14.673782,25.982572,-10.789617,2.366452,-13.309617,14.374879,3.760807,-0.221943,1.894901,-0.714136,2.995874,-0.547852
MANHATTAN,-4.720686,-30.729854,-2.818182,-14.443925,12.323277,0.412315,-29.129605,-2.023267,88.549783,-28.65194,-9.842248,-27.039126,8.864692,-18.885809,-5.450005,1.820176,-4.545452,9.270233,-14.094553,-5.530206
QUEENS,-1.542923,4.887068,8.871638,3.65746,-4.101743,-7.625076,9.77234,1.492894,-10.609951,16.007555,-8.559657,-0.031084,-30.430409,1.816411,-4.125595,5.279258,1.553658,-3.281323,10.545501,-3.031297
STATEN ISLAND,0.635651,-8.988983,-5.827085,35.626662,0.127745,-3.224823,22.997306,-0.534283,-26.532049,13.065251,-3.312714,17.61859,-2.728427,0.184887,-14.180704,-3.822813,1.248955,-2.957129,-6.906776,4.571901
UNKNOWN,0.061608,-7.386175,-2.964049,-1.43824,-2.378058,0.685396,-7.407031,1.414359,-11.134517,-2.973131,643.975461,-7.677643,-2.889745,-2.100744,-2.884533,-2.058018,0.690849,-0.747264,-1.919829,-0.725304


## Question 2: Victim Race Population Independency

### Chi-square Independence Test
Independence is a key concept in probability that describes a situation where knowing the value of one variable tells you nothing about the value of another.

**We will use the chi-square independence test to test whether:**

H0: The victim race population is related to the crimes reported in NYC.

Ha: The victim race population is not related to the crimes reported in NYC.

In [12]:
# Count the ratio of each races relative to the whole crime report

# remove rows with race "UNKNOWN"
new_df = df[df.VIC_RACE != "UNKNOWN"]
# new_df = df


# Get frequency of each race
count_race = new_df.VIC_RACE.value_counts()
  
# Use the indexes to iterate over
races = count_race.index

# list of objects used to construct the df
d = []

for race in races:
    count = count_race[race]
    ratio = count_race[race] / len(new_df.VIC_RACE)
    d.append(
        {
            'RACE': race,
            'COUNT': count,
            'RATIO': np.round(ratio, 5),
        }
    )
            
# Create the df using the list
df_race = pd.DataFrame(d)
print(df_race.RATIO.sum())
df_race.RATIO

1.0


0    0.35905
1    0.24537
2    0.22190
3    0.10702
4    0.05837
5    0.00829
Name: RATIO, dtype: float64

In [13]:
# Count the ratio of each crimes relative to the whole crime report
count_offense = new_df.CRIME_TYPE.value_counts()

all_offenses = count_offense.index

# list of objects used to construct the df
d = []

for offense in all_offenses:
    count = count_offense[offense]
    ratio = count_offense[offense] / len(new_df.CRIME_TYPE)
    # Append object to list
    d.append(
        {
            'OFFENSE': offense,
            'COUNT': count,
            'RATIO': np.round(ratio, 9),
        }
    )
    
# Create the df using the list
df_offense = pd.DataFrame(d)
df_offense
print(df_offense.RATIO.sum())

1.0


In [14]:
np.random.seed(12)

# Sample data randomly at fixed probabilities
# Create a random sample distribution of victim race
victim_race = np.random.choice(a= df_race.RACE,
                              p = df_race.RATIO,
                              size=10000)

# Sample data randomly at fixed probabilities
# Create a random sample distribution of offense type
offense = np.random.choice(a=df_offense.OFFENSE,
                              p = df_offense.RATIO,
                              size=10000)

victims = pd.DataFrame({"victim_race":victim_race, 
                       "offense":offense})

# Create the crosstab of random dist
victim_tab = pd.crosstab(victims.offense, victims.victim_race, margins = True)

#  # Get table without totals for later use
observed = victim_tab.iloc[0:5,0:3]  

victim_tab

victim_race,AMERICAN INDIAN/ALASKAN NATIVE,ASIAN / PACIFIC ISLANDER,BLACK,BLACK HISPANIC,WHITE,WHITE HISPANIC,All
offense,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ARSON,0,0,1,0,0,0,1
ASSAULT,19,220,802,129,479,515,2164
BURGLARY,1,21,82,15,62,49,230
DRIVING_UNDER_INFLUENCE,0,0,2,0,1,2,5
FRAUD,0,18,49,4,28,37,136
HARRASSMENT,22,212,764,116,417,478,2009
KIDNAPPING,0,0,1,0,2,0,3
LARCENY,22,285,940,153,568,658,2626
MISC_PENAL_LAW,4,29,102,22,81,87,325
MURDER,0,2,5,0,1,1,9


**For a test of independence, we use the same chi-squared formula that we used for the goodness-of-fit test.** 


The main difference is we have to calculate the expected counts of each cell in a 2-dimensional table instead of a 1-dimensional table. To get the expected count for a cell, multiply the row total for that cell by the column total for that cell and then divide by the total number of observations. We can quickly get the expected counts for all cells in the table by taking the row totals and column totals of the table, performing an outer product on them with the np.outer() function and dividing by the number of observations:

In [15]:
# Calculate the expected distribution using the crosstab of randomly sampled data
expected =  np.outer(victim_tab["All"],
                     victim_tab.loc["All"]) / 10000

expected = pd.DataFrame(expected)
expected.columns = df_race.RACE.tolist() + ["All"]
# Pluck out Gambling (0 occurences) from the dataframe then reselect the OFFENSE column. Convert to list and add ["All"]
expected.index = df_offense[df_offense.OFFENSE != "GAMBLING"].OFFENSE.tolist() + ["All"]
# df_offense[df_offense.OFFENSE != "GAMBLING"].OFFENSE.tolist() + ["All"]
expected

Unnamed: 0,BLACK,WHITE HISPANIC,WHITE,ASIAN / PACIFIC ISLANDER,BLACK HISPANIC,AMERICAN INDIAN/ALASKAN NATIVE,All
LARCENY,0.0087,0.1058,0.366,0.0576,0.2173,0.2446,1.0
ASSAULT,18.8268,228.9512,792.024,124.6464,470.2372,529.3144,2164.0
HARRASSMENT,2.001,24.334,84.18,13.248,49.979,56.258,230.0
OFFENSES_AGAINST_PUBLIC_ORDER,0.0435,0.529,1.83,0.288,1.0865,1.223,5.0
ROBBERY,1.1832,14.3888,49.776,7.8336,29.5528,33.2656,136.0
MISC_PENAL_LAW,17.4783,212.5522,735.294,115.7184,436.5557,491.4014,2009.0
BURGLARY,0.0261,0.3174,1.098,0.1728,0.6519,0.7338,3.0
SEX_CRIMES,22.8462,277.8308,961.116,151.2576,570.6298,642.3196,2626.0
TRAFFIC_LAWS_VIOLATION,2.8275,34.385,118.95,18.72,70.6225,79.495,325.0
FRAUD,0.0783,0.9522,3.294,0.5184,1.9557,2.2014,9.0


Now we can follow the same steps we took before to calculate the chi-square statistic, the critical value and the p-value:

*Note: We call .sum() twice: once to get the column sums and a second time to add the column sums together, returning the sum of the entire 2D table.*

In [16]:
# Compute the chi-squared stat
chi_squared_stat = (((observed-expected)**2)/expected).sum().sum()
print(chi_squared_stat)

324397.7782883418


**The computed chi-squared statistic is 324397.7782883418**

We will now compute the critical value and p-value:

In [17]:
# Compute a critical value 
crit = stats.chi2.ppf(q = 0.95, # Find the critical value for 95% confidence*
                      df = 24)   # *


# Compute a p-value
p_value = 1 - stats.chi2.cdf(x=chi_squared_stat,  # Find the p-value
                             df=8)

print("Critical value:", crit)
print("P value:", np.round(p_value, 3))

Critical value: 36.41502850180731
P value: 0.0


From this, we can see that **the critical value is  36.41502850180731** and **the p-value is 0.0**

The chi-squared statistics is **199528.30020088967** which is way higher than our critical value of **36.42**.

(Source: https://www.stat.purdue.edu/~lfindsen/stat503/Chi-Square.pdf)


Since the p-value is 0, and having a way higher chi-squared statistics, we can reject the null hypothesis.

As with the goodness-of-fit test, we can use scipy to conduct a test of independence quickly. Use `stats.chi2_contingency()` function to conduct a test of independence automatically given a frequency table of observed counts:

In [18]:
stats.chi2_contingency(observed = observed)

(3.7184736750924667,
 0.8815887828944307,
 8,
 array([[1.64609053e-02, 2.13168724e-01, 7.70370370e-01],
        [1.71358025e+01, 2.21908642e+02, 8.01955556e+02],
        [1.71193416e+00, 2.21695473e+01, 8.01185185e+01],
        [3.29218107e-02, 4.26337449e-01, 1.54074074e+00],
        [1.10288066e+00, 1.42823045e+01, 5.16148148e+01]]))

 The output shows the chi-square statistic, the p-value and the degrees of freedom followed by the expected counts.
As expected, given the high p-value, the test result **does not detect a significant relationship between the variables**.