# Lesson 3: Statistics and Significance

## Ideas

1. What is statistics?
2. How is statistics related to probability?

## Terminology Recap

See PA 2.1.1, PA 2.2.2, and PA 2.4.2:
- **sample:** one measurement of a certain unit
- **sample size:** number of measurements
- **population:** group that we're sampling from
- **parameter ($\pi$):** any measured quantity of an entire population
- **variable ($x$):** the actual sample's measurement


See PA 2.1.5 and PA 2.3.3:
- **model:** mathematical model that's data-driven
- **chance model:** probabilistic model based on heuristics or assumptions
- **statistical inference:** assumptions made based on our data and statistical model
- **statistical significance:** strength or measure of confidence of model


- **measures of center (dataset tendency or dispersion):** tendency to "gravitate" toward a value
- **mean:** arithmetic mean of the dataset
- **median:** exact middle of the dataset
- **mode:** maximum-repeated value of the dataset
- **outlier:** in frequentist terms, deviated data

## Sampling (PA 2.2.5)

- **simple random sampling:** take random from the population
- **systematic (monotonic) sampling:** take every x sample from the population
- **stratified sampling:** group out and take some k subset from each group
- **cluster sampling:** randomly choose one group
- **convenience sampling:** whatever samples you can get

## Statistical Testing (PA 2.4.6, 2.6.2, 2.6.3)

- **null hypothesis ($H_0$):** process is random chance
- **alternative hypothesis ($H_A$):** process is not random chance
- **standardized statistic ($z$):**  $\frac{\bar{x} - \pi_0}{\hat{\pi_0}}$ where $\hat{\pi_0} = \sqrt{\pi(1 - \pi)/n}$
- **p-value:** probability of obtaining an extreme value (outside of the distribution) assuming $H_0$ is true

![p-value closeness to random, z-score is distance from random](https://desktop.arcgis.com/en/arcmap/10.3/tools/spatial-statistics-toolbox/GUID-CBF63B74-D1B2-44FC-A316-7AC2B1C1D464-web.png)

- How does sample size affect these (PA 2.6.7)?
- Are $H_A : \pi > 0.5$ and $H_A : \pi \ne 0.5$ the same?

## Let's do some statistical inference!

So what are we really trying to do with statistics?

Is there a difference between **binary** (categorical) data and numerical data?

Let's analyze the dataset in `bac.csv`!  This dataset contains two columns of interest: `BAC` and `PASS`.  It is a dataset that indicates the `BAC` (numerical continuous data) and `PASS`, whether the subject passed the field sobriety test (binary data).  We need to make a third column for whether or not the subject truly failed (binary variable), based on if BAC >= 0.08.  All subjects tested were pulled over on suspicion of driving under the influence (DUI) by a police officer.

### 1. What research questions can we ask?  What type of data do we have here?

**Q1:** Is a police officer able to use driver behavior to predict DUI better than random chance?

In this assumption, we assume that every driver pulled over is DUI. The hypotheses are formed:

$H_0(\text{pulled\_over\_correctly}): \pi = 0.5$

$H_A(\text{pulled\_over\_correctly}): \pi \gt 0.5$.



**Q2:** An alternative question could be to ask whether a sobriety test is better than random chance at predicting whether someone is biologically drunk, based on suspicions by police officers on drunk driving.

This means that the null hypothesis for the sobriety test correctness is a mean that is better than random chance (50% chance).  The following hypotheses are formed as a result of this question.

$H_0(\text{sobriety\_test\_correctness}): \pi = 0.5$

$H_A(\text{sobriety\_test\_correctness}): \pi \gt 0.5$


### 2. What preliminary metrics (statistics) should we compute?
Mean, standard deviation, histogram.

In [11]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.stats.proportion as pp  # importing the proportions module from statsmodels

In [2]:
df_bac = pd.read_csv("../datasets_as/bac.csv")
print(df_bac.head())

   Subject       BAC  PASS   forpass      pred  pass.ols  PASS.OLS
0       54  0.000344     1  0.518128  0.971148         0         0
1       94  0.000523     1  0.464620  0.970857         0         0
2       72  0.001530     1  0.797098  0.969166         0         1
3        1  0.001819     1  0.327069  0.968663         0         0
4       49  0.001850     1  0.546285  0.968608         0         0


It looks like there are some column names that are not too descriptive.  Let's also remove some columns that we do not need.

In [3]:
df_bac_trim = df_bac[['Subject', 'BAC', 'PASS']].copy()
df_bac_trim.head()

Unnamed: 0,Subject,BAC,PASS
0,54,0.000344,1
1,94,0.000523,1
2,72,0.00153,1
3,1,0.001819,1
4,49,0.00185,1


Then, let's add our better column name and remove the columns we do not want.

In [6]:
df_bac_trim.rename(columns={"PASS": "FST Pass"}, inplace=True)

In [7]:
df_bac_trim.head()

Unnamed: 0,Subject,BAC,FST Pass
0,54,0.000344,1
1,94,0.000523,1
2,72,0.00153,1
3,1,0.001819,1
4,49,0.00185,1


In [9]:
df_bac_trim.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 0 to 99
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Subject   100 non-null    int64  
 1   BAC       100 non-null    float64
 2   FST Pass  100 non-null    int64  
dtypes: float64(1), int64(2)
memory usage: 3.1 KB


Upon looking at the histograms, it does not appear that BAC is very distinguished between samples.  However, there is a massive difference where BAC is used to determine whether someone is truly legal.  There is a higher proportion of subjects passing the sobriety test.  Let's compute the proportions so we can see more quantitatively the difference between failing a sobriety test and truly driving under the influence, and whether the original suspicions were correct:

In [8]:
df_bac_trim[df_bac_trim['BAC'] >= 0.08].count()

Subject     18
BAC         18
FST Pass    18
dtype: int64

In [10]:
df_bac_trim[df_bac_trim['FST Pass'] == 0].count()

Subject     39
BAC         39
FST Pass    39
dtype: int64

In [15]:
df_bac_trim['Subject'].count()

100

It seems that indeed, the proportion of drivers pulled over by the police officers were primarily not drunk.

Now, let's compute the Z-score for the mean of the variable of officers' intuition correctly pulling over drunk drivers, to see if there is a difference between random chance and the police officers' intuition.

Let's checkout how to use statsmodels' `stats` package: https://www.statsmodels.org/stable/stats.html

In [34]:
pp.proportions_ztest(count=df_bac_trim[df_bac_trim['BAC'] >= 0.08]['Subject'].count(),
                     nobs=df_bac_trim['Subject'].count(),
                     value=1.0  # null hypothesis assumes officer is always correct
                    )

(-21.343747458109497, 4.457478379159264e-101)

This p-value, being significant as it is smaller than 0.05, indicates that the 18% correctness is statistically significantly different than a coin flip.  So statistically speaking, the officers in this study have a high false positive rate of incorrect driver stoppage that is definitely different than being 100% correct.

Let's see now if the sobriety test is correct by filtering on multiple conditions.  Make sure to use parentheses to separate out the conditions, or you'll get funny errors due to the ampersand (bitwise AND):

In [25]:
df_bac_trim[(df_bac_trim['BAC'] >= 0.08) & (df_bac_trim['FST Pass'] == 0)].count()

Subject     18
BAC         18
FST Pass    18
dtype: int64

In [26]:
df_bac_trim[df_bac_trim['FST Pass'] == 0].count()

Subject     39
BAC         39
FST Pass    39
dtype: int64

In [30]:
qty_subj_failed_and_drunk = (
    df_bac_trim[(df_bac_trim['FST Pass'] == 0) &
                (df_bac_trim['BAC'] >= 0.08)].count()['Subject'])

qty_subj_failed = df_bac_trim[df_bac_trim['FST Pass'] == 0].count()['Subject']

ratio_sobriety_correct = qty_subj_failed_and_drunk / qty_subj_failed

print(ratio_sobriety_correct)

0.46153846153846156


So we see that there are 39 total failures of the sobriety test, but only 18 of the subjects who failed actually have blood alcohol content above the legal limit.  So, how awful is this 46% failure rate?

In [38]:
pp.proportions_ztest(count=qty_subj_failed_and_drunk,
                     nobs=qty_subj_failed,
                     value=0.5  # null sobriety assumes sobriety test is just a coin flip
                    )

(-0.4818120558297154, 0.6299394643484414)

In this case, our 46% failure rate is statistically insignificant, meaning that we accept the null hypothesis in that the sobriety test is really no better than a coin flip.

### 3. What statistical inferences can we draw?
The variable of a driver being pulled over correctly due to officers' intuition of 18% was statistically significant.  The variable of the sobriety test being correct more than random chance of 79% was statistically significant.

### 4. Can we make any statistically-significant conclusions?
We can conclude that the police officers in this dataset do not necessarily have great intuition when it comes to using intuition to detect drunk drivers.  However, sobriety tests seem to be effective for detection of high BACs.

### 5. How could we improve this study?
Please try to think of some ideas :)