# Statistical Thinking in Python Functions for Reuse:

In [33]:
import numpy as np
import matplotlib.pyplot as plt

## Statistical inference packages:
scipy.stats \
statsmodel \
scikit.learn \
numpy (for hacker statistics)

## PMF vs PDF vs CDF vs ECDF:
__PMF__: *Distcrete* outcomes (/discrete random variables); normal/Gaussian distribution (bell-curve) \
__PDF__: *Continuous* outcomes (/continuous random variables); normal/Gaussian distribution (bell-curve); for hist: normed=True\
__CDF__: *Hypothetical* probability distribution; exponential or normal/Sigmoid \
__ECDF__: *Observed* probability distribution; exponential or normal/Sigmoid

## ECDF
__E__ mpirical __C__ umulative __D__ istribution __F__ unction

In [29]:
def ecdf(data):
    """Compute ECDF for a one-dimensional array of measurements."""
    
    #Number of data points: n
    n = len(data)
    # x-data for the ECDF: x
    x = np.sort(data)
    # y-data for the ECDF: y
    y = np.arange(1, n+1) / n
    
    return x, y

`#Compute ECDF for versicolor data: x_vers, y_vers'
x_vers, y_vers = ecdf(versicolor_petal_length)
#Generate plot 
_ = plt.plot(x_vers, y_vers, marker= '.', linestyle = 'none') 
#Label the axes 
_ = plt.xlabel('versicolor_petal_length') 
_ = plt.ylabel('ECDF') 
#Display the plot 
plt.show()`

### Binomial Distribution:
The number *r* of successes in *n* Bernoulli (success/fail) trials, with probability *p* of success, is Binomially distributed. \
__`np.random.binomial()`__

### Bernoulli Trials

In [30]:
def perform_bernoulli_trials(n, p):
    """Perform n Bernoulli trials with success probability p
    and return number of successes."""
    # Initialize number of successes: n_success
    n_success = 0

    # Perform trials
    for i in range(n):
        # Choose random number between zero and one: random_number
        random_number = np.random.random()

        # If less than p, it's a success so add one to n_success
        if random_number < p:
            n_success += 1

    return n_success

## Poisson: 
__Poisson process:__ The timing of the next event is completely independent of when the previous event happened (ie bus arrivals in Poissonville) \
__Poisson distribution:__ A limit of the Binomial distribution for low probabilty of success and large number of trials (ie, for rare events):
1) The number *r* of arrivals of a Poisson process in a given time interval with average rate of ? arrivals per interval is Poisson distributed. \
2) The number *r* of hits on a website in one hour with an average hit rate of 6 hits per hour is Poisson distributed.\
__`np.random.poisson()`__

The waiting time between arrivals of a Poisson process is Exponentially distributed. \

__`successive_poisson()` function:__

In [32]:
def successive_poisson(tau1, tau2, size=1):
    """Compute time for arrival of 2 successive Poisson processes."""
    # Draw samples out of first exponential distribution: t1
    t1 = np.random.exponential(tau1, size=1)

    # Draw samples out of second exponential distribution: t2
    t2 = np.random.exponential(tau2, size=1)

    return t1 + t2

## Exponential: 
The waiting time between arrivals of a Poisson process is Exponentially distributed. \
Parameters: mean (waiting time), size \
__`np.random.exponential(scale=1.0, size=None)`__

`inter_nohitter_time = np.random.exponential(tau, 100000)`

## Checking normality of distribution:

`import numpy as np` \
`mean = np.mean(michelson_speed_of_light)` \
`std = np.std(michelson_speed_of_light)` \
`samples = np.random.normal(mean, std, size = 10000)` \
`x, y = ecdf(michelson_speed_of_light)` \
`x_theor, y_theor = ecdf(samples)` \

Then, plot empirical and theoretical CDF's on the same plot to check for normal distribution. \
__This is preferrable to histogram check for normal distribution because there is no binning bias.__

## Computing Percentiles:
__`np.percentile(df['column'], [list of percentiles])`__ 

#### 25th, 50th, 75th percentiles:
__`np.percentile(df['column'], [25, 50, 75])`__ 
#### 95% Confidence interval: 
__`np.percentile(df['column'], [2.5, 97.5])`__ 
#### 99% Confidence interval:
__`np.percentile(df['column'], [0.5, 99.5])`__

## Pearson Correlation Coefficient:
Pearson correlation coefficient, $\rho$, ranges from -1 (for complete anti-correlation) to 1 (for complete positive correlation). $\rho$ = 0 indicates no correlation.

__Covariance:__ a measure of how two quantities vary *together.*

$\rho$ = covariance / [(std of x)(std of y)]

$\rho$ = variablity due to codependence/ independent variability

In [31]:
def pearson_r(x, y):
    """Compute Pearson correlation coefficient between two arrays."""
    # Compute correlation matrix: corr_mat
    corr_mat = np.corrcoef(x,y)

    # Return entry [0,1]
    return corr_mat[0,1]

__Standard error of the mean (sem):__ 

```
# Take 10,000 bootstrap replicates of the mean: bs_replicates
bs_replicates = draw_bs_reps(rainfall, np.mean, size=10000)

# Compute and print SEM
sem = np.std(rainfall) / np.sqrt(len(rainfall))
print(sem)


## The np.random module:
A suite of functions based on pseudo-random number generation.

__`np.random.seed()`__ \
set the seed

__`np.random.random(size= )`__ \
draw a number between 0 and 1

__`np.random.choice()`__ \
`np.random.choice([1,2,3,4,5], size = 10)`\
first argument: array of values to "choose" from\
size: how many samples we want to take out of that array\
default: `np.random.choice(a, size=None, replace=True, p=None)`\
__`bs_sample`__ `= np.random.choice(michelson_speed_of_light, size=100)`\
 --this is a bootstrap sample since there were 100 data points in the original data set, and we ar choosing 100 of them with replacement.


__`np.random.binomial(4, 0.5)`__ \
__`np.random.binomial(4, 0.5, 10)`__ 

sampling from a Binomial distribution \
arguments: \
(4) = number of Bernoulli trials (coin flips) \
(0.5) = probability of success (50:50) \
(10) = how many times to repeat the (4 flip) experiment


__`np.random.poisson(5, 10000)`__ \
`random.poisson(lam=1.0, size=None)`¶

__`np.random.normal(mean, std, size)`__ \
`np.random.normal(np.mean(height), np.std(height), size = 10000)`

__`random.exponential(scale=1.0, size=None)`__  \
`np.random.exponential(mean, 10000)`

## Linear Regression

__Residual:__ Distance between singular data point and the line of best fit; "residual error" \
__Least Squares:__ The process of finding the parameters for which the sum of the squares of the residual of the residuals is minimal.

__`np.polyfit()`__ performs least squares analysis with polynomial functions (a linear function is a first degree polynomial). \
__`slope, intercept = np.polyfit(x, y, degree)`__ 


__Zip lists together for x, y coords and compute parameters of resulting linear regression:__ 

```#Iterate through x,y pairs
for x, y in zip(anscombe_x, anscombe_y):
    # Compute the slope and intercept: a, b
    a, b = np.polyfit(x,y,1)
    # Print the result
    print('slope:', a, 'intercept:', b)```

## Bootstrapping
__Bootstrapping:__ The use of resampled data to perform statistical inference. \
__Bootstrap Sample:__ A resampled array of data. \
__Bootstrap replicate:__ A statistic computed from resampled array (for example; mean of a bootstrap sample/ resampled array). "A simulated replica of the original data acquired by bootstrapping."


__default:__ `np.random.choice(a, size=None, replace=True, p=None)`\
__`bs_sample = np.random.choice(michelson_speed_of_light, size=100)`__\
 --this is a bootstrap sample since there were 100 data points in the original data set, and we ar choosing 100 of them with replacement.

Since we will compute the bootstrap replicates over and over again, we can write a function to generate a bootstrap replicate:

__Single bootstrap replicate of 1D data:__
```def bootstrap_replicate_1d(data, func):
        '''Generate bootstrap replicate of 1-D data array'''
        bs_sample = np.random.choice(data, len(data)) #bootstrap sample needs same number of entries as original data
        return func(bs_sample)```

__Many bootstrap replicates:__
```bs_replicates = np.empty(10000)
for i in range(10000):
    bs_replicates[i] = bootstrap_replicate_1d(michelson_speed_of_light, np.mean)```

In [35]:
def draw_bs_reps(data, func, size=1):
    """Draw bootstrap replicates."""

    # Initialize array of replicates: bs_replicates
    bs_replicates = np.empty(size)

    # Generate replicates
    for i in range(size):
        bs_replicates[i] = bootstrap_replicate_1d(data, func)

    return bs_replicates

### Bootstrap confidence intervals
__p% confidence interval of a statistic:__
If we repeated measurements over and over again, p% of the observed values would lie within the p% confidence interval.

`95_conf_int = np.percentile(bs_replicates, [2.5, 97.5])`

### Pairs bootstrap

__Pairs bootstrap for linear regression:__ \
We can perform bootstrap estimates to get the confidence intervals of the slope and intercept of a linear regression model as well.\
In instances where we cannot resample individual data because each observation has two variables associated with it, we resample pairs. 

For example: voting counties in PA have total number of votes and democratic share of votes attributed to them, so we resample pairs of data (total votes per county with their respective democratic share of votes) together.

1) Resample data in pairs.\
2) Compute slope and intercept from resampled data.\
3) Each slope and intercept is a bootstrap replicate.\
4) Compute confidence intervals from percentiles of bootstrap replicates. 

Because __`np.random.choice()`__ must sample a 1-D array, sample the indices of the data points. \
Generate the indices of a numpy array using __`np.arange(n)`__, which gives us a range of sequential integers, beginning with 0, and ending with n-1.\
The bootstrap sample is generated by slicing out the respective values from the original data arrays.

__Generating a pairs bootstrap sample:__
```inds = np.arange(len(total_votes))
bs_inds = np.random.choice(inds, len(inds))
bs_total_votes = total_votes[bs_inds]
bs_dem_share = dem_share[bs_inds]```

__Computing a pairs bootstrap replicate:__\
```bs_slope, bs_intercept = np.polyfit(bs_total_votes, bs_dem_share, 1)``` \
1 refers to degree of polynomial (1st), ie  we're using a linear model

__Function for pairs bootstrap:__

In [36]:
def draw_bs_pairs_linreg(x, y, size=1):
    """Perform pairs bootstrap for linear regression."""

    # Set up array of indices to sample from: inds
    inds = np.arange(0, len(x))

    # Initialize replicates: bs_slope_reps, bs_intercept_reps
    bs_slope_reps = np.empty(size)
    bs_intercept_reps = np.empty(size)

    # Generate replicates
    for i in range(size):
        bs_inds = np.random.choice(inds, size=len(inds))
        bs_x, bs_y = x[bs_inds], y[bs_inds]
        bs_slope_reps[i], bs_intercept_reps[i] = np.polyfit(bs_x, bs_y, 1)

    return bs_slope_reps, bs_intercept_reps

__Plotting bootstrap regressions:__
__```#Generate array of x-values for bootstrap lines: x
x = np.array([0,100])```__ #Creates array __x= ([0  100])__

__```#Plot the bootstrap lines
for i in range(100):
    _ = plt.plot(x, 
                 bs_slope_reps[i]*x + bs_intercept_reps[i],
                 linewidth=0.5, alpha=0.2, color='red')```__

__```#Plot the data
_ = plt.plot(illiteracy, fertility, marker='.', linestyle='none')```__

__```#Label axes, set the margins, and show the plot
_ = plt.xlabel('illiteracy')
_ = plt.ylabel('fertility')
plt.margins(0.02)
plt.show()```__

## Hypothesis testing
How do we assess how reasonable it is that our observed data are actually described by a chosen model? 

__Hypothesis testing:__ Assessment of how reasonable the observed data are assuming a hypothesis is true.\
__Null hypothesis:__ $H_{0}$; The hypothesis that there is no significant statistical difference between specified popuations, any observed difference being due to sampling or experimental error ("chance"). Typically, you will always be testing the $H^{0}$.

__Simulating the $H^{0}$__: Simulate what the data would look like if the county level voting trends in the two states were identically distibuted. We do this by:

### Permuation
1) Putting the democratic share of the vote for all of PA's 67 counties and Ohio's 88 counties together.\
2) Ignore what state each data point belongs to. \
3) Randomly scramble the order.
4) Relabel first 67 as "PA" and the last 88 as "Ohio"\
__*So, essentially, we just redid the election as if there were no difference between PA county votes and OH county votes.*__

__Permutation:__ \
Random reordering of array.

__Permutation sample:__ \
The permutation, or newly shuffled arrangement assigned to values (`perm_sample_PA`, `perm_sample_OH`, for example).

__Permutation replicate:__ \
A single value of a statistic computed from a permutation sample. (Test statistic of permutation sample)

### Generating a permutation sample

```#Make a single array with all of the data in it (all counties from both states)
#Note that concatenate only accepts a tuple of arrays to concatenate.
dem_share_both = np.concatenate((dem_share_PA, dem_share_OH))
#Shuffle concatenated array
dem_share_perm = np.random.permutation(dem_share_both)
#Slice first 67 counties as PA's permuted sample
perm_sample_PA = dem_share_perm[:len(dem_share_PA)]
#Slice last 88 counties as PA's permuted sample
perm_sample_OH = dem_share_perm[len(dem_share_PA):]```

`perm_sample_PA` and `perm_sample_OH` are called __permutation samples.__

__Generate single permutation sample:__

In [40]:
def permutation_sample(data1, data2):
    """Generate a permutation sample from two data sets."""

    # Concatenate the data sets: data
    data = np.concatenate((data1, data2))

    # Permute the concatenated array: permuted_data
    permuted_data = np.random.permutation(data)

    # Split the permuted array into two: perm_sample_1, perm_sample_2
    perm_sample_1 = permuted_data[:(len(data1))]
    perm_sample_2 = permuted_data[len(data1):]

    return perm_sample_1, perm_sample_2

### Generating permutation replicates
Below: func *must* be a function that accpts two arrays as arguments. \
In most circumstances, func will be a function you write yourself:

In [38]:
def draw_perm_reps(data_1, data_2, func, size=1):
    """Generate multiple permutation replicates."""

    # Initialize array of replicates: perm_replicates
    perm_replicates = np.empty(size)

    for i in range(size):
        # Generate permutation sample
        perm_sample_1, perm_sample_2 = permutation_sample(data_1, data_2)

        # Compute the test statistic
        perm_replicates[i] = func(perm_sample_1, perm_sample_2)

    return perm_replicates

__diff_of_means()__

In [None]:
def diff_of_means(data_1, data_2):
    """Difference in means of two arrays."""

    # The difference of means of data_1, data_2: diff
    diff = np.mean(data_1)- np.mean(data_2)

    return diff

__p-value__

```#Compute difference of mean impact force from experiment: empirical_diff_means
empirical_diff_means = diff_of_means(force_a, force_b)
#Draw 10,000 permutation replicates: perm_replicates
perm_replicates = draw_perm_reps(force_a, force_b,
                                 diff_of_means, size=10000)
#Compute p-value: p
p = np.sum(perm_replicates >= empirical_diff_means) / len(perm_replicates)```

## Test statistics and p-values
Hypothesis testing: What about the data do we assess and how do we quantify the assessment? Test statistics and p-values (respectively).

__Test statistic:__ A single number that can be computed from observed data and from data you simulate under the null hypothesis, that serves as a basis of comparison between what the hypothesis predicts and what we actually observe.
Example test statistic: difference in means (if hypothesis is correct, difference of difference in means should be 0).

__Permutation replicate:__ The value of a test statistic computed from a permutation sample is called a permutation replicate. \
`(np.mean(dem_share_PA) - np.mean(dem_share_OH)) - (np.mean(perm_sample_PA) - np.mean(perm_sample_OH))` \
= difference in votes by state of observed data - difference in votes by state of permuted data

__p-value:__ the probability of obtaining a value of your test statistic that is at least as extreme as what was observed, under the assumption the null hypothesis is true. \
When the p-value is small, it is often said that the data are statistically significantly different.

__Statistical significance:__ determined by smallness of p-value.

Pipeline for hypothesis testing: 
1) Clearly state the null hypothesis \
2) Define test statistic \
3) Generate many sets of simulated data assuming the null hypothesis is true \
4) Compute the test statistic for each simulated data set \
5) The p-value is the fraction of your simulated data sets for which the test statistic is at least as extreme as for the real data 

__One sample test:__ Compare one set of data to a single number\
__Two sample test:__ Compare two sets of data

__A/B Testing, diff_frac() function:__

In [41]:
def diff_frac(data_A, data_B):
    frac_A = np.sum(data_A)/ len(data_A)
    frac_B = np.sum(data_B)/ len(data_B)
    return frac_B - frac_A

for A/B testing: \
`diff_frac_obs = diff_frac(clickthrough_A, clickthrough_B)`

__Hypothesis test of correlation:__ \
1) Posit null hypothesis: the two variables are completely uncorrelated.\
2) Simulate data assuming null hypothesis is true.\
3) Use Pearson correlation coefficient, $\rho$, as test statistic.\
4) Compute p-value as fraction of replicates that have $\rho$ at least as large as observed.

## Plots:

### Bee swarm plot:
`sns.swarmplot()` \
`sns.swarmplot(x)` \
`sns.swarmplot(x, y, data)` 

`sns.swarmplot(*, x=None, y=None, hue=None, data=None, order=None, hue_order=None, dodge=False, orient=None, color=None, palette=None, size=5, edgecolor='gray', linewidth=0, ax=None, **kwargs)`

## Other:

__`np.arange()`__ \
`([start, ]stop, [step, ]dtype=None, *, like=None)` \
Create an np array of range (n) \
`np.arange(7)` \
creates: `array([ 0 1 2 3 4 5 6 ])`

__`np.empty()`__ \
`(shape, dtype=float, order='C', *, like=None)` \
Create an empty np array of shape ()

__`np.empty_like()`__ returns a new array with the same shape and type as a given array (which you provide as an argument). \
`rss = np.empty_like(a_vals)`

__`np.concatenate()`__ \
`((a1, a2, ...), axis=0, out=None, dtype=None, casting="same_kind")` \
Concatenate arrays to perform permutation when testing null hypothesis. \
__Note:__ Only accepts a *tuple* of values to concatenate (note extra parentheses).

__`np.random.permutation(x)`__ \
Randomly permute a sequence, or return a permuted range.\
If *x* is a multi-dimensional array, it is only shuffled along its first index.

__Nonparametric inference:__ \
Makes no assumptions about the model or probability distribution underlying the data. Estimates/summary statistics are computed using data alone (and no underlying/assumed models).

## Exponential + half tau + 2x tau

`#Create an ECDF from real data: x, y` \
`x, y = ecdf(nohitter_times)` 

`#Create a CDF from theoretical samples: x_theor, y_theor`\
`x_theor, y_theor = ecdf(inter_nohitter_time)`

`#Take samples with half tau: samples_half`\
`samples_half = np.random.exponential(tau/2, 10000)`

`#Take samples with double tau: samples_double`\
`samples_double = np.random.exponential(2*tau, 10000)`

`#Generate CDFs from these samples`\
`x_half, y_half = ecdf(samples_half)`\
`x_double, y_double = ecdf(samples_double)`

`#Plot these CDFs as lines`\
`_ = plt.plot(x_half, y_half)`\
`_ = plt.plot(x_double, y_double)`

`#Overlay the plots`\
`plt.plot(x_theor, y_theor)`\
`plt.plot(x, y, marker='.', linestyle='none')`

`#Margins and axis labels`\
`plt.margins(0.02)`\
`plt.xlabel('Games between no-hitters')`\
`plt.ylabel('CDF')`

`plt.show()`

__Simulating coin flips__:

In [None]:
n_all_heads = 0 
for i in range(10000): 
    heads = np.random.random(size=4) < 0.5
    n_heads = np.sum(heads)
    if n_heads == 4:
        n_all_heads += 1
n_all_heads/10000