## Import Statements ##

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy import stats
from env import host, user, password
 np.random.seed(123)

## Types of Distributions ##

#### Normal Distribution ####
- models a continuous random variable 
- the further away from the mean you are, the less likely the outcome
- commonly referred to as the 'bell curve'
- many conituous variables tend to follow a normal distribution
- defined by a mean and a standard deviation
    - standard normal distribution is a normal distribution with a mean
      of 0 and a standard deviation of 1

In [None]:
# examples: 

mean = 
std = 

# theoretical
distribution = stats.norm(mean, std)
distribution.function()
    # function can be .pdf(), .pmf(), cdf(), ppf(), sf(), isf()

##Example:    

mean = 3.0
std = .3
students = stats.norm(mean, std)    
    
# simulation/experimental
np.random.normal(mean, std, size)
    # numpy array
    
##Example:

mean = 3.0
std = .3
n_students = 100_000
grades = np.random.normal(mean, std, n_students)

#### Binomial Distribution ####
- models the number of success after a number of trials, given a certain 
  probability of success
- defined by a number of trials and a probability of success
- assumes that each trial is independent of the others
- the definition of a 'success' can vary
    - really we need a random variable with two outcomes, and we can define one of the outcomes as a success
    - a binomial distribution with an n of 1 is reffered to as a Bernoulli Distribution

In [None]:
# examples:

n = n_trials
p = probability of success

# theoretical
distribution = stats.binom(n, p)
distribution.function()
    # function can be .pdf(), .pmf(), cdf(), ppf(), sf(), isf()

##Example:

clicks = stats.binom(4326, .02)
    
# simulation/experimental
np.normal.binomial(n, p, size)
    # numpy array
    
##Example:

click = np.random.binomial(4326, .02, 100_000)
(click >= 97).mean()

#### Poisson Distribution ####
- models a situation where a certain number of events happen over a
  specified time interval
- number of events is a discrete measure
- this distributino can tell us the likihood of a certain number of 
  events occuring over the time period
- assumes that the events are inpedendent of each other and independent 
  of the time since the last event
- must know the average rate to use poisson distribution

In [None]:
# examples

mean = avg rate/time

# theoretical
distribution = stats.poisson(mean)
distribution.function()
    # function can be .pdf(), .pmf(), cdf(), ppf(), sf(), isf()
    
##Example:

cars = 2
no_cars = stats.poisson(2).cdf(0)
    
# simulation/experimental
np.random.poisson(mean, n_trials)
    #numpy array

##Example:    

n = 100_000
zero_cars = np.random.poisson(2, n)
none = (zero_cars == 0).mean()
#or
(zero_cars.rvs(n) == 0).mean()

#### Uniform Distribution ####
- models events where the outcome is discrete
- each outcome has an equally likely chance of happening

In [None]:
# example: rolling a 6-sided die

die_dist = stats.randint(1, 7)
# .randint gives a range 1-7, 7 being exclusive

die_dist.rvs() 
    # returns a single random value from die_dist
die_dist.rvs(5)
    # returns 5 random values
die_dist.rvs((5, 5))
    # returns a matrix of rancom values

## Distribution Methods ##

#### rvs() ####
- generates random values based on distribution
- can pass:
    - no argument to get a single random value
        - distributino.rvs()
        - 6
    - a single integer to get that many random values
        - distibution.rvs(5)
        - array([3, 5, 3, 2, 5])
    - a tuple with the dimensions of a matrix of random values
        - distribution.rvs((5, 5))
        - array([[3,4,2,2,1],
                 [2,2,1,1,2],
                 [4,6,5,1,1],
                 [5,2,4,3,5],
                 [3,5,1,6,1]])

PDF/PPF: probability our random variable takes on a given value '=='

#### Probability Desnsity Function - pdf()
- for continuous distributions 
- accepts a probability
- returns an exact value 

#### Probability Mass Funtion - pmf()
- for discrete distributions
- accepts single value
- returns the probability of any single outcome

CDF/PPF: probability our random variable takes on a value less than or equal to a given point '<='

#### Cumulative Density Function - cdf() 
- given a value, what is the probability?
    - looking left, inclusive

#### Percent Point Function - ppf()
- also known as the quantile function
- given a probability, what is the value?

SF/ISF: probability our random variable takes on a value greater than a given point '>='

#### Survival Function - sf()
- given a value, what is the probability?
- greater than a number, not including that number 
    - looking right, non inclusive

#### Inverse Survival Function - isf()
- given a probability, what is the value?

## Hypothesis Testing ##
- process of comapring one hypothesis to another
- uses statistics to help evaluate the hypothesis

In [None]:
# Null Hypothesis --> Ho --> the 'default' hypothesis, usually no change
# Alternative Hypothes --> H1 or Ha --> some sort of change, difference between what is being tested
# Significance Level, False Positive Rate --> P(FP) = P(Tpye I Error)
# Statistical Power --> P(Reject Ho when Ho is false)
# False Negative Rate --> P(FN) - P(Type II Error)
# p-value --> P(we observed this result due to chance | H0 is true)

#### Central Limit Theorem ####
- tells us that the sampling distribution for a random variable is 
  normally distributed, even if the underlying random variable is not
- this concept allows us to make calculations using the normal 
  distribution based on the values we calculate from our samples

In [None]:
# Reject the Null Hypothesis
# Fail to reject the Null Hypothesis 
- two statements used to determine the result of the statistical testing
- this does not tell us that the alternative hypothesis is true
- the alternative hypothesis can either be that there is a difference or that the difference is greater or less than
- tells us whether we are setting up a two-tailed (for any difference)
- or a one-tailed (for a specific defference) test

#### Hypothesis Test Results ####
Confidence Interval - range of values within we are sure our statistic will fall a certain percent of the time, for our testing to begin
Significance Level - we set our significance level (alpha) by choosing a confidence interval.
- alpha is defined as 1 - confidence interval(.95) == (.05)
- typical confidence levels are 95%, 99%, 99.9%

#### p-values ####
- is represented at the chanve that we obtained the results we did (or would obtain more extreme results) due to
  chance if the null hypothesis is true
- if p-value us less than alpha, we reject the null hypothesis 
- if p-value is more than alpha, we fail to reject the null hypothesis 

#### Hypothesis Testing Errors ####
Type I Error
- when we reject the null hypothesis, but in reality, the null hypothesis is true
- you rejected the Ho when you should have left it alone - Sin of Commission
Type II Error
- when we fail to reject the null hypothesis when the null hypothesis is actually false
- you left the Ho alone when you should have reject it - Sin of Ommission

## T-Test ##
- a t-test lets us compare a categorical and a continuous variable by comparing the mean of the continuous variable by subgroups based on the categorical variable
- can help answer questions such as:
    - are the salaries of the marketing department higher than the 
      company average?
    - do customers that receive marketing emails spend more money?
    - are sales for product A higher when we run apromotion for it?

#### One Sample T-Test ####
- lets us compare the mean for a specific subgroup against the popultion 
  mean
- lets us compare a categorical and continuous variable by using subgroup
  and population means
- Null Hypothesis: there is no difference in the means of our subgroup 
  and the population
- we assume that the continuous variable is normally distributed

t = x−μ / s/√n

x = sample mean
m = population


t = (xbar - pop_mean) / (sample_stdev / sqrt(sample_pop_number))
print(f"t-score = {t}")
p = stats.t(sample_pop_number - 1).sf(t) * 2
print(f"p-value = {p}")

In [None]:
Two Sample T-Test

## Misc. ##

In [None]:
Trial vs Simulation:
21 students toss 10 coins each
10 trials - specific to binom dist. (experiment)
21 simulations - number of times

Generating Random Numbers with Numpy
The numpy.random module provides a number of functions for generating random numbers.

np.random.choice: selects random options from a list
np.random.uniform: generates numbers between a given lower and upper bound
np.random.random: generates numbers between 0 and 1
np.random.randn: generates numbers from the standard normal distribution
np.random.normal: generates numbers from a normal distribution with a specified mean and standard deviation

In [None]:
discrete variable is counted
    - n_students, n_parents
continuous variable is measured
    - height, weight

In [None]:
add additional t-test
add dcorrelation
add chi squared

In [None]:
# t-test for one sample - one subgroup to the population, wiht a sequence and a scalar
stats.ttest_1samp(series/array/etc , mean)
# (sequence of values: list, population average: float)

# t-test for 2 samples - comparing 2 subgroups, with 2 sequences
stats.test_ind(series 1, series 2)
# (sequence of values 1: list, sequence of values 2: list)