# STUDY GROUP - M02S19
## Distributions & Sampling

### Objectives
You will be able to:
* Understand and explain why statistical distributions are useful to data scientists
* Understand and explain the uses of the different distribution functions
* Explain Central Limit Theorem


### Statistical Distibutions

#### Uniform, Normal, Binomial Distributions
**uniform distribution** - describes an event where every possible outcome is equally likely. No single outcome carries any more or less probability of happening than any other possible outcome. The Uniform Distribution can be discrete or continuous.

**normal distribution** - normal distribution is for continous-valued distributions

**An example question we can answer with the Normal Distribution is "what percentage of people are at least 2 inches shorter than the global average hieght?"**

**binomial distribution** - discrete version of the normal distribution, describes the probability distribution for a of a given number of successes in a set of repeated Benoulli Trials, also known as a Binomial Experiment. 
* describes the number of successes  k  achieved in  n  trials, where the probability of success is  p 
* Recall that Binomial Experiments have the following constraints:

    1. Each experiment consists of  n  repeated trials.

    2. The outcome of each trial is binary, resulting in either success or failure (it doesn't matter which outcome we label as success or failure, just that we're able to assign the labels).

    3. The probability  p  of a given outcome is the same on every trial.

    4. The trials are independent. The results of a given trial are not influenced by prior trial results, and will not influence future trial results in turn.
    
**An example question we could answer with the Binomial Distribution is "if I flip a fair coin 5 times, what is the probability that exactly 2 of those flips lands on heads?"**

#### Negative Binomial Distribution
**negative binomial distribution** -  discree distribution which allows us to easily describe the probability distribution of the different ways a Negative Binomial Trial could work out, 
* describes the number of successes  k  until observing a pre-determined number of failures  r  where the probability of success for each independent trail is  p 

In [1]:
import numpy as np

s = np.random.negative_binomial(1, 0.1, 100000)
for i in range(1, 11):
    probability = sum(s<i) / 1000000
    print("{} wells drilled, probability of success: {:.4f}%".format(i, probability * 100))

1 wells drilled, probability of success: 0.9971%
2 wells drilled, probability of success: 1.8962%
3 wells drilled, probability of success: 2.7071%
4 wells drilled, probability of success: 3.4571%
5 wells drilled, probability of success: 4.1191%
6 wells drilled, probability of success: 4.7089%
7 wells drilled, probability of success: 5.2417%
8 wells drilled, probability of success: 5.7134%
9 wells drilled, probability of success: 6.1416%
10 wells drilled, probability of success: 6.5389%


#### Geometric Distribution
**geometric distribution** - discrete probability distribution that helps us calculate the probability distribution of repeated independent events, **special case of the Negative Binomial Distribution**

**"What is the probability that I can flip a coin X times before it lands on tails?"**

In [2]:
def geom_prob(p, x):
    q = 1 - p
    ex = x - 1
    return q**ex * p

# Let's test that it works
geom_prob(0.5, 3) # Expected Output 0.125

0.125

In [3]:
for x in range(1, 11):
    p = 0.474
    print("Probability of roulette landing on red {} times in a row: {:.5f}%".format(x, geom_prob(p, x) * 100))

Probability of roulette landing on red 1 times in a row: 47.40000%
Probability of roulette landing on red 2 times in a row: 24.93240%
Probability of roulette landing on red 3 times in a row: 13.11444%
Probability of roulette landing on red 4 times in a row: 6.89820%
Probability of roulette landing on red 5 times in a row: 3.62845%
Probability of roulette landing on red 6 times in a row: 1.90857%
Probability of roulette landing on red 7 times in a row: 1.00391%
Probability of roulette landing on red 8 times in a row: 0.52805%
Probability of roulette landing on red 9 times in a row: 0.27776%
Probability of roulette landing on red 10 times in a row: 0.14610%


#### Poisson Distibution
**Poisson Distribution** -  allows us to calculate the probability of a given event happening by examining the mean number of events that happen in a given time period

#### Exponential Distribution
**Exponential Distribution** - describes the probability distribution of the amount of time it may take before an event occurs. In a way, it solves the **inverse** of the problem solves by the Poisson Distribution, **continuous analogue of the Geometric Distribution**

The Poisson Distribution lets us ask how likely any given number of events are over a set interval of time.

The Exponential Distribution lets us ask how likely the length of an interval of time is before an event occurs exactly once.

**How long before a sensor in this factory breaks down?**
**How long until the next earthquake happens?**
**How long will the next customer interaction take?**
**How long until the next person visits my website?**

### Sampling

#### Central Limit Theorem
Rarely, if ever, are we able to completely survey a population of interest. Similarly, we will often deal with missing data. Whatever it may be, whether estimating asthma rates, fish populations, daily temperatures, material volumes, risk, manufacturing defects or any other measurement of unknown or large scale quantities, we are unlikely to have complete information of the system in question. As a result, we do our best by taking samples and using these to estimate the corresponding measurements for the complete population, from which we took the sample. These estimates of population parameters are known as point estimates. Interestingly, point estimates of specific parameters of a population have predictable behaviors, in that the point estimates themselves will form specific probability distributions. 

**central limit theorem** - states that under many conditions, independent random variables summed together will converge to a normal distribution as the number of variables increases
#### Confidence Intervals

