## Probability Distributions


In [24]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
#import viz # curriculum viz example code

np.random.seed(123)

In [27]:
die_distribution = stats.randint(1, 7)

die_distribution.rvs()
die_distribution.rvs(10)    # returns a 1D array
die_distribution.rvs((10,10))  # returns a 2D array

array([[2, 2, 1, 1, 2, 4, 6, 5, 1, 1],
       [5, 2, 4, 3, 5, 3, 5, 1, 6, 1],
       [2, 4, 5, 5, 5, 2, 6, 4, 3, 2],
       [5, 1, 4, 3, 6, 1, 4, 3, 3, 3],
       [6, 3, 5, 4, 4, 6, 5, 5, 6, 4],
       [3, 1, 5, 4, 2, 4, 3, 6, 2, 3],
       [5, 1, 2, 5, 3, 2, 2, 4, 5, 6],
       [2, 1, 1, 4, 2, 4, 4, 4, 6, 2],
       [2, 3, 4, 4, 4, 4, 1, 2, 4, 2],
       [5, 4, 2, 6, 3, 4, 5, 4, 2, 6]])

### PMF / PDF

The probability mass function (pmf) (probability density function (pdf) for continuous distributions) is a function that gives us the probability of any single outcome. For example, we could use the pmf to give us the probability of rolling a 3 with our dice rolling distribution:

In [25]:
die_distribution.pmf(4)   # What's the liklihood we roll a four?

0.16666666666666666

#### Note: 

All of the functions that we discuss here can accept a single value or a list of values and will produce either a single number, or a numpy array of results that correspond to the inputs.


    For example, we can calculate multiple pmfs at once like this:

    die_distribution.pmf([1, 2, 3])

#### CDF / PPF

The cumulative density function tells us the likelihood of a single outcome or all the results below it. For our dice rolling example, this might be something like "what is the probability of rolling a 3 or lower?"

In [26]:
die_distribution.cdf(3)  # What's the liklihood we roll a 3 or less?


0.5

The percent point function (ppf) (also known as the quantile function) can be thought of as the inverse of the cdf.

The ppf accepts a probability, and gives us the value that is associate with that probability:

In [28]:
die_distribution.ppf(5/6)

5.0

#### SF / ISF

The survival function (sf) tells us what the probability of our random variable falling above a certain value is. This is the same as 1 minus the cdf of the same value.

We can use this to answer questions like: "What is the likelihood we roll a value higher than 4?"

In [30]:
# The sum of these will always be 1. They are complimentary
# sf is the area to the right    cdf is the area to the left
       die_distribution.sf(4) + die_distribution.cdf(4)


1.0

Like the ppf, the inverse survival function (isf) will give us a value when we provide a probability.

    For example: "There is a 1/3 chance a dice roll will be higher than what value?"

In [31]:
die_distribution.isf(1/3)


4.0

### Mini Exercises

Use a scipy statistical distribution to answer the questions below:

die_distribution = stats.randint(1, 7)

    What is the probability of rolling a 1?
    
    There's a 1 in 2 chance that I'll roll higher than what number?
    
    What is the probability of rolling less than or equal to 2?
    
    There's a 5 in 6 chance that my roll will be less than or equal to what number?
    
    There's a 1 in 2 chance that my roll will be less than or equal to what number?
    
    What is the probability of rolling less than or equal to 6?
    
    There's a 1 in 3 chance that I'll roll higher than what number?
    
    What is the probability of rolling higher than a 1?
    
    There's a 2 in 3 chance that my roll will be less than or equal to what number?
    
    There's a 2 in 3 chance that I'll roll higher than what number?
    
    There's a 1 in 3 chance that my roll will be less than or equal to what number?
    
    There's a 1 in 6 chance that I'll roll higher than what number?

In [33]:
# What is the probability of rolling a 1?
die_distribution.pmf(1)

0.16666666666666666

In [37]:
# There's a 1 in 2 chance that I'll roll higher than what number?
die_distribution.isf(1/2)

3.0

In [36]:
# What is the probability of rolling less than or equal to 2?
die_distribution.cdf(2)

0.3333333333333333

In [38]:
# There's a 5 in 6 chance that my roll will be less than or equal to what number?
die_distribution.ppf(5/6)

5.0

In [40]:
# There's a 1 in 2 chance that my roll will be less than or equal to what number?
die_distribution.ppf(1/2)

3.0

In [41]:
# What is the probability of rolling less than or equal to 6?
die_distribution.cdf(6)

1.0

In [42]:
# There's a 1 in 3 chance that I'll roll higher than what number?
die_distribution.isf(1/3)

4.0

In [43]:
# What is the probability of rolling higher than a 1?
die_distribution.sf(1)

0.8333333333333334

In [44]:
# There's a 2 in 3 chance that my roll will be less than or equal to what number?
die_distribution.ppf(2/3)

4.0

In [45]:
# There's a 2 in 3 chance that I'll roll higher than what number?
die_distribution.isf(2/3)

2.0

In [46]:
# There's a 1 in 3 chance that my roll will be less than or equal to what number?
die_distribution.ppf(1/3)

2.0

In [47]:
# There's a 1 in 6 chance that I'll roll higher than what number?
die_distribution.isf(1/6)

5.0

### Binomial Distribution


The binomial distribution lets us model the number of successes after a number of trials, given a certain probability of success. The classic example of this is the number of heads you would expect to see after flipping a coin a certain number of times.


A binomial distribution is defined by a number of trials, and a probability of success. These two pieces of information are what we need in order to model a problem with the binomial distribution.


The binomial distribution assumes that each trial is independent of the others.

Let's take an example:

    You are taking a multiple choice test consisting of 30 questions that you forgot to study for. Each question has 4 possible answers and you will choose one at random. What is the probability you get more than 10 of the questions right?

Here we have a probability of success, 0.25, and a number of trials, 30. We'll define X as the number of questions we get right on the test. We want to know the probability that X > 10, which tells us we want to use the survival function.


In [48]:
# The prob of getting heads on exactly 3 coins out of 8 tosses Nice!
stats.binom(8, .5).pmf(3)   # 8 coins, p = .5

0.21875000000000014

In [49]:
# The prob of getting heads on 3 or more tosses of 8 coins
stats.binom(8, .5).sf(3) 

0.6367187499999999

In [51]:
# The number of times I get heads out of 100 simulations of random values
stats.binom(8, .5).rvs(100)

array([4, 4, 3, 3, 5, 3, 5, 4, 4, 6, 5, 3, 2, 3, 4, 5, 7, 3, 5, 4, 5, 3,
       4, 3, 3, 4, 5, 2, 2, 3, 5, 5, 4, 5, 4, 3, 3, 3, 5, 3, 4, 4, 4, 0,
       7, 6, 3, 3, 4, 6, 7, 3, 4, 5, 4, 5, 3, 4, 6, 7, 2, 4, 3, 4, 4, 2,
       3, 6, 4, 4, 5, 5, 2, 5, 5, 3, 5, 3, 3, 5, 2, 5, 6, 5, 4, 4, 5, 4,
       2, 4, 5, 3, 6, 5, 4, 5, 2, 6, 5, 6])

In [52]:
# Food Truck example done this way, a way to check your simulations and how you set them up
stats.binom(3, .7).pmf(0)     # The likelihood that there has not been a food truck in 3 days

0.027000000000000007

In [53]:
# The likelihood that there is a food truck in 5 days
stats.binom(5, .7).sf(0)    

0.99757

In [None]:
# The likelihood that there is a food truck in a five day period for 52 weeks or one year
stats.binom(5, .7).sf(0).rvs(52)   

### Normal Distribution

The normal distribution models a continuous random variable where the further away from the mean you are, the less likely the outcome. This is commonly referred to as the "bell curve", and many continous variables tend to follow a normal distribution.

A normal distribution is defined by a mean and a standard deviation. The standard normal distribution is a normal distribution with a mean of 0 and standard deviation of 1.

    Suppose that a store's daily sales are normally distributed with a mean of 12,000 dollars and standard deviation of 2000 dollars. How much would the daily sales have to be to be in the top 10% of all days?

Here we are given the mean and standard deviation, and are asked to find the value that corresponds to the top 10%. Here, since we know the probability and want a value, we can use the percent point function to find our answer.

In [54]:
# distribution object to interrogate
stats.norm(12000, 2000)

<scipy.stats._distn_infrastructure.rv_frozen at 0x1a18b42a90>

In [55]:
# What's the likelihood we sell more than $10,000?
stats.norm(12000, 2000).sf(10000)

# 84% chance our sales are more than $10,000

0.8413447460685429

In [57]:
# What's the cutoff point that determines whether a day is in the bottom 5% of all sales?
stats.norm(12000, 2000).ppf(.05)

# If the sales for the day fall below 8710, it falls in the bottom  5% of all sales days

8710.292746097053

In [58]:
# What's the cutoff point that determines whether a day is in the top 10% of all sales?
stats.norm(12000, 2000).isf(.1)

# For the sales of a day to be in the top 10%, they have to be above $14,563

14563.103131089201

In [59]:
# This is the inverse of the above code which returns the same value
stats.norm(12000, 2000).ppf(.9)

14563.103131089201

### Poisson Distribution - events happen over specified time interval


-The poisson distribution lets us model a situation where a certain number of events happen over a specified time interval1. The number of events that happen is a discrete measure, and this distribution can tell us the likelihood of a certain number of events occuring over the time period.

-The poisson distribution assumes that the events are indpendent of each other and independent of the time since the last event. We must also know the average rate to use a poisson distribution.

-The outcome of a poisson is discrete, not continuous like the normal dist.

-Some examples of real-world processes that can be modeled with a poisson distribution are:

    -The number of emails sent by a mail server in a day

    -The number of phone calls received by a call center per hour
    
    -The number of decay events per second from a radioactive source
    
    -Let's dive into a specific example:

    Codeup knows that, on average, students consume 5 lbs of coffee per week. 

In [61]:
# How likely is it that the coffee consumption for this week is only 3 lbs?
stats.poisson(5).pmf(3)
# There is about a 14% chance

0.1403738958142805

In [67]:
# 300 tacos per lunch / lunch
# How likely is it that we sell only 200 or fewer tacos?
stats.poisson(300).cdf(200)
# There is about a 5% chance

5.10497804862988e-10

In [None]:
# taco cutoff for the top 10% of busiest lunches
stats.poisson(300).isf(.1)

What is the likelihood that more than 7 lbs of coffee are consumed?



In [68]:
stats.poisson(5).sf(7)
# There's about a 13% chance of this happening

0.13337167407000744

### More Normal Examples

In [None]:
# Suppose the average Codeup admissions phone call is 15 mins long with a standard deviation of 3
# How likely s it that a phone call will go on for 20 mins or longer?


In [62]:
# Both of these will get you the answer to the question above...
stats.norm(15, 3).sf(20)

0.0477903522728147

In [65]:
1 - stats.norm(15, 3).cdf(20)

0.047790352272814696

In [66]:
# how quick does a phone call finish if it is in the bottom 25% of all phone calls?
stats.norm(15, 3).ppf(.25)
# 12.98 min phone call is the cutoff for the bottom quartile of phone call length

12.976530749411754

In [None]:
NOTE: option + m µ
    