In [1]:
from scipy import stats

## Probability Distribution Practice

### Lt Col Horton

In this notebook, I will provide some practice problems that will help reinforce your understanding of probability distributions. Throughout this practice, I highly encourage you to refer to the `scipy` documentation online. While you are there, feel free to explore areas we don't cover in this notebook (particularly plotting and randomization). 

For each of the tasks below **_1)_** define a random variable that will help you answer the question; **_2)_** state the distribution and parameters of that random variable; **_3)_** determine the expected value and variance of that random variable. 

I will demonstrate using **_1.1_** below. 

#### Problem 1

The T-6 training aircraft is used during UPT. Suppose that on each training sortie, aircraft return with a maintenance-related failure at a rate of 1 per 100 sorties. 

**_1.1_** Find the probability of no maintenance failures in 15 sorties. 

**_1.2_** Find the probability of at least two maintenance failures in 15 sorties. 

**_1.3_** Find the probability of at least 30 successful (no mx failures) sorties before the first failure.

**_1.4_** Find the probability of at least 50 successful sorties before the third failure. 



##### Demonstration using 1.1

Find the probability of no maintenance failures in 15 sorties.

$X$: the number of maintenance failures in 15 sorties. 

$X\sim \textsf{Bin}(n=15, p=0.01)$

$E(X) = np = 15*0.01 = 0.15$

$V(X) = np(1-p) = 15*0.01*0.99 = 0.1485$

Probability of no maintenance failures, $P(X=0)$:  

In [5]:
stats.binom.pmf(0,15,0.01)

0.8600583546412884

The probability of getting no maintenance failures in 15 sorties is about 0.86. It is worth taking a moment to make sure this value makes sense. Since failures are fairly unlikely, and the **expected** number of failures (0.15) is close to 0, then 0 failures should be a fairly likely outcome and a probability of 0.86 makes sense. 

**_1.2_** Find the probability of at least two maintenance failures in 15 sorties. 

$P(X\geq 2)$

In [12]:
1-stats.binom.cdf(1,15,.01)

0.009629773443364797

In [14]:
1-stats.binom.pmf(0,15,.01)-stats.binom.pmf(1,15,.01)

0.009629773443364603

**_1.3_** Find the probability of at least 30 successful (no mx failures) sorties before the first failure.


For the context of a negative binomial, we will consider a success in this context to be a mx issue.
There are 30 trials till this occurs
$x=\text{number of trials till n successes}=30$
$n=1$
$p=.01$

In [19]:
1-stats.nbinom.cdf(29,1,.01)

0.7397003733882803

Here, we doing $1-P(X\leq 29)=P(X\geq 30)$

**_1.4_** Find the probability of at least 50 successful sorties before the third failure. 

In [23]:
1-stats.nbinom.cdf(49,3,.01)

0.9846473742663409

In [27]:
#Expected Values
mean, var, skew, kurt = stats.nbinom.stats(3,.01, moments='mvsk')
var

array(29700.)

#### Problem 2

On a given Saturday, suppose vehicles arrive at the USAFA North Gate according to a Poisson process at a rate of 40 arrivals per hour. 

**_2.1_** Find the probability no vehicles arrive in 10 minutes. 

**_2.2_** Find the probability at least 50 vehicles arrive in an hour. 

**_2.3_** Find the probability that at least 5 minutes will pass before the next arrival.

**_2.4_** Find the probability that the next vehicle will arrive between 2 and 10 minutes from now. 

**_2.5_** Find the probability that at least 7 minutes will pass before the next arrival, given that 2 minute have already passed. Compare this answer to **_2.3_**. This is an example of the *memoryless* property of the exponential distribution.

**_2.6_** Fill in the blank. There is a probability of 90% that the next vehicle will arrive within __ minutes. This value is known as the 90% percentile of the random variable. 


**_2.1_** Find the probability no vehicles arrive in 10 minutes. 

In [29]:
#Poisson process... 
lamb=40/6 #arrivals per 10 minutes
x=0 #no arrivals
stats.poisson.pmf(x,lamb)

0.0012726338013398079

**_2.2_** Find the probability at least 50 vehicles arrive in an hour. 

In [31]:
lamb=40 #change back because of our interval is now an hour
x=49 
1-stats.poisson.cdf(x,lamb)

0.07033506665939493

So why 49? Because of the least part of the statement. We are are looking at the cumulative distribution prior to, and negating that statement. 

**_2.3_** Find the probability that at least 5 minutes will pass before the next arrival.


In [34]:
#Poisson normalizes the time units (as long as you can get the expected values)
lamb=40/12 #interval is 5 min
x=0 #how many cars in that 5 min
#within this interval of 5 min, we want the probability of 0 in 5 min (because it's zero is why we don't use cdf's)
stats.poisson.pmf(x,lamb)

0.035673993347252395

In [45]:
#We can also do this as an exponential 
lamb=40/60 #normalized in minutes
1-stats.expon.cdf(5,scale=1/lamb) #1-(cdf that it takes up to 5 minutes before the next arrival)=takes at least 5 min 
#before next arrival.

0.03567399334725241

**_2.4_** Find the probability that the next vehicle will arrive between 2 and 10 minutes from now. 


In [43]:
#For this we move to an exponential distribution. (Next arrival=exponential )
#find average number of cars in the interval
lamb=40/60 #normalized in terms of minutes

stats.expon.cdf(10,scale=1/lamb)-stats.expon.cdf(2,scale=1/lamb)

0.2623245043143869

**_2.5_** Find the probability that at least 7 minutes will pass before the next arrival, given that 2 minute have already passed. Compare this answer to **_2.3_**. This is an example of the *memoryless* property of the exponential distribution.


$P(Y\geq 7|Y\geq 2) = \frac{P(Y\geq 7, Y\geq 2)}{P(Y\geq 2)} = \frac{P(Y\geq 7)}{P(Y\geq 2)}$:  

For the top part above, $P(Y\geq 7, Y\geq 2)=P(Y\geq 7)$ occurs because the only time Y is $\geq$ to 7 and 2, is when its $\geq7$

In [52]:
#cdf's do less than or equal to.
(1-stats.expon.cdf(7,scale=1/lamb))/(1-stats.expon.cdf(2,scale=1/lamb))

0.03567399334725243

**_2.6_** Fill in the blank. There is a probability of 90% that the next vehicle will arrive within __ minutes. This value is known as the 90% percentile of the random variable. 

In [61]:
stats.expon.ppf(.9,scale=1/lamb)

3.453877639491069

#### Problem 3

Suppose there are 12 male and 7 female cadets in a classroom. I select 5 completely at random (without replacement). 

**_3.1_** Find the probability I select no female cadets. 

**_3.2_** Find the probability I select more than 2 female cadets. 

**_3.1_** Find the probability I select no female cadets. 

In [62]:
TotalPopulation=12+7
successes=7
sampleSize=5
sampleSuccess=0
stats.hypergeom.pmf(sampleSuccess,M=TotalPopulation,n=successes,N=sampleSize)

0.06811145510835913

In [67]:
(12/19)*(11/18)*(10/17)*(9/16)*(8/15)

0.06811145510835913

**_3.2_** Find the probability I select more than 2 female cadets. 

In [66]:
TotalPopulation=12+7
successes=7
sampleSize=5
sampleSuccess=2 #we select 2 cadets
1-stats.hypergeom.cdf(sampleSuccess,M=TotalPopulation,n=successes,N=sampleSize)

0.23658410732714208

#### Problem 4

Suppose PFT scores in the cadet wing follow a normal distribution with mean 330 and standard deviation 50. 

**_4.1_** Find the probability a randomly selected cadet has a PFT score higher than 450. 

**_4.2_** Find the probability a randomly selected cadet has a PFT score within 2 standard deviations of the mean.

**_4.3_** Find $a$ and $b$ such that 90% of PFT scores will be between $a$ and $b$. 

**_4.4_** Find the probability a randomly selected cadet has a PFT score higher than 450 given he/she is among the top 10% of cadets. 

**_4.1_** Find the probability a randomly selected cadet has a PFT score higher than 450. 

In [71]:
x=450 #random
mu=330 #mean
sigma=50 #std deviation

#side note: create distribution function
dist=stats.norm(mu,sigma)

print(1-dist.cdf(x))
#should be equivalent to:
1-stats.norm.cdf(x,mu,sigma)

0.008197535924596155


0.008197535924596155

**_4.2_** Find the probability a randomly selected cadet has a PFT score within 2 standard deviations of the mean.


In [73]:
#Think of this like finding the area underneath of the probability curve
dist.cdf(mu+2*sigma)-dist.cdf(mu-2*sigma)

0.9544997361036416

**_4.3_** Find $a$ and $b$ such that 90% of PFT scores will be between $a$ and $b$. 


In [77]:
print('a=',dist.ppf(.95)) # 95% of the data are BELOW this number
print('b=',dist.ppf(.05)) #95% are above this number

#this is trivial, just for comparison. The problem doesn't state centering about the mean. 
print(dist.ppf(.9)) #90% are under this number

a= 412.2426813475736
b= 247.75731865242636
394.07757827723003


**_4.4_** Find the probability a randomly selected cadet has a PFT score higher than 450 given he/she is among the top 10% of cadets. 

$P(PFT>450 | \text{top 10})=\frac{P(PFT>450,X>top10)}{(P(X>top10))}$

In [80]:
PgreaterThan450=1-dist.cdf(450)
PgreaterThan450

0.008197535924596155

In [82]:
#since PgreaterThan450 is less than .1 (which comes from top 10%), the intersect of the sets of probabilities
#are {numbers<.00819}={<=.00819...}
PgreaterThan450/.1

0.08197535924596155

#### Problem 5

Suppose time until computer errors on the F-35 follows a Gamma distribution with mean 20 hours and variance 10.  

**_5.1_** Find the probability that 50 hours pass without a computer error. 

**_5.2_** Find the probability that 75 hours pass without a computer error, given that 25 hours have already passed. Dose the memoryless property apply to the Gamma distribution? 

**_5.3_** Find $a$ and $b$: There is a 95% probability time until next computer error will be between $a$ and $b$.  

**_5.1_** Find the probability that 50 hours pass without a computer error. 

$Mean=\frac{\alpha}{\lambda}$
$Variance=\frac{\alpha}{\lambda^2}$

In [87]:
#Since we are given mean and variance, you have to solve by hand to get alpha and lambda
lamb=2 #solve system if you're curious
alpha=40
#mean is alpha/lambda

1-stats.gamma.cdf(50,alpha,1/lamb) #don't forget scaling factor is 1/lambda for this shit

0.07365494485857027

**_5.2_** Find the probability that 75 hours pass without a computer error, given that 25 hours have already passed. Dose the memoryless property apply to the Gamma distribution? 

In [90]:
(1-stats.gamma.cdf(75,alpha,1/lamb))/(1-stats.gamma.cdf(25,alpha,1/lamb))

4.577191724634049e-06

**_5.3_** Find $a$ and $b$: There is a 95% probability time until next computer error will be between $a$ and $b$.  

In [93]:
#95% is between these two points, compared to where we did the similar thing above with 90%.
print('Left end:',stats.gamma.ppf(.025,alpha,1/lamb))
print('Right end:',stats.gamma.ppf(.975,alpha,1/lamb))

Left end: 29.076586441788965
Right end: 53.81428386583284
