***Product Defects***

You are in charge of monitoring the number of defective products from a specific factory. You’ve been told that the number of defects on a given day follows the Poisson distribution with the rate parameter (lambda) equal to 7. You’re new here, so you want to get a feel for what it means to follow the Poisson(7) distribution. You remember that the Poisson distribution is special because the rate parameter represents the expected value of the distribution, so in this case, the expected value of the Poisson(7) distribution is 7 defects per day.

You will investigate certain attributes of the Poisson(7) distribution to get an intuition for how many defective objects you should expect to see in a given amount of time. You will also practice and apply what you know about the Poisson distribution on a practice data set that you will simulate yourself.

In [1]:
import scipy.stats as stats
import numpy as np
## Task 1: 
# define expected value (lambda value for poisson distribution)
lam = 7

You know that the rate parameter of a Poisson distribution is equal to the expected value. So in our factory, the rate parameter would equal the expected number of defects on a given day. You are curious about how often we might observe the exact expected number of defects.

Calculate and print the probability of observing exactly "lam" defects on a given day.

In [2]:
print("The probability of observing exactly {} defects on a given days is {}".format(lam, round(stats.poisson.pmf(7,lam),3)))

The probability of observing exactly 7 defects on a given days is 0.149


Our boss said that having 4 or fewer defects on a given day is an exceptionally good day. You are curious about how often that might happen.

Calculate and print the probability of having one of these days.

In [3]:
print("The probability of having 4 defects or fewer a day is {}".format(round(stats.poisson.cdf(4, lam),3)))

The probability of having 4 defects or fewer a day is 0.173


On the other hand, our boss said that having more than 9 defects on any given day is considered a bad day.

Calculate and print the probability of having one of these bad days.

In [4]:
print("The probability of having 10 defects or more a day is {}".format(round(1 - stats.poisson.cdf(9, lam),3)))

The probability of having 10 defects or more a day is 0.17


You’ve familiarized yourself a little bit about how the Poisson distribution works in theory by calculating different probabilities. But let’s look at what this might look like in practice.

Create a variable that has 365 random values from the Poisson distribution (simulated year).

In [5]:
year_defects = stats.poisson.rvs(lam, size = 365)

Let’s take a look at our new simulated dataset. Print the first 20 values in this data set.

In [6]:
print(year_defects[0:20])

[ 5  8  5  8  8  8  6  6  5  6 10  7  6  8  7  7  7  5 10  5]


If we expect 7 defects on a given day, what is the total number of defects we would expect over 365 days?

Calculate and print this value to the output terminal.

In [7]:
print("The total number of defects we would expect over 365 days is {}".format(lam*365))

The total number of defects we would expect over 365 days is 2555


Calculate and print the total sum of the simulated dataset. How does this compare to the total number of defects we expected over 365 days?

In [8]:
print("The total sum of randomised defects for the year is {}".format(year_defects.sum()))

The total sum of randomised defects for the year is 2554


Calculate and print the average number of defects per day from our simulated dataset.

How does this compare to the expected average number of defects each day that we know from the given rate parameter of the Poisson distribution?

In [9]:
print("The average number of defects per day from our randomised defects dataset is {}".format(round(year_defects.mean(),1)))

The average number of defects per day from our randomised defects dataset is 7.0


You’re worried about what the highest amount of defects in a single day might be because that would be a hectic day.

Print the maximum value of simulated year.

In [10]:
worst_day = year_defects.max()
print("The highest amount of defects in a single day (worst day) from our randomised defects dataset is {}".format(worst_day))

The highest amount of defects in a single day (worst day) from our randomised defects dataset is 15


It would probably be super busy if there were that many defects on a single day. Hopefully, it is a rare event!

Calculate and print the probability of observing that maximum value or more from the Poisson(7) distribution.

In [11]:
print("The probability of having highest amount of defects or more in a single day is {}".format(1 - stats.poisson.cdf(worst_day, lam)))

The probability of having highest amount of defects or more in a single day is 0.0024065803473980463


Let’s say we want to know how many defects in a given day would put us in the 90th percentile of the Poisson(7) distribution. One way we could calculate this is by using the following method:

stats.poisson.ppf(percentile, lambda) 
percentile is equal to the desired percentile (a decimal between 0 and 1), and lambda is the lambda parameter of the Poisson distribution. This function is essentially the inverse of the CDF.

Use this method to calculate and print the number of defects that would put us in the 90th percentile for a given day. In other words, on 90% of days, we will observe fewer defects than this number.

In [12]:
percentile = 0.9
xth_per_defects = stats.poisson.ppf(percentile, lam)
print("The number of defects that would put us in the {}th percentile for a given day is {}".format(int(percentile*100), xth_per_defects))

The number of defects that would put us in the 90th percentile for a given day is 10.0


Now let’s see what proportion of our simulated dataset is greater than or equal to the number we calculated in the previous step.

By definition of a percentile, we would expect 1 - .90, or about 10% of days to be in this range.

To calculate this:
Count the number of values in the dataset that are greater than or equal to the 90th percentile value.
Divide this number by the length of the dataset.

In [15]:
counter = 0
for i in year_defects:
  if i >= xth_per_defects:
    counter += 1
print("Proportion of our simulated dataset is greater than or equal to the number we calculated using percentile is {}".format(round(counter/len(year_defects)*100,1)))
# Simplier way to calculate same proportion 
print("This is same number using different calculatinion method " + str(round(sum(year_defects >= xth_per_defects)/len(year_defects)*100,1)))

Proportion of our simulated dataset is greater than or equal to the number we calculated using percentile is 16.4
This is same number using different calculatinion method 16.4
