## Discrete Distributions

In a random experiment we can have two types of events discrete or continuous, we will start by describing different types of discrete events. To describe this events we can use mathematical functions f(x), that gives us the different probabilities for each possible outcome in a trial or a number of trials.

These functions for the discrete distributions are named Probability Mass Functions (PMF) 

## Probability Mass Function (PMF)

Probability Mass Functions can be described as the probability each random variables, where all elements of the function that describe the distribution of the data (which can be represented graphically and/or in tabular form). This function has three properties


1. P(X = x) = f(x) > 0 if x $\in$ S


2. $ \sum_{x\in S}f(x) = 1 $


3. $ P(X \in A) = \sum_{x\in S}f(x)$


Property 3 means that to find the probability of any event A, you must sum the probabilities of the x values in A

#### PMFs can be represented in multiple ways (tabular, graphical or mathematically).

Example:

In a class we count the number of siblings for each student. Lets define x as the number of siblings (a random variable) = {0,1,2,3}. We obtained this result

PMF = 


|X      |0    |1    |2    |3    |
|:------|:----|:----|:----|:----|
|p(X)   |0.35 |0.40 |0.20 |0.05 |



In [None]:
x = c(0.35,0.40,0.20,0.05)
plot(0:3,x, type = 'h')

Example 2: I toss a fair coin twice, and let X be defined as the number of heads I observe. Find the range of X, RX, as well as its probability mass function PX.

What is the sample space?

what is $P_x(k)$

### Cumulative Distributions

Another way to represent discrete distributions is using a step function called Cumulative Distribution Function CDF.

The CDF is formally defined as $F_x(t) = P(X \leq t)$ (Note the difference in the notation, a CDF is noted as a capitol F, whereas a PDF is noted as a lower case f)

This is an important definition as it implies that the random variable is a function of t (which can be defined as the order of events - time).

then 

$$F_x(t) = \sum_{j = 1}^{t} p(x_j)$$

In [None]:
cumsum(x)
plot(cumsum(x),type = 's')

#### **Note that the range for the y axis in a CDF always goes from 0 to 1.**
___

## Discrete distributions

We will explore the most used types of discrete distributions, the way we will describe any distribution will revolve around four main properties. 

PMF = Will define the function and its solution to a random trial.

CDF = The cumulative function, as how the function behaves as we increase the number of random variables

Mean = $\mu$ = The central parameter of the distribution  

Standard Deviation = $\sigma$ = The spread or size of the distribution  



#### Bernoulli Trial

A Bernoulli trial is one, and only one, random experiment with two outcomes. e.g. sucess/failure, yes/no, on/off, etc. in this experiment the probability of sucess or failure doesn't change from trial to trial


## Bernoulli Distribution

The Bernoulli distribution is the probability distribution of a random variable which takes the value 1 with probability p and the value 0 with probability q=1-p


The distribution of heads and tails in coin tossing is an example of a Bernoulli distribution with p=q=1/2. The Bernoulli distribution is the simplest discrete distribution, and it the building block for other more complicated discrete distributions.
 
There is not CDF as this distribution deals only with one trial.

$$ PMF = f(k;p) = \bigg\{\begin{array}{ll}  p &\    if\ k = 1, \\  1-p &\ if\ k =0.\end{array}$$

Which can also be written f(k;p) = $p^k(1-p)^{1-k}$

$$\mu = p$$

$$\sigma = \sqrt{p(1-p)}$$

## Geometric Distribution

The next simple case of a discrete distribution relates to cases where we run multiple bernoulli trials until we observe the first success.


The PMF is given by (which is derived from the geometric series - discrete growth):

$f(x)=P(X=x)=(1−p)^{x−1}p $

$0 < p < 1, x = 1, 2,$ ... for a geometric random variable X is a valid p.m.f.


The CDF is:

$F(x)=P(X≤x)=1−(1−p)^x $

The mean is:

$\mu = E(X) = \frac {1}{p}$

The variance:

$\sigma^2 = \frac {1-p}{p^2}$

---

Example 1:

Products produced by a machine has a 3% defective rate

What is the probability that the first defective occurs in the fifth item inspected?

P(X=5)  = P(first 4 not defective)*P(5th defective)

        = P(0.97)^4 * (0.03)
        = 0.026
        = 2%

In [None]:
dgeom(4,0.03) #4 is the number of failures until the first sucess (which in this case the first sucess is 
#to find the first defective item)

What is the probability that the first defective occurs in the first five inspections?

P(X=>5) = 1 - P(First 5 non-defective)
        = 1 - 0.97^5


or the cumulative probability of first defective on each of the first five inspections (which is about 15% (3%*5)


In [None]:
pgeom(4,0.03)

Example 2:

A patient is waiting for a suitable matching kidney donor for a transplant. If the probability that a randomly selected donor is a suitable match is p=0.1, what is the expected number (*This means what is the average number*) of donors who will be tested before a matching donor is found?

You can intuitively see that if the probability is 0.1 then there is 1 in 10 suitable donors, then the expected number of donors who will be tested before a matching donor is found should be 9, we can use our the mean to get the same answer

$\mu = E(X) = \frac {1}{p}$ ##### This the mean for the expected number of successes, the mean for the expected number of failures is

$\mu = E(X) = \frac {1-p}{p}$

$\mu = E(X) = \frac {1-0.1}{0.1}$ = 9



Let’s generate 100 random samplings where the probability of success on any given trial is 1/2, like we were repeatedly flipping a coin and recording how many heads we got before we got a tail

In [None]:
sample <- rgeom(100, 1/2)
summary(sample)
sd(sample)

In [None]:
hist(sample, breaks=seq(-0.5,6.5, 1), col='light grey', border='grey', xlab="")

---

## Binomial Distribution (Sampling with replacement)

**Binomial distributions are defined as a collection of Bernoulli trials only If each individual trial is independent.** 

Similar to the Bernoulli Distribution, the binomial distribution pertains to random experiments with two possible otucomes: Sucess (S) and Failure (F). Thus, For any random variable (X) we can assign x = 1 when sucess and x = 0 when failure.

if p(s) = p then p(f) = 1-p

the PMF of k for one trial is

$$f_x(k) = p^k (1-p)^{1-k}$$

The binomial model has three properties

* It uses multiple Bernoulli trials (n times)
* The trials are independent
* P(s) states the same 

if X counts the number of sucesses in the n independet trials then the PMF of X is

$$f_x(k) = {n \choose k} p^k (1-p)^{n-k}$$ 

The mean of the distribution is
$$\mu = np$$

The standard deviation of the distribution is

$$\sigma = \sqrt{np(1-p)}$$

A four-child family. Each child may be either a boy (B) or a girl (G). For simplicity we suppose that P(B) = P(G) = 1=2 and that the genders of the children are determined independently. 
If we let X count the number of B’s, then X ~ binom(size = 4; prob = 1=2). 



In [None]:
##we can calculate the binomial probability of no having any boys in the family of four
##in R using the function pbinom()

dbinom(0,4,0.5)

How about finding the probability of having two boys in a family of 4 Further, P(X = 2) is

$$f_x(2) = {4 \choose 2} \frac{1}{2}^2 \frac{1}{2}^{2} = \frac{6}{2^4}$$ 

In [None]:
choose(4,2) #which calculates the combinations of two sucessess in four trials

6/2^4 ##probability of having two boys in a family of 4

dbinom(2,4,0.5)

find $\mu$ and $\sigma$

In [None]:
#lets plot PMF and CDF for this example

two_boys_pmf <- dbinom(0:4, size = 4, prob = 0.5)
plot(two_boys_pmf, type = "h", ylim= c(0,0.5))
points(two_boys_pmf,pch=19)

In [None]:
plot(1:5,cumsum(two_boys_pmf), type = 'l')
#plot(1:5,cumsum(two_boys_pmf), type = 's')
#plot(pbinom(0:5,4,0.5), type = 'l')

Exercise:  the CDC estimates that 22% of adults in the U.S. smoke
If we randomly sample 10 individuals from the US population, what is the probability that 5 individuals from the sample smoke?
In this case smoke is the success and non smoke is the failure
plot the PMF and CMF for this distribution?

In [None]:
#dbinom(5,size=10,prob=.22)

In [None]:
#plot(pbinom(0:10,10,.22), type = 'l')

### Confidence intervals:

In most cases obtaining a unique probability is not very reasonable, as we are dealing with random trials. It is better to supply this info using confidence intervals, normally we use 95% CF.

In our previous example calculate the 95% that we obtain 5 smokers of our sample


In [None]:
#Generate a sequency from 0 to 1 by 0.01 units
se = seq(0,1,by = 0.01)

#calculate all binomial probabilities at each initial probability
a = dbinom(5,10,prob = se)

#combine data
#df = as.data.frame(cbind(se,a))
#df

In [None]:
plot(seq(0,1,by=0.01),a, type = 'l')
abline(h = 0.05, xlim =c(0,1), col ='red')

Obviously we can do it in R, in an easier manner

In [None]:
#install.packages("binom")
library(binom)
binom.confint(5, 10, conf.level = 0.95)

### Comparing Distributions

We can visually see how two distributions are different from each other by plotting their two PMFs (or histograms) on the same plot. For example.

Lets compare two distributions of a coin flip simulation where one of the coins is biased (p = 0.3) instead of a normal p = 0.5

We know that this type of trials (sampling with replacement) follow a binomial distribution, therefore, we can then randomly sample from the binomial distribution (rbinom)

In [None]:
experiment_num<-10000
my_size<-100

##we will sample 10000 times flipping 100 coins, and get the counts of successes
fair_flips <- rbinom(n = experiment_num, size = my_size, prob = 0.5)
biased_flips <- rbinom(n = experiment_num, size = my_size, prob = 0.3)#note this the biased coin

## We can construct histograms to represent each distribution

hist(fair_flips, col = rgb(1,1,0,0.5), xlim = c(10,70), ylim = c(0,1000), breaks = 40 ,
     main = "Histogram of biased (green) and fair (yellow) flips")
hist(biased_flips,col = rgb(0,1,0,0.5),breaks = 40,add=T) ##add = T, adds the plot to a previous plot call





In [None]:
# we can extract a matrix of counts and frequency for each count
ff = hist(fair_flips, breaks = 40 ,plot = F)
bf = hist(biased_flips,breaks = 40,plot = F)
names(ff)
ff$counts


In [None]:
my_xlim<-range(c(ff$breaks, bf$breaks))
my_ylim<-range(c(ff$counts, bf$counts))
hist(fair_flips, breaks = 40, 
     xlim= my_xlim, ylim=my_ylim, 
     main="Histogram of coin flips", 
     xlab=paste("Heads out of", my_size, "flips"))

b<-par('usr') ##lets save the coordinates for this previous plot



In [None]:
rect(b[1], b[3], b[2], b[4], col="gray")##using the previously recorded coordinates - lets draw a nicer plot
plot(ff, col=rgb(1,0,0,.6), add=T)
plot(bf, col=rgb(0,0,1,.6), add=T)

In [None]:
##Lets construct the PMF but a smaller k, to see a better plot (you can tried with a large k, but the idea is the same)

fair_flips_pmf <- dbinom(1:100, size = 100, prob = 0.5)
biased_flips_pmf <- dbinom(1:100, size = 100, prob = 0.3)
plot(fair_flips_pmf, type = "h")
points(fair_flips_pmf,pch=19)
par(new=TRUE)
plot(biased_flips_pmf, type = "h")
points(biased_flips_pmf,pch=19)

In [None]:
##or in two different plots

par(mfrow=c(1,2))
plot(fair_flips_pmf, type = "h")
points(fair_flips_pmf,pch=19)
plot(biased_flips_pmf, type = "h")
points(biased_flips_pmf,pch=19)
par(mfrow=c(1,1))

In [None]:
##And the CDF

plot(cumsum(fair_flips_pmf), type = "l")
lines(cumsum(biased_flips_pmf), col = "red")

In [None]:
###Which is the same as

plot(pbinom(1:100,100,0.5), type = "l")
lines(pbinom(1:100,100,0.3), col = "red")