# Chapter 4. Theoretical Distributions

## Imports

In [None]:
import math
import numpy as np
from scipy.stats import norm, binom
from typing import Dict, List, Tuple

# Warning regarding solutions

NO GUARANTEE THAT THE SOLUTIONS WILL WORK OR WORK CORRECTLY! USE THEM AT YOUR OWN RISK!

THE ANSWERS PROVIDED BELOW MAY BE WRONG. USE THEM AT YOUR OWN RISK!

## Exercises

### Exercise 4.1

Assuming that the height of adult males has a Normal distribution,
what proportion of males will be more than two standard deviations
above the mean height?

#### Ex 4.1. Solution with scipy

In [None]:
1 - norm.cdf(x=2, loc=0, scale=1)

#### Answer to Exercise 4.1

Approximatelly 0.02275013194817921 or approx. 2.3% of men.

### Exercise 4.2

The probability of being blood group B is 0.08. What is the
probability that if one pint of blood is taken from each of
100 unrelated blood donors fewer than three pints of group B blood will be obtained?

#### Ex. 4.2. Solution with scipy

In [None]:
binom.cdf(k=2, n=100, p=0.08)

#### Ex. 4.2. Solution by running n simulations

In [None]:
def get_n_blood_pints(n: int = 100) -> np.ndarray:
    # 0 - blood group B, 1 - other blood group
    return np.random.choice(
        a=np.repeat(a=[0, 1], repeats=[8, 92]),
        size=n, replace=True)

In [None]:
def get_counts(vector: np.ndarray) -> Dict[int, int]:
    uniques, counts = np.unique(vector, return_counts=True)
    return {k: v for (k, v) in zip(uniques, counts)}

In [None]:
def is_gr_b_found(less_than: int = 3) -> bool:
    return get_counts(get_n_blood_pints()).get(0, 0) < less_than

In [None]:
def get_n_simulations_for_gr_b(n_simuls: int = 1000) -> List[bool]:
    return [is_gr_b_found() for _ in range(n_simuls)]

In [None]:
def get_probability_of_gr_b(n_simuls: int = 10_000) -> float:
    return np.mean(np.array(get_n_simulations_for_gr_b(n_simuls)))

In [None]:
get_probability_of_gr_b()

#### Answer to Exercise 4.2

The probability is around 0.01 or 1%

### Exercise 4.3

The probability of a baby being a boy is 0.52.
For six women delivering consecutively in the same labour ward on one day,
which of the following exact sequences of boys and girls is most likely and which least likely?

GBGBGB
BBBGGG
GBBBBB

In [None]:
births_seqs: np.ndarray = np.array(["GBGBGB", "BBBGGG", "GBBBBB"])

#### Ex 4.3 Solution with mathematical calculations

In [None]:
def get_prob_of_birth_seq(birth_seq: str) -> np.float64:
    return np.prod([0.52 if b == "B" else 0.48 for b in birth_seq])

In [None]:
births_seqs_probs1: List[Tuple[float, str]] = [(get_prob_of_birth_seq(bs), bs) for bs in births_seqs]

#### Ex. 4.3 Solution with computer simulation

In [None]:
def get_births_seq(how_many: int = 6) -> str:
    births: np.ndarray = np.random.choice(
        a = ["G", "B"], size=how_many, replace=True, p = [0.48, 0.52])
    return "".join(births)

In [None]:
def get_n_birth_seqs(n: int = 100_000) -> np.ndarray:
    return np.array([get_births_seq() for _ in range(n)])

In [None]:
rand_birth_seqs: np.ndarray = get_n_birth_seqs()
rand_births_counts: Dict[str, int] = {k: v for (k, v) in zip(*np.unique(rand_birth_seqs, return_counts=True))}
rand_births_probs: Dict[str, float] = {k: rand_births_counts.get(k, 0) / rand_birth_seqs.size for k in rand_births_counts}

In [None]:
births_seqs_probs2: List[Tuple[float, str]] = [(rand_births_probs.get(bs, 0), bs) for bs in births_seqs]

#### Answer to Ex. 4.3

Assumming the exact order of births and the exact sex of the children (where p for B: 0.52, and for G: 0.48) we got:

According to mathematical calculations the probabilities are:
- (0.015550119935999997, 'GBGBGB')
- (0.015550119935999997, 'BBBGGG')
- (0.018249793536, 'GBBBBB')

 According to simulation [100'000 random birth sequences generated] the probabilities are:
- (0.01573, 'GBGBGB')
- (0.01541, 'BBBGGG')
- (0.01852, 'GBBBBB')

### Exercise 4.4

The Binomial distribution with p = 0.15 and n = 10

(a) 

If 15% of all pregnancies result in miscarriages,
what is the probability that more than half of a group of ten pregnant women will have a miscarriage?

(b)

Among groups of users of video display terminals there are 20'000 large enough for ten women to become pregnant in one year. If we call six or more miscarriages out of 10 a 'cluster', how many clusters would we expect in one year, assuming that there is no increased risk of miscarriage associated with using a terminal? (Based on Blackwell and Chang, 1988)

#### My Comments to Ex 4.4

Not sure how should I understand the Ex4.4b task, especially the phase '...there are 20'000 large enought for ten women to become pregnant in one year.'
I just assume that those 20'000 of women get pregnant, and determine how many 'clusters' (>=6 out of 10) I can expect there.

#### Ex. 4.4a Solution with scipy

In [None]:
prob_miscar1: np.float64 = binom.cdf(k=4, n=10, p=1-0.15)

#### Ex. 4.4a Solution with compouter simulation

In [None]:
# 0 - birth, 1 - miscarage
def get_n_miscarriages(n: int) -> np.ndarray:
    return np.random.choice(a=[0, 1], size=n, replace=True, p=[0.85, 0.15])

In [None]:
def is_more_than_k_miscarriages(n_of_births: int, k: int) -> bool:
    births: np.ndarray = get_n_miscarriages(n_of_births)
    return sum(births) > k

In [None]:
prob_miscar2: np.float64 = np.mean(
    np.array(
        [is_more_than_k_miscarriages(10, 5) for _ in range(10_0000)])
)

#### Answer to Ex 4.4a

Probability of more than 5 miscarages out of 10 births:
- calculated with scipy, p = 0.0013832352123046884
- estimated with computer simulation, p = 0.00124

#### Solution to Ex4.4b using scipy

In [None]:
expected_n_of_misc_clusters1: np.float64 = prob_miscar1 * 20_000 / 10


#### Solution to Ex4.4b using computer simulation

In [None]:
# execution time around 8 secs on my laptop
expected_n_of_misc_clusters2: np.float64 = prob_miscar2 * 20_000 / 10

#### Answer to Ex4.4b

Assuming 20'000 of women gives birth to children, and probability of miscarriage = 0.15 we would expect:
- 2.76647 (according to mathematical calculations)
- 2.48 (according to computer simulation)
miscarriage clusters (>=6 miscarriages out of 10 births)

So in practice 2 or 3 of such clusters are expected to be found.


### Exercise 4.5

If an infection is present in a school it would be expected to spread to 10% of the children

(a) How many children should be tested to have a probability of 0.95 (95%) of detecting the infection if it is there? (Hint: consider the probability of all the children in the sample being negative to the test if the infection is present in the school.)

(b) What is the effect of the number of children in the school on this calculation?

#### Ex. 4.5a Solution with mathematical calculations 

In [None]:
# 0.1 - probability of being sick, 0.9 probability of being healthy
# so I need to find x in: 0.9^x = 0.05 (5% probability that all kids are healthy)
# so I look for log base 0.9 of 0.05 is x
math.log(0.05, 0.9)   

#### Ex. 4.5a Answer

It takes 28.43315880574342 children.

So, one needs to test 28 or 29 children to have 95% probability to find such an infection.

#### Ex. 4.5b Answer

Not sure, I guess You need to have enough children in the school to take the sample of 28 or 29 children.

Except for that notion number of children will not influence the answer to Ex. 4.5a

(If it doesn't affect the spread of the disease)

### Exercise 4.6

Over a 25 year period the mean height of adult males increased from 175.8 cm to 179.1 cm, but the standard deviations stayed at 5.84 cm. The minimum height requrement for men to join the police force is 172 cm. What proportion of men would be too short to become policemen at the beginning and end of the 25 year period, assuming that the height of adult males ahs a Normal distribution?

In [None]:
norm.cdf(x=172, loc=175.8, scale=5.84)

In [None]:
norm.cdf(x=172, loc=179.1, scale=5.84)

#### Ex. 4.6. Answer

At the beginning of the 25 year period approx. 25.8% of men were too short to become policemen.

At the end of the 25 year period approx. 11.2% of men were too short to become policemen.

### Exercise 4.7

A researcher plans to measure blood pressure in a number of subjects. He proposes to take three measurements, but intends to discard the third measurement as unreliable if it does not fall between the first two measurements. Assuming that the subjects' blood pressure stays constant during the measuring, what is the probability that for a given subject the third value will not lie between the other two? (Hint: the answer does not depend upon the variability of blood pressure measurements.) Comment on the researcher's proposal.

#### Ex 4.7 My Notes/Answer

Not sure I understand the question.

So the blood pressure is constant during all 3 measurements? And the different values obtained during the measurement are a result of imprecise reading of the measuring device (or a person that uses it) or because of some other unspecified but random factors?

Hmm, if so then I should assume that I will always get three separate values (and not e.g. 3x the exact same value)? The values will be randomly dispersed within some small distance around the true blood pressure value?

Hmm, So I got 2 values, the third can be:
- a) lower than the lowest of the two
- b) higher than the highest of the two
- c) inbetween the two previous values
So, 1 out of 3, therefore p = 1/3 = 0.33 = 33%

### Exercise 4.8

In Britain the commonest autosomal recessive disorder is cystic fibrosis, with about one in 2000 live births being affected. If both parents are heterozygous for the abnormal gene there is a 1 in 4 chance of their child having cystic fibrosis.

(a) What is the probability that a couple who are both heterozygous will have two unaffected children?

(b) If they have four unaffected children, what is the probability that their fifth child would be unaffected?

(c) About one in 22 people is heterozygous for cystic fibrosis. In a hospital where there are 3500 births a year, what is the expected number of babies per year affected by cystic fibrosis (assuming that there is no genetic counselling)?

#### Ex. 4.8a Solution

In [None]:
# p - that 1 children is unaffected 0.75, so for two children (one healthy AND other healthy) 0.75 * 0.75 = 0.5625
# let's check it with scipy
binom.pmf(k=2, n=2, p=0.75)
#dst.pdf(dst.Binomial(2, 0.75), 2)
     

#### Ex 4.8a Answer

If two parents are heterozygous (Cc x Cc) then the probability that two of their children are healthy is equal to 0.5625.

#### Ex. 4.8b Answer

Not sure how to understand "If they have four unaffected children"? Who are "they"?

I assume those are the heterozygous parents from Ex. 4.8a, if so:

the probability that a fifth child is healthy is p = 0.75.
Reason: any given child got probability of 0.75 of being healthy.

It is like tossing a coin, it has no memory effect, so the result of one throw does not affect the result of the next throw.

#### Ex. 4.8c Answer

The description of Ex 4.8c tells only about frequency of heterozygous people in the population.

Therefore I guess I should assume that the people with cystic fibrosis (cc x cc) do not have children.

If so then two heterozygous parents meet at random (Cc x Cc), so 1/22 * 1/22 and the probability of their child having cystic fibrosis is 1/4.

Therefore the probability is 1/22 * 1/22 * 1/4 = 0.00051652

And the expected value is p * 3500 = 1.81 children