# Probability 1
In this exercise, you will practice solving probability problems. Many of these will be math problems. You can write your answers in Markdown cells. Recall that you can use LaTeX to nicely format your math inside Markdown cellsby enclosing equations in single dollar signs (e.g., $x^2+4=8$) for inline math or double dollar signs for centered equations like $$P(X > 5) = \frac{1}{6}.$$

For a reference if you are new to LaTeX, see the [overleaf documentation for mathematical expressions](https://www.overleaf.com/learn/latex/mathematical_expressions). Some particular LaTeX commands you might find useful include:
- To write a fraction, use `\frac{numerator}{denominator}` like $\frac{3}{4}$
- To write set intersection and union use `\cap` and `\cup` respectively like $A \cap B$ and $A \cup B$.
- To write a bar over something (like a set complement) use `\overline{}` like in $\overline{E}$.
- To write a subscript or superscript, use `_{}` or `^{}` as in $X_{n}$ or $X^{2}$.
- To write a sum or product, use `\sum` or `\prod` as in $\sum_{i=1}^{n} X_i$ or $\prod_{i=1}^{n} X_i$.

An example of the level of explanation we are expecting is if we asked you to calculate the probability that a dice roll is even or > 3 (like in the videos), we'd first expect a translation to probabilistic terms and then transformations until you find the number like so:

$$ P(even \cup roll > 3) = P(even \cup roll > 3) $$
$$ =P(even) + P(roll > 3) - P(even \cap roll > 3) $$
$$ = \frac{1}{2} + \frac{1}{2} - \frac{1}{3} $$
$$ =\frac{2}{3} $$

### Question 1
Suppose we roll a single fair six-sided die (meaning we get a number from 1 to 6 drawn uniformly at random) and toss a single fair coin (meaning we get heads or tails each with 50% probability). For each of the following events, what is the probability of the event?

1. An odd die roll and coin toss of heads.
2. A die roll greater than or equal to 4 and anything for the coin.
3. A coin toss of tails and anything for the die.
4. A die roll greater than or equal to 3 or a coin toss of tails (possibly both).

1. 
- Let A = odd die roll, B = coin toss of heads
- P(A) = 1/2
- P(B) = 1/2
- Since the two events are independent, P(A and B) = P(A) * P(B) = (1/2)x(1/2) = 1/4
- Ans: 0.25 (or 1/4)

2. 
- Let A = die roll greater than or equal to 4, B = anything for the coin 
- P(A) = 1/2
- P(B) = 1
- Since the two events are independent, P(A and B) = P(A) * P(B) = (1/2)x(1) = 1/2
- Ans: 0.5 (or 1/2)

3. 
- Let A = coin toss of tails, B = anything for the die
- P(A) = 1/2
- P(B) = 1
- Since the two events are independent, P(A and B) = P(A) * P(B) = (1/2)x(1) = 1/2
- Ans: 0.5 (or 1/2)

4. 
- Let A = die roll greater than or equal to 3, B = coin toss of tails 
- P(A) = 4/6 = 2/3
- P(B) = 1/2
- P(A and B) = P(A) * P(B) = (2/3)x(1/2) = 1/3
- Since the two events are not mutually exclusive, P(A or B) = P(A) + P(B) - P(A and B) = (2/3) + (1/2) - (1/3) = 5/6
- Ans: 0.83 (or 5/6)

### Question 2
Assume 500 students enrolled in both Calculus and Physics. Of these students,
- 82 got an A in calculus, 
- 73 got an A in physics, and 
- 42 got an A in both courses.

For each of the following, compute the probability that student chosen uniformly at random:
1. Got an A in both courses
2. Got less than an A in at least one of the two courses
3. Got an A in calculus but not in physics
4. Got an A in at least one of the two courses

Let C = A in calculus, P = A in physics
1. 
- P(C and P) = 42/500
- Ans: 0.084 (or 42/500)

2. 
- P(!C or !P) = P(!(C and P)) = 458/500
- Ans: 0.916 (458/500)

3. 
- P(C and !P) = P(C) - P(C and P) = (82/500) - (42/500) = 40/500 
- Ans: 0.08 (40/500)

4. 
- P(C or P) = P(C) + P(P) - P(C and P) = (82/500) + (73/500) - (42/400) = 113/500
- Ans: 0.226 (113/500)

### Question 3
Consider a certain disease. Suppose that the probability $P(D)$ that a random person has the disease is 1% or 0.01.

1. Suppose the probability $P(T)$ that a random person gets tested for the disease is 20% or 0.2. Also, suppose the probability $P(D \cap T)$ that someone has the disease and gets tested is 0.5% or 0.005. What is the probability $P(D \mid T)$ that a random person has the disease given that they get tested?
2. Suppose the test is not 100% accurate. In particular, the probability $P(pos \mid D)$ of a positive test result given you have the disease is 95% or 0.95. The probability $P(pos \mid \overline{D})$ of a positive test result given you do not have the disease is 10% or 0.1. Suppose a random person takes the test. What is the probability $P(pos)$ that they test positive? 
3. What is the probability $P(D \mid pos)$ that a random person has the disease given that they test positive?

1. 
- P(D given T) = P(D and T) / P(T) = 0.005/0.2 = 0.025
- Ans: 2.5% 

2. 
- P(pos) = P(pos given D) * P(D) + P(pos given !D) * P(!D) = 0.95 * 0.01 + 0.1 * 0.99 = 0.1085
- Ans: 10.85%

3. P(D given pos) = P(D and pos) / P(pos) = (P(pos given D) * P(D)) / P(pos) = (0.95 * 0.01) / 0.1085 = 0.08756
- Ans: 8.76%

### Question 4

Suppose a student is taking a 50-question multiple-choice exam with 4 answer choices per question. For each question, the probability that the student knows the correct answer is 80%. The student correctly answers these questions, and guesses uniformly at random on the other questions.

1. For a single quesiton, what is the student's expected score if correct answers get 2 points and incorrect answers get 0 points?
2. What is the student's expected overall score on all 50 questions if correct answers get 2 points and incorrect answers are penalized with 0.5 negative points?

1. Let K = knows correct answer, C = correct answer 
- P(K) = 0.8, P(!K) = 0.2
- If K, then P(C) = 1; If !K, then P(C) = 1/4
- E(ppq given K) = 2, E(ppq given !K) = 0.5
- E(ppq) = 0.8 * 2 + 0.2 * 0.5 = 1.6 + 0.1 = 1.7
- Ans: 1.7

2. Modify E(ppq given !K) = (1/4) * 2 + (3/4) * -0.5 = 0.5 - 0.375 = 0.125
- E(ppq) = 0.8 * 2 + 0.2 * 0.125 = 1.625
- E(score) = 1.625 * 50 = 81.25
- Ans: 81.25

### Question 5
To simplify this problem, assume that a year has twelve equal duration months, rather than the unequal months as in the standard Gregorian calendar. 

1. Assuming that people are equally likely to be born in any month, how many people must be selected independently and uniformly at random before the probability that at least two of the selected people have the same birth month is at least 1/2?
2. Now assume that for each month among May, June, July, and August, the probability of being born in that month is 1/10, whereas for every other month the probability of being born in that month is 3/40. What is the probability that two people selected uniformly at random have the same birth month? Is this more or less than in part a? 

1. n = 5, because 1 - (12 * 11 * 10 * 9 * 8) / 12^5 = 0.618 > 0.5
2. 0.085, which is greater than that of part a

### Question 6

1. Suppose the average/mean home price in the United States is 300,000 dollars. If we select a home uniformly at random, what is an upper bound on the probability that it's price is more than 1 million dollars? That is, provide and justify a statement of the form "The probability that a home chosen uniformly at random has price greater than $X$ is at most $P$." Use Markov's inequality to provide the bound.
2. Suppose the standard deviation in home prices is 100,000 dollars. Use Chebyshev's inequality to provide a better upper bound on the probability that a home selected unfiormly at random has price more than 1 million dollars.
3. Suppose we have an iid sample of $n$ home prices and the empirical mean price $\overline{X}_n$ is greater than 301,000 dollars. Assuming that we know the standard deviation in home prices is 100,000 dollars and assuming $n$ is large enough that the central limit theorem applies, how many samples $n$ would we need before we could conclude that the probability of observing $\overline{X}_n > 301,000$ is at most 5%, assuming the true mean of the underlying distribution is 300,000 dollars?    

1. Markov's Inequality: P(X >= a) <= E(X) / a
- P(X >= 1,000,000) <= 300,000 / 1,000,000 = 0.3
- Ans: 30%

2. Chebyshev's Inequality: P(|X - E(X)| >= a) <= Var(X) / a^2
- P(|X - 300,000| >= 700,000) <= (100,000)^2 / (700,000)^2 = 0.02
- Ans: 2%

3. Central Limit Theorem: Z = (X-mu) / (sigma / sqrt(n))
- P(Xn > 301,000) <= 0.05
- Z = (301,000 - 300,000) / (100,000 / sqrt(n)) 
- Z = 2 since 95% of the data are within 2 standard deviations of the mean, so n = 40,000
- Ans: 40,000

### Question 7
There were 151 "pokemon" in the original generation of the popular franchise. Suppose you start purchasing pokemon cards. Each card you purchase has a pokemon chosen independently and uniformly at random from the 151 possible with replacement (that is, you can get the same card multiple times). Use Monte Carlo Simulation to estimate answers to the following questions. You can use Python's `random` library or the `numpy.random` library.

1. Suppose you have a particular favorite pokemon. Suppose you purchase 150 cards total. What is the probability that you will get your favorite pokemon's card? Conduct enough simulations that you are confident in your answer +/- 0.01 (or 1%).
2. Suppose you purchase 300 cards total. What is the expected number of *unique* pokemon cards that you will have? Conduct enough simulations that you are confident in your answer +/- 1 card.  
3. Suppose you buy 600 cards total. What is the probability that you will have all 151 unique cards? Conduct enough simulations that you are confident in your answer +/- 0.01 (or 1%).


In [5]:
import numpy as np
import random

In [1]:
# Put your code to answer question 7 here
# Feel free to add extra code cells as needed
# probability of getting favorite pokemon card
favProb = 1/151
notFavProb = 1-(1/151)
numCards = 150

# 1
print(1-(150/151)**150)

0.6308976890038165


In [6]:
# 2
cardList = [i for i in range(1,152)]
uniqueList = []
for i in range(1000):
    trialList = []
    for i in range(300):
        trialList.append(random.choice(cardList))
    uniqueList.append(len(set(trialList)))
print(np.mean(uniqueList))

130.515


In [7]:
# 3
cardList = [i for i in range(1,152)]
uniqueList = []
for i in range(1000):
    trialList = []
    for i in range(600):
        trialList.append(random.choice(cardList))
    if len(set(trialList)) == 151:
        uniqueList.append(1)
    else:
        uniqueList.append(0)
print(np.mean(uniqueList))

0.057


1. 0.63 +/- 0.01
2. 130 +/- 1
3. 0.05 +/- 0.01