In last projects, we covered the fundamentals of probability and learned about:

* Theoretical and empirical probabilities
* Probability rules (the addition rule and the multiplication rule)
* Counting techniques (the rule of product, permutations, and combinations)

In this project, we'll build on what we've learned and develop new techniques that will enable us to better estimate probabilities. We will be learning how to calculate probabilities based on certain conditions — hence the name conditional probability.

Let's start by considering rolling a regular six-sided die. This random experiment has six possible outcomes: 1, 2, 3, 4, 5, and 6. The set of all possible outcomes associated with a random experiment is called a sample space, and we denote it by the Greek letter  ("Omega"). We represent the sample space of a die roll as a set:

![image.png](attachment:image.png)

In [1]:
# The probability of getting a 2.

p_2 = 1/6

# probability of getting an odd number (1, 3, or 5).

p_odd = 3/6

# probability of getting a 2 or a 4

p_2_or_4 = 2/6

Now suppose the die is rolled and we're told some new information: the die showed an odd number (1, 3, or 5) after landing. Is the probability of getting a 5 still P(5) = 1/6? Or should we instead update the probability based on the information we have?

When we don't know whether the number is odd, the possible outcomes of the experiment are 1, 2, 3, 4, 5, or 6. But after we find out the number is odd, the possible outcomes are 1, 3, or 5. In other words, the new information we got reduced the sample space from {1, 2, 3, 4, 5, 6} to {1, 3, 5}:



In [2]:
# A fair six-sided die is rolled. All we know is that the number we got is less than 5. Calculate

# The probability of getting a 3
p_3 = 1/4

# The probability of getting a 6. 
p_6 = 0/4

# The probability of getting an odd 
p_odd= 2/4  

# The probability of getting an even number
p_even= 2/4  

![image.png](attachment:image.png)

In [3]:
# A student is randomly selected from a class. 
# All we know is that he was born during winter.
# Assume the winter months are December, January, and February and 
# ignore the fact that these three months have different number of days.

# The probability that he was born in December
p_december = 1/3

# The probability that he was born in a 31-day month
p_31 = 2/3

# probability that he was born during summer.
p_summer = 0/3

# The probability that he was born in a month 
# which ends in letter "r" — "September," for instance, ends in "r," while "April" doesn't.
p_ends_r = 1/3

Say we roll a fair six-sided die and want to find the probability of getting an odd number, given the die showed a number greater than 1 after landing. Using probability notation, we want to find P(A|B) where:

* A is the event that the number is odd: A = {1, 3, 5}
* B is the event that the number is greater than 1: B = {2, 3, 4, 5, 6}

![image.png](attachment:image.png)

The only possible odd numbers we can get are only 3 and 5, and the number of possible successful outcomes is also given by the cardinal of the set {3, 5}:
![image.png](attachment:image.png)


![image.png](attachment:image.png)

Two fair six-sided dice are simultaneously rolled, and the two numbers they show are added together. The diagram below shows all the possible results that we can get from adding the two numbers together.![image.png](attachment:image.png)

In [5]:
# Find P(A|B), where A is the event where the sum is an even number, and B is the event that the sum is less than eight.

# Find card(B). Assign your answer to card_b. We'll have to treat identical sums differently if they come from different die numbers.

card_b = 21  

# Find card(A ∩ B)

card_a_and_b = 9

# Calculate P(A|B)

p_a_given_b = 9/21


We'll now use the formula in the context of a more realistic example. A team of biologists wants to measure the efficiency of a new HIV test they developed (HIV is a virus that causes AIDS, a disease which affects the immune system). They used the new method to test 53 people, and the results are summarized in the table below

![image.png](attachment:image.png)

By reading the table above, we can see that:

* 23 people are infected with HIV.
* 30 people are not infected with HIV (HIVC means not infected with HIV — the superscript * "C" indicates a set complement).
* 45 people tested positive for HIV .
* 8 people tested negative for HIV.
* Out of the 23 infected people, 21 tested positive (correct diagnosis).
* Out of the 30 not-infected people, 24 tested positive (wrong diagnosis).

The team now intends to use these results to calculate probabilities for new patients and figure out whether the test is reliable enough to use in hospitals. They want to know:

* What is the probability of testing positive, given that a patient is infected with HIV?
* What is the probability of testing negative, given that a patient is not infected with HIV?

According to the formula, we have:
![image.png](attachment:image.png)

The probability of testing positive, given that the patient is infected with HIV, is therefore 91.30% — this may suggest that the new test is fairly good at detecting the HIV virus when the virus is actually present. However, at a probability of 91.31%, we can expect that for every 10,000 patients infected with HIV, about 9,131 patients will get a correct diagnosis, while the other 869 will not. The team should probably conclude that the test needs more refinement with respect to detecting the virus.

In [1]:
# Calculate P(T = - | HIVC). 

p_negative_given_non_hiv = 6/30
p_negative_given_non_hiv

0.2

'''
The probability of testing negative given that a patient is not
infected with HIV is 20%. This means that for every 10,000 healthy
patients, only about 2000 will get a correct diagnosis, while the
other 8000 will not. It looks like the test is almost completely
inefficient, and it could be dangerous to have it used in hospitals.
'''

![image.png](attachment:image.png)

Above formula is useful when we only know probabilities

![image.png](attachment:image.png)

A company offering a browser-based task manager tool intends to do some targeted advertising based on people's browsers. The data they collected about their users is described in the table below:
![image.png](attachment:image.png)

In [5]:
# Find:

# P(Premium | Chrome) — the probability that a randomly chosen user has a premium subscription, 
# provided their browser is Chrome
p_premium_given_chrome = 158/2762

# P(Basic | Safari)
p_basic_given_safari = 274/1288

# P(Free | Firefox)} 
p_free_given_firefox = 2103/2285

# Between a Chrome user and a Safari user, who is more likely to have a premium subscription

more_likely_premium = "Safari" # because Safari probablity is higher i.e 120/1288

In this project, we learned the fundamentals of conditional probability and managed to derive two important formulas:
![image.png](attachment:image.png)