### **Conditional Probability**

The probability of something ocurring given that something else occurred first that it depends on.

If I have two events that depend on each other, what's the probability that both will occur?

- Notation: P(A,B) is the probability of A and B both occurring independently of each other.
- P(B│A): probability of B given that A has occurred. This implies dependency between B and A.

**P(B│A) = P(A,B) / P(A)**

_Example_: Two test for students. 60% of the students passed both tests, but the first test was easier (80% passed that one). What percentage of studentes who passed the first test also passed the second?

a = passing the first test;
b = passing the second test.

P(B│A): ?

In [2]:
prob_b_a = 0.6/0.8
round(prob_b_a, 2)

0.75

> **75% of students who passed the first test passed the second.**

_Example_: Using fake data on how much stuff people purchase given their age range.

- It generates 100000 random "people" and randomly assigns them as being in their 20's, 30's, 40's, 50's, 60's, or 70's.
- It then assigns a lower probability for young people to buy stuff.

In the end, there are two Python dictionaries:

1. **"totals"** contains the total number of people in each age group.
2. **"purchases"** contains the total number of things purchased by people in each age group.

The grand total of purchases is in totalPurchases, and it is known the total number of people is 100.000.

In [3]:
from numpy import random
random.seed(0)

totals = {20:0, 30:0, 40:0, 50:0, 60:0, 70:0}
purchases = {20:0, 30:0, 40:0, 50:0, 60:0, 70:0}
totalPurchases = 0
for _ in range(100000):
    ageDecade = random.choice([20, 30, 40, 50, 60, 70])
    purchaseProbability = float(ageDecade) / 100.0
    totals[ageDecade] += 1
    if (random.random() < purchaseProbability):
        totalPurchases += 1
        purchases[ageDecade] += 1

In [4]:
totals

{20: 16576, 30: 16619, 40: 16632, 50: 16805, 60: 16664, 70: 16704}

In [5]:
purchases

{20: 3392, 30: 4974, 40: 6670, 50: 8319, 60: 9944, 70: 11713}

In [6]:
totalPurchases

45012

- Compute **P(E|F)**, where E is "purchase" and F is "you're in your 30's". 
- The probability of someone in their 30's buying something is just the percentage of how many 30-year-olds bought something:

In [8]:
prob_e_f = float(purchases[30])/float(totals[30])
print('P(purchase|30s): ' + str(prob_e_f))

P(purchase|30s): 0.29929598652145134


- P(F) is just the probability of being 30 in this data set:

In [9]:
prob_f = float(totals[30])/100000
print("P(30's): " +  str(prob_f))

P(30's): 0.16619


- P(E) is the overall probability of buying something, regardless of the age:

In [10]:
prob_e = float(totalPurchases)/100000
print("P(Purchase):" + str(prob_e))

P(Purchase):0.45012


> **45% of chance of buying stuff**

If E and F were independent, then it would expect P(E|F) to be about the same as P(E). But they're not; P(E) is 0.45, and P(E|F) is 0.3. So, **E and F are dependent** (it was artificially created a dependency - older people, buy more).

P(E,F) is different from P(E|F). **P(E,F) would be the probability of both being in your 30's _and_ buying something, out of the toal population** - not just the population of people in theirs 30's:

In [11]:
print("P(30's, Purchase)" + str(float(purchases[30])/100000))

P(30's, Purchase)0.04974


Let's also compute the product of P(E) and P(F), P(E)P(F):

- overall probability of buying something _and_ overall probability of being in your 30's.

In [12]:
print("P(30's)P(Purchase)" + str(prob_e * prob_f))

P(30's)P(Purchase)0.07480544280000001


In stats is common to see **P(E,F) = P(E)P(F)**, but this assumes E and F are **independent**. In this example, it was found that P(E,F) is about 0.05, while P(E)P(F) is about 0.075. So when **E and F are dependent** - and there is a conditional probability going on.

#### **Changing the example above to change variable dependency**:

> First: modifying the code to have some fixed purchase probability regardless of age, -> 40%.

In [13]:
random.seed(0)

totals = {20:0, 30:0, 40:0, 50:0, 60:0, 70:0}
purchases = {20:0, 30:0, 40:0, 50:0, 60:0, 70:0}
totalPurchases = 0
for _ in range(100000):
    ageDecade = random.choice([20, 30, 40, 50, 60, 70])
    purchaseProbability = 0.4 # fixed purchase prob
    totals[ageDecade] += 1
    if (random.random() < purchaseProbability):
        totalPurchases += 1
        purchases[ageDecade] += 1

- Compute P(E|F) for some age group - 30's:

In [14]:
prob_e_f = float(purchases[30])/float(totals[30])
print("P(purchase|30s): " + str(prob_e_f))

P(purchase|30s): 0.3987604549010169


- Compute P(E):

In [15]:
prob_e = float(totalPurchases)/100000
print("P(Purchase):" + str(prob_e))

P(Purchase):0.4003


> **P(E|F) is close to P(E), so it is possible to say that E and F are likely indepedent variables.**