# Conditional Probability

Notes: https://github.com/daviskregers/notes/blob/master/data-science/02-statistics-and-probability-refresher/09-conditional-probability.md

we create data that represents random people in their age ranges (20's, 30's and so on). Then we assign probabilities for them to buy stuff.

In [1]:
from numpy import random
random.seed(0)

totals = {20:0, 30:0, 40:0, 50:0, 60:0, 70:0}
purchases = {20:0, 30:0, 40:0, 50:0, 60:0, 70:0}
totalPurchases = 0
for _ in range(100000):
    ageDecade = random.choice([20, 30, 40, 50, 60, 70])
    purchaseProbability = float(ageDecade) / 100.0
    totals[ageDecade] += 1
    if (random.random() < purchaseProbability):
        totalPurchases += 1
        purchases[ageDecade] += 1

In [2]:
totals

{20: 16576, 30: 16619, 40: 16632, 50: 16805, 60: 16664, 70: 16704}

In [3]:
purchases

{20: 3392, 30: 4974, 40: 6670, 50: 8319, 60: 9944, 70: 11713}

In [4]:
totalPurchases

45012

Now for the conditional probability. We can compute P(E|F) where E is purchase and F is 30's age group. The probability of someone in their 30's buting something is just the percentage of how many 30-year-olds bought something.

In [16]:
PEF = float(purchases[30]) / float(totals[30])
print('P(purchase | 30s) = ' + str(PEF))

P(purchase | 30s) = 0.29929598652145134


P(F) is just the probability of being 30 in this data set:

In [15]:
PF = float(totals[30]) / 100000.0
print("P(30's) = " +  str(PF))

P(30's) = 0.16619


And P(E) is the overall probability of buying something, regardless of your age:

In [14]:
PE = float(totalPurchases) / 100000.0
print("P(Purchase) = " + str(PE))

P(Purchase) = 0.45012


If E and F were independent, then we would expect P(E|F) to be about the same as P(E). But they're not. So that tells us that E and F are dependent.

P(E,F) is different from P(E,F). P(E,F) would be the probability of both being in your 30's and buying something, out of the total population - not just the population of the people in their 30's.

In [10]:
print("P(30's, Purchase) = " + str(float(purchases[30]) / 100000.0))

P(30's, Purchase) = 0.04974


Let's also compute the product of P(E) and P(F), P(E)P(F):

In [12]:
print("P(30's)P(Purchase) = " + str(PE * PF))

P(30's)P(Purchase) = 0.07480544280000001


Something you may want to lean in stats is that P(E,F) = P(E)P(F), but this assumes E and F are independent. We've found here that P(E, F) is about 0.05, while P(E)P(F) is about 0.075. So when E and F are dependent - and we have a conditional probability going on - we can't just say that P(E, F) = P(E)P(F).

We can also check that P(E|F) = P(E,F)/P(F), which is the relationship.

In [17]:
print((purchases[30] / 100000.0) / PF)

0.29929598652145134


## Now when the variables are independent

In [18]:
from numpy import random
random.seed(0)

totals = {20:0, 30:0, 40:0, 50:0, 60:0, 70:0}
purchases = {20:0, 30:0, 40:0, 50:0, 60:0, 70:0}
totalPurchases = 0
for _ in range(100000):
    ageDecade = random.choice([20, 30, 40, 50, 60, 70])
    purchaseProbability = 0.4
    totals[ageDecade] += 1
    if (random.random() < purchaseProbability):
        totalPurchases += 1
        purchases[ageDecade] += 1

Next we will compute P(E|F) for some age group.

In [19]:
PEF = float(purchases[30]) / float(totals[30])
print("P(purchase | 30s) = " + str(PEF))

P(purchase | 30s) = 0.3987604549010169


And we'll compute P(E)

In [20]:
PE = float(totalPurchases) / 100000.0
print("P(Purchase) = " + str(PE))

P(Purchase) = 0.4003


P(E|F) is close to P(E), so we can say that E and F are likely indepedent variables.