# Conditional Probability Activity & Exercise

Below is some code to create some fake data on how much stuff people purchase given their age range.

It generates 100,000 random "people" and randomly assigns them as being in their 20's, 30's, 40's, 50's, 60's, or 70's.

It then assigns a lower probability for young people to buy stuff.

In the end, we have two Python dictionaries:

"totals" contains the total number of people in each age group.
"purchases" contains the total number of things purchased by people in each age group.
The grand total of purchases is in totalPurchases, and we know the total number of people is 100,000.

Let's run it and have a look:

In [61]:
from numpy import random
random.seed(0)

totals = {20:0, 30:0, 40:0, 50:0, 60:0, 70:0}
purchases = {20:0, 30:0, 40:0, 50:0, 60:0, 70:0}
totalPurchases = 0
totalPeople = 100000
for _ in range(totalPeople):
    ageDecade = random.choice([20, 30, 40, 50, 60, 70])
    purchaseProbability = 1.0 * ageDecade / 100
    totals[ageDecade] += 1
    if (random.random() < purchaseProbability):
        totalPurchases += 1
        purchases[ageDecade] += 1

In [62]:
totals

{20: 16576, 30: 16619, 40: 16632, 50: 16805, 60: 16664, 70: 16704}

In [63]:
purchases

{20: 3392, 30: 4974, 40: 6670, 50: 8319, 60: 9944, 70: 11713}

In [64]:
totalPurchases

45012

Let's play with conditional probability.

First let's compute P(E|F), where E is "purchase" and F is "you're in your 30's". The probability of someone in their 30's buying something is just the percentage of how many 30-year-olds bought something:

In [72]:
PEgF = 1.0 * purchases[30] / totals[30]
print("P(Purchase|30s): " + str(PEgF))

P(Purchase|30s): 0.299295986521


P(E) is the overall probability of buying something, regardless of your age:

In [71]:
PE = 1.0 * totalPurchases / totalPeople
print("P(Purchase): " + str(PE))

P(Purchase): 0.45012


P(F) is just the probability of being 30 in this data set:

In [67]:
PF = 1.0* totals[30] / totalPeople
print("P(30's): " +  str(PF))

P(30's): 0.16619


If E and F were independent, then we would expect P(E|F) to be about the same as P(E). But they're not; PE is 0.45, and P(E|F) is 0.3. So, that tells us that E and F are dependent (which we know they are in this example.)

What is P(E,F)? P(E,F) would be the probability of both being in your 30's and buying something, out of the total population - not just the population of people in their 30's:

In [115]:
PEaF = 1.0 * purchases[30] / totalPeople
print("P(30's,Purchase): " + str(PEaF))

P(30's,Purchase): 0.041681


P(E,F)=0.05, is different from P(E|F)=0.30. This is no surprise - they are very different beasts.

If E and F were independent from each other, P(E,F)=P(E)xP(F), but this is not the case:

In [40]:
print("P(30's)xP(Purchase): " + str(PE * PF))

P(30's)xP(Purchase): 0.0748054428


We can also check that P(E|F) = P(E,F)/P(F) and sure enough, it is:

In [73]:
print(PEaF / PF)

0.299295986521


## Your Assignment

Modify the code above such that the purchase probability does NOT vary with age, making E and F actually independent.

Then, confirm that P(E|F) is about the same as P(E), showing that the conditional probability of purchase for a given age is not any different than the a-priori probability of purchase regardless of age.


# Making E and F independent

In [116]:
from numpy import random
random.seed(0)

totals = {20:0, 30:0, 40:0, 50:0, 60:0, 70:0}
purchases = {20:0, 30:0, 40:0, 50:0, 60:0, 70:0}
totalPurchases = 0
totalPeople = 100000
for _ in range(totalPeople):
    ageDecade = random.choice([20, 30, 40, 50, 60, 70])
    purchaseProbability = 0.25
    totals[ageDecade] += 1
    if (random.random() < purchaseProbability):
        totalPurchases += 1
        purchases[ageDecade] += 1

In [117]:
totals

{20: 16576, 30: 16619, 40: 16632, 50: 16805, 60: 16664, 70: 16704}

In [118]:
purchases

{20: 4264, 30: 4133, 40: 4155, 50: 4149, 60: 4155, 70: 4203}

In [119]:
totalPurchases

25059

P(E|F), where E is "purchase" and F is "you're in your 30's"

In [120]:
PEgF = 1.0 * purchases[30] / totals[30]
print("P(purchase|30s): " + str(PEgF))

P(purchase|30s): 0.248691256995


P(E) is the overall probability of buying something, regardless of your age:

In [121]:
PE = 1.0 * totalPurchases / totalPeople
print("P(Purchase): " + str(PE))

P(Purchase): 0.25059


P(F) is just the probability of being 30 in this data set:

In [122]:
PF = 1.0* totals[30] / totalPeople
print("P(30's): " +  str(PF))

P(30's): 0.16619


Since E and F are independent, P(E|F) should be about the same as P(E), and it is: P(E|F)=0.25 and P(E)=0.25.

P(E,F) would be the probability of both being in your 30's and buying something, out of the total population:

In [123]:
PEaF = 1.0 * purchases[30] / totalPeople
print("P(30's, Purchase): " + str(PEaF))

P(30's, Purchase): 0.04133


Since E and F are independent from each other, P(E,F)=P(E)xP(F):

In [126]:
print("P(30's)xP(Purchase): " + str(PE * PF))

P(30's)xP(Purchase): 0.0416455521


We can also check that P(E|F) = P(E,F)/P(F) and sure enough, it still is:

In [125]:
print(PEaF / PF)

0.248691256995
