# Conditional Probability Activity & Exercise

Below is some code to create some fake data on how much stuff people purchase given their age range.

It generates 100,000 random "people" and randomly assigns them as being in their 20's, 30's, 40's, 50's, 60's, or 70's.

It then assigns a lower probability for young people to buy stuff.

In the end, we have two Python dictionaries:

"totals" contains the total number of people in each age group.
"purchases" contains the total number of things purchased by people in each age group.
The grand total of purchases is in totalPurchases, and we know the total number of people is 100,000.

Let's run it and have a look:

In [1]:
from numpy import random
random.seed(0) # this code gives consistency results everytime we run this code. This wont make the results radom but always the same.  

totals = {20:0, 30:0, 40:0, 50:0, 60:0, 70:0}
purchases = {20:0, 30:0, 40:0, 50:0, 60:0, 70:0}
totalPurchases = 0
for _ in range(100000):
    ageDecade = random.choice([20, 30, 40, 50, 60, 70])
    purchaseProbability = float(ageDecade) / 100.0
    totals[ageDecade] += 1
    if (random.random() < purchaseProbability):
        totalPurchases += 1
        purchases[ageDecade] += 1

In [2]:
totals

{20: 16576, 30: 16619, 40: 16632, 50: 16805, 60: 16664, 70: 16704}

In [3]:
purchases

{20: 3392, 30: 4974, 40: 6670, 50: 8319, 60: 9944, 70: 11713}

In [4]:
totalPurchases

45012

Let's play with conditional probability.

First let's compute P(E|F), where E is "purchase" and F is "you're in your 30's". The probability of someone in their 30's buying something is just the percentage of how many 30-year-olds bought something:

In [5]:
PEF = float(purchases[30]) / float(totals[30])
print('P(purchase | 30s): ' + str(PEF))

P(purchase | 30s): 0.299295986521


P(F) is just the probability of being 30 in this data set:

In [6]:
PF = float(totals[30]) / 100000.0
print("P(30's): " +  str(PF))

P(30's): 0.16619


And P(E) is the overall probability of buying something, regardless of your age:

In [7]:
PE = float(totalPurchases) / 100000.0
print("P(Purchase):" + str(PE))

P(Purchase):0.45012


If E and F were independent, then we would expect P(E | F) to be about the same as P(E). But they're not; P(E) is 0.45, and P(E|F) is 0.3. So, that tells us that E and F are dependent (which we know they are in this example.)

P(E,F) is different from P(E|F). P(E,F) would be the probability of both being in your 30's and buying something, out of the total population - not just the population of people in their 30's:

In [8]:
print("P(30's, Purchase)" + str(float(purchases[30]) / 100000.0))

P(30's, Purchase)0.04974


Let's also compute the product of P(E) and P(F), P(E)P(F):

In [9]:
print("P(30's)P(Purchase)" + str(PE * PF))

P(30's)P(Purchase)0.0748054428


Something you may learn in stats is that P(E,F) = P(E)P(F), but this assumes E and F are independent. We've found here that P(E,F) is about 0.05, while P(E)P(F) is about 0.075. So when E and F are dependent - and we have a conditional probability going on - we can't just say that P(E,F) = P(E)P(F).

We can also check that P(E|F) = P(E,F)/P(F), which is the relationship we showed in the slides - and sure enough, it is:

In [10]:
print((purchases[30] / 100000.0) / PF)

0.299295986521


## Your Assignment

Modify the code above such that the purchase probability does NOT vary with age, making E and F actually independent.

Then, confirm that P(E|F) is about the same as P(E), showing that the conditional probability of purchase for a given age is not any different than the a-priori probability of purchase regardless of age.


In [11]:
from numpy import random
random.seed(0) # this code gives consistency results everytime we run this code. This wont make the results radom but always the same.  

totals = {20:0, 30:0, 40:0, 50:0, 60:0, 70:0}
purchases = {20:0, 30:0, 40:0, 50:0, 60:0, 70:0}
totalPurchases = 0
for _ in range(100000):
    ageDecade = random.choice([20, 30, 40, 50, 60, 70])
    purchaseProbability = 0.4
    totals[ageDecade] += 1
    if (random.random() < purchaseProbability):
        totalPurchases += 1
        purchases[ageDecade] += 1

In [12]:
totals

{20: 16576, 30: 16619, 40: 16632, 50: 16805, 60: 16664, 70: 16704}

In [13]:
purchases

{20: 6710, 30: 6627, 40: 6670, 50: 6665, 60: 6638, 70: 6720}

In [14]:
totalPurchases

40030

In [15]:
PEF = float(purchases[30]) / float(totals[30])
print('P(purchase | 30s): ' + str(PEF))

P(purchase | 30s): 0.398760454901


In [16]:
PE = float(totalPurchases) / 100000.0
print("P(Purchase):" + str(PE))

P(Purchase):0.4003


## New Example

In [17]:
import pandas as pd
import numpy as np
#Load the data set and repalce the empty rows for NaN valeu
data = pd.read_csv('C:\\Users\\Diego Alves\\Desktop\\Data_sets\\advertising.csv',header=0,encoding = 'unicode_escape')

In [18]:
SALES = data[['sales']]
TV = data[['TV']]

In [19]:
from itertools import product
list(product([TV], [SALES]))


[(        TV
  0    230.1
  1     44.5
  2     17.2
  3    151.5
  4    180.8
  5      8.7
  6     57.5
  7    120.2
  8      8.6
  9    199.8
  10    66.1
  11   214.7
  12    23.8
  13    97.5
  14   204.1
  15   195.4
  16    67.8
  17   281.4
  18    69.2
  19   147.3
  20   218.4
  21   237.4
  22    13.2
  23   228.3
  24    62.3
  25   262.9
  26   142.9
  27   240.1
  28   248.8
  29    70.6
  ..     ...
  170   50.0
  171  164.5
  172   19.6
  173  168.4
  174  222.4
  175  276.9
  176  248.4
  177  170.2
  178  276.7
  179  165.6
  180  156.6
  181  218.5
  182   56.2
  183  287.6
  184  253.8
  185  205.0
  186  139.5
  187  191.1
  188  286.0
  189   18.7
  190   39.5
  191   75.5
  192   17.2
  193  166.8
  194  149.7
  195   38.2
  196   94.2
  197  177.0
  198  283.6
  199  232.1
  
  [200 rows x 1 columns],      sales
  0     22.1
  1     10.4
  2      9.3
  3     18.5
  4     12.9
  5      7.2
  6     11.8
  7     13.2
  8      4.8
  9     10.6
  10     8.6
  11    17.

In [20]:
n = 5
omega = set(product(['Not Invest', 'Invest'], repeat=n))

In [21]:
omega

{('Invest', 'Invest', 'Invest', 'Invest', 'Invest'),
 ('Invest', 'Invest', 'Invest', 'Invest', 'Not Invest'),
 ('Invest', 'Invest', 'Invest', 'Not Invest', 'Invest'),
 ('Invest', 'Invest', 'Invest', 'Not Invest', 'Not Invest'),
 ('Invest', 'Invest', 'Not Invest', 'Invest', 'Invest'),
 ('Invest', 'Invest', 'Not Invest', 'Invest', 'Not Invest'),
 ('Invest', 'Invest', 'Not Invest', 'Not Invest', 'Invest'),
 ('Invest', 'Invest', 'Not Invest', 'Not Invest', 'Not Invest'),
 ('Invest', 'Not Invest', 'Invest', 'Invest', 'Invest'),
 ('Invest', 'Not Invest', 'Invest', 'Invest', 'Not Invest'),
 ('Invest', 'Not Invest', 'Invest', 'Not Invest', 'Invest'),
 ('Invest', 'Not Invest', 'Invest', 'Not Invest', 'Not Invest'),
 ('Invest', 'Not Invest', 'Not Invest', 'Invest', 'Invest'),
 ('Invest', 'Not Invest', 'Not Invest', 'Invest', 'Not Invest'),
 ('Invest', 'Not Invest', 'Not Invest', 'Not Invest', 'Invest'),
 ('Invest', 'Not Invest', 'Not Invest', 'Not Invest', 'Not Invest'),
 ('Not Invest', 'Invest'

In [22]:
len(omega)

32

In [31]:
A = {om for om in omega if om[0] == 'Invest'}
B = {om for om in omega if om.count('Invest') == 2}

In [32]:
A

{('Invest', 'Invest', 'Invest', 'Invest', 'Invest'),
 ('Invest', 'Invest', 'Invest', 'Invest', 'Not Invest'),
 ('Invest', 'Invest', 'Invest', 'Not Invest', 'Invest'),
 ('Invest', 'Invest', 'Invest', 'Not Invest', 'Not Invest'),
 ('Invest', 'Invest', 'Not Invest', 'Invest', 'Invest'),
 ('Invest', 'Invest', 'Not Invest', 'Invest', 'Not Invest'),
 ('Invest', 'Invest', 'Not Invest', 'Not Invest', 'Invest'),
 ('Invest', 'Invest', 'Not Invest', 'Not Invest', 'Not Invest'),
 ('Invest', 'Not Invest', 'Invest', 'Invest', 'Invest'),
 ('Invest', 'Not Invest', 'Invest', 'Invest', 'Not Invest'),
 ('Invest', 'Not Invest', 'Invest', 'Not Invest', 'Invest'),
 ('Invest', 'Not Invest', 'Invest', 'Not Invest', 'Not Invest'),
 ('Invest', 'Not Invest', 'Not Invest', 'Invest', 'Invest'),
 ('Invest', 'Not Invest', 'Not Invest', 'Invest', 'Not Invest'),
 ('Invest', 'Not Invest', 'Not Invest', 'Not Invest', 'Invest'),
 ('Invest', 'Not Invest', 'Not Invest', 'Not Invest', 'Not Invest')}

In [33]:
B

{('Invest', 'Invest', 'Not Invest', 'Not Invest', 'Not Invest'),
 ('Invest', 'Not Invest', 'Invest', 'Not Invest', 'Not Invest'),
 ('Invest', 'Not Invest', 'Not Invest', 'Invest', 'Not Invest'),
 ('Invest', 'Not Invest', 'Not Invest', 'Not Invest', 'Invest'),
 ('Not Invest', 'Invest', 'Invest', 'Not Invest', 'Not Invest'),
 ('Not Invest', 'Invest', 'Not Invest', 'Invest', 'Not Invest'),
 ('Not Invest', 'Invest', 'Not Invest', 'Not Invest', 'Invest'),
 ('Not Invest', 'Not Invest', 'Invest', 'Invest', 'Not Invest'),
 ('Not Invest', 'Not Invest', 'Invest', 'Not Invest', 'Invest'),
 ('Not Invest', 'Not Invest', 'Not Invest', 'Invest', 'Invest')}

In [34]:
def prob(X):
    return len(X) / len(omega)

In [35]:
def cond_prob(X, Y):
    return len(X & Y) / len(Y)

In [36]:
prob(A)

0

In [37]:
prob(B)

0

In [38]:
cond_prob(A, B)

0

In [42]:
prob(A & B)

0

In [43]:
prob(A) * prob(B)

0

In [44]:
def are_indep(X, Y):
    return prob(X & Y) == prob(X) * prob(Y)

In [45]:
are_indep(A, B)

True

In [47]:
C = {om for om in omega if om[1] == 'H'}