### Class #2: Probability - Examples
In this notebook we'll go over how we simulate data. First we import the necessary packages.

In [1]:
import random
import pandas as pd
import numpy as np

from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go

How do we simulate random processes using Python?

In [2]:
dice_rolls = np.arange(1,7)
coin_flips = np.arange(1,3)

print(dice_rolls)
print(coin_flips)

[1 2 3 4 5 6]
[1 2]


Roll one die

In [5]:
all_rolls=[]
for i in range(1000):
    roll = np.random.choice(dice_rolls)
    all_rolls.append(roll)

data = [go.Histogram(x=all_rolls, histnorm='probability')]
iplot(data)

Roll two dice

In [4]:
all_rolls=[]
for i in range(10000):
    roll_1 = np.random.choice(dice_rolls)
    roll_2 = np.random.choice(dice_rolls)
    all_rolls.append(roll_1+roll_2)

data = [go.Histogram(x=all_rolls, histnorm='probability')]
iplot(data)

Pick a random sample from a non-uniform but known discrete distribution. In this case we have a die wrigged to weight heigher number rolls.

In [5]:
all_rolls=[]
for i in range(100):
    roll = np.random.choice(np.arange(1,7),p=[0.05,  0.10,  0.15,  0.20,  0.25,  0.25])
    all_rolls.append(roll)

data = [go.Histogram(x=all_rolls, histnorm='probability')]

layout = {
    'shapes': [
        # Line Vertical
        {
            'type': 'line',
            'x0': 1,
            'y0': 0,
            'x1': 1,
            'y1': 2,
            'line': {
                'color': 'rgb(55, 128, 191)',
                'width': 3,
            },
        }
    ]
}

iplot(data)

We're going to use this plotting method several times, so let's just declare it as a function

In [6]:
def plot_probs(x_, y_):
    trace = go.Scatter(
        x = x_,
        y = y_
    )

    layout = go.Layout(
        xaxis=dict(
            title='Number of Simulations'
        ),
        yaxis=dict(
            title='Probability'
        )
    )

    fig = go.Figure(data = [trace], layout = layout)
    return fig

Let's deonte the probability of a single event (rolling a 4) as:
### P(A4)
How can we simulate and show that this is 1/6 (or 0.167)?

In [7]:
dice_rolls = np.arange(1,7)
count_hits = 0
x_data = []
y_data = []

for num_simulations in range(1, 10001):
    roll = np.random.choice(dice_rolls)
    if roll == 4:
        count_hits += 1
    if (num_simulations % 10) == 0:
        y_data.append(float(count_hits) / float(num_simulations))
        x_data.append(num_simulations)

# Plot the figure
fig = plot_probs(x_data, y_data)
iplot(fig)

This stabilization over the number of simulations is predicted by the <b>Law of Large Numbers</b>

For two events, the probability that at least one occurs is given by:
## P(A or B) = P(A) + P(B) - P(A and B)
For our dice example, we can see that:<br>
P(A4 or A3) = P(A4) + P(A3) - P(A4 and A3)<br>
P(A4 or A3) = 1/6 + 1/6 - 0<br>
P(A4 or A3) = 2/6 = 1/3 = 0.33


In [8]:
dice_rolls = np.arange(1,7)
count_hits = 0
x_data = []
y_data = []

for num_simulations in range(1, 10001):
    roll = np.random.choice(dice_rolls)
    if (roll == 4) | (roll == 3) :
        count_hits += 1
    if (num_simulations % 10) == 0:
        y_data.append(float(count_hits) / float(num_simulations))
        x_data.append(num_simulations)

# Plot the figure
fig = plot_probs(x_data, y_data)
iplot(fig)

The Probability that both will occur is given by:
## P(A and B) = P(A) * P(B)
So, for our dice example, let's say we have two dice. The odds of rolling sixes on both can be stated as:<br>
P(A6 and B6) = P(A6) * P(B6)<br>
P(A6 and B6) = 1/6 * 1/6 = 1/36 = 0.0278

In [9]:
dice_rolls = np.arange(1,7)
count_hits = 0
x_data = []
y_data = []

for num_simulations in range(1, 10001):
    roll_1 = np.random.choice(dice_rolls)
    roll_2 = np.random.choice(dice_rolls)
    if (roll_1 == 6) & (roll_2 == 6) :
        count_hits += 1
    if (num_simulations % 10) == 0:
        y_data.append(float(count_hits) / float(num_simulations))
        x_data.append(num_simulations)

# Plot the figure
fig = plot_probs(x_data, y_data)
iplot(fig)

Next, we can consider the likelihood of one event, given observation of another. We describe this as:
## P(A | B) = P(A and B) / P(B)
Let's look at this scenario for our dice:<br>
P(A6 | B6) = P(A6 and B6) / P(B6)<br>
P(A6 | B6) = (1/36) / (1/6) = 1/6 = 0.167<br>
The odds of the second die being a 6 are independent of the outcome of the first roll. 

In [10]:
dice_rolls = np.arange(1,7)
effective_simulations = 0
count_hits = 0
x_data = []
y_data = []

while effective_simulations < 10000:
    roll_1 = np.random.choice(dice_rolls)
    roll_2 = np.random.choice(dice_rolls)
    if roll_1 == 6: 
        effective_simulations += 1
        if roll_2 == 6:
            count_hits += 1
        if (effective_simulations % 10) == 0:
            y_data.append(float(count_hits) / float(effective_simulations))
            x_data.append(effective_simulations)

# Plot the figure
fig = plot_probs(x_data, y_data)
iplot(fig)

### The M&M Problem:
In 1995, they introduced blue M&M’s. Before then, the color mix in a bag of plain M&M’s was 30% Brown, 20% Yellow, 20% Red, 10% Green, 10% Orange, 10% Tan. Afterward it was 24% Blue , 20% Green, 16% Orange, 14% Yellow, 13% Red, 13% Brown.

Suppose a friend of mine has two bags of M&M’s, and he tells me that one is from 1994 and one from 1996. He won’t tell me which is which, but he gives me one M&M from each bag. One is yellow and one is green. What is the probability that the yellow one came from the 1994 bag?

In [12]:
colours_old = ["Brown", "Yellow", "Red", "Green", "Orange", "Tan"]
pOld = [0.3, 0.2, 0.2, 0.1, 0.1, 0.1]
colours_new = ["Blue", "Green", "Orange", "Yellow", "Red", "Brown"]
pNew = [0.24, 0.2, 0.16, 0.14, 0.13, 0.13]
one_yellow_one_green = 0
old_yellow_new_green = 0

for mnm_select in range(1,100001):
    m_from_old = np.random.choice(colours_old, p=pOld)
    m_from_new = np.random.choice(colours_new, p=pNew)
    
    if (m_from_old == "Yellow") | (m_from_new == "Yellow"):
        if (m_from_old == "Green") | (m_from_new == "Green"):
            one_yellow_one_green += 1
            if (m_from_old == "Yellow"):
                old_yellow_new_green += 1

# Output the probability
print("Prob is %.3f" % (float(old_yellow_new_green) / 
                        (float(one_yellow_one_green))))

Prob is 0.736


## Examples from Open Intro Stats, Chapter 2

<b>2.25</b> It’s never lupus. Lupus is a medical phenomenon where antibodies that are supposed to attack foreign cells to prevent infections instead see plasma proteins as foreign bodies, leading to a high risk of blood clotting. It is believed that 2% of the population suffer from this disease. The test is 98% accurate if a person actually has the disease. The test is 74% accurate if a person does not have the disease. There is a line from the Fox television show House that is often used after a patient tests positive for lupus: “It’s never lupus.” Do you think there is truth to this statement? Use appropriate probabilities to support your answer.

In [25]:
has_lupus = 0.02
no_lupus = 0.98

has_lupus_tests_pos = 0.98
has_lupus_tests_neg = 0.02
no_lupus_tests_pos = 0.26
no_lupus_tests_neg = 0.74

# Given that the test for lupus is positive, what are the odds you actually have the disease?
pos_test_given_disease = has_lupus * has_lupus_tests_pos
pos_test_given_no_disease = no_lupus * no_lupus_tests_pos
disease_given_pos_test = pos_test_given_disease / (pos_test_given_no_disease + pos_test_given_disease)

output_percent = disease_given_pos_test * 100

print("If you have a positive test, there is a %.2f%% chance that you have the disease" % output_percent)

If you have a positive test, there is a 7.14% chance that you have the disease


<b>2.35</b> Hearts win. In a new card game, you start with a well-shuffled full deck and draw 3 cards without replacement. If you draw 3 hearts, you win \$50. If you draw 3 black cards, you win \$25. For any other draws, you win nothing.<br>
(a) Create a probability model for the amount you win at this game, and find the expected winnings. Also compute the standard deviation of this distribution.<br>
(b) If the game costs \$5 to play, what would be the expected value and standard deviation of the net profit (or loss)? (Hint: profit = winnings * cost; X * 5)<br>
(c) If the game costs \$5 to play, should you play this game? Explain.

In [40]:
odds_of_three_hearts = 13/52 * 12/51 * 11/50
odds_of_three_black_cards = 26/52 * 25/51 * 24/50
odds_of_losing = 1 - odds_of_three_hearts - odds_of_three_black_cards

#a) How much can you expect to win:
expected_win_value = odds_of_three_hearts * 50 + odds_of_three_black_cards * 25
print("You can expect to win $%.2f" % expected_win_value)

variance_in_win_value = ((50.0 - expected_win_value)**2 * odds_of_three_hearts) + \
                        ((25.0 - expected_win_value)**2 * odds_of_three_black_cards) + \
                        ((0 - expected_win_value)**2 * odds_of_losing)

std_in_win_value = np.sqrt(variance_in_win_value)

print("The variance in the win value is $%.2f" % variance_in_win_value)
print("The standard deviation in the win value is $%.2f" % std_in_win_value)

#b) Expected value and SD of the net profit (or loss)
cost_to_play = 5
expected_value = expected_win_value - cost_to_play

print("We expect each time we play to have a net gain of $%.2f, with a standard deviation of %.2f" % (expected_value, std_in_win_value))

#c) Should you play
print("Since the net gain is negative, it is not recommendable to play")

You can expect to win $3.59
The variance in the win value is $93.01
The standard deviation in the win value is $9.64
We expect each time we play to have a net gain of $-1.41, with a standard deviation of 9.64
Since the net gain is negative, it is not recommendable to play
