<div style="text-align: right">INFO 6105 Data Science Eng Methods and Tools</div>
<div style="text-align: right">Dino Konstantopoulos, 31 January 2019</div>

Today we run a lab to make you comfortable working probabilities by writing code instead of doing math. Please rearrange to work in pairs, but each on their own laptop. **Do not advance past your teammate** at any time. This lab may count as a homework.

<br />
<center>
<img src = ipynb.images/pairs.png width = 500 />
</center>

# Two Exams

<br />
<center>
<img src = ipynb.images/exam.jpg width = 500 />
</center>

You have two tests scheduled for the semester, a midterm exam, and a final exam. (Actually, I got rid of your final exam for the semester, but let's assume it's still on). 

- What's the probability that you'll pass *both* exams? 
- What's the probability you'll pass both exams if one of them is held on a Friday?

<center>
<img src = ipynb.images/friday.jpg width = 300 />
</center>

## Part 1

Let's import our probability counting framework which we introduced in class. We use the function `P` and the class `ProbDist`.

In [1]:
from fractions import Fraction

def P(pred, dist): 
    "probability that pred is true, given prob distro dist"
    return sum(dist[event] for event in dist if pred(event))

class ProbDist(dict):
    "pdf. An {outcome => probability} mapping."
    def __init__(self, mapping=(), **kwargs):
        self.update(mapping, **kwargs)
        total = sum(self.values())
        if isinstance(total, int): 
            total = Fraction(total, 1)
        for key in self: # Ensure pb's sum to 1
            self[key] = self[key] / total
            
    def __and__(self, pred): # Call this fct. with `probdist & pred`
        "new pdf restricted to outcomes of ProbDist where pred is true"
        return ProbDist({event:self[event] for event in self if pred(event)})

print('hello world')

hello world


Now, write down a function that returns a **joint distribution** of two variables $X$, $Y$. Each variable can be thought of as a python **set**.

In [2]:
def joint_exclusive(A, B, sep=' '):
    """The joint distribution of two independent probability distributions. 
    Result is all entries of the form {a+sep+b: P(a)*P(b)}"""
    return ProbDist({a + sep + b: A[a] * B[b]
                    for a in A
                    for b in B
                    if a != b})

Use this function to return the joint distribution of any two F1 drivers coming in as 1st place/2nd place. Call this probability distribution `OneTwo`:

OneTwo = 

In [3]:
RGP = ProbDist(
    LH = 281,
    SV = 241,
    KR = 174,
    VB = 171,
    MV = 148,
    DR = 126,
    NH = 53,
    FA = 50,
    KM = 49,
    SP = 46,
    EO = 45,
    CS = 38,
    PG = 28,
    RG = 27,
    CL = 15,
    SD = 8,
    LS = 6,
    ME = 6,
    BH = 2,
    SS = 1)

In [4]:
OneTwo = joint_exclusive(RGP, RGP)
import random
random.sample(list(OneTwo), 10)

['PG KR',
 'KM SD',
 'ME KM',
 'CS SS',
 'LS SV',
 'VB EO',
 'BH SV',
 'EO PG',
 'KR SS',
 'DR SS']

In [5]:
OneTwo

{'LH SV': Fraction(67721, 2045712),
 'LH KR': Fraction(8149, 340952),
 'LH VB': Fraction(16017, 681904),
 'LH MV': Fraction(10397, 511428),
 'LH DR': Fraction(5901, 340952),
 'LH NH': Fraction(14893, 2045712),
 'LH FA': Fraction(7025, 1022856),
 'LH KM': Fraction(13769, 2045712),
 'LH SP': Fraction(281, 44472),
 'LH EO': Fraction(4215, 681904),
 'LH CS': Fraction(5339, 1022856),
 'LH PG': Fraction(1967, 511428),
 'LH RG': Fraction(2529, 681904),
 'LH CL': Fraction(1405, 681904),
 'LH SD': Fraction(281, 255714),
 'LH LS': Fraction(281, 340952),
 'LH ME': Fraction(281, 340952),
 'LH BH': Fraction(281, 1022856),
 'LH SS': Fraction(281, 2045712),
 'SV LH': Fraction(67721, 2045712),
 'SV KR': Fraction(6989, 340952),
 'SV VB': Fraction(13737, 681904),
 'SV MV': Fraction(8917, 511428),
 'SV DR': Fraction(5061, 340952),
 'SV NH': Fraction(12773, 2045712),
 'SV FA': Fraction(6025, 1022856),
 'SV KM': Fraction(11809, 2045712),
 'SV SP': Fraction(241, 44472),
 'SV EO': Fraction(3615, 681904),
 'S

Now, write down a python function called `Uniform`, whcih returns a Uniform probability distribution. Google *uniform probability distribution* on Wikipedia.

In [6]:
# import uniform distribution
from scipy.stats import uniform

In [7]:
def joint(A, B, sep=' '):
    """The joint distribution of two independent probability distributions. 
    Result is all entries of the form {a+sep+b: P(a)*P(b)}"""
    return ProbDist({a + sep + b: A[a] * B[b]
                    for a in A
                    for b in B})


In [8]:
def Uniform(outcomes): return ProbDist({event: 1 for event in outcomes})

We're now ready to start with our lab. Let's describe the experiment, the sample space, and do some counting!

**Test Experiment**: At the end of the semester, a student is chosen at random in this class. The student is asked if he passed at least one of his exams for the class. "Yes", he says.

We'll use the letter 'P' for `pass`, and the letter 'F' for `fail` to describe the outcome of an exam. 

Assume each exam can be either a pass or a fail. In other words, pass of fail is **equiprobable**. Write down the probability distribution describing the outcome of two exams. Call it  **two_exams**.  **two_exams** is if we don't care about the day of the week any exam is held.

In [9]:
two_exams = Uniform({'PP', 'PF', 'FP', 'FF'})

Now, let's start caring about the day of the week that the exams are held on! Assume 7 days in the week. Yes, professor Dino sometimes schedules exams on Sundays! In fact, Professor Dino is so random, that the chances of scheduling an exam on any day of the week are equiprobable.

Write down the probability distribution for one exam, taken any day of the week. Call it  **one_exam_week**.

Write down the probability distribution for two exams, each taken on any day of the week. Call it  **two_exams_week**.

Start counting weekdays on Sunday, so if Sunday is 1, Friday is 6.

What's the number of possible outcomes for **two_exams_week**?

In [10]:
one_exam_week = joint(Uniform('PF'), Uniform('1234567'))

In [11]:
one_exam_week

{'P 1': Fraction(1, 14),
 'P 2': Fraction(1, 14),
 'P 3': Fraction(1, 14),
 'P 4': Fraction(1, 14),
 'P 5': Fraction(1, 14),
 'P 6': Fraction(1, 14),
 'P 7': Fraction(1, 14),
 'F 1': Fraction(1, 14),
 'F 2': Fraction(1, 14),
 'F 3': Fraction(1, 14),
 'F 4': Fraction(1, 14),
 'F 5': Fraction(1, 14),
 'F 6': Fraction(1, 14),
 'F 7': Fraction(1, 14)}

In [12]:
two_exams_week = joint(one_exam_week, one_exam_week)
two_exams_week
# random.sample(list(two_exams_week), 10)

{'P 1 P 1': Fraction(1, 196),
 'P 1 P 2': Fraction(1, 196),
 'P 1 P 3': Fraction(1, 196),
 'P 1 P 4': Fraction(1, 196),
 'P 1 P 5': Fraction(1, 196),
 'P 1 P 6': Fraction(1, 196),
 'P 1 P 7': Fraction(1, 196),
 'P 1 F 1': Fraction(1, 196),
 'P 1 F 2': Fraction(1, 196),
 'P 1 F 3': Fraction(1, 196),
 'P 1 F 4': Fraction(1, 196),
 'P 1 F 5': Fraction(1, 196),
 'P 1 F 6': Fraction(1, 196),
 'P 1 F 7': Fraction(1, 196),
 'P 2 P 1': Fraction(1, 196),
 'P 2 P 2': Fraction(1, 196),
 'P 2 P 3': Fraction(1, 196),
 'P 2 P 4': Fraction(1, 196),
 'P 2 P 5': Fraction(1, 196),
 'P 2 P 6': Fraction(1, 196),
 'P 2 P 7': Fraction(1, 196),
 'P 2 F 1': Fraction(1, 196),
 'P 2 F 2': Fraction(1, 196),
 'P 2 F 3': Fraction(1, 196),
 'P 2 F 4': Fraction(1, 196),
 'P 2 F 5': Fraction(1, 196),
 'P 2 F 6': Fraction(1, 196),
 'P 2 F 7': Fraction(1, 196),
 'P 3 P 1': Fraction(1, 196),
 'P 3 P 2': Fraction(1, 196),
 'P 3 P 3': Fraction(1, 196),
 'P 3 P 4': Fraction(1, 196),
 'P 3 P 5': Fraction(1, 196),
 'P 3 P 6'

Now sample for the possible outcomes (give me a random sampling of possibilities):

In [13]:
random.sample(list(two_exams_week), 10)

['P 6 F 5',
 'F 5 P 2',
 'P 7 F 2',
 'P 5 F 4',
 'F 1 P 5',
 'P 6 F 6',
 'P 1 F 1',
 'F 1 P 4',
 'F 3 P 6',
 'F 1 F 4']

Now let's define some relevant predicates, and start playing around with the problem.

Write down the predicate (a function that returns true or false when given one possible outcome) called `at_least_one_pass`, which returns true if at least one of the exams was a pass. 

Write down the predicate (a function that returns true or false when given one possible outcome) called `two_passes`, which returns true if both exams were a pass. 

In [14]:
def at_least_one_pass(outcome): return 'P' in outcome
def two_passes(outcome): return outcome.count('P') == 2

Evaluate the probability (as a `Fraction`) of passing at least one exam, when the same space is **two_exams_week** (where we keep track of the day of week) and, next, in **two_exams** (where we don't). Use your `P` function.

In [15]:
P(at_least_one_pass, two_exams)

Fraction(3, 4)

In [16]:
P(two_passes, two_exams)

Fraction(1, 4)

Evaluate the probability (as a `Fraction`) of passing both exams, when the same space is **two_exams_week** (where we keep track of the day of week) and in **two_exams** (where we don't). 

Evaluate the probability (as a `Fraction`) of passing both exams, given that at least one of them is a pass (a **conditional probability**), when the sample space is **two_exams** (where we don't keep track of days of week). 

Then, evaluate the probability (as a `Fraction`) of passing both exams, given that at least one of them is a pass (a **conditional probability**), when the sample space is **two_exams_week** (where we keep track of the day of week).

In [17]:
P(two_passes, two_exams & at_least_one_pass)

Fraction(1, 3)

Is the joint probability of two passes equal to the joint-conditional probability of two passes given at least one pass? Does the sample space make a difference? Please write answer in english and reason why you think you got the results you got. 

Now, define a predicate for one pass **on Friday**, call it `at_least_one_pass_friday`:

In [18]:
def at_least_one_pass_friday(outcome): return 'P 6' in outcome

And now answer the question: What's the probability you'll pass two exams if one of them is held on a Friday and is a pass?

In [19]:
 P(two_passes, two_exams_week & at_least_one_pass_friday)

Fraction(13, 27)

Is it the same as the probability of passing two exams while one is Pass? Why?

## Part 2

Let's display results as a 2D grid of outcomes. A cell will be colored green if `event` is true, yellow if `condition` is true, and the sample space consists of `dist`:

```python```
from IPython.display import HTML

def Pgrid(event, dist, condition):
    def first_half(s): return s[:len(s)//2]
    firsts = sorted(set(map(first_half, dist)))
    return HTML('<table>' +
                cat(row(first, event, dist, condition) for first in firsts) +
                '</table>' + 
                '<tt>P({} | {}) = {}</tt>'.format(
                event.__name__, condition.__name__, 
                P(event, dist & condition)))

def row(first, event, dist, condition):
    "Display a row where the first exam is paired with each of the possible second exam."
    thisrow = sorted(outcome for outcome in dist if outcome.startswith(first))
    return '<tr>' + cat(cell(outcome, event, condition) for outcome in thisrow) + '</tr>'

def cell(outcome, event, condition): 
    "Display outcome in appropriate color."
    color = ('lightgreen' if event(outcome) and condition(outcome) else
             'yellow' if condition(outcome) else
             'white')
    return '<td style="background-color: {}">{}</td>'.format(color, outcome)    

cat = ''.join
```

In [20]:
from IPython.display import HTML

def Pgrid(event, dist, condition):
    def first_half(s): return s[:len(s)//2]
    firsts = sorted(set(map(first_half, dist)))
    return HTML('<table>' +
                cat(row(first, event, dist, condition) for first in firsts) +
                '</table>' + 
                '<tt>P({} | {}) = {}</tt>'.format(
                event.__name__, condition.__name__, 
                P(event, dist & condition)))

def row(first, event, dist, condition):
    "Display a row where the first exam is paired with each of the possible second exam."
    thisrow = sorted(outcome for outcome in dist if outcome.startswith(first))
    return '<tr>' + cat(cell(outcome, event, condition) for outcome in thisrow) + '</tr>'

def cell(outcome, event, condition): 
    "Display outcome in appropriate color."
    color = ('lightgreen' if event(outcome) and condition(outcome) else
             'yellow' if condition(outcome) else
             'white')
    return '<td style="background-color: {}">{}</td>'.format(color, outcome)    

cat = ''.join

Draw the grid that displays in green all possibilities of two passes, in yellow all possibilities of one pass and one fail, and in white all twin fails. In other words, colored cells describe all possibilities of at least one pass. The sample space is **two_exams_week**.

```python```
Pgrid(two_passes, two_exams_week, at_least_one_pass)
```

In [21]:
Pgrid(two_passes, two_exams_week, at_least_one_pass)

0,1,2,3,4,5,6,7,8,9,10,11,12,13
F 1 F 1,F 1 F 2,F 1 F 3,F 1 F 4,F 1 F 5,F 1 F 6,F 1 F 7,F 1 P 1,F 1 P 2,F 1 P 3,F 1 P 4,F 1 P 5,F 1 P 6,F 1 P 7
F 2 F 1,F 2 F 2,F 2 F 3,F 2 F 4,F 2 F 5,F 2 F 6,F 2 F 7,F 2 P 1,F 2 P 2,F 2 P 3,F 2 P 4,F 2 P 5,F 2 P 6,F 2 P 7
F 3 F 1,F 3 F 2,F 3 F 3,F 3 F 4,F 3 F 5,F 3 F 6,F 3 F 7,F 3 P 1,F 3 P 2,F 3 P 3,F 3 P 4,F 3 P 5,F 3 P 6,F 3 P 7
F 4 F 1,F 4 F 2,F 4 F 3,F 4 F 4,F 4 F 5,F 4 F 6,F 4 F 7,F 4 P 1,F 4 P 2,F 4 P 3,F 4 P 4,F 4 P 5,F 4 P 6,F 4 P 7
F 5 F 1,F 5 F 2,F 5 F 3,F 5 F 4,F 5 F 5,F 5 F 6,F 5 F 7,F 5 P 1,F 5 P 2,F 5 P 3,F 5 P 4,F 5 P 5,F 5 P 6,F 5 P 7
F 6 F 1,F 6 F 2,F 6 F 3,F 6 F 4,F 6 F 5,F 6 F 6,F 6 F 7,F 6 P 1,F 6 P 2,F 6 P 3,F 6 P 4,F 6 P 5,F 6 P 6,F 6 P 7
F 7 F 1,F 7 F 2,F 7 F 3,F 7 F 4,F 7 F 5,F 7 F 6,F 7 F 7,F 7 P 1,F 7 P 2,F 7 P 3,F 7 P 4,F 7 P 5,F 7 P 6,F 7 P 7
P 1 F 1,P 1 F 2,P 1 F 3,P 1 F 4,P 1 F 5,P 1 F 6,P 1 F 7,P 1 P 1,P 1 P 2,P 1 P 3,P 1 P 4,P 1 P 5,P 1 P 6,P 1 P 7
P 2 F 1,P 2 F 2,P 2 F 3,P 2 F 4,P 2 F 5,P 2 F 6,P 2 F 7,P 2 P 1,P 2 P 2,P 2 P 3,P 2 P 4,P 2 P 5,P 2 P 6,P 2 P 7
P 3 F 1,P 3 F 2,P 3 F 3,P 3 F 4,P 3 F 5,P 3 F 6,P 3 F 7,P 3 P 1,P 3 P 2,P 3 P 3,P 3 P 4,P 3 P 5,P 3 P 6,P 3 P 7


So the probability of two passes given at least one pass is, indeed, 1/3 because we need to throw away the upper left quadrant, and there is 1/3rd of green cells over colored cells.

Now draw the grid that displays in green all possibilities of two passes when one is on a Friday, in yellow all possibilities of one pass and one fail when one pass is on a Friday, and in white for the rest. In other words, colored cells are all cells where at least one pass happened on Friday. The sample space is **two_exams_week**.

```python```
Pgrid(two_passes, two_exams_week, at_least_one_pass_friday)
```

In [22]:
Pgrid(two_passes, two_exams_week, at_least_one_pass_friday)

0,1,2,3,4,5,6,7,8,9,10,11,12,13
F 1 F 1,F 1 F 2,F 1 F 3,F 1 F 4,F 1 F 5,F 1 F 6,F 1 F 7,F 1 P 1,F 1 P 2,F 1 P 3,F 1 P 4,F 1 P 5,F 1 P 6,F 1 P 7
F 2 F 1,F 2 F 2,F 2 F 3,F 2 F 4,F 2 F 5,F 2 F 6,F 2 F 7,F 2 P 1,F 2 P 2,F 2 P 3,F 2 P 4,F 2 P 5,F 2 P 6,F 2 P 7
F 3 F 1,F 3 F 2,F 3 F 3,F 3 F 4,F 3 F 5,F 3 F 6,F 3 F 7,F 3 P 1,F 3 P 2,F 3 P 3,F 3 P 4,F 3 P 5,F 3 P 6,F 3 P 7
F 4 F 1,F 4 F 2,F 4 F 3,F 4 F 4,F 4 F 5,F 4 F 6,F 4 F 7,F 4 P 1,F 4 P 2,F 4 P 3,F 4 P 4,F 4 P 5,F 4 P 6,F 4 P 7
F 5 F 1,F 5 F 2,F 5 F 3,F 5 F 4,F 5 F 5,F 5 F 6,F 5 F 7,F 5 P 1,F 5 P 2,F 5 P 3,F 5 P 4,F 5 P 5,F 5 P 6,F 5 P 7
F 6 F 1,F 6 F 2,F 6 F 3,F 6 F 4,F 6 F 5,F 6 F 6,F 6 F 7,F 6 P 1,F 6 P 2,F 6 P 3,F 6 P 4,F 6 P 5,F 6 P 6,F 6 P 7
F 7 F 1,F 7 F 2,F 7 F 3,F 7 F 4,F 7 F 5,F 7 F 6,F 7 F 7,F 7 P 1,F 7 P 2,F 7 P 3,F 7 P 4,F 7 P 5,F 7 P 6,F 7 P 7
P 1 F 1,P 1 F 2,P 1 F 3,P 1 F 4,P 1 F 5,P 1 F 6,P 1 F 7,P 1 P 1,P 1 P 2,P 1 P 3,P 1 P 4,P 1 P 5,P 1 P 6,P 1 P 7
P 2 F 1,P 2 F 2,P 2 F 3,P 2 F 4,P 2 F 5,P 2 F 6,P 2 F 7,P 2 P 1,P 2 P 2,P 2 P 3,P 2 P 4,P 2 P 5,P 2 P 6,P 2 P 7
P 3 F 1,P 3 F 2,P 3 F 3,P 3 F 4,P 3 F 5,P 3 F 6,P 3 F 7,P 3 P 1,P 3 P 2,P 3 P 3,P 3 P 4,P 3 P 5,P 3 P 6,P 3 P 7


And now count the probability of two passes given that one was on a Friday by counting the number of green cells over the number of colored cells, and conclude.

Probability theory is all about counting!

<br />
<center>
<img src = ipynb.images/counting.jpg width = 400 />
</center>

The critical twist here is that the information on a pass on Friday is used at the outset to determine the sample space. In other words, we are really considering the question: “Among all twin exams for which one of them is conducted on a Friday, what fraction of these interviews are double passes”? This is the information we're after, and this information determines your sample space.

Paradoxes usually appear when the sampling procedure is not fully specified and the reader needs to interpret it. A good interpretation of the hypothesis is required to qunch paradoxes.

<br />
<center>
<img src = ipynb.images/paradox.jpg width = 400 />
</center>