<div style="text-align: right">INFO 6105 Data Sci Engineering Methods and Tools, Week 4 Lecture 1</div>
<div style="text-align: right">Dino Konstantopoulos, 4 February 2019</div>

We correct your data science homework involving Formula 1 driving. 

Let's use the following probability function.

In [2]:
def p(event, space): 
    """The probability of an event, given a sample space of outcomes. 
    event: a collection of outcomes, or a predicate that is true of outcomes in the event. 
    space: a set of outcomes or a probability distribution of {outcome: frequency} pairs."""
    if is_predicate(event):
        event = such_that(event, space)
    if isinstance(space, ProbDist):
        return sum(space[o] for o in space if o in event)
    else:
        return Fraction(len(event & space), len(space))

is_predicate = callable

def such_that(predicate, space): 
    """The outcomes in the sample pace for which the predicate is true.
    If space is a set, return a subset {outcome,...} with outcomes where predicate(element) is true;
    if space is a ProbDist, return a ProbDist {outcome: frequency,...} with outcomes where predicate(element) is true."""
    if isinstance(space, ProbDist):
        return ProbDist({o:space[o] for o in space if predicate(o)})
    else:
        return {o for o in space if predicate(o)}

# F1

<br />
<center>
    <img src="ipynb.images/f1races.png" width=800 />
</center>

Question 1.1 (20 points) There are two F1 races coming up: The Russian Grand Prix this weekend and the Japanese Grand Prix the weekend after. The 2018 driver standings are given [here](https://www.formula1.com/en/results.html/2018/drivers.html). Given these standings (please do not use team standings given on the same Web site, use driver standings), what is the Probability Distribution for each F1 driver to win the Russian Grand Prix? What is the Probability Distribution for each F1 driver to win *both* the Russian and Japanese Grand Prix? What is the probability for Ferrari to win both races? What is the probability for Ferrari to win at least one race? Note that Ferrari, and each other racing team, has two drivers per race.

Question 1.2 (30 points) If Ferrari wins the first race, what is the probability that Ferrari wins the next one? If Ferrari wins at least one of these two races, what is the probability Ferrari wins both races? How about Mercedes, Red Bull, and Williams?

Question 1.3 (50 points) Ferrari wins one of these two races on a rainy day. What is the probability Ferrari wins both races, assuming races can be held on either rainy, sunny, cloudy, snowy or foggy days? Assume that rain, sun, clouds, snow, and fog are the only possible weather conditions on race tracks.

You need to provide *proof* for your answers. `I think it's one in a million because Ferrari sucks` is not a good answer. Leverage the counting framework in this workbook.

put the standings here, so you don't have to refer to the Web site every time. Leave the cell unformatted.

...

Hint: Use `RGP` to denote the Probability Distribution given by F1 driver wins. Write driver initials as keys and driver wins as values in a dictionary that you pass to our function `ProbDist`..

### Question 1

In [None]:
RGP = ProbDist(
    LH = 281,
    SV = 241,
    KR = 174,
    VB = 171,
    MV = 148,
    DR = 126,
    NH = 53,
    FA = 50,
    KM = 49,
    SP = 46,
    EO = 45,
    CS = 38,
    PG = 28,
    RG = 27,
    CL = 15,
    SD = 8,
    LS = 6,
    ME = 6,
    BH = 2,
    SS = 1)
RGP

The probability of two successive wins is just the square of each single win probability:

In [None]:
def f(x): return x ** 2.0
RJGP = {k: f(v) for k, v in RGP.items()}
RJGP

You could also compute this in a different way. You could leverage:

In [None]:
def joint(A, B, sep=' '):
    """The joint distribution of two independent probability distributions. 
    Result is all entries of the form {a+sep+b: P(a)*P(b)}"""
    return ProbDist({a + sep + b: A[a] * B[b]
                    for a in A
                    for b in B})

RJGP2 = joint(RGP, RGP, ' ')
import random
winners_rnd = random.sample(list(RJGP2), 10)
print(winners_rnd)
probs_rnd = [RJGP2[k] for k in winners_rnd]
winners_n_probs_rnd = ...
print('\n'.join([str(outcome) for outcome in winners_n_probs_rnd]))

Verify the sum of all probabilities is 1:
```python
all_probs = [RJGP2[k] for k in RJGP2]
sum(all_probs)
```

and finally:

In [None]:
#print([outcome for outcome in RJGP2 ])
#print(list(zip(RJGP2.keys(), RJGP2.values())))
winners = RJGP2.keys()
probs = RJGP2.values()
double_winners = {x for x in winners if x[0] == x[3] and x[1] == x[4]}
#print(double_winners)
probs = [RJGP2[k] for k in double_winners]
double_winners_n_probs = ...
print('\n'.join([str(outcome) for outcome in double_winners_n_probs]))

.. which matches our previous result!

Since The Ferrari drivers are Kimmi Raikkonen (KR) and Sebastien Vettel (SV), the probability of Ferrari winning *both* races are represented by keys KR SV, SV KR, KR KR, and SV SV. And so the probability of Ferrari winning *both* races is:

In [None]:
winners = RJGP2.keys()
probs = RJGP2.values()
print(len(winners))
#Ferrari_wins_2 = {x for x in winners if x in ('KR SV', 'SV KR', 'KR KR', 'SV SV')}
Ferrari_wins_2 = {...}
#print(Ferrari_wins_2)
Ferrari_probs_2 = [RJGP2[k] for k in Ferrari_wins_2]
#Ferrari_wins_2_n_probs = list(zip(Ferrari_wins_2, Ferrari_probs_2))
#print('\n'.join([str(outcome) for outcome in Ferrari_wins_2_n_probs]))
Ferrari_wins_2_prob = sum(Ferrari_probs_2)
print('Probability of Ferrari drivers winning both races is: ' + str(Ferrari_wins_2_prob))

Now, the probability of Ferrari winning *one of these races* is any win combination that includes one of the Ferrari drivers as keys. Note that this includes *both* Ferrari drivers winning!

In [None]:
winners = RJGP2.keys()
print(len(winners))
#Ferrari_wins_1 = {x for x in winners if 'KR ' in x or ' KR' in x or 'SV ' in x or ' SV' in x}
Ferrari_wins_1 = {...}
print('all possible combinations for one Ferrari win: ')
print(sorted(Ferrari_wins_1))
#print('\n'.join(sorted(Ferrari_wins_1)))
print(len(Ferrari_wins_1))
Ferrari_probs_1 = [RJGP2[k] for k in Ferrari_wins_1]
#print(Ferrari_probs_1)
#print(len(Ferrari_probs_1))
#Ferrari_wins_1_probs = list(zip(Ferrari_wins_1, Ferrari_probs_1))
#print('\n'.join([str(outcome) for outcome in Ferrari_wins_1_n_probs]))
Ferrari_wins_1_prob = sum(Ferrari_probs_1)
print('Probability of Ferrari drivers winning one of these races is: ' + str(Ferrari_wins_1_prob))

Wiat.. That seems strange... Sounds like a good bet for Ferrari.. 
<center>
    <img src="ipynb.images/scrat.png" width=200 />
</center>

Let's do some debugging. You should always do this when you get strange results in Data Science. Slow down, and retrace your steps in more detail.

In [25]:
def joint_dbg(A, B, sep=' '):
    """The joint distribution of two independent probability distributions. 
    Result is all entries of the form {a+sep+b: P(a)*P(b)}"""
    return ProbDist({a + '(' + "{0:.4f}".format(A[a]) + ')' + sep + b + '(' + "{0:.4f}".format(B[b]) + ')' + sep 
                     + "{0:.4f}".format(A[a] * B[b]): A[a] * B[b]
                    for a in A
                    for b in B})

Let's recompute *carefully*. 

In [None]:
RJGP3 = joint_dbg(RGP, RGP)
winners = RJGP3.keys()
print('Possible win combinations: ' + str(len(winners)))

#print('Mercedes:')
Mercedes_wins_1 = [x for x in winners if x.count('LH') + x.count('SV') ...]
#print(Mercedes_wins_1)
Mercedes_wins_1_probs = [float(x[22:]) for x in Mercedes_wins_1]
#print(Mercedes_wins_1_probs)
print('Probability of at least one Mercedes victory: ' + str(sum(Mercedes_wins_1_probs)))
#print('---------')

#print('Ferrari:')
Ferrari_wins_1 = [x for x in winners if x.count('SV') + x.count('KR') ...]
#print(Ferrari_wins_1)
Ferrari_wins_1_probs = [float(x[22:]) for x in Ferrari_wins_1]
#print(Ferrari_wins_1_probs)
print('Probability of at least one Ferrari victory: ' + str(sum(Ferrari_wins_1_probs)))
#print('---------')

#print('Red Bull:')
RedBull_wins_1 = [x for x in winners if x.count('MV') + x.count('DR') ...]
#print(RedBull_wins_1)
RedBull_wins_1_probs = [float(x[22:]) for x in RedBull_wins_1]
#print(RedBull_wins_1_probs)
print('Probability of at least one Red Bull victory: ' + str(sum(RedBull_wins_1_probs)))
#print('---------')
      
#print('Williams:')
Williams_wins_1 = [x for x in winners if x.count('LS') + x.count('SS') ...]
#print(Williams_wins_1)
Williams_wins_1_probs = [float(x[22:]) for x in Williams_wins_1]
#print(Williams_wins_1_probs)
print('Probability of at least one Williams victory: ' + str(sum(Williams_wins_1_probs)))
#print('---------')

Why do the probabilities *not add up to 1*?!!

If you had computed the probability of one Ferrari win by squaring the probability of a 'KR' or 'SV' win, you would be making a *mistake*:

In [None]:
Ferrari_win = {RGP[k] for k in ('KR', 'SV')}
print(Ferrari_win)
print('one win: ' + str(sum(Ferrari_win)))
print('two wins: ' + str(sum(Ferrari_win) ** 2))

And so the probability of Mercedes or Ferrari drivers winning at least one of these races is much higher than the probability of winning one race. See the difference that *at least* makes? The probability of winning both races is much easier to evaluate.

### Question 2

If Ferrari wins one of these races (assume that is given as evidence), what is the probability that Ferrari wins both of them? If Ferrari wins at least one of these two races (assume that is given as evidence), what is the probability Ferrari wins both races? How about Mercedes-McLaren?

RJGP has 400 possible outcomes. This is a random sampling:

In [None]:
print(len(RJGP2))
import random
winners = random.sample(list(RJGP2), 10)
probs = [RJGP2[k] for k in winners]
winners_n_probs = list(zip(winners, probs))
print('\n'.join([str(outcome) for outcome in winners_n_probs]))

If Ferrari wins one race, then the probability that Ferrari also wins the next one is just the probability of Ferrari winning one race, which is 27%.

Let's look at this from a different perspective:

If Ferrari wins the first race (Russian Grand Prix), what is the probability Ferrari wins the next one (Japanese Grand Prix)? If Ferrari wins at least one of these races, what is the probability Ferrari wins both?

Let's define predicates for Ferrari winning the next race `next_ferrari_win_p`, Ferrari winning at least one race (not only one race!) `one_ferrari_win_p`, and Ferrari winning 2 races `two_ferrari_wins_p`:

In [5]:
def next_ferrari_win_p(outcome): ...
def one_ferrari_win_p(outcome): return outcome.count('KR') + outcome.count('SV') ...
def two_ferrari_wins_p(outcome): return outcome.count('KR') + outcome.count('SV') ...

Let's define a new joint probability distribution that takes in a predicate as an argument:

In [6]:
def joint_p(A, B, pred, sep=' '):
    """The joint distribution of two independent probability distributions, with a condition 
    Result is all entries of the form {a+sep+b: P(a)*P(b)}"""
    return ProbDist({a + sep + b: A[a] * B[b]
                    for a in A
                    for b in B
                    ...})


The probability Ferrari wins the Japanese Grand Prix given that it won the Russian Grand Prix:

In [None]:
next_ferrari_win = joint_p(RGP, RGP, next_ferrari_win_p, ' ')
len(next_ferrari_win)

In [None]:
p(two_ferrari_wins, next_ferrari_win)

And now the probability that Ferrari wins both races given that it won *at least one of them*:

In [None]:
one_ferrari_win = joint_p(RGP, RGP, one_ferrari_win_p, ' ')
len(one_ferrari_win)

In [None]:
p(two_ferrari_wins, one_ferrari_win)

Understanding the answer is tougher. Some people may think the answer should be 27%. Can we justify the answer 16%? 

A win on the Russian Grand Prix does not really affect the probabily of winning the next race, so it's just the probability for Ferrari of winning *a* race.

But there are more possibilities for `at least one Ferrari win` than `a Ferrari win for the Japanese Grand Prix`. And so our sample space is different. When we add up favorable outcomes, the total probability is smaller. 

Intuitively, if I tell you the Patriots won the AFC Championship game against the Chiefs two weekends ago, what's the probability they will win the Superbowl against the Rams, too? You will say, "oh, good chance, good team"! But if i tell you that over this weekend *and* next, the Patriots are going to win at least one game, and you know they already won one game, you will probably say, "oh, that means they will loose the next one", or at least you won't be as confident about the next one.

### Question 3

Ferrari wins one of these races on a rainy day (Ferrari has great Pirelli tires). What is the probability Ferrari wins both races, assuming races can be held on either rainy, sunny, cloudy, snowy or foggy days? Assume that rain, sun, clouds, snow, and fog are the only possible weather conditions on race tracks.
 
One Ferrari win is on a cloudy day. What's the probability both races are wins for Ferrari?

Whaaaaat? Why do we care that Ferrari wins on a cloudy day? Well, we do, because that is *extra* information. For example, it may tell us that Ferrari has great tires, so any time it rains, Ferrari is a proven winner on rainy tracks. It allows us to computer a *posterior* probability to update a *prior*, given new evidence. That is the foundation of bayesian statistics and Bayesian Machine Learning, which we'll cover this week.
 
We have a *non-uniform* probability distribution for racers, and a *uniform* probability distribution for weather conditions. Let's derive the joint distribution for one race, and the joint distribution over two races.

A Uniform distribution is just like a Probability distribution, except that all the values (probabilities) are the same!

In [None]:
def Uniform(outcomes): return ProbDist({event: 1 for event in outcomes})
F1_1_conditions  = ...
F1_2_conditions = joint(F1_1_conditions, F1_1_conditions)
len(F1_2_conditions)

We have 10,000 different combinations of wins over two races, with weather conditions! Let's sample:

In [None]:
import random
random.sample(list(F1_2_conditions), 10)

For example, one possibility is Lewis Hamilton winning the first race on a foggy day, and Fernando Alonso winning the second race on a rainy day: {LHf FAr}.

We determine below the probability of *at least* one Ferrari win over two races, both where we keep track of weather conditions and where we don't, is the same:

In [None]:
def one_ferrari_win_p(outcome): return outcome.count('KR') + outcome.count('SV') ...
p(one_ferrari_win_p, RJGP2)

In [None]:
"LHf FAr".count('LH') + "LHf FAr".count('FA')

In [None]:
p(one_ferrari_win_p, F1_2_conditions)

We determine below that the probability of two Ferrari wins, both where we keep track of weather conditions and where we don't, is the same:

In [None]:
p(two_ferrari_wins_p, RJGP2)

In [None]:
p(two_ferrari_wins_p, F1_2_conditions)

So the probability of one Ferrari win and of two Ferrari wins is independent of weather conditions.

Ok, now we're reassured that we are not crazy, and the professor is not part of some alien conspiracy to make us *think we're crazy*.

<center>
    <img src="ipynb.images/Scrat_Ice_Age.png" width=200 />
</center>

Now, we have to figure the probability of two Ferrari wins given the evidence of one Ferrari win on a cloudy day. Hmmmmmm..

Let's' define a predicate for the event of a Ferrari win on a rainy day:

In [14]:
def at_least_one_Ferrari_win_on_a_rainy_day(outcome): return ...

and so:

In [None]:
F1_2_conditions_at_least_one_Ferrari_win_on_a_rainy_day = (
    joint_p(...)
)
len(F1_2_conditions_at_least_one_Ferrari_win_on_a_rainy_day)

Out of 10,000 possibilities, we have 396 favorable outcomes. And so the probability for two Ferrari wins over two races given that Ferrari won one of them on a cloudy day:

In [None]:
def two_ferrari_wins_p(outcome): return outcome.count('KR') + outcome.count('SV') ...
p(two_ferrari_wins_p, F1_2_conditions_at_least_one_Ferrari_win_on_a_rainy_day)

The probability of two Ferrari wins when we know that Ferrari wins at least one race on a rainy day is 25%, which is much higher than the probability of two Ferrari wins (7%), and lower than the probability of at least one Ferrari win (47%). That's the value of extra information: It produces a different posterior!

Wow, Bayesian statistics looks like great stuff, like maybe I can use to play at the new Boston [Casino](https://encorebostonharbor.com/), but what if the professor is just trying to seduce us with complicated programming?

<br />
<center>
    <img src="ipynb.images/seduce.png" width=300 />
</center>

Let's do some debugging. Let's display results as a 2D grid of outcomes. A cell will be colored white if Ferrari does not win two races, yellow if Ferrari wins two races but not with a least one win on a cloudy day, and green if Ferrari wins two races with at leat one win on a cloudy day. 

Let's reduce the amount of data to being our debugging, to just Ferrari, Renault, and Williams. You always debug with less data! And I picked these teams because they cover the range of wins: a lot, medium, very few. I'll also reduce weather to (r)ain, and (s)un.

In [19]:
def Uniform(outcomes): return ProbDist({event: 1 for event in outcomes})

def joint(A, B, sep=' '):
    """The joint distribution of two independent probability distributions. 
    Result is all entries of the form {a+sep+b: P(a)*P(b)}"""
    return ProbDist({a + sep + b: A[a] * B[b]
                    for a in A
                    for b in B})

def next_ferrari_win_p(outcome): return outcome.count(' KR') + outcome.count(' SV') == 1
def one_ferrari_win_p(outcome): return outcome.count('KR') + outcome.count('SV') >= 1
def two_ferrari_wins_p(outcome): return outcome.count('KR') + outcome.count('SV') == 2

def at_least_one_Ferrari_win_on_a_rainy_day(outcome): return 'KRr' in outcome or 'SVr' in outcome

In [None]:
# probability of winning one race. Ferrari is SV and KR
RGPr = ProbDist(
    SV = 241,
    KR = 171,
    NH = 53,
    CS = 38,
    LS = 6,
    SS = 1)
RGPr

In [None]:
# probability of winning a race on a specific weather condition
RGPrw  = joint(RGPr, ..., '') #care about the weather
RGPrw

In [None]:
# probability of winning two races on specific weather conditions
RJGPrw  = joint(RGPrw, RGPrw)
len(RJGPrw)

Ok, I can work with 12 x 12 data points, they won't fry my kernel. What do they look like?

In [None]:
import random
random.sample(list(RJGPrw), 10)

Let's do some plotting. Machine learning is all about *geometry* (specifically, building outcome manifolds in state space that represent the surface joining all possible outcomes). That is why we debug everything with pictures.

Let's plot all possible outcomes of our discrete probability distribution on a grid, and color cells in green and yellow depending on two respective predicates. If one is true, color the cell any color (yellow or green), if the other is true *as well*, color the cell green.

In [24]:
from IPython.display import HTML

def Pgrid(event, condition, dist):
    def first_half(s): return s[:len(s)//2]
    firsts = sorted(set(map(first_half, dist)))
    return HTML('<table>' +
                cat(row(first, event, dist, condition) for first in firsts) +
                '</table>')

def row(first, event, dist, condition):
    "Display a row where the first race result is paired with each of the possible second race results."
    thisrow = sorted(outcome for outcome in dist if outcome.startswith(first))
    return '<tr>' + cat(cell(outcome, event, condition) for outcome in thisrow) + '</tr>'

def cell(outcome, event, condition): 
    "Display outcome in appropriate color."
    color = ('lightgreen' if event(outcome) and condition(outcome) else
             'yellow' if condition(outcome) else
             'white')
    return '<td style="background-color: {}">{}</td>'.format(color, outcome)    

cat = ''.join

In [None]:
# Let's plot the all possible outcomes
# white cells: no two ferrari wins
# colored cells: at least one ferrari win on a rainy day
# green cells: two ferrari wins with at least one of them on a rainy day
Pgrid(...)

Let's count.

Number of cells where Ferrari wins at least once on a rainy day = 12 + 12 + (12 - 2) + (12 - 2) = 44
Number of cells where Ferrari wins both races = 3 + 3 + 3 + 3 = 12
And so probability of two Ferrari wins given that Ferrari won at least once race on a cloudy day is 12 / 44 = 27%.

对

And now let's color a slightly bigger table. Let's increase the amount of data to Mercedes, Ferrari, Renault, and Williams. Also, let's add (c)loudy day.

In [None]:
# probability of winning one race. Ferrari is SV and KR
RGPr = ProbDist(
    LH = 281,
    SV = 241,
    VB = 174,
    KR = 171,
    NH = 53,
    CS = 38,
    LS = 6,
    SS = 1)
RGPr

In [None]:
# probability of winning a race on a specific weather condition
RGPrw  = joint(RGPr, Uniform('rsc'), '') #care about the weather
RGPrw

In [None]:
# probability of winning two races on specific weather conditions
RJGPrw  = joint(RGPrw, RGPrw)
len(RJGPrw)

Yikes, that's a 24 x 24 table, the square of our previous table!

In [None]:
Pgrid(...)

But the counting is similar: 

24 + 24 + (24 - 2) + (24 - 2) = 92 yellow cells
5 + 5 + 5 + 5 = 20 green cells

So probability of two Ferrari wins given at least one Ferrari win on a rainy day = 20/92 = 22%.

And so it makes sense that for all of Formula 1 drivers and all 5 weather conditions, that the probability hovers around 25%. Do you want to run this? Sorry, I am not going to risk frying my kernel, but you can :-)


Some of you may have gotten confused when I added evidence to a probability distribution, and you wondered how **the heck** are weather conditions related to F1 rankings? But you were thinking too hard. Probability theory is just counting. Given all possible outcomes, figure out the joint sample space, and just count favorable outcomes over all possible outcomes.

<center>
    <img src="ipynb.images/elementary.png" width=300 />
</center>

But you know what's the strangest thing? A Ferrari win on a rainy day is understandable, because Ferrari has great Pirelli tires, and once I figure that out, I'm ready to bet on Ferrari on rainy days. But if i told you that Ferrari wins on the day it rains... in Australia (not on the race track), and that there are 5 possible weather conditions in Australia, *the same listing of cells above holds*! Same sample space, same favorable outcomes and unfavorable ones!

<center>
    <img src="ipynb.images/cmon-iceage.png" width=200 />
</center>

You may say... "Wait a minute, if this theory that you call probability theory gives me illogical answers, how can I trust it"? 

Probability theory is mathematically sound, but it all comes down to **the model you apply it to**. 

If your sample space is the joint distribution of F1 drivers, and.. weather conditions **in Australia**, then you've probably built the wrong statistical model, and there ain't a single Machine Learning algorithm that will help you here. Junk in, junk out. But weather on race tracks and F1 racing, that does make sense, right? Skills of drivers, performance of tires..

That is why the model is so important in Data Science. Before you start applying statistics to data, or feeding data into a Machine, you need to work on your model. In the case of discrete random variables, your pdf is a dictionary, much like our Danish or Formula 1 exemplars: Even though you have *some* data based on *some* experiment, you still need to build a model with joint distributions to complete your sample space for the problem at hand. In the case of continuous random variables your state space is most often infinite (countable or uncountable) and you need to build a much denser model. In fact, a high dimensional manifold in state space.

Statistics is the field of mathematics which deals with the understanding and interpretation of data. Specifically, you want to find the underlying mechanism that yields the data you're analyzing. You took the first step by learning probability theory. Now we'll begin to catalog all possible probability distributions that lead to seemingly random data (Poisson, gaussian, exponential, etc), and by shaping models with Bayesian inference: You match the histogram of your data to a pdf on the catalogue, you find its parameters using probabilistic programs like variational inference or Monte Carlo methods, and once you have your analytic model, you extract all kinds of interesting statistics from it (instead of from the data).

In fact, that is exactly what machines do. They use a deep neural structure to build a pdf, which they can then use for prediction. Machine Learning experts have stronger stats foundation than CS undergraduates in a deep learning class. Information theory, in general, requires a strong understanding of data and probability (and linear algebra), and anyone interested in becoming a Data Scientist or Machine Learning Engineer needs to develop a deep intuition of statistical (and linear algebra) concepts. That is our journey in this class.

In many cases, predictive Machine Learning algorithms are completely useless in helping with the understanding of data. They do not yield their data model. Bayesian ML will change all that, and Alexa will be able to tell you why it lowered room temperature, John.