<div style="text-align: right">INFO 6105 Data Sci Engineering Methods and Tools, Week 3 Lecture 1</div>
<div style="text-align: right">Dino Konstantopoulos, 23 January 2019, with material from Peter Norvig and Cam Davidson-Pilon</div>

Please unzip all images from each week's `images.zip` on blackboard onto your `C:/Users/<username>/ipynb.images` folder (create it if it doesn't exist). If there's a `data.zip` file on blackboard, unzip its contents onto your `C:/Users/<username>/data` folder (create it if it doesn't exist).

At the end of this lecture, you should a good understanding of probabilities, combinatorics, python list comprehensions, python lambda arithmetic, and how to compute probabilities.
![Bayes](http://img1.ph.126.net/xKZAzeOv_mI8a4Lwq7PHmw==/2547911489202312541.jpg)
<center><a href="https://en.wikipedia.org/wiki/Thomas_Bayes">Rev. Thomas Bayes</a><br>1701-1761
</center>

# 1. Probability

In 1814, Pierre-Simon, marquis de Laplace (23 March 1749 – 5 March 1827), a French scholar whose work was important to the development of mathematics, statistics, physics and astronomy, [wrote](https://en.wikipedia.org/wiki/Classical_definition_of_probability):

>*Probability ... is thus simply a fraction whose numerator is the number of favorable cases and whose denominator is the number of all the cases possible ... when nothing leads us to expect that any one of these cases should occur more than any other.*


Here's some vocabulary:

- **[Experiment](https://en.wikipedia.org/wiki/Experiment_(probability_theory%29):**
  An occurrence with an uncertain outcome that we can observe.
  <br>*For example, rolling a die.*
- **[Outcome](https://en.wikipedia.org/wiki/Outcome_(probability%29):**
  The result of an experiment; one particular state of the world. What Laplace calls a "case."
  <br>*For example:* `4`.
- **[Sample Space](https://en.wikipedia.org/wiki/Sample_space):**
  The set of all possible outcomes for the experiment. 
  <br>*For example,* `{1, 2, 3, 4, 5, 6}`.
- **[Event](https://en.wikipedia.org/wiki/Event_(probability_theory%29):**
  A subset of possible outcomes that together have some property we are interested in.
  <br>*For example, the event "even die roll" is the set of outcomes* `{2, 4, 6}`. 
  
And here's a little game: What's the probability that someone in this classroom shares your birthday? 
Each person can have your birthday with probability 1/365. There are n−1 people other than yourself, so the probability that someone shares your birthday is ...

Now, what is the probability that *two* students in this classroom have the *same* birthday? Which one of the two you think is higher?


What is the probability that someone in this classroom shares your birthday? 

Each person can have your birthday with probability 1/365. There are n-1 people other than yourself, so the probability that someone shares your birthday is (n-1)/ 365.

What is the probability that two students in this classroom have the same birthday? 



In [1]:
from operator import mul
from functools import reduce

# assume 10 people in class
def probSomeoneShares():
    return 9/365
def prob2StudentsShare():
    """return 1 - (365 * 364 * ... * 356)/(365 ** 10)"""
    lc = [n for n in range(365, 355, -1)]
    return 1 - (reduce(mul, lc, 1) / (365 ** 10))

print("Shares: " + str(probSomeoneShares()))
print("2 share: " + str(prob2StudentsShare()))

Shares: 0.024657534246575342
2 share: 0.11694817771107768


# 2. Dice (singular: Die) 
<center>
<img src="https://i.warosu.org/data/sci/img/0067/71/1411506402282.png" width="100" height="100" />
</center>

`p` is the traditional name for the Probability function:
```python
from fractions import Fraction
def p(event, space): 
    "The probability of an event, given a sample space of equiprobable outcomes."
    return Fraction(len(event & space), 
                    len(space))
```

To note:
* We use ```Fraction``` rather than regular division because I want exact answers like 1/3, not 0.3333333333333333.
* `&` is the python set *intersection* operation, while `|` is the python *union* operation.

**Exercise**: What's the probability of rolling an even number with a single six-sided fair die? Use python tuples (unordered collection with no duplicate elements), since we don't expect them to change.

Define the sample space D:
```D    = {...}```

and the event even:
```even = {...}```

and compute the probability:
```p(even, D)```

Copy and paste the code above in the cell below, and replace ```...``` with the right values!

In [6]:
from fractions import Fraction
def p(event, space): 
    "The probability of an event, given a sample space of equiprobable outcomes."
    return Fraction(len(event & space), 
                    len(space))

D = {1,2,3,4,5,6}
even = {2,4,6}
p(even, D)

Fraction(1, 2)

What happens if you specify ```even = {2, 4, 6, 8, 10, 12, 14, 16, 18, 20}```?

To note:
* The definition of ```p``` uses ```len(event & space)``` rather than ```len(event)``` because I don't want to count outcomes that were specified in event but aren't actually in the sample space.

# 3. Urns, combinations, and permutations

<br />
<center>
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/3/3c/Urn_problem_qtl6.svg/200px-Urn_problem_qtl6.svg.png"/>
</center>


Around 1700, Jacob Bernoulli wrote about removing colored balls from an urn in his landmark treatise *[Ars Conjectandi](https://en.wikipedia.org/wiki/Ars_Conjectandi)*

Here is a three-part problem [adapted](http://mathforum.org/library/drmath/view/69151.html)  from mathforum.org:

> An urn contains 23 balls: 8 white, 6 blue, and 9 red.  We select six balls at random (each possible selection is equally likely). What is the probability of each of these possible outcomes:

> 1. All balls are red
2. 3 are blue, 2 are white, and 1 is red
3. Exactly 4 balls are white

So, outcome = set of 6 balls, sample space = set of all possible 6 ball combinations. 

We'll mark our balls `'W1'` through `'W8'`, `'B1'` through `'B6'`, and `'R1'` through `'R9'`.

To note:
- An outcome is a *set* of balls, where order doesn't matter, not a *sequence*, where order matters. When order **matters**, the set of outcomes is called all **permutations**. When order **does not matter**, the set of outcomes is called all **combinations**.

The number of *combinations* of balls is the number of *permutations* divided by `c!`, where *c* is the number of balls. So there are less combinations and more permutations possible. If I want to choose 2 white balls from the 8 available, there are 8 ways to choose a first white ball and 7 ways to choose a second (because the first one has been picked and not available in the sample space anymore), and therefore 8 &times; 7 = 56 permutations of two white balls. But there are only 56 / 2 = 28 combinations, because `(W1, W2)` is the same combination as `(W2, W1)`.

We'll start by defining the contents of the urn, and we'll use a python `set` (unordered collection with no duplicate elements), since we don't expect them to change. Since we're passing in strings, the `+` operator will concatenate strings together.

```python
def cross(A, B):
    """The set of ways of concatenating one item from collection A with one from B."""
    return {a + b 
            for a ... for b ...}  # fill in the ...

urn = cross('W', '12345678') | cross('B', ...) | cross('R', ...) #fill in the ...
urn
len(urn)
```

In [7]:
def cross(A, B):
    """The set of ways of concatenating one item from collection A with one from B."""
    return {a + b 
            for a in A for b in B}

urn = cross('W', '12345678') | cross('B', '123456') | cross('R', '123456789')
urn

{'B1',
 'B2',
 'B3',
 'B4',
 'B5',
 'B6',
 'R1',
 'R2',
 'R3',
 'R4',
 'R5',
 'R6',
 'R7',
 'R8',
 'R9',
 'W1',
 'W2',
 'W3',
 'W4',
 'W5',
 'W6',
 'W7',
 'W8'}

Now let's define the sample space, `U6`, as the set of all 6-ball combinations (physicists define all forces in nature in terms of similar sample spaces where the number of samples is the number of symmetries in the behavior of the objects that the forces act upon. Read [here](https://arxiv.org/pdf/hep-th/9712154.pdf) for a good introduction to symmetry groups in physics).

We will use the python `itertools.combinations` package to generate the combinations, and then join each combination into a string:

```python
import itertools

def combinations(items, n):
    "All combinations of n items; each combination as a concatenated str."
    return {' '.join(combo) 
            for combo in itertools.combinations(items, n)}

U6 = combinations(urn, 6)
len(U6)
```

In [8]:
import itertools

def combinations(items, n):
    "All combinations of n items; each combination as a concatenated str."
    return {' '.join(combo) 
            for combo in itertools.combinations(items, n)}

U6 = combinations(urn, 6)
len(U6)

100947

You should find that there are 100,947 members in our sample space. To take a peek at a random sample of 10 of them (you should always take a peek at big datasets in Data Science. *Always*):

```python
import random
random.sample(U6, 10)
```

In [9]:
import random
random.sample(U6, 10)

['B4 B6 R6 W3 R7 B3',
 'R8 R4 R6 W2 B3 R5',
 'R4 R3 R6 W6 R5 R2',
 'W5 B6 R1 W2 R7 R2',
 'R4 W4 B6 W8 W2 R7',
 'R8 W8 B5 R1 W2 R5',
 'R8 B4 R4 W4 R3 W3',
 'W8 B5 W3 B1 R7 R2',
 'R9 R6 W1 W2 B1 R5',
 'R8 B5 R1 W2 B1 B3']

We can pick any of 23 balls for the first item, any of 22 for the second, ..., and any of 18 for the sixth. But since we don't care about the ordering of the six items, we divide the product by 6! (the number of possible combinations of 6 things) and thus:

$$23 ~\mbox{choose}~ 6 = \frac{23 \cdot 22 \cdot 21 \cdot 20 \cdot 19 \cdot 18}{6!} = 100947$$

But since $23 \cdot 22 \cdot 21 \cdot 20 \cdot 19 \cdot 18 = 23! \;/\; 17!$, we can write:

$$n ~\mbox{choose}~ c = \frac{n!}{(n - c)! \cdot c!}$$

To translate that to code, use the following, and note that
* Python has two division operators, a single slash `/` character for classic division and a double-slash `//` for *floor* division (rounds down to nearest whole number). Classic division means that if the operands are both integers, it will perform floor division, while for floating point numbers, it represents true division.

```python
from math import factorial

def choose(n, c):
    """Number of ways to choose c items from a list of n items."""
    return factorial(n) // (factorial(n - c) * factorial(c))
choose(23, 6)
```


To note:
* `count()` is the python function that returns the *cardinal* (a.k.a *length*) of a sequence (a.k.a. list/set/tuple/dict), filtered by an argument. True statement: ```'foobar'.count('o') == 2```. 

Now we're ready to answer the 4 problems: 

### Urn Problem 1: what's the probability of selecting 6 red balls? 

```python
red6 = {b for b in U6 if b.count(...) == ...}  # fill in the ...
print(red6)
p(red6, U6)
```

Go ahead, cut and paste below and replace `...` with the right answer. Then verify your answer by running the code below and ensuring that it's the probability for picking 6 our of 9 balls in an unordered fashion out of sample space U6:
```python
p(red6, U6) == Fraction(choose(9, 6), 
                        len(U6))
```

In [10]:
red6 = {b for b in U6 if b.count('R') == 6}
print(red6)
p(red6, U6)

{'R4 R9 R3 R6 R1 R7', 'R8 R6 R1 R7 R5 R2', 'R8 R4 R3 R6 R7 R2', 'R8 R4 R6 R1 R7 R5', 'R9 R3 R6 R1 R7 R5', 'R4 R3 R6 R1 R5 R2', 'R4 R3 R6 R1 R7 R2', 'R8 R4 R9 R6 R7 R5', 'R8 R4 R3 R6 R1 R2', 'R8 R4 R9 R6 R5 R2', 'R8 R4 R9 R3 R7 R5', 'R8 R4 R9 R1 R7 R5', 'R8 R4 R9 R1 R5 R2', 'R4 R3 R1 R7 R5 R2', 'R8 R9 R3 R6 R1 R2', 'R8 R9 R6 R7 R5 R2', 'R8 R4 R6 R1 R5 R2', 'R8 R4 R9 R6 R7 R2', 'R9 R6 R1 R7 R5 R2', 'R8 R9 R6 R1 R7 R2', 'R3 R6 R1 R7 R5 R2', 'R8 R4 R3 R1 R7 R5', 'R8 R4 R9 R3 R6 R5', 'R8 R4 R3 R6 R1 R7', 'R4 R3 R6 R7 R5 R2', 'R8 R3 R6 R1 R7 R2', 'R8 R9 R3 R1 R7 R2', 'R8 R4 R9 R6 R1 R2', 'R4 R9 R1 R7 R5 R2', 'R8 R4 R9 R3 R1 R7', 'R8 R4 R9 R3 R1 R2', 'R8 R9 R1 R7 R5 R2', 'R8 R4 R9 R3 R6 R2', 'R4 R9 R3 R6 R1 R2', 'R8 R4 R9 R3 R6 R1', 'R4 R3 R6 R1 R7 R5', 'R4 R9 R6 R1 R5 R2', 'R8 R4 R3 R6 R7 R5', 'R8 R4 R3 R1 R7 R2', 'R4 R9 R3 R6 R7 R5', 'R9 R3 R6 R1 R5 R2', 'R8 R4 R3 R6 R1 R5', 'R8 R3 R1 R7 R5 R2', 'R4 R9 R3 R1 R7 R2', 'R4 R9 R3 R6 R7 R2', 'R8 R9 R3 R6 R1 R7', 'R8 R4 R9 R3 R1 R5', 'R4 R9 R6 R1

Fraction(4, 4807)

### Urn Problem 2: what is the probability of selecting 3 blue, 2 white, and 1 red?

```python
b3w2r1 = {s for s in U6 if ...}
p(b3w2r1, U6)
```

and verify that it's equal to the probability of picking 3 blue out of 6 and 2 white out of 8 and 1 red out of 9:
```python
p(b3w2r1, U6) == Fraction(choose(6, 3) * choose(8, 2) * choose(9, 1), 
                          len(U6))
```

You can also reason that there are 6 ways to pick the first blue, 5 ways to pick the second blue, and 4 ways to pick the third. Then 8 ways to pick the first white and 7 to pick the second. Then 9 ways to pick a red. But the order 'B1, B2, B3' should count as the same as 'B2, B3, B1' and all the other orderings; so divide by 3! to account for the permutations of blues, by 2! to account for the permutations of whites, and finally by 100947 to get a probability.

In [11]:
b3w2r1 = {s for s in U6 if s.count('B') == 3 and s.count('W') == 2 and s.count('R') == 1}
p(b3w2r1, U6)

Fraction(240, 4807)

In [14]:
Fraction(choose(6, 3) * choose(8, 2) * choose(9, 1), len(U6))

NameError: name 'choose' is not defined

### Urn Problem 3: What is the probability of exactly 4 white balls?

In other words, choosing 4 out of the 8 white balls and 2 out of the 15 non-white balls.
```python
w4 = {s for s in U6 if
      s.count('W') == 4}

P(w4, U6)

P(w4, U6) == Fraction(choose(8, 4) * choose(15, 2),
                      len(U6))

P(w4, U6) == Fraction((8 * 7 * 6 * 5) * (15 * 14),
                      factorial(4) * factorial(2) * len(U6))
```

In [16]:
w4 = {s for s in U6 if
      s.count('W') == 4}

p(w4, U6)

Fraction(350, 4807)

In [17]:
p(w4, U6) == Fraction(choose(8, 4) * choose(15, 2),
                      len(U6))

NameError: name 'choose' is not defined

# 4. Working with transformations instead of samples

<br />
<center>
<img src="http://agilitrix.com/wp-content/uploads/2013/02/1-Transformation.jpg"  width="500" />
</center>

Sometimes we don't have a straightforward way to easily enumerate all possible samples in a sample space, but we have an easy way of defining a transformation that will yield a desired sample. In other words, we want to work with **lambdas** instead of **objects**. 

Here's a generator for natural numbers, a compact, transformation-based way of defining natural numbers so that we don't have to write a lot of data:

In [11]:
def numbers():
    i = 0
    while True:
        yield i
        i += 1

And here's how to consume the generator (a bit ugly, granted, can you improve on this, i.e. no `break` statement?):

In [81]:
import queue as Q
q = Q.Queue()
for index, item in enumerate(numbers()):
    q.put(item)
    if index == 100:
        break
print(list(q.queue))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100]


In the case of the die roll, I can define an `even` lambda: 

In [18]:
def even(n): return n % 2 == 0

Now in order to make `p(even, D)` work, I'll have to modify `p` to accept an event as either
a *set* of outcomes (as before), or a *predicate* over outcomes&mdash;a function that returns true for an outcome that is in the event:

In [19]:
def p(event, space): 
    """The probability of an event, given a sample space of equiprobable outcomes.
    event can be either a set of outcomes, or a predicate (true for outcomes in the event)."""
    if is_predicate(event):
        event = such_that(event, space)
    return Fraction(len(event & space), len(space))

is_predicate = callable

def such_that(predicate, collection): 
    """The subset of elements in the collection for which the predicate is true."""
    return {e for e in collection if predicate(e)}

Here we see how `such_that`, the new `even` predicate, and the new `p` work:

In [18]:
such_that(even, D)

NameError: name 'such_that' is not defined

In [21]:
p(even, D)

Fraction(1, 2)

In [22]:
D12 = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}

such_that(even, D12)

{2, 4, 6, 8, 10, 12}

In [23]:
p(even, D12)

Fraction(1, 2)

Note: `such_that` is just like the built-in python function `filter` (recall that we talked about the most important built-in functions in python for Data Science: `map`, `reduce`, and `filter`), except that `such_that` returns a *set*.

We can now define more interesting events using predicates; for example we can determine the probability that the sum of a three-dice roll is *prime*:

```python
D = {1,2,3,4,5,6}

D3 = {(d1, d2, d3) for d1 in D for d2 in D for d3 in D}

def prime_sum(outcome): return is_prime(sum(outcome))

def is_prime(n): return ...  # implement is_prime()!

p(prime_sum, D3)
```

Go ahead, please implement `is_prime()`..

In [21]:
D3 = {(d1, d2, d3) for d1 in D for d2 in D for d3 in D}

def prime_sum(outcome): return is_prime(sum(outcome))

def is_prime(n): return ...  # implement is_prime()!

p(prime_sum, D3)

TypeError: unsupported operand type(s) for &: 'function' and 'set'

# Conclusion of Week 3, Lecture 1
We talked about **frequentist** and **Bayesian** statistics, defined each, and took a first peek at **Bayes' Theorem**, a pillar of Data Science. **Probabilities** is all about *counting*: What's the number of alternate (quantum) universes that can exist? How many universes does Lews Hamilton win the F1 championship in? How many universes does Fernando Alonso win the F1 championship in? Computers are essentially counting machines. So we built a counting framework to let computers do the counting for us. Use it to have an idea which driver to place your bet on!

We'll continue this lab next week. In the meantime, make sure you understand everything we did in this notebook, because our data science will get more complicated.

<br />
<center>
    <img src="ipynb.images/complicated.jpg" width=400 />**Complicated**</a><br>unknown artist
</center>

# Homework for next week

<br />
<center>
    <img src="ipynb.images/f1races.png" width=800 />
</center>

Assume there are two F1 races coming up: The Russian Grand Prix this weekend and the Japanese Grand Prix the weekend after. The 2018 driver standings are given [here](https://www.formula1.com/en/results.html/2018/drivers.html). Given these standings (please do not use team standings given on the same Web site, use driver standings), what is the Probability Distribution for each F1 driver to win the Russian Grand Prix? What is the Probability Distribution for each F1 driver to win *both* the Russian and Japanese Grand Prix? What is the probability for Ferrari to win both races? What is the probability for Ferrari to win at least one race? Note that Ferrari, and each other racing team, has two drivers per race.