# Probability: Introduction to Foundational Concepts

## Objectives

- Recognize combinatoritcs
- Describe the fundamentals of probability theory
- Describe set theory and its terminology
- Define conditional probability

![xkcd](images/increased_risk_2x.png)
[[Image Source]](https://xkcd.com/1252/)

# Set Theory Basics

In probability theory, a **set** is a well-defined collection of objects (called **elements**). An element is either in a given set or not, and order is not considered.

> If an **element** $x$ belongs to a set $S$, then you'd write $x \in S$. 
> 
> On the other hand, if $x$ does not belong to $S$, then you'd write $x\notin S$.

Although sets can be abstract mathematical object, we'll typically think of sets as a collection of things sharing some relevant attribute(s).

## Set Notation

<img src="images/new_venn_diagram.png" alt="venn diagram with set notation" width=500>

The **union** of 2 sets $A$ and $B$ is the set $T$ of elements in either $A$ or $B$, or in both. We denote this with the symbol $\cup$.


$$\large T = A\cup B = B\cup A$$

The **intersection** of two sets $A$ and $B$ is the set $V$ that contains all elements of $A$ that also belong to $B$. We denote this with the symbol $\cap$.

$$\large V = A\cap B = B\cap A$$

<img src="images/setnotation.jpg" alt="set notation, found from https://slideplayer.com/slide/10502152/" width=800>

[Image Source](https://slideplayer.com/slide/10502152/)

More on Sets: https://www.mathsisfun.com/sets/sets-introduction.html

### But Now, in Python:

 These are set methods - work only on Python sets!

| Method                      | Result |
| ------                      | ------ |
| `s.issubset(t)`             | test whether every element in s is in t |
| `s.issuperset(t)`           | test whether every element in t is in s |
| `s.union(t)`                | new set with elements from both s and t |
| `s.intersection(t)`         | new set with elements common to s and t |
| `s.difference(t)`           | new set with elements in s but not in t |
| `s.symmetric_difference(t)` | new set with elements in either s or t but not both |

# Combinatorics

Before we dive in much further, let's briefly talk about how you can take elements from different collections and group them.

### Combinatorics Definition

> Combinatorics is the branch of mathematics that deals with the relations characterizing sets, subsets, lists, and multisets.
>
> Sometimes combinatorics is said to be the branch of math that deals with counting; and that’s true, but not in the sense in which you learned to count in kindergarten. Though combinatorics deals with numbering and finding out how many members are in sets, it’s designed to find ways of doing that **without actual, potentially tedious, counting** involved.

-- [Statistics How To](https://www.statisticshowto.com/probability-and-statistics/combinatorics/)

### What's the difference between a permutation and a combination?

In a **permutation**, order matters. If you have a race, it matters who arrives in first, or second, or third place - there's a difference in the ordering of the group!

> The number of **permutations** of **n** objects taken **r** at a time is given by the formula:
>
> $$\large P(n,r) = \frac{n!}{(n - r)!}$$

In a **combination**, you only care about which items are members of the set. For example, if you're creating groups of students to work on a project, the order in which you add their names to the group doesn't really matter - it's the group members itself, not any order, that you care about.

> The number of **combinations** of **n** objects taken **r** at a time is given by the formula:
>
> $$\large C(n,r) = \frac{n!}{r!(n - r)!}$$

Main things to ask when dealing with combinations or permutations:
- Does order matter? 
- With or without replacement? (aka does the first choice then restrict the second choice?)

### Q: How many possible orders are there for first/second/third place in a race with 30 contestants?

In [3]:
# Can use math:
import math

math.factorial(30) / (math.factorial(30-3))

24360.0

In [4]:
range(1,31)

range(1, 31)

In [7]:
# Also the python library itertools has combination and permutations!
import itertools

list(itertools.permutations(['A', 'B', 'C', 'D', 'E'], 3))

[('A', 'B', 'C'),
 ('A', 'B', 'D'),
 ('A', 'B', 'E'),
 ('A', 'C', 'B'),
 ('A', 'C', 'D'),
 ('A', 'C', 'E'),
 ('A', 'D', 'B'),
 ('A', 'D', 'C'),
 ('A', 'D', 'E'),
 ('A', 'E', 'B'),
 ('A', 'E', 'C'),
 ('A', 'E', 'D'),
 ('B', 'A', 'C'),
 ('B', 'A', 'D'),
 ('B', 'A', 'E'),
 ('B', 'C', 'A'),
 ('B', 'C', 'D'),
 ('B', 'C', 'E'),
 ('B', 'D', 'A'),
 ('B', 'D', 'C'),
 ('B', 'D', 'E'),
 ('B', 'E', 'A'),
 ('B', 'E', 'C'),
 ('B', 'E', 'D'),
 ('C', 'A', 'B'),
 ('C', 'A', 'D'),
 ('C', 'A', 'E'),
 ('C', 'B', 'A'),
 ('C', 'B', 'D'),
 ('C', 'B', 'E'),
 ('C', 'D', 'A'),
 ('C', 'D', 'B'),
 ('C', 'D', 'E'),
 ('C', 'E', 'A'),
 ('C', 'E', 'B'),
 ('C', 'E', 'D'),
 ('D', 'A', 'B'),
 ('D', 'A', 'C'),
 ('D', 'A', 'E'),
 ('D', 'B', 'A'),
 ('D', 'B', 'C'),
 ('D', 'B', 'E'),
 ('D', 'C', 'A'),
 ('D', 'C', 'B'),
 ('D', 'C', 'E'),
 ('D', 'E', 'A'),
 ('D', 'E', 'B'),
 ('D', 'E', 'C'),
 ('E', 'A', 'B'),
 ('E', 'A', 'C'),
 ('E', 'A', 'D'),
 ('E', 'B', 'A'),
 ('E', 'B', 'C'),
 ('E', 'B', 'D'),
 ('E', 'C', 'A'),
 ('E', 'C'

### Q: How many possible codes are there for a standard padlock?

> Hint: (there are 40 numbers on a padlock and use 3 numbers - **with replacement**)

In [None]:
# Calculate it:

# Why can't we use itertools?
# Because their permutation function doesn't allow for replacement (alas)

40 * 40 * 40

### Q: How many unique 3 topping pizzas can you make from the following 8 ingredients:

- Mushrooms
- Pepperoni
- Onion
- Peppers
- Ham
- Pineapple
- Sausage
- Olives
    
> Side note: which is the worst?

In [9]:
# Using math
n = 8
r = 3

#n! / (r! * (n-r)!)

math.factorial(8) / ((math.factorial(3))*(math.factorial(8-3)))

56.0

In [12]:
# Our list of toppings
toppings = ["Mushrooms", "Pepperoni", "Onion", "Peppers", 
            "Ham", "Pineapple", "Sausage", "Olives"]

# Using itertools
three_topping_pizzas = list(itertools.combinations(toppings, 3))
three_topping_pizzas
len(three_topping_pizzas)

56

# Probability Theory

Probability theory is the study of the frequency of a given event occurring with respect to all possible events.

> **Probability is the likelihood of a specific outcome occuring out of all possible outcomes, expressed as a fraction between 0 and 1.**

But perhaps more importantly:

> **"Probabilities do not tell us what will happen for sure; they tell us what is _likely to happen_ and what is _less likely to happen_."**
>
> -- _Naked Statistics_, by Charles Wheelan, p. 72

In general, you can think of dividing the outcome you're exploring by all possible outcomes:

$$\large P(Event) = \frac{|Event|}{|Sample\ Space|} $$

## Why Should I Care About Probabilities?

Studying probabilities allows us to make better and more informed decisions, based on data previously collected. For example, understanding the fact that it is nearly impossible for us to ever win the lottery from a probabilistic standpoint deters us from using it as a source of income.

Probability theory also lies at the heart of making inferences using our data, which is what statistics is all about!

## Probability Terminology

From [_Introduction to Probability, Statistics, and Random Processes_](https://www.probabilitycourse.com/chapter1/1_3_1_random_experiments.php):

> **Outcome:** A result of a random experiment.
>
> **Sample Space:** The set of all possible outcomes.
>
> **Event:** A subset of the sample space.

To be more specific, an **event** is a specific outcome of a random experiment, often what we are hoping to predict (so it's sometimes called a successful outcome)

A random variable can either be continuous or discrete
- **Continuous**: the variable can have any values within a range (e.g. a person's height)
- **Discrete**: the variable can only have a finite subset of values within a range (e.g. number of cars owned)

This is important because, for continuous variables, the probability of some specific event x _will functionally be zero_! BUT "in reality we are always interested in the probability of some intervals rather than a specific point x" [[Source]](https://www.probabilitycourse.com/chapter1/1_3_5_continuous_models.php)

#### Independent Events
A special condition is when the outcome of $A$ has no bearing on the outcome of $B$. We say these two events are **independent** (e.g. rolling a die and tossing a coin).

Formally, $A$ and $B$ are *independent* if and only if the probability that *both* $A$ *and* $B$ happen is:

$$\large P(A \cap B) = P(A) \cdot P(B)$$

## Probability Using Sets

In its most basic form, the **probability** of an event occurring is the size of the specific outcome set divided by the size of the sample space. In some ways, probability is just about counting outcomes.

**Example**: What is the probability of rolling an even number on a six-sided die?

* Event: {2, 4, 6} has size 3
* Sample space: {1, 2, 3, 4, 5, 6} has size 6
* Thus the probability = 3/6 = 1/2 = 50%

We sometimes write probability statements using notation like $P(E) = 0.5$, where $E$ is an event, such as rolling an even number on a six sided die.

### 🧠 Knowledge Check

### 1) AND Question:

What is the probability of rolling a 5 on a fair die _and_ getting a tails on a fair coin toss?

**Your answer here**:

- 1/6 * 1/2 = 1/12


In [None]:
roll a 1 and get a heads
roll a 2 and get a heads
etc.
roll a 1 and get a tails
etc. etc.


### 2) OR Question: 

What is the probability of rolling a 5 on a die _or_ getting a tails on a coin toss?

**Your answer here**:

- 
P(rolling a 5 on a die) = 1/6
P(getting a tails on a coin toss) = 1/2

subtract by 1/12 because we're counting the outcome of a 5 tails happening twice.
we need to subtract by the P(A and B) to get the final answer.

1 heads
2 heads
3 heads
4 heads
5 heads
6 heads
1 tails
2 tails
3 tails
4 tails
5 tails
6 tails

answer: 7/12

P(A or B) = P(A) + P(B) - P(A and B)

## Enough Talk - Let's Explore in Python!

### Mushroom dataset

Let's look at a modified version of the Mushroom dataset from UCI [here](https://archive.ics.uci.edu/ml/datasets/Mushroom). Each row in this dataset corresponds to one observation (one mushroom). 

In [13]:
# Import pandas
import pandas as pd

In [14]:
# Read in the data, check the first few rows
df = pd.read_csv('data/Mushrooms_cleaned.csv')
df.head()

Unnamed: 0,edible-poisonous,gill-spacing,stalk-shape,stalk-color-above-ring,stalk-color-below-ring,gill-color,bruised
0,poisonous,close,enlarging,white,white,black,True
1,edible,close,enlarging,white,white,black,True
2,edible,close,enlarging,white,white,brown,True
3,poisonous,close,enlarging,white,white,brown,True
4,edible,crowded,tapering,white,white,black,False


In [15]:
# Describe the dataset
df.describe()

Unnamed: 0,edible-poisonous,gill-spacing,stalk-shape,stalk-color-above-ring,stalk-color-below-ring,gill-color,bruised
count,8124,8124,8124,8124,8124,8124,8124
unique,2,2,2,9,9,12,2
top,edible,close,tapering,white,white,buff,False
freq,4208,6812,4608,4464,4384,1728,4748


#### 1) If you picked a row from this dataset at random, what is the probability it corresponds to a bruised mushroom? 

In other words, find $P(bruised)$

In [20]:
df['bruised'].value_counts(normalize=True)

False    0.584441
True     0.415559
Name: bruised, dtype: float64

In [18]:
# Let's find our bruised mushrooms
df.loc[df['bruised'] == True]

Unnamed: 0,edible-poisonous,gill-spacing,stalk-shape,stalk-color-above-ring,stalk-color-below-ring,gill-color,bruised
0,poisonous,close,enlarging,white,white,black,True
1,edible,close,enlarging,white,white,black,True
2,edible,close,enlarging,white,white,brown,True
3,poisonous,close,enlarging,white,white,brown,True
5,edible,close,enlarging,white,white,brown,True
...,...,...,...,...,...,...,...
7940,edible,close,enlarging,white,white,white,True
7946,edible,close,enlarging,white,white,white,True
7952,edible,close,enlarging,white,white,white,True
7965,edible,close,enlarging,white,white,white,True


In [31]:
# And now the probability
p_bruised = len(df.loc[df['bruised'] == True]) / len(df)

In [26]:
# Let's see what we get with a quick sample row...
df.sample(1)

Unnamed: 0,edible-poisonous,gill-spacing,stalk-shape,stalk-color-above-ring,stalk-color-below-ring,gill-color,bruised
3693,edible,close,tapering,white,white,purple,True


#### 2) What is the probability you pick a row corresponding to a mushroom that is bruised _AND_ edible?

$P(edible \cap bruised)$

BUT! Are they independent events ...?


In [28]:
# Use loc to find where they're bruised and edible
df.loc[(df['bruised'] == True) & (df['edible-poisonous'] == 'edible')]

Unnamed: 0,edible-poisonous,gill-spacing,stalk-shape,stalk-color-above-ring,stalk-color-below-ring,gill-color,bruised
1,edible,close,enlarging,white,white,black,True
2,edible,close,enlarging,white,white,brown,True
5,edible,close,enlarging,white,white,brown,True
6,edible,close,enlarging,white,white,gray,True
7,edible,close,enlarging,white,white,brown,True
...,...,...,...,...,...,...,...
7940,edible,close,enlarging,white,white,white,True
7946,edible,close,enlarging,white,white,white,True
7952,edible,close,enlarging,white,white,white,True
7965,edible,close,enlarging,white,white,white,True


In [30]:
# Calculate the probability
p_bruised_and_edible = len(df.loc[(df['bruised'] == True) & (df['edible-poisonous'] == 'edible')]) / len(df)
p_bruised_and_edible

0.33874938453963566

Are being bruised and being edible independent of each other?

> Formally, $A$ and $B$ are *independent* if and only if the probability that *both* $A$ *and* $B$ happen is:
> 
> $$\large P(A \cap B) = P(A) \cdot P(B)$$

In [32]:
# Let's find P(bruised)
p_bruised = len(df.loc[df['bruised'] == True]) / len(df)

In [33]:
# And now P(edible)
p_edible = len(df.loc[df['edible-poisonous'] == 'edible']) / len(df)

In [34]:
# And check - does P(bruised and edible) == P(bruised) * P(edible)?
p_bruised * p_edible

0.21524761082589627

## Enter: Conditional Probability

### When do we compute conditional probabilities? 

We need to compute conditional probabilities when the outcome of an event depends on the outcome of previous events (**dependent** events). A conditional probability of an event is the probability of the event *given* another event has occurred.


When events _are_ independent, the rule for probabilistic AND ('$\cap$') is simple:

$$\large P(a\cap b) = P(a) P(b)$$

But the more general rule, which includes non-independent events, is:

$$\large P(a\cap b) = P(a | b) P(b)$$

In fact, this is the definition of conditional probability. Rearranging:

$$\large P(a | b) = \frac{P(a\cap b)}{P(b)}$$

The `|` here should be read as "given". We are **given** some information, `b`, and thus it reduces our sample space!

#### 3) What is the probability of picking an edible mushroom GIVEN it is bruised? 

$P(edible | bruised)$

In [36]:
# Can calculate this with our earlier probabilities
p_edible_given_bruised = p_bruised_and_edible / p_bruised
p_edible_given_bruised

0.8151658767772512

In [None]:
# Can also use loc!


#### 4) What is the probability of picking a bruised mushroom GIVEN it is edible? 

$P(bruised | edible)$

* For this, want you to recognize that even though computing the probability that a mushroom is edible and bruised is the same as the probability that a mushroom is bruised and edible, the conditional probability is **not the same** because the condition that needs to be met to compute the probability is different (i.e. the sample space is different)

In [38]:
# Let's calculate
p_bruised_given_edible = p_bruised_and_edible / p_edible
p_bruised_given_edible

0.6539923954372624

In [39]:
# Let's use loc
edible = df.loc[df['edible-poisonous'] == 'edible']

In [40]:
edible['bruised'].value_counts(normalize=True)

True     0.653992
False    0.346008
Name: bruised, dtype: float64

### Intuition behind conditional probability: 

How do you compute the probability that mushrooms are edible given they are bruised? 

When you ask the question "what is the probability that the mushrooms are edible and bruised?", the sample space originally contains all 8124 rows of mushrooms. 

<img src="images/Image_72_Cond4.png" width="300">

However, to compute the probability that the mushrooms are edible given they are bruised, you need to consider the reduced size of the sample space. 

In the image above, S is the universe of all mushrooms in the dataset, A is the set of mushrooms that are edible, and B is the set of mushrooms that are bruised.

* When you ask the question "what is the probability that the mushrooms are edible given the mushrooms are bruised?", you have effectively reduced the size of the sample space to include only those mushrooms that are bruised. 

* Given that mushrooms are bruised, the only way for the mushrooms to be edible is for these mushrooms to fall in the intersection of the set of mushrooms that are edible _and_ the set of mushrooms that are bruised , $P(edible \cap bruised)$.  

* To account for the smaller sample space, you divide the probability mushrooms are edible and bruised by the probability the mushrooms are bruised: $$\large P(edible|bruised) = \frac{P(edible \cap bruised)}{P(bruised)}$$




## Partitioning Complex Events

You're not really a mushroom expert, but you can see that the mushroom in your hand is has some orange on the stalk. Given the data at your disposal, what's the probability that the mushroom is edible?



$$\large P(\text{edible|orange stalk}) = \frac{P(\text{edible} \cap \text{orange stalk})}{P(\text{orange stalk})}$$

Furthermore, we can decompose $P(\text{orange stalk})$ into the two possibilities:

$P(\text{orange stalk}) = P(\text{orange_stalk_below_ring}\cup\text{orange_stalk_above_ring})$

But be careful here! 

$P(\text{orange_stalk_below_ring}\cup \text{orange_stalk_above_ring})$ **DOES NOT EQUAL** $P(\text{orange_stalk_below_ring}) + P(\text{orange_stalk_above_ring})$

While this may seem correct, adding these individual probabilities **double counts** mushrooms which have entirely orange stalks. However - when we do these things in Pandas (using a loc and an or condition), it won't duplicate the row - so you don't need to worry about it here!

In [41]:
df

Unnamed: 0,edible-poisonous,gill-spacing,stalk-shape,stalk-color-above-ring,stalk-color-below-ring,gill-color,bruised
0,poisonous,close,enlarging,white,white,black,True
1,edible,close,enlarging,white,white,black,True
2,edible,close,enlarging,white,white,brown,True
3,poisonous,close,enlarging,white,white,brown,True
4,edible,crowded,tapering,white,white,black,False
...,...,...,...,...,...,...,...
8119,edible,close,enlarging,orange,orange,yellow,False
8120,edible,close,enlarging,orange,orange,yellow,False
8121,edible,close,enlarging,orange,orange,brown,False
8122,poisonous,close,tapering,white,white,buff,False


In [46]:
df.shape[0] == len(df)

True

In [43]:
# Can prove that P(orange above) + P(orange below) != p(orange stalk)
p_orange_above = df[df['stalk-color-above-ring'] == 'orange'].shape[0]/df.shape[0]
p_orange_below = df[df['stalk-color-below-ring'] == 'orange'].shape[0]/df.shape[0]

In [42]:
p_orange_stalk = df[(df['stalk-color-above-ring'] == 'orange')
                    | (df['stalk-color-below-ring'] == 'orange')
                    ].shape[0]/df.shape[0]

In [47]:
p_orange_stalk == (p_orange_above + p_orange_below)

False

Now you try!

In [48]:
# Find P(edible | orange stalk)
p_edible_and_orange = df[((df['stalk-color-above-ring'] == 'orange')
                          | (df['stalk-color-below-ring'] == 'orange')
                          )
                         & (df['edible-poisonous'] == 'edible')
                         ].shape[0]/df.shape[0]

p_edible_given_orange = p_edible_and_orange / p_orange_stalk

p_edible_given_orange

1.0

*Disclaimer: none of this should be taken as definitive foraging advice!*

## Extra Credit: Additional Conditional Probability Practice

What's the probability that a mushroom is poisonous if it has close gill spacing AND a tapering stalk?

$$\large P(poisonous|close \cap tapering) = \frac{P(poisonous \cap close \cap tapering)}{P(close \cap tapering)}$$

In [None]:
# P that mushroom is poisonous given close gill spacing and tapering stalk
