# Probability: Introduction to Foundational Concepts

Aka:

 - Set Notation
 - Permutations vs. Combinations
 - What is Probability? 
 - Plus: Dependent vs. Independent Probability, aka when to use Conditional Probabilities


## Set Notation

Sets are collections of elements, and the language and notation used to talk about sets are used in a lot of other placs, so it's nice to go ahead and lay the groundwork here.

<img src="../images/setnotation.jpg" alt="set notation, found from https://slideplayer.com/slide/10502152/" width=800>

[Image Source](https://slideplayer.com/slide/10502152/)

More on Sets: https://www.mathsisfun.com/sets/sets-introduction.html

<img src="../images/new_venn_diagram.png" alt="venn diagram with set notation" width=500>

### But Now, in Python:

 

| Method        |	Equivalent |	Result |
| ------                    | ------       | ------    |
| s.issubset(t)             |	s <= t     | test whether every element in s is in t
| s.issuperset(t)           |	s >= t     | test whether every element in t is in s
| s.union(t)                |	s $\mid$ t | new set with elements from both s and t
| s.intersection(t)         |	s & t      | new set with elements common to s and t
| s.difference(t)           |	s - t 	   | new set with elements in s but not in t
| s.symmetric_difference(t) |	s ^ t      | new set with elements in either s or t but not both

## Permutations vs. Combinations

Now let's talk about how you can take elements from different collections and group them.

### What's the difference between a permutation and a combination?

In a **permutation**, order matters. If you have a race, it matters who arrives in first, or second, or third place - there's a difference in the ordering of the group!

In a **combination**, you only care about which items are members of the set. For example, if you're creating groups of students to work on a project, the order in which you add their names to the group doesn't really matter - it's the group members itself, not any order, that you care about.

#### How many possible codes are there for a standard padlock?

> Hint: (there are 40 numbers on a padlock and use 3 numbers.)

In [1]:
# A:
40**3

64000

For the first number: 40 choices. For the second number, still 40 choices: $40\cdot40=40^2$. Again, for 3rd number, still 40 choices: $40\cdot40\cdot40=40^3$

This is an example of... 

 - **Permutation** or Combination ?

#### How many unique 3 topping pizzas can you make from the following ingredients:

- Mushrooms
- Pepperoni
- Onion
- Peppers
- Ham
- Pineapple
- Sausage
- Olives
    
> Side note: which is the worst?

In [2]:
import itertools

In [3]:
toppings = ["Mushrooms","Pepperoni","Onion","Peppers","Ham","Pineapple","Sausage","Olives"]
three_topping_pizzas = list(itertools.combinations(toppings, 3))
len(three_topping_pizzas)

56

This is an example of... 

 - Permutation or **Combination**?

## What is Probability?

> **Probability is the likelihood of a specific outcome occuring out of all possible outcomes, expressed as a fraction between 0 and 1.**

Perhaps more importantly:

> **"Probabilities do not tell us what will happen for sure; they tell us what is _likely to happen_ and what is _less likely to happen_."**
>
> -- _Naked Statistics_, by Charles Wheelan, p. 72

In general, you can think of dividing the outcome you're exploring by all possible outcomes:

$$ P(Event) = \frac{|Event|}{|Sample\ Space|} $$

## Planning Party Playlists with Probabilities

- We are constructing a dinner party playlist for a gathering we are planning. 
- We asked our attendees to each provide a handful of songs they would like to be played at the dinner party.

In [4]:
import pandas as pd
import numpy as np

In [5]:
# This code is a bit different from anything we've seen before...
# Point is, it grabs each of the CSVs in a folder and reads them in 
import os, glob

datafolder = "probability_playlists/"
rec_files = glob.glob(datafolder+"*.csv")

playlists = {}
for file in rec_files:
    key = os.path.basename(file).replace('_recs.csv','')
    playlists[key] = pd.read_csv(file)
playlists.keys()

dict_keys(['joe', 'james', 'anne', 'john', 'samantha'])

In [6]:
for name, playlist in playlists.items():
    print(f"{name.title()}'s Requests:")
    display(playlist)

Joe's Requests:


Unnamed: 0,artist,track,Recommended By
0,Green Day,Time of your Life,Joe
1,B-52s,Rock Lobster,Joe
2,Lady GaGa,Poker Face,Joe
3,John Lennon,Imagine,Joe


James's Requests:


Unnamed: 0,artist,track,Recommended By
0,Eve 6,Here's to the Night,James
1,Neutral Milk Hotel,Into the Aeroplane Over the Sea,James
2,Rilo Kiley,With Arms Outstretched,James
3,Red Hot Chili Peppers,Otherside,James


Anne's Requests:


Unnamed: 0,artist,track,Recommended By
0,Smashing Pumpkins,"Tonight, Tonight",Anne
1,Black Eyed Peas,Let's Get it Started,Anne
2,Green Day,Time of your Life,Anne


John's Requests:


Unnamed: 0,artist,track,Recommended By
0,Black Eyed Peas,Let's Get it Started,John
1,Lady GaGa,Poker Face,John
2,Lady GaGa,Bad Romance,John
3,Lady GaGa,Just Dance,John


Samantha's Requests:


Unnamed: 0,artist,track,Recommended By
0,Black Eyed Peas,Let's Get it Started,Samantha
1,Panic at the Disco,Hallelujah,Samantha
2,Adele,Set Fire to the Rain,Samantha


For now, lets assume we take everyone's recommendations and add them all to our playlist, even if the same song has been recommended by someone else.

In [7]:
# Create 1 df for all recs
df = pd.concat(playlists).reset_index(drop=True)
df

Unnamed: 0,artist,track,Recommended By
0,Green Day,Time of your Life,Joe
1,B-52s,Rock Lobster,Joe
2,Lady GaGa,Poker Face,Joe
3,John Lennon,Imagine,Joe
4,Eve 6,Here's to the Night,James
5,Neutral Milk Hotel,Into the Aeroplane Over the Sea,James
6,Rilo Kiley,With Arms Outstretched,James
7,Red Hot Chili Peppers,Otherside,James
8,Smashing Pumpkins,"Tonight, Tonight",Anne
9,Black Eyed Peas,Let's Get it Started,Anne


### Q1: What is the probability of the next song being by Lady Gaga?

Assume we just accept everyone's suggestions allowing duplicate songs and play on shuffle.

Remember: 


$$ P(E) = \frac{|E|}{|S|} $$

In [8]:
# Our sample space - number of songs, grouped by artist
sample_space = df['artist'].value_counts()
sample_space

Lady GaGa                4
Black Eyed Peas          3
Green Day                2
Rilo Kiley               1
Smashing Pumpkins        1
Panic at the Disco       1
John Lennon              1
B-52s                    1
Neutral Milk Hotel       1
Red Hot Chili Peppers    1
Eve 6                    1
Adele                    1
Name: artist, dtype: int64

In [9]:
# What is the event space?
E = sample_space['Lady GaGa']
E

4

In [10]:
# What about the sample space? Numerically, I mean
S = sample_space.sum()
S

18

In [11]:
# Find the probability of lady gaga playing
P_lady_gaga = E/S
P_lady_gaga

0.2222222222222222

### Q2: What is the probability of the next song being "Time of Your Life"?

In [12]:
# Sample space by tracks this time
sample_space = df['track'].value_counts()
display(sample_space)

Let's Get it Started               3
Time of your Life                  2
Poker Face                         2
Imagine                            1
Bad Romance                        1
Otherside                          1
Into the Aeroplane Over the Sea    1
Hallelujah                         1
With Arms Outstretched             1
Set Fire to the Rain               1
Just Dance                         1
Tonight, Tonight                   1
Rock Lobster                       1
Here's to the Night                1
Name: track, dtype: int64

In [13]:
# Event space?
E = sample_space.loc['Time of your Life']
E

2

In [14]:
# Sample space?
S = sample_space.sum()
S

18

In [15]:
# Probability of 'Time of yYour Life'
P_time_of_your_life = E/S
P_time_of_your_life

0.1111111111111111

### Q3: what is the probability of hearing a song by Lady GaGa or Green Day?


In [16]:
# Again, by artist
sample_space = df['artist'].value_counts()
sample_space

Lady GaGa                4
Black Eyed Peas          3
Green Day                2
Rilo Kiley               1
Smashing Pumpkins        1
Panic at the Disco       1
John Lennon              1
B-52s                    1
Neutral Milk Hotel       1
Red Hot Chili Peppers    1
Eve 6                    1
Adele                    1
Name: artist, dtype: int64

In [17]:
# Event Space
E = sample_space.loc['Lady GaGa'] + sample_space.loc['Green Day']
E

6

In [18]:
# Sample space
S = sample_space.sum()
S

18

In [19]:
# Find the probability
P_lady_gaga_or_greenday = E/S
P_lady_gaga_or_greenday

0.3333333333333333

Let's bring it back to our different permutation/combination notation:

### Q4: How many different ways could we build a playlist using everyone's recommendations (without shuffle, no looping, and no repeated songs)?

- Combination or **permutation**?
    - Order of a playlist matters!

- What formula would we use?

$$\large P(n) = n!$$

In [20]:
# First, let's deal with those duplicates we've been ignoring
df = df.drop_duplicates(subset='track'- Q: What formula would we use?
    - A:   $$\large P(n) = n!$$)

In [21]:
len(df)

14

In [34]:
from math import factorial
ans = factorial(len(df))
print(f"{ans:,d}")

87,178,291,200


### Q5: What if we limit the playlist to only 10 songs, without replacement? How many possible playlists?

- What formula would we use?

$$ \large P_{k}^{n}= \dfrac{n!}{(n-k)!}$$ 

In [36]:
n = len(df)
k = 10
n,k

(14, 10)

In [37]:
print(f"{factorial(n)/factorial(n-k):,.0f}")

3,632,428,800


### Q6: what if we limit the playlist to 10 songs, WITH replacement?

- What formula would we use?

$$ \large {P}_{j}^{n} = n^j $$

In [38]:
print(f"{18**10:,}")

3,570,467,226,624


### Q7: what if we select 10 songs out of the total number of suggestions and allow for repitition?

- What formula would we use?

$$\large C_{k}^{n} = \displaystyle\binom{n}{k} = \dfrac{P_{k}^{n}}{k!}=\dfrac{ \dfrac{n!}{(n-k)!}}{k!} = \dfrac{n!}{(n-k)!k!}$$

In [39]:
ans = factorial(18) / (factorial(18-10)*factorial(10))
print(f"{ans:,.0f}")

43,758


## So....  We realize we need to relax and not worry about the song-order. That's what Shuffle is for, right? 😅

### Q8: How many playlists can we produce for an 8-track playlist from the unique suggested songs (14)?

- What formula would we use?

    $$\large C_{k}^{n} = \displaystyle\binom{n}{k} = \dfrac{P_{k}^{n}}{k!}=\dfrac{ \dfrac{n!}{(n-k)!}}{k!} = \dfrac{n!}{(n-k)!k!}$$

In [41]:
n = 14
k = 8

ans = factorial(n) / (factorial(n-k)*factorial(k))
print(f"{ans:.0f}")

3003


### Conditional Probability

#### When do we compute conditional probabilities? 

- We need to compute conditional probabilities when the outcome of an event depends on the outcome of previous events (dependent events). A conditional probability of an event is the probability of the event given another event has occurred.


### Mushroom dataset

To discuss conditional probability, let's look at a modified version of the Mushroom dataset from UCI [here](https://archive.ics.uci.edu/ml/datasets/Mushroom). Each row in this dataset corresponds to one observation (one mushroom). 

The modified dataset includes 4 variables:

* **edible-poisonous**
    * This categorical variable can have one of two values: if the mushroom is edible, "edible". If not, "poisonous"

* **bruised**
    * This is a Boolean variable that can assume either one of two values, True or False.

* **gill-spacing**
    * This categorical variable can have one of three values: "close", "crowded", or "distant"
    
* **stalk-shape**
    * This categorical variable can have one of two values: "enlarging" or "tapering"
* **stalk-color-above-ring**
    * This categorical variable can have one of nine values:  "brown","buff","cinnamon","gray","orange", "pink","red","white" or "yellow"

* **stalk-color-below-ring**
    * This categorical variable can have one of nine values:  "brown","buff","cinnamon","gray","orange", "pink","red","white" or "yellow"

* **gill-color**
    * This categorical variable can have one of twelve values: "black","brown","buff","chocolate","gray", "green","orange","pink","purple","red", "white" or "yellow" 



In [1]:
df = pd.read_csv('Mushrooms_cleaned.csv')
df.head()

Unnamed: 0,edible-poisonous,gill-spacing,stalk-shape,stalk-color-above-ring,stalk-color-below-ring,gill-color,bruised
0,poisonous,close,enlarging,white,white,black,True
1,edible,close,enlarging,white,white,black,True
2,edible,close,enlarging,white,white,brown,True
3,poisonous,close,enlarging,white,white,brown,True
4,edible,crowded,tapering,white,white,black,False


#### If you picked a row from this dataset at random, what is the probability it corresponds to a bruised mushroom? $P(bruised)$

In [7]:
print(len(df.index))
print(len(df.loc[df['bruised'] == True].index))

8124
3376


In [9]:
# Instead of df.shape[0] could use len(df.index)
# And instead of the first part, len(df.loc[df['bruised'] == True].index)
p_bruised = df[df['bruised'] == True].shape[0]/df.shape[0]
p_bruised

0.4155588380108321

#### What is the probability you pick a row corresponding to a mushroom that is bruised _AND_ edible? $P(edible \cap bruised)$ 

In [10]:
p_bruised_and_edible = df[(df['bruised'] == True) & (df['edible-poisonous'] == 'edible')].shape[0]/df.shape[0]

In [11]:
p_bruised_and_edible

0.33874938453963566

#### What is the probability of picking an edible mushroom given it is bruised? $P(edible | bruised)$

In [12]:
p_edible_given_bruised = p_bruised_and_edible/p_bruised
p_edible_given_bruised

0.8151658767772512

In [7]:
bruised = df[df['bruised'] == True]
bruised['edible-poisonous'].value_counts(normalize=True)

edible       0.815166
poisonous    0.184834
Name: edible-poisonous, dtype: float64

#### What is the probability of picking a bruised mushroom given it is edible? $P(bruised | edible)$

* For this, it is important that students recognize that, even though computing the probability that a mushroom is edible and bruised is the same as the probability that a mushroom is bruised and edible, the conditional probability is **not the same** because the condition that needs to be met to compute the probability is different (i.e. the sample space is different)

In [14]:
p_edible = df[df['edible-poisonous'] == 'edible'].shape[0]/df.shape[0]
p_edible

0.517971442639094

In [18]:
p_bruised_given_edible = p_bruised_and_edible/p_edible
p_bruised_given_edible

0.6539923954372624

In [6]:
edible = df[df['edible-poisonous'] == 'edible']
edible['bruised'].value_counts(normalize=True)

True     0.653992
False    0.346008
Name: bruised, dtype: float64

### Intuition behind conditional probability: 

How do you compute the probability that mushrooms are edible given they are bruised? 

When you ask the question "what is the probability that the mushrooms are edible and bruised?", the sample space originally contains all 8124 rows of mushrooms. 

<img src="images/Image_72_Cond4.png" width="300">

However, to compute the probability that the mushrooms are edible given they are bruised, you need to consider the reduced size of the sample space. 

In the image above, S is the universe of all mushrooms in the dataset, A is the set of mushrooms that are edible, and B is the set of mushrooms that are bruised.

* When you ask the question "what is the probability that the mushrooms are edible given the mushrooms are bruised?", you have effectively reduced the size of the sample space to include only those mushrooms that are bruised. 

* Given that mushrooms are bruised, the only way for the mushrooms to be edible is for these mushrooms to fall in the intersection of the set of mushrooms that are edible _and_ the set of mushrooms that are bruised , $P(edible \cap bruised)$.  

* To account for the smaller sample space, you divide the probability mushrooms are edible and bruised by the probability the mushrooms are bruised: $$\large P(edible|bruised) = \frac{P(edible \cap bruised)}{P(bruised)}$$




## Partitioning Complex Events

You're not really a mushroom expert, but you can see a bunch of orange spots all over the mushroom in your hand. Given the data at your disposal, what's the probability that the mushroom is edible?



$$\large P(edible|orange) = \frac{P(edible \cap orange)}{P(orange)}$$

Furthermore, we can decompose $P(orange)$ into all of the possibilities:

$P(orange) = P(\text{orange_gill}\cup\text{orange_stalk_below_ring}\cup\text{orange_stalk_above_ring})$

But be careful here! 

$P(\text{orange_gill}\cup\text{orange_stalk_below_ring}\cup \text{orange_stalk_above_ring}) != P(\text{orange_gill}) + P(\text{orange_stalk_below_ring}) + P(\text{orange_stalk_above_ring})$

While this may seem correct, adding these individual probabilities double counts mushrooms which have both orange gills and orange stalks or entirely orange stalks.  

In [10]:
df.head()

Unnamed: 0,edible-poisonous,gill-spacing,stalk-shape,stalk-color-above-ring,stalk-color-below-ring,gill-color,bruised
0,poisonous,close,enlarging,white,white,black,True
1,edible,close,enlarging,white,white,black,True
2,edible,close,enlarging,white,white,brown,True
3,poisonous,close,enlarging,white,white,brown,True
4,edible,crowded,tapering,white,white,black,False


In [8]:
p_orange = df[(df['gill-color'] == 'orange')
              | (df['stalk-color-above-ring'] == 'orange')
              | (df['stalk-color-below-ring'] == 'orange')
              ].shape[0]/df.shape[0]
p_edible_and_orange = df[((df['gill-color'] == 'orange')
                          | (df['stalk-color-above-ring'] == 'orange')
                          | (df['stalk-color-below-ring'] == 'orange')
                          )
                         & (df['edible-poisonous'] == 'edible')
                         ].shape[0]/df.shape[0]
p_edible_given_orange = p_edible_and_orange / p_orange
p_edible_given_orange

# Apparently orange mushrooms seem fairly safe....(Disclaimer: don't take this as definitive foraging adivce!)

1.0

## Summary


In this lesson, you reviewed 4 major foundational concepts for probability: permutations, combinations, conditional probability and partitioning complex events. Remember that your standard padlock should be more accurately called a permutation lock! Order matters for permutations, whereas only the members of the set are important for combinations. Conditional probability investigates the odds of an event occurring given other information. In these instances, the universal set of possibilities reflects the given information. In the mushroom example, the probability of a mushroom being edible given that it is bruised can be computed by dividing the probability that it a mushroom is both edible AND bruised, by the probability that it is bruised. 

Mathematically: 

$$\large P(edible|bruised) = \frac{P(edible \cap bruised)}{P(bruised)}$$ 

Finally, you investigated partitioning complex events. Often, complex events can be broken into constituent parts, and the total probability can be calculated by combining these smaller events.

# Additional Resources

## Additional Conditional Probability Practice

What's the probability that a mushroom is poisonous if it has close gill spacing and a tapering stalk?

$$\large P(edible|close \cap tapering) = \frac{P(edible \cap close \cap tapering)}{P(close \cap tapering)}$$

In [17]:
# P that mushroom is poisonous given close gill spacing and tapering stalk'


0.46153846153846156

## Challenge Problem

Let's take some time and review questions like those from the [dsc-law-of-total-probability-lab](https://github.com/learn-co-curriculum/dsc-law-of-total-probability-lab).  

According to the CDC, [14% of Americans currently smoke, 15.8% of males and 12.2% of females](https://www.cdc.gov/tobacco/data_statistics/fact_sheets/adult_data/cig_smoking/index.htm). According the the American Lung Association, [men who smoke are 23 times more likely to smoke then never-smokers, and women are 13 times as likely](https://www.lung.org/lung-health-and-diseases/lung-disease-lookup/lung-cancer/resource-library/lung-cancer-fact-sheet.html). The American Cancer Society estimates that [the lifetime risk of developing lung cancer is 6.85% for males and 5.95% for females](https://www.cancer.org/cancer/cancer-basics/lifetime-probability-of-developing-or-dying-from-cancer.html). Currently, the census estimates that [women are 50.8% of the population](https://www.census.gov/quickfacts/fact/table/US/PST045218). 

What is the risk of lung cancer for non-smokers? Non-smoker males? Non-smoker females?

> To learn more about lung-cancer risks for non-smokers, see https://www.cancer.org/latest-news/why-lung-cancer-strikes-nonsmokers.html.

In [None]:
# Risk of lung cancer for non-smokers


In [None]:
# Risk of lung cancer for non-smoking males


In [None]:
# Risk of lung cancer for non-smoking females
