# Probability Practice


## What is Probability?

> **Probability is the likelihood of a specific outcome occuring out of all possible outcomes, expressed as a fraction between 0 and 1.**

Perhaps more importantly:

> **"Probabilities do not tell us what will happen for sure; they tell us what is _likely to happen_ and what is _less likely to happen_."**
>
> -- _Naked Statistics_, by Charles Wheelan, p. 72

In general, you can think of dividing the outcome you're exploring by all possible outcomes:

$$ P(Event) = \frac{|Event|}{|Sample\ Space|} $$

## Planning Party Playlists with Probabilities & Combinatorics

- We are constructing a dinner party playlist for a gathering we are planning. 
- We asked our attendees to each provide a handful of songs they would like to be played at the dinner party.

In [None]:
# Imports
import pandas as pd
import numpy as np
import math

In [None]:
# This code might be a bit different from anything you've seen before...
# Point is, it grabs each of the CSVs in a folder and reads them in 
import os, glob

datafolder = "probability_playlists/"
rec_files = glob.glob(datafolder+"*.csv")

playlists = {}
for file in rec_files:
    key = os.path.basename(file).replace('_recs.csv','')
    playlists[key] = pd.read_csv(file)
playlists.keys()

In [None]:
for name, playlist in playlists.items():
    print(f"{name.title()}'s Requests:")
    display(playlist)

For now, lets assume we take everyone's recommendations and add them all to our playlist, even if the same song has been recommended by someone else.

In [None]:
# Create 1 df for all recs
df = pd.concat(playlists).reset_index(drop=True)
df

### Q1: What is the probability of the next song being by Lady Gaga?

Assume we just accept everyone's suggestions allowing duplicate songs and play on shuffle.

Remember: 


$$ P(E) = \frac{|E|}{|S|} $$

In [None]:
# Set up: number of songs, grouped by artist
df['artist'].value_counts()

In [None]:
# What is the event space?
S = None

In [None]:
# What about the sample space? 
E = None

In [None]:
# Find the probability of lady gaga playing
P_lady_gaga = None

### Q2: What is the probability of the next song being "Time of Your Life"?

In [None]:
# Set Up
df['track'].value_counts()

In [None]:
# Event space?
E = None

In [None]:
# Sample space?
S = None

In [None]:
# Probability of 'Time of yYour Life'
P_time_of_your_life = None

### Q3: what is the probability of hearing a song by Lady GaGa or Green Day?


In [None]:
# Set up
df['artist'].value_counts()

In [None]:
# Event Space
E = None

In [None]:
# Sample space
S = None

In [None]:
# Find the probability
P_lady_gaga_or_greenday = None

### Q4: How many different ways could we build a playlist using everyone's recommendations (without shuffle, no looping, and no repeated songs)?

In [None]:
# First, let's deal with those duplicates we've been ignoring
df = df.drop_duplicates(subset=['artist', 'track'])

In [None]:
# Calculate how many possible playlists


### Q5: What if we limit the playlist to only 10 songs, without replacement? How many possible playlists?

In [None]:
# Calculate how many possible playlists


### Q6: What if we select 10 songs out of the total number of suggestions and allow for repetition?

In [None]:
# Calculate how many possible playlists


Hooray! Great job practicing probabilities and combinatorics!

---

## Conditional Probability with Mushrooms

#### When do we compute conditional probabilities? 

- We need to compute conditional probabilities when the outcome of an event depends on the outcome of previous events (dependent events). A conditional probability of an event is the probability of the event given another event has occurred.


### Mushroom dataset

To discuss conditional probability, let's look at a modified version of the Mushroom dataset from UCI [here](https://archive.ics.uci.edu/ml/datasets/Mushroom). Each row in this dataset corresponds to one observation (one mushroom). 

The modified dataset includes 4 variables:

* **edible-poisonous**
    * This categorical variable can have one of two values: if the mushroom is edible, "edible". If not, "poisonous"

* **bruised**
    * This is a Boolean variable that can assume either one of two values, True or False.

* **gill-spacing**
    * This categorical variable can have one of three values: "close", "crowded", or "distant"
    
* **stalk-shape**
    * This categorical variable can have one of two values: "enlarging" or "tapering"
* **stalk-color-above-ring**
    * This categorical variable can have one of nine values:  "brown","buff","cinnamon","gray","orange", "pink","red","white" or "yellow"

* **stalk-color-below-ring**
    * This categorical variable can have one of nine values:  "brown","buff","cinnamon","gray","orange", "pink","red","white" or "yellow"

* **gill-color**
    * This categorical variable can have one of twelve values: "black","brown","buff","chocolate","gray", "green","orange","pink","purple","red", "white" or "yellow" 



In [None]:
df = pd.read_csv('../data/Mushrooms_cleaned.csv')
df.head()

### Q1: If you picked a row from this dataset at random, what is the probability it corresponds to a bruised mushroom? 

$P(bruised)$

In [None]:
# Calculate the probability of a bruised mushroom
p_bruised = None

### Q2: What is the probability you pick a row corresponding to a mushroom that is bruised _AND_ edible? 

$P(edible \cap bruised)$ 

In [None]:
# Calculate the probability of a bruised and edible mushroom
p_bruised_and_edible = None

### Q3: What is the probability of picking an edible mushroom given it is bruised? 

$P(edible | bruised)$

In [None]:
# Calculate the probability of an edible mushroom if you know it's bruised
p_edible_given_bruised = None

### Q4: What is the probability of picking a bruised mushroom given it is edible? 

$P(bruised | edible)$

In [None]:
# Calculate the probability of a bruised mushroom if you know it's edible
p_bruised_given_edible = None

### Q5: What is the probability than a mushroom is edible if you can see that part of it is orange?

$P(edible | orange)$

Note - explore the data! Lots of parts of a mushroom could be orange!

In [None]:
# Explore the data and find which columns tell you about the mushroom's color


In [None]:
# Calculate the probability of an edible mushroom if you know part of it is orange
p_edible_given_orange = None

## Level Up

What's the probability that a mushroom is poisonous if it has close gill spacing and a tapering stalk?

$P(edible|close \cap tapering)$

In [None]:
# P that mushroom is edible given close gill spacing and tapering stalk'
p_edible_given_close_tapering = None