In [None]:
import numpy as np
import pandas as pd
from scipy import stats

from plotnine import *

# Together

## Distributions

A distribution is a curve (although sometimes it's pretty straight) that shows how common or uncommon different values are. For example, this is a normal distribution with mean = 0 and standard deviation = 1. Which values are relatively common under this distribution? Uncommon?

<img src="https://drive.google.com/uc?export=view&id=1G0l7s0oheMJRt0I0j9qm_rNbRpogVobo" width = 600px />


What about this one?
<img src="https://drive.google.com/uc?export=view&id=1FZ0YZCHBIZZ-4RYo7pUssuSz-fwGc2Jn" width = 600px />


## Function Optimization

In our lecture we talked about derivatives, and that we often want to *minimize* functions when doing Data Scienc e + Machine Learning. While I won't dive into ALL the calculus now (if this kind of thing excites you, you should take CPSC 393!) I want to talk about some of the ideas behind minimizing functions.

<img src="https://drive.google.com/uc?export=view&id=1GP0gQm1spsSU5qZWZJseENIz6q7XUfvK" width = 600px />
or

<img src="https://drive.google.com/uc?export=view&id=1UT7GKUDqjNU1yHcQHK2bLmgHsxWBUNJv" width = 600px />

or

<img src="https://drive.google.com/uc?export=view&id=1Fn8W3IMhYZDk-Ut7VtWgA_gyQw3uSJCq" width = 600px />

# In Your Groups

## Logarithms

Use your new pandas skills to add a column, `logX` to the dataframe DF that contains the log ([`np.log()`](https://numpy.org/doc/stable/reference/generated/numpy.log.html)) of the column `X`.

Then run the pre-written code (no need to change anything) to plot the log function.

What range of values can the log() function take in? What range of values can the log function spit out? What happens to values < 1 when you log() them? What about values > 1?

<img src="https://drive.google.com/uc?export=view&id=1ghyQPx1N8dmU3MV4TrANvqNhGwnLni72" width = 200px />

In [None]:
DF = pd.DataFrame({"X": np.linspace(0.0001,10, 10000)})

DF.head()

### YOUR CODE HERE ###




### /YOUR CODE HERE ###

In [None]:
# DONT CHANGE, JUST RUN

(ggplot(DF, aes(x = "X", y = "logX")) +
 geom_line(color = "darkblue", size = 3) +
 theme_bw() +
theme(panel_border = element_blank(),
     panel_grid_minor = element_blank()))

## Data Types

In your lecture, you learned about different types of data/variables we could have. Go to our course github and click on the *Data* folder. Get the raw URL for the **Beyonce_data.csv** dataset and load it in using `pd.read_csv()`. Store your dataframe in the variable `bey`, and print the head of the dataframe.
<img src="https://drive.google.com/uc?export=view&id=1ghyQPx1N8dmU3MV4TrANvqNhGwnLni72" width = 200px />

In [None]:
bey = ###

What types are all the variables?

- Categorical:
    - nominal:
    - ordinal:
    - interval:
- Numeric:
- Boolean:
- Text:

If a variable is Categorical, how do you decide if it's nominal, ordinal, or interval? Give an example of each.
<img src="https://drive.google.com/uc?export=view&id=1ghyQPx1N8dmU3MV4TrANvqNhGwnLni72" width = 200px />

## Probabilities and Conditional Probabilities

Remember that in general, probabilities are

$\frac{\text{# of events of interest}}{\text{total # of events}}$

Given this information, and the dataframe `voters`, calculate the probability of:

- being a registered voter
- being a vegetarian AND a registered voter

<img src="https://drive.google.com/uc?export=view&id=1ghyQPx1N8dmU3MV4TrANvqNhGwnLni72" width = 200px />

In [33]:
registered = ['no', 'yes', 'yes', 'no', 'no', 'yes', 'yes', 'yes', 'no', 'no',
       'yes', 'no', 'no', 'no', 'yes', 'yes', 'no', 'no', 'no', 'yes']

diet = ['Meat_Eater', 'Vegetarian', 'Vegetarian', 'Vegan', 'Vegetarian',
       'Vegetarian', 'Vegetarian', 'Vegetarian', 'Meat_Eater', 'Vegan',
       'Vegan', 'Vegetarian', 'Meat_Eater', 'Meat_Eater', 'Vegan',
       'Vegan', 'Vegetarian', 'Meat_Eater', 'Meat_Eater', 'Vegetarian']

voters = pd.DataFrame({"RegisteredToVote": registered, "Diet": diet})

voters

Unnamed: 0,RegisteredToVote,Diet
0,no,Meat_Eater
1,yes,Vegetarian
2,yes,Vegetarian
3,no,Vegan
4,no,Vegetarian
5,yes,Vegetarian
6,yes,Vegetarian
7,yes,Vegetarian
8,no,Meat_Eater
9,no,Vegan


**Conditional probabilities** are just probabilities where the total events are *restricted* by some kind of information.

For example: $P(\textbf{Vegetarian | registered to vote})$ (in words we'd say this as "the Probability of being Vegetarian *given* that you are registered to vote") means that we want to know the probability of being Vegetarian when ONLY looking at registered voters. This means that the denominator of our probability will only count registered voters.

There are 9 registered voters in our data frame, and out of those 9, 6 are Vegetarian. So $P(\textbf{Vegetarian | registered to vote})$  = $\frac{6}{9}$.

Using the data frame `booksRead` below which indicates the responses from 25 people about which books they had read in the past 5 years, calculate (using code or by hand) the following probabilites:
- P(read Tale of Two Cities)
- P(read the Bible)
- P(read What to Expect When You're Expecting **|** read Tale of Two Cities)
- P(read What to Expect When You're Expecting **|** read Tale of Two Cities **AND** the Bible)
- P(read How to Win Friends and Influence People **|** did not read LOTR)
- P(read LOTR **AND** Tale of Two Cities)

<img src="https://drive.google.com/uc?export=view&id=1ghyQPx1N8dmU3MV4TrANvqNhGwnLni72" width = 200px />

In [40]:
taleOfTwoCities = ['yes', 'yes', 'yes', 'no', 'yes', 'yes', 'no', 'yes', 'yes', 'no',
       'yes', 'no', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes',
       'yes', 'yes', 'yes', 'yes', 'yes', 'yes']

bible = ['yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes',
       'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes',
       'yes', 'yes', 'yes', 'no', 'yes', 'yes', 'yes']

howToWinFriendsAndInfluencePeople = ['yes', 'no', 'no', 'no', 'no', 'no', 'yes', 'yes', 'no', 'no',
       'yes', 'no', 'no', 'no', 'no', 'yes', 'no', 'no', 'no', 'no', 'no',
       'no', 'no', 'no', 'yes']

whatToExpectWhenYoureExpecting = ['yes', 'yes', 'yes', 'no', 'yes', 'yes', 'yes', 'no', 'no', 'yes',
       'no', 'no', 'yes', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no',
       'no', 'yes', 'no', 'yes']

LOTR = ['yes', 'yes', 'no', 'no', 'no', 'no', 'no', 'yes', 'no', 'no',
       'no', 'no', 'no', 'yes', 'yes', 'no', 'no', 'no', 'no', 'no',
       'yes', 'yes', 'no', 'yes', 'no']

booksRead = pd.DataFrame({"taleOfTwoCities": taleOfTwoCities,
                         "bible": bible,
                         "howToWinFriendsAndInfluencePeople": howToWinFriendsAndInfluencePeople,
                         "whatToExpectWhenYoureExpecting": whatToExpectWhenYoureExpecting,
                         "LOTR": LOTR})

booksRead

Unnamed: 0,taleOfTwoCities,bible,howToWinFriendsAndInfluencePeople,whatToExpectWhenYoureExpecting,LOTR
0,yes,yes,yes,yes,yes
1,yes,yes,no,yes,yes
2,yes,yes,no,yes,no
3,no,yes,no,no,no
4,yes,yes,no,yes,no
5,yes,yes,no,yes,no
6,no,yes,yes,yes,no
7,yes,yes,yes,no,yes
8,yes,yes,no,no,no
9,no,yes,no,yes,no


## Odds

Odds are the probability of something happening, divided by the probabilit of it not happening. What are the **Odds** of the following events:

- The odds of Bob scoring a goal during a soccer game if P(Bob scoring a goal during a soccer game) = 0.2
- The odds of flipping a heads on a fair coin if P(head) = 0.5
- The odds of your professor showing up in a Dinosaur costume today if P(professor showing up in a Dinosaur costume) = 0.7
- The odds of NOT winning the lottery if P(winning the lottery) = 0.0000001

<img src="https://drive.google.com/uc?export=view&id=1ghyQPx1N8dmU3MV4TrANvqNhGwnLni72" width = 200px />

If my odds of ordering pizza tonight are 3, what is the probability that I order pizza? If I increase my odds by 10x and my odds are now 30, what is the probability that I order pizza?

<img src="https://drive.google.com/uc?export=view&id=1ghyQPx1N8dmU3MV4TrANvqNhGwnLni72" width = 200px />