## Simulations and Hypothesis Testing Code Snippets

## <span style='background :yellow' > Simulations </span>

**If you flip 8 coins, what is the probability of getting exactly 3 heads? What is the probability of getting more than 3 heads?**


In [8]:
import numpy as np
import pandas as pd

np.random.seed(1349)

# set outcomes as 0 and 1 (0 = tails, 1 = heads)
outcomes = [1, 0]
flips = np.random.choice(outcomes, size=(100_000,8))
flips

three_heads = flips.sum(axis =1) 
(three_heads == 3).mean()

more_than_three_heads = flips.sum(axis=1)
(more_than_three_heads > 3).mean()


0.63618

**There are approximately 3 web development cohorts for every 1 data science cohort at Codeup. Assuming that Codeup randomly selects an alumni to put on a billboard, what are the odds that the two billboards I drive past both have data science students on them?**

In [9]:
# set outcomes as 0 and 1 (0 = webdev, 1 = datasci)
outcomes = [0, 1]
bboards = np.random.choice(outcomes, size=(100_000,2), p=[0.75, 0.25])
bboards

two_datasci = bboards.sum(axis=1)
(two_datasci == 2).mean()

0.06267

**Compare Heights**
* Men have an average height of 178 cm and standard deviation of 8cm.
* Women have a mean of 170, sd = 6cm.

If a man and woman are chosen at random, P(woman taller than man)?

In [10]:
man = np.random.normal(178, 8, size=(100_000, 1))
woman = np.random.normal(170, 6, size=(100_000, 1))

woman_is_taller = (woman > man)

woman_is_taller.mean()


0.21127

## <span style='background :yellow' > Probability Distributions -- Poisson</span>

**A bank found that the average number of cars waiting during the noon hour at a drive-up window follows a Poisson distribution with a mean of 2 cars. Make a chart of this distribution and answer these questions concerning the probability of cars waiting at the drive-up window.**
* What is the probability that no cars drive up in the noon hour?
* What is the probability that 3 or more cars come through the drive through?
* How likely is it that the drive through gets at least 1 car?

In [12]:
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats

λ = 2

x = np.arange(0,10)
y = stats.poisson(λ).pmf(x)

# What is the probability that no cars drive up in the noon hour?
stats.poisson(λ).pmf(0).round(3)

# What is the probability that 3 or more cars come through the drive through?
stats.poisson(λ).sf(2).round(3)

# How likely is it that the drive through gets at least 1 car?
1-stats.poisson(λ).pmf(0).round(3)

0.865

## <span style='background :yellow' > Probability Distributions -- Normal Distribution</span>
Grades of State University graduates are normally distributed with a mean of 3.0 and a standard deviation of .3. Calculate the following:


In [13]:
# What grade point average is required to be in the top 5% of the graduating class?
stats.norm(3, 0.3).isf(0.05)

# What GPA constitutes the bottom 15% of the class?
stats.norm(3, 0.3).ppf(0.15)

#confirm with simulation method
np.quantile(np.random.normal(3, 0.3, 10_000), 0.15)

# An eccentric alumnus left scholarship money for students in the third decile from the bottom of their class.
# Determine the range of the third decile.
# Would a student with a 2.8 grade point average qualify for this scholarship?

stats.norm(3, 0.3).ppf([0.2,0.3])

# If I have a GPA of 3.5, what percentile am I in?
stats.norm(3, 0.3).cdf(3.5) *100

95.22096477271853

**Connect to the employees database and find the average salary of current employees, along with the standard deviation. For the following questions, calculate the answer based on modeling the employees salaries with a normal distribution defined by the calculated mean and standard deviation then compare this answer to the actual values present in the salaries dataset.**

In [None]:
import env
import pandas as pd

url = f'mysql+pymysql://{env.user}:{env.password}@{env.host}/employees'
query = '''
SELECT *
FROM salaries s
WHERE emp_no IN (
    SELECT emp_no FROM dept_emp
    WHERE to_date > NOW()
) AND to_date > NOW()
'''
    
salaries = pd.read_sql(query, url)

In [None]:
# What percent of employees earn less than 60,000?
mean = salaries.salary.mean()
sd = salaries.salary.std()
mean, sd
stats.norm(mean,sd).cdf(60000)

# What percent of employees earn more than 95,000?
stats.norm(mean, sd).sf(95000)

# What percent of employees earn between 65,000 and 80,000?
np.diff(stats.norm(mean, sd).cdf([65000, 80000]))

# What do the top 5% of employees make?
stats.norm(mean, sd).isf(.05)

# another way
salaries.salary.quantile(.95)

#Can look at histogram to confirm normal distribution

## <span style='background :yellow' > Probability Distributions -- Binomial Distribution</span>

**A marketing website has an average click-through rate of 2%. One day they observe 4326 visitors and 97 click-throughs. How likely is it that this many people or more click through?**

In [None]:
n = 4326
p = 0.02
stats.binom(n, p).sf(96)

# Using simulation
clicks = np.random.choice([0,1], (100_000, 4326), p = [0.98, 0.02])
clicks

(clicks.sum(axis =1) >= 97).mean()

# Using poisson approximation
λ = n * p

stats.poisson(λ).sf(96)

**You are working on some statistics homework consisting of 100 questions where all of the answers are a probability rounded to the hundreths place. Looking to save time, you put down random probabilities as the answer to each question.
What is the probability that at least one of your first 60 answers is correct?**

In [14]:
n = 60
p = 0.01

stats.binom(n, p).sf(0)

0.4528433576092388