This study guide should reinforce and provide practice for all of the concepts you have seen in the past week. There are a mix of written questions and coding exercises, both are equally important to prepare you for the sprint challenge as well as to be able to speak on these topics comfortably in interviews and on the job.

If you get stuck or are unsure of something remember the 20 minute rule. If that doesn't help, then research a solution with google and stackoverflow. Only once you have exausted these methods should you turn to your Team Lead - they won't be there on your SC or during an interview. That being said, don't hesitate to ask for help if you truly are stuck.

Have fun studying!

In [0]:
# Load a dataset, fix headers and missing value indicators
# Identify variables for, run, and correctly interpret both 1 and 2-sample t-tests.
# Identify variables for, run, and correctly interpret chi^2 tests for independence between two categorical variables
# Be able to generate and graph a confidence interval around a sample estimate (mean).

# Resources

[Scipy Stats Documentation](https://docs.scipy.org/doc/scipy/reference/stats.html)

# General Terms

Define the following terms. *Double click the text to edit the markdown cells.*
<br/><br/>

**Normal Distribution:** When the data is spread evenly around a central value without a bias to the left or right.

**Standard Deviation:** A measure of the spread of the data or the square root of the variance which is the average of the squared differences from the mean.

**Z-Score:** Can tell you how far from the mean a data point is.

**P-Value:** This number is used to test your null hypothesis. It is the probability that your null hypothesis should be rejected or that you have failed to reject your null hypothesis. A small p value usually means you reject your hypothesis.

**Null Hypothesis:** It is used in hypothesis testing as what is assumed. We test this against an alternative hypothesis. The null hypothesis is typically what we are aiming to disprove or refute.

**Sample:** A portion of the population meant to represent that population.

**Statistical Signifigance:** Results are considered statistically  significant when the p-value falls below the critical value. It means the results are saying something other than our null hypothesis with enough statistical credibility to be worth considering.

# T-Test

Answer the following questions as though you are explaining it to a non-technical person. *Double click the text to edit the markdown cells.*
<br/><br/>

1. What is a T-Test? What is it used for?

A T-Test is a hypothesis test that we use to compare the mean of two samples and determine if they are different in a statistically significant way.
2. What is the difference between the normal distribution and the t-distribution?

  ` Your Answer Here `

3. What is the difference between a 1-sample and a 2-sample t-test?

  ` Your Answer Here `

We are scientists running a drug trial and wanting to know whether our drug reduced patient symptoms. Below are the results (just random numbers), explain in 2-3 sentences whether or not the drug was effective. How can we tell that from the t-test?

```
Your Answer Here
```

What is likely our null hypothesis?

```
Your Answer Here
```

In [0]:
from scipy import stats
import numpy as np
import pandas as pd

# Get our "results" with random numbers
np.random.seed(42)
with_drug = stats.norm.rvs(loc=5, scale=10, size=500)
without_drug = stats.norm.rvs(loc=5, scale=10, size=500)

# See if our drug made a difference
stats.ttest_ind(rvs1, rvs2)

Ttest_indResult(statistic=-0.40331379088750186, pvalue=0.6868037874359643)

Here is a dataframe of movie ratings. Divide the dataframe by gender and then use t-tests to show which movies have a statistically significant difference in rating when divided by gender. Give a sentence explanation of the results.

In [0]:
df = pd.DataFrame({'gender':['m','f','f','m','m','m','f','f','m','f'],
                   'jurassic park':[10,9,10,9,9,10,10,10,9,9],
                   'love actually':[6,9,10,7,6,7,10,10,5,8],
                   'pacific rim':[10,3,4,8,9,8,5,4,9,3]})
df.head()

Unnamed: 0,gender,jurassic park,love actually,pacific rim
0,m,10,6,10
1,f,9,9,3
2,f,10,10,4
3,m,9,7,8
4,m,9,6,9


In [0]:
# Divide the dataframe here

**Jurassic Park**

Explanation of results:

```
Your Answer Here
```

In [0]:
# T-Test Code Here

**Love Actually**

Explanation of results:

```
Your Answer Here
```

In [0]:
# T-Test Code Here

**Pacific Rim**

Explanation of results:

```
Your Answer Here
```

In [0]:
# T-Test Code Here

# Confidence Interval

Answer the following question as though you are explaining it to a non-technical person. *Double click the text to edit the markdown cells.*
<br/><br/>

1. What is a confidence interval?

  ` Your Answer Here `

Using the movie rating data, graph the ratings with a confidence interval. After graphing the ratings with the confidence interval, write a brief explanation of how to interpret the graph.

```
Your interpretation here
```

In [0]:
import matplotlib.pyplot as plt

# Your Graph Code Here

# Chi Squared

Answer the following questions as though you are explaining it to a non-technical person. *Double click the text to edit the markdown cells.*
<br/><br/>

1. What is a Chi Squared Test? What is it used for?

  ` Your Answer Here `

2. What type of data is it used on?

  ` Your Answer Here `

3. What is a contingency table?

  ` Your Answer Here `

4. Define Degrees of Freedom

  ` Your Answer Here `

Use the `grades` dataframe below to complete the following:
- Create at least 2 contingency tables
- Use chi-squared tests to find 2 features that are independent of each other.
 - Write a brief interpretation of the results
- Use chi-squared tests to find 2 features that are dependent to each other.
 - Write a brief interpretation of the results

In [0]:
grades = pd.DataFrame({'good_standing':[True, True, False, False, False, True, True, False, True, True],
                       'grade_1':['A', 'B', 'A', 'C', 'A', 'A', 'D', 'A', 'B', 'B'],
                       'grade_2':['Pass', 'Pass', 'Fail', 'Fail', 'Fail','Pass', 'Pass', 'Fail', 'Pass', 'Fail'],
                       'grade_3':[10, 5, 6, 10, 9, 9, 8, 7, 3, 9]})
df.head()

Unnamed: 0,good_standing,grade_1,grade_2,grade_3
0,True,A,Pass,10
1,True,B,Pass,5
2,False,A,Fail,6
3,False,C,Fail,10
4,False,A,Fail,9


In [0]:
# Contingency Table 1

In [0]:
# Contingency Table 2

In [0]:
# Chi Squared, independent features

In [0]:
# Chi Squared, dependent features

# Bayesian Statisics

Answer the following questions as though you are explaining it to a non-technical person. *Double click the text to edit the markdown cells.*
<br/><br/>

1. What is the difference between Bayesian and Frequentist Statistics?

  ` Your Answer Here `

2. What is a prior belief? How is it used in Bayesian Statistics?

  ` Your Answer Here `

3. What is the law of total probability?

  ` Your Answer Here `

4. What is the law of conditional probability?

  ` Your Answer Here `

5. Give an example of when you might use bayesian statistics. Do not use an example given during the lecture or assignment.

  ` Your Answer Here `

# Graphing

Use any of the dataframes above and make two additional visualizations to explore the data. Make sure to include axis labels and title for each graph.