# Overview

The aim of this notebook is to go over rudimentary statistical concepts and how to apply them using Python. For examples, I'll use the data set below which should be loaded before any examples are loaded.

If you do not have pydataset installed, it can be installed with the following command.

Also, keep in mind that R is better for statistical testing. Python is good for mixing stats with other things, but not pure statistics.

`$ pip install pydataset`

## Example Data - Journals

**title** - journal title

**pub** - publisher

**society** - scholarly society ?

**libprice** - library subscription price

**pages** - number of pages

**charpp** - characters per page

**citestot** - total number of citations

**date1** - year journal was founded

**oclc** - number of library subscriptions

**field** - field description

In [2]:
from pydataset import data

journals = data('Journals')

# Populations and Samples

**Population**: any large collection of objects or individuals about which info is desired.    
**Parameter**: Any summary number that describes the entire population. (Ex. average or percentage)  
**μ**: ("mu") Population mean.  
**p**: (population proportion) Population proportion.

**Sample**: Any representative group drawn from the population.  
**Statistic**: Any summary number that describes the sample. (Ex. average or percentage). An estimate of the parameter.  
**x¯**: Sample mean. (x with a line above it)  
**p̂**: Sample proportion

Populations are super rare and don't really require any statistical analysis. Example data is a sample.

In [None]:
# Sample mean
print("Mean of pages:\n", journals['pages'].mean())

#Sample proportion
print("\nProportion of publishers:\n",
      journals.groupby('pub').size().apply(
          lambda x:(x/journals.groupby('pub').size().max())))

# Confidence Intervals

Is the range of vales that we can be a % confident the population parameter is contained in. Usually 95%.

Lower value < population mean **μ** < Upper value

*In the most general form:*  
Sample estimate ± margin of error

*More specific:*  
Sample mean ± (t-multiplier × standard error)

Can also have one-sided confidence intervals where either the lower or upper is infinity. This allows you to have a tighter estimate on the non-infinity value.

**When to use one-sided** - Specifically asked whether the unknown mean is more than or not more than a specific value. Or if the practical consequences of the unknown mean being more/less than a specific value make it so. Example, how much weight can this bridge hold? Underestimating doesn't hurt, but over estimating can mean death.

**When to use two-sided** - Specifically asked if an unknown mean is equal to a specific value. Or, it's more practical. Example, where is the wormhole located? Being too over or under could result in over shooting it and thus potential death in the freezer of space.

# Hypothesis Testing

Has three steps
* Make initial assumption
* Collect evidence
* Based on available evidence, decide whether or not to reject the initial assumption

Don't make claims that very unusual events are the null hypothesis and it needs to be prove otherwise. Innocent until proven wrong.

### Type I Error versus Type II Error
**Type I error**: Null hypothesis is rejected when it's true, considered **alpha ( α )**.  
**Type II error**: Null hypothesis is not rejected when it's false.

Goal is to minimize these errors. Which one is worse to make depends on the situation.

### Likely vs Unlikely 
This is the decision it comes down to. There are two ways to determine if something is likely vs unlikely.
* **critical value approach** - Older
* **P-value approach** - Most often used

### Critical value approach
Determine whether or not to reject by determining whether or not the observed test statistic is more extreme than some cutoff value, the **critical value**.

Four steps are:
* Specify the null and alternative hypotheses
* Using the sample data and assuming null is true, calculate the value of the test statistic.
* Determine the critical value by finding the value of the known distribution of the test statistic such that the propability of making a type I error, alpha, is small. (Jorgenson always did 0.05). (Alpha = Type I Error = **significance level of the test**)
* Compare the test statistic to the critical value. If it's more extreme, reject.