<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Fundamental Concepts in Statistics

_Authors: Matt Brems (DC)_

---

### Learning Objectives
*After this lesson, you will be able to:*
- Define target population, sampled population, sample, sampling unit, observation unit, sampling frame, probability sampling, and non-probability sampling, and identify these within the context of a real-world situation.
- Differentiate between descriptive and inferential statistics.
- Calculate, interpret, and apply properties of univariate and bivariate descriptive statistics.

### Lesson Guide

- [Introduction: What is "statistics"?](#introduction)
- [Sampling](#sampling)
- [Probability and Non-Probability Samples](#prob-nonprob)
- [Types of Samples](#three-types)
- [Replication](#replication)
- [Descriptive vs. Inferential Statistics](#descriptive_inferential)
- [Univariate vs. Bivariate Statistics and Parameters](#univariate_bivariate)
- [Pearson Correlation Coefficient](#correlation)
- [Covariance](#covariance)
- [Additional Resources](#additional-resources)

<a id='introduction'></a>

## Introduction: What is "Statistics?"

---

**When I say "statistics," what comes to mind? How do we define statistics?**

- "Statistics is the study of the collection, analysis, interpretation, presentation, and organization of data." —  _Wikipedia_
- "A branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of numerical data." — _Merriam-Webster_
- "Statistics is the science of data." — _The Ohio State University, STAT 1450_
- "Statistics is a method of solving problems through data-driven, quantitative reasoning." — _Matt Brems (DSI-DC)_

<a id='sampling'></a>
## Sampling

---

Before we work with data, we must consider how it will be collected. It is important to collect data in a theoretically proper manner and, in your analysis, account for the "real-world" issues that will naturally arise.

It is often expensive, time-consuming, and, in some cases, impossible to take measurements of every person or observation of interest. In order to overcome this reality, statisticians have developed a variety of methods to collect data from a subset of the population of interest in an intentional manner and then extrapolate those findings to the larger group.

Speaking broadly, in statistics we take measurements from a sample to learn about a broader population. We must, however, be *very* careful about how we define this population, as any inferences we make based on a sample can only be extended to the sampled population.

### Sampling Terms
- **Sample:** A subset of a population.
- **Target Population:** The population about which we want to make inferences; the population of interest.
- **Sampled Population:** The population about which we *can* make inferences; the population from which we sample.
- **Sampling Unit:** A unit that can be selected for a sample.
- **Observation Unit:** An object on which a measurement is taken.
- **Sampling Frame:** A specification of sampling units in the population from which a sample may be selected.



### Sampling anecdote from DSI-DC instructor Matt Brems:

> A lot of my work at Optimus revolved around survey design and analysis. My goal was to estimate the proportion of people in New Hampshire who planned to vote for each candidate in the Republican and Democratic presidential primaries. 

> I had a voter file containing the names, contact information, and demographic data of everyone registered to vote within the state of New Hampshire — roughly 900,000 people. 

> I choose to call about 10,000 people. Of those people, about 6% responded to the question "Who do you plan to vote for in the upcoming presidential primary?"

**Using Matt's example:**

- Describe the sample.
- Describe the target population.
- Describe the sampled population.
- Describe the sampling unit.
- Describe the observation unit.
- Describe the sampling frame.

### Another hypothetical example:

> Say I am interested in finding the true average height of all undergraduate women enrolled at Ohio State University. 

> Rather than taking a census of all 25,000 undergraduate women, I decide to survey a sample. 

> I contact the Ohio State registrar for a list of names and contact information for all undergraduate women. I identify 1,000 women whose height I wish to measure, and reach out to them. 

> 249 women show up to have their heights measured.

**Given the example above, as a group:**

- Describe the sample.
- Describe the target population.
- Describe the sampled population.
- Describe the sampling unit.
- Describe the observation unit.
- Describe the sampling frame.

### Independent practice

Taken from ["Sampling: Design and Analysis", 2nd ed., Sharon Lohr, 2010.](http://evalenzu.mat.utfsm.cl/Docencia/2016/Primer%20semestre/Metodos%20Estadisticos%20en%20Ingenieria/Apunte1.pdf)

> Many scholars and policy makers are interested in the proportion of homeless people who are mentally ill. 

> Wright (1988) estimated that 33% of all homeless people are mentally ill by sampling homeless persons who received medical attention from one of the clinics in the Health Care for the Homeless (HCH) project.


- Describe the sample.
- Describe the target population.
- Describe the sampled population.
- Describe the sampling unit.
- Describe the observation unit.
- Describe the sampling frame.

---

<a id='prob-nonprob'></a>

### When sampling, one can conduct _probability_ samples or _non-probability_ samples.

In a probability sample, the probability of getting a particular sample can be calculated. (In other words, each unit in the population has a known probability of selection). In a non-probability sample, this is not the case. 

When we want to conduct standard inference, one of the first assumptions we make is that observations come from a random probability sample. (Research has only recentlly started diving into methods of conducting inference from non-probability samples). 

Similarly, whenever you generate a confidence interval or execute a hypothesis test, you rely on the assumption that your data comes from a random probability sample. (It is far more common to see references to "random sample," in which the "probability" is implied).

<a id='three-types'></a>

## Three basic types of random probability samples

---

1. **Simple Random Sample (SRS):** A sample in which every possible subset of `n` units in the population has the same chance of being the sample.<br><br>

2. **Stratified Random Sample:** A sample in which the population is broken into subgroups (sometimes called strata), a simple random sample is pulled from within each subgroup, and those samples are combined into one larger "stratified" sample.<br><br>

3. **Cluster Random Sample:** A sample in which observation units are grouped into larger sampling units (sometimes called clusters), a sample of larger sampling units is selected, and then observation units within the selected larger sampling units are selected.

<a id='replication'></a>

## Replication

---

Whether it be for an article in an academic journal, results for your supervisor, or in-house testing among peers, you want to be able to replicate your work. When drawing samples using Python, however, randomness can make code results difficult to duplicate.

Computers cannot actually do anything that is *truly* random. Instead, they generate "pseudorandom" numbers based on "seed" values. You can actually set the "seed" to return the computer to a particular state before generating a random number. If you set your seed to 6, then generate a random sample, set your seed to 6 again, and generate another random sample, you'll notice that the exact same random sample was generated!

```python
import random ## This is exactly like when we imported math — actually, it's from the same place!
random.seed(insert_integer_here)
random.sample(list_or_series_to_be_sampled,sample_size)
```

In [1]:
import random 
random.seed(3)
random.sample([1,2,3,4,5,6,7,8,9,10], 3)

[3, 5, 10]

In [2]:
random.seed(3)
random.sample([1,2,3,4,5,6,7,8,9,10], 3)

[3, 5, 10]

In [3]:
random.seed(5)
random.sample([1,2,3,4,5,6,7,8,9,10], 3)

[7, 10, 9]

<a id='descriptive_inferential'></a>
## Descriptive vs. inferential statistics

---

Loosely speaking:
- **Descriptive statistics** is the branch of statistics that deals with _summarizing available information_.
- **Inferential statistics** is the branch of statistics that deals with _generalizing available information to a larger population_.

**Discussion:** In what cases would we want to use descriptive statistics? In what cases would we want to use inferential statistics? Does it make sense to use one without the other?

<a id='univariate_bivariate'></a>
## Univariate vs. bivariate statistics and parameters

--- 

Formally, 

> A **statistic** is a function of the data. 

Imagine a function that calculates standard deviation; we input every data point (the sample), and the output is the value of the sample standard deviation.

> A **parameter** is a characteristic of the population. 

This might be the true average height of all undergraduate women at Ohio State University, the true median salary in the United States, or the true standard deviation of hours of Netflix watched among users between 18 and 35 years old.

Unless we have access to every observation in the population and the ability to measure each observation, it will be impossible to know the true value of the parameter of interest. However, we can draw a sample and calculate a *statistic* which — assuming our sample is large enough and our measurements are done properly — should be a reasonably precise estimate of our parameter.

**Put more succinctly: "Statistics estimate parameters."**

---

### Univariate Statistics and Parameters

In a univariate (one variable) case, we are interested in describing the distribution of a variable, where "distribution" is the set of all possible values the variable can take on, as well as how frequently it takes on those values.

**The most important aspects of a variable's distribution:**
- What is the center of the distribution? (Mean, median, mode)
- What is the spread around the center of the distribution? (Standard deviation/variance/range/IQR)
- What is the shape of the distribution? (Skewed/symmetric, unimodal/multimodal)


**Check:** Can measures of spread be zero? If so, when? If not, why not? Can measures of spread be negative?

**Check:** For skewed data, which measure(s) of center and spread is/are most appropriate? Why?

**Check:** For symmetric data, which measure(s) of center and spread is/are most appropriate? Why?



---

### Bivariate Statistics and Parameters

When working within a univariate case, we're interested in knowing what the distribution of a particular variable looks like. 

In a **bivariate** (two variable) case, we're more interested in the relationship between two variables. The most common measures are the correlation (Pearson correlation coefficient) and the covariance.



<a id='correlation'></a>
## Pearson's Correlation

---

Correlation (the Pearson correlation coefficient) measures the *strength and direction* of the *linear* relationship between two variables and can take on any value between -1 and +1. 

The sample correlation is denoted by **r**, while the population correlation is denoted by **rho**.

**Interpreting correlation:**
- Values close to -1 or +1 indicate a strong and linear relationship between the two variables. 
- Values close to 0 indicate a weak and/or nonlinear relationship between the two variables. 
- Values above 0 indicate a positive relationship between the two variables. 
- Values below 0 indicate a negative relationship between the two variables.

![](https://upload.wikimedia.org/wikipedia/commons/d/d4/Correlation_examples2.svg)
Graphic pulled from [Wikipedia's article on correlation and dependence.](https://en.wikipedia.org/wiki/Correlation_and_dependence)


If you want to calculate the correlation between two variables with numpy, use the following code:
```python
import numpy as np
np.corrcoef(var1,var2)
```

In [4]:
import numpy as np
np.corrcoef([1,2,5,2,3,7,1,2,6,8,3,4,2],
            [0,6,2,3,7,1,3,4,21,3,7,4,5])

array([[ 1.        ,  0.21122262],
       [ 0.21122262,  1.        ]])

### Properties of Pearson's Correlation

$ -1 \le Cor(X, Y) \le 1 $

$ Cor(X,Y) = Cor(Y,X) $

$ Cor(a + bX,c + dY) = Cor(X,Y)$ if $bd > 0$, and

$ Cor(a + bX,c + dY) = -Cor(X,Y) $ if $bd < 0$

<a id='covariance'></a>
## Covariance

---

Covariance is a generalization of correlation that measures how two variables change together. However, it is not adjusted for individual variances like correlation. 

Covariances can take on any value, which means that interpreting covariance by itself is difficult to do. It makes more sense to compare two correlations than to compare two covariances.

If you want to calculate the covariance between two variables in Python, use the following code:
```python
import numpy as np
np.cov(var1,var2)
```

In [5]:
np.cov([1,2,5,2,3,7,1,2,6,8,3,4,2],
       [0,6,2,3,7,1,3,4,21,3,7,4,5])

array([[  5.26923077,   2.53846154],
       [  2.53846154,  27.41025641]])

### Properties of Covariance

$Cov(X,X) = Var(X)$

$Cov(X,Y) = Cov(Y,X)$

$Cov(a + bX,c + dY) = bdCov(X,Y)$

<a id='additional-resources'></a>
## Additional resources

---

- ["Sampling: Design and Analysis", 2nd ed., Sharon Lohr, 2010, online .pdf](http://evalenzu.mat.utfsm.cl/Docencia/2016/Primer%20semestre/Metodos%20Estadisticos%20en%20Ingenieria/Apunte1.pdf)
- [Stats Cookbook](https://github.com/mavam/stat-cookbook)