One important element of inferential statistics (including, out of necessity, descriptive statistics) can be seen as overcoming less-than-ideal circumstances.

If you want to know the average height and weight of all Americans, ideally, you would weigh and measure the heights of each American, then calculate the averages.  If you were able to accurately obtain all of these measures, you would have exact numbers for the average height and weight of all adult Americans (at least at one point in time).  While it would be ideal, this is not realistic.

A __'population'__ consists of all members belonging to the group of interest, in this case, the population is all American adults (let's say every person living in the US, regardless of citizenship status, age 18 or older).  Where, due to the constraints of time, money or feasibility, it is not possible to record measurements for all of the variables of interest for all members of the population of interest, one may elect to use a subset of the population (a __'sample'__).  It is imperative to eliminate or minimize random and systematic errors while designing the study, as well as the biases one may encounter when selecting the members of the sample.  The results of each measurement for each observation are collected into a table known as a __'DataFrame'__.

Let's say we're able to conduct a study of 500 randomly-selected Americans.  We want to record the gender, age, race, height and weight for each member in the sample data set.  Rather than using their names, we'll assign an ID to each.

Each row of the DataFrame represents a single __'observation'__.  In our case, that would be the gender, age, race, height and weight of a single person from our sample set.  While designing our study, we must establish a system for measuring and recording data.  As an example, height could be measured in inches or centimeters (numeric, discrete data type), or fractions thereof (could be seen as numeric, continuous), or, it could be binned into bins with 2-inch intervals (categorical).  The same data could be treated in a variety of ways, depending upon the goals of the study.  In designing our study, we've decided to assign data types as follows:  
>gender - categorical, binary  
age - numeric, discrete, but we'll bin these into age ranges in 10-year increments, making them categorical  
race - categorical  
height - numeric, discrete  
weight - numberic, continuous  

Each column of the DataFrame logs the outcome of each observation for that column's variable, transformed in a logical and consistent manner into a single numeric or categorical data type in order to facilitate the mathematics needed to study the results for that variable across all observations.

Once we've collected and recorded all of our observations, we can begin to study each variable in a process referred to __'Exploratory Data Analysis'__ or __'EDA'__.  Let's start with height.  

is studied for its centrality and dispersion.  This process is   The recorded measurements for the variable of interest for each observation (all of the values in a single column of our DataFrame) are compiled and used to find the average value of the variable (its __'mean'__), as well as the distribution of all values.  Imagine the distribution as a cluster of values centered around the mean.  EDA allows us to represent how far each value is from the mean (its __'deviation'__), and how many other values are the same as or close to that value (the __'frequency'__ or __'density'__ of the distribution).

EDA 'collapses' all observed values of a variable, allowing us to describe the variable's results in terms of a single number, or a range of values.  We can also represent the values of the variable visually, in a number of different graphs.  Representing distributions using a histogram, PDF, CDF or PMF (provide a short summary of the differences between these).  Something about how data type affects tools used?

Empirical v analytic distributions.  Why model an empiric distribution as an analytic?  Think Stats explanation.  Analytic dist can smooth empiric data points and help in generating estimates of the centrality and distribution of the inferred population and confidence levels.  

Central Limit Theorem, Z-test and t-test.

Inferential stats used to estimate the actual values of the entire population.  Use probability to generate a point estimate and/or a range of possible values and a confidence interval that represents likelihood of another sample set resulting within the range of values indicated by the confidence interval.

Bootstrapping

Hypothesis testing

Relationships between two or more variables may be studied... correlation coefficient, linear regression, multivariate linear regression, etc.





In [2]:
# imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import stats


# 1. __Data Types__
***


## 1.1. Numeric (Quantitative) Data

### 1.1.1. Continuous
Data that can take on any value (i.e., an infinite number of values) in an interval.  
__Examples__: wind speed, temperature, time duration   
__Describing__:  
__Hypothesis Testing__:  

### 1.1.2. Discrete
Data that can only on only integer values.  
__Examples__: counts, age range bins  
__Describing__:  
__Hypothesis Testing__:  

## 1.2. Categorical (Qualitative) Data

### 1.2.1. Categorical  
Data that can only take on a specific set of values representing a set of possible categories.  
__Examples__: U.S. States, display types  
- The Likert scale might be the most common type of rating scale used in human-subject research.  Example:  
>The United States should adopt a national system of health insurance.   
>()Strongly agree ()Agree ()Neither agree nor disagree ()Disagree ()Strongly disagree  

__Describing__:  
__Hypothesis Testing__:  

### 1.2.2. Binary
A subcategory of categorical data with just two categories.  
__Examples__: True/False, Male/Female    
__Describing__:  
__Hypothesis Testing__:  

### 1.2.3. Ordinal
Categorical data that has some meaningful order, so that higher values represent more of some characteristic than lower values.  
__Examples__: numeric ratings, such as 5-stars or Likert Scale   
__Describing__:  
__Hypothesis Testing__:  

In [None]:
# generate 500 samples


# 2. __Measuring & Collecting Data__
***
Before you can use statistics to analyze a problem, you must convert information about the problem into data.  That is, you must establish or adopt a system of assigning values, most often numbers, to the objects or concepts that are central to the problem in question.  Measurement is the process of systematically assigning numbers to objects and their properties to facilitate the use of mathematics in studying and describing objects and their relationships. 

__Operationalization__ is the process of specifying how a concept will be defined and measured.  
Operationalization is necessary when a quality of interest cannot be measured directly, such as intelligence. There is no way to measure intelligence directly, so in the place of such a direct measurement, we accept something that we can measure, such as the score on an IQ test.

A __proxy measurement__ is the process of substituting one measurement for another.  One example is the methods police officers use to evaluate the sobriety of individuals while in the field. Lacking a portable medical lab, an officer can’t measure a driver’s BAC directly to determine whether the driver is legally drunk. Instead, the officer might rely on observable signs that are believed to correlate well with BAC.  

## 2.1. __True and Error Scores__  
We assume that all measurements contain some error.  
__Random error__ is error due to chance: it has no particular pattern and is assumed to cancel itself out over repeated measurements.

__Systematic error__ has an observable pattern, is not due to chance, and often has a cause or causes that can be identified and remedied.

We can learn to live with random error while doing whatever we can to avoid systematic error. 

## 2.2. __Reliability and Validity__
There are many ways to assign numbers or categories to data, and not all are equally useful. Two standards we commonly use to evaluate methods of measurement (for instance, a survey or a test) are reliability and validity.

__Reliability__ refers to how consistent or repeatable measurements are.  

3 Approaches to Measuring Reliability
1. _Multiple-occasions reliability_ (or, test-retest reliability), refers to how similarly a test or scale performs over repeated administration.  
2. _Multiple-forms reliability_ (or, parallel-forms reliability), refers to howsimilarly different versions of a test or questionnaire perform in measuring the same entity.  
3. _Internal consistency reliability_ refers to how well the items that make up an instrument (for instance, a test or survey) reflect the same construct.  


__Validity__ refers to how well a test or rating scale measures what it is supposed to measure.  The process of gethering evidence to support the types of inferences intended to be drawn from the measurements in question.  

4 Approaches to Measuring Validity
1. _Content validity_ refers to how well the process of measurement reflects the important content of the domain of interest and is of particular concern when the purpose of the measurement is to draw inferences about a larger domain of interest.
2. _Face validity_ closely related to content validity, face validity refers to whether the test in question appears, on its face, to be a fair assessment of the qualities under study.  Example: if parents' don't agree that your geometry test is a good or accurate test of students' knowledge of geometry, the test lacks face validity and parents might disagree with your assessment of their student's knowledge.
3. _Concurrent validity_ refers to how well inferences drawn from a measurement can be used to predict some other behavior or performance that is measured at approximately the same time.  Example: if an achievement test score is highly related to contemporaneous school performance, the test has high concurrent validity.
4. _Predictive validity_ similar to concurrent validity, predictive validity concerns the ability to draw inferences about some event in the future.  Example: a test has high predictive validity if the score is highly related to school performance the following year.


__Triangulation__ Because every system of measurement has its flaws, researchers often use several approaches to measure the same thing.  We expect that each measurement contains error, but we hope it does not include the same type of error, so that through multiple types of measurement, we can get a reasonable estimate of the quantity or quality of interest.

Hiring decisions in a company are usually made after consideration of several types of information, including an evaluation of each applicant’s work experience, his education, the impression he makes during an interview, and possibly a work sample and one or more competency or personality tests.


## 2.3. __Measurement Bias__
Measurement bias is a source of systematic error, which can lead to false conclusions despite the application of correct statistical procedures and techniques.  Failure to eliminate or minimize measurement bias can invalidate the results of an otherwise exemplary study.  

2 Primary Ways Bias can Enter into a Study
1.  __Selection and retention of subjects__
    - _Selection bias_ exists if some potential subjects are more likely than others to be selected - usually reserved for bias that occurs due to the process of sampling.  
    - _Volunteer bias_ refers to the fact that people who volunteer to be in studies are usually not representative of the population as a whole.   
    - _Nonresponse bias_ refers to the other side of volunteer bias.  Just as people who volunteer to take part in a study are likely to differ systematically from those who do not, so people who decline to participate in a study when invited to do so very likely differ from those who consent to participate.   
    - _Informative censoring_ can create bias in any long-term (longitudinal) study. Losing subjects during a long-term study is a common occurrence, but the real problem comes when subjects do not drop out at random but for reasons related to the study’s purpose. 
2.  __Method of information collection__  
    - _Interviewer bias_ is introduced into the data collected because of the attitudes or behavior of the interviewer.   
    - _Recall bias_ refers to the fact that people with a life experience such as suffering from a serious disease or injury are more likely to remember events that they believe are related to that experience.  
    - _Detection bias_ refers to the fact that certain characteristics may be more likely to be detected or reported in some people than in others.  
    - _Social desirability bias_ is caused by people’s desire to present themselves in a favorable light. This often motivates them to give responses that they believe will please the person asking the question.  


# 3. __Sampling Data__

# 4. __Probability__
***
## 4.1. __Overview__
__Probability__ the probability of an event (__$P(E)$__) is always between 0 (the event will never occur) and 1 (the event will always occur).

__Sample Space__ (__$S$__), is the set of all possible outcomes of a trial.  Therefore, the probability of the sample space is always 1.  If the trial is flipping a coin once, then the sample space is $S$ = {heads, tails} because those two alternatives represent all the possible outcomes for the experiment.  

__4 Fun Facts about Probability__  
_1: the probability of an event is always between 0 and 1_  
$0 \le P(E) \le 1$

_2: the sample space, S, includes all possible outcomes of a trial_  
$P(S) = 1$  

_3: the probability of an event and its complement is always equal to 1_  
$P(E) + P(\sim E) = 1$   

_4: the corollary of Fact #3: the probability of the complement of an event is always 1 - the probability of the event_  
$P(\sim E) = 1 - P(E)$  

## 4.2 __Events__  
An event (__$E$__) is the specification of the outcome of a trial and can consist of either a single outcome or a set of outcomes.  

- __simple event__ is the outcome of a single experiment or observation, such as a single coin flip. 
- __compound event__ simple events can be combined into compound events.  If the experiment consists of multiple trials, all possible combinations of outcomes of the trials must be specified as part of the sample space. For instance, if the trial consists of flipping a coin twice, the sample space is S = {( h, h), (h, t), (t, h), (t, t)}.

## 4.3. __Compound Events__

### 4.3.1. Union - 'either E or F or both E and F'
The union of several simple events creates a compound event that occurs if one or more of the events occur. The union of E and F is written E ∪ F and means “either E or F or both E and F.”  The union of E and F is the shaded area in the Venn diagram.  

![Union](Venn_Union.png)  

### 4.3.2. Intersection - 'both E and F'
The intersection of two or more simple events creates a compound event that occurs only if all the simple events occur. The intersection of E and F is written E ∩ F and means “both E and F.” The intersection of E and F is the shaded area in the Venn diagram below; note that only points that belong to both E and F satisfy the condition.

![Intersection](Venn_Intersection.png)

### 4.3.3. Complement - 'not E' or 'E complement'
The complement of an event means everything in the sample space that is not that event. The complement of event E is written variously as $\sim E$, $E^c$, or Ē, and is read as “not E” or “E complement.”

![Complement](Venn_Complement.png)

### 4.3.4. Mutual Exclusivity - 'either E or F'
If events cannot occur together, they are mutually exclusive. To put it another way, if two sets have no events in common, they are mutually exclusive.

![Mutual Exclusivity](Venn_Mutually_Exclusive.png)

## 4.4. Independence, Permutations & Combinations
__Independence__  If two trials are independent, the outcome of one trial does not influence the outcome of another. To put it another way, if the trials are independent, knowing the outcome of one trial gives you no information about the outcome of the other.  

__Permutations__ all the possible ways elements in a set can be arranged, taking into account the order of the outcomes.  The number of potential outcomes is calculated by using factorials.  

__Combinations__ are similar to permutations with the difference that the order of elements is not significant in combinations.  

# 5. __Describing Data__
## Overview:
Descriptive Statistics describe the center, variability and distribution of a feature (aka variable, outcome, or observation).  The measured quantity is referred to as a statistic if the values were taken from a sample or a parameter if the values were taken from a population.  We can also create a variety of graphs to illustrate a statistic's center and frequency distributions, or to compare two or more variables.  

Measures of central tendency show the most common values; the most frequently-observed values.  Measures of variability or dispersion show the density and distribution of values for the variable being studied; how concentrated values are around the center and how widely-distributed they can be.  The distribution of a data set is a listing or a function showing all the possible values (or intervals) of the data and how often they occur.  Statistics has identified a variety of __[analytic (theoretical) distribution shapes](https://www.wolfram.com/mathematica/new-in-8/parametric-probability-distributions/HTMLImages.en/univariate-continuous-distributions/O_1.png)__ (standard normal, binomial, lognormal, etc.).  If you know that an empirical distribution fits an analytic distribution, you can apply what is known about that analytic distribution to your data.  

You can standardize values from different distributions (normal or not) in order to compare the probability or percentile values of outcomes from different samples or populations.  This allows you to compare the results in a relative sense - which is more unlikely or extraordinary?  The standard normal distribution, or $Z$-distribution, has a mean of 0, a standard deviation of 1.  For example, you want to compare one student's ACT score to another student's SAT score.  Looking at an ACT score of 31 and an SAT score of 1500, which result is more extraordinary?  Standardize the distributions for ACT scores and SAT scores and then compare the $z$-scores for 31 and 1500.  One of them will be further from the mean than the other, making that the comparatively 'better' score.

## 5.1. __Estimating the Center__
Getting a typical value for each feature: an estimate of where most of the data is located (its central tendency).

***
### Mean  
The sum of all values divided by the number of values.  
- `mean(array)` `np.mean(array)` `df.mean(array)` 
- gives an indication of the typical magnitude of a measurement  
- appropriate for interval and ratio data and the mean of dichotomous variables coded as 0 or 1
- useful, but not robust - outliers can skew the mean  


$$Sample\ Mean = \bar x = \frac{1}{n}{\sum_i^nx_i}$$  
$$Population\ Mean = \mu = \frac{1}{n}{\sum_i^nx_i}$$

>sum all numbers from $x_i ... x_n$ then divide by $n$  
>$\bar x$: the mean of the statistic  
>$\mu$: the mean of the parameter  
>$n$ is the number of values for the statistic or parameter  
>$x_i$ is the value of x for a particular case  

- Variations:
    - __Weighted mean__:  The sum of all values times a weight divided by the sum of the weights. 
        - `np.average(array, weights)`
    - __Trimmed mean__:  The average of all values after dropping a fixed number of extreme values.  
        - `sp.trim_mean()`  
        - widely used to avoid the influence of outliers  
        - compromise between the median and the mean: it's robust to outliers, but uses more data to calculate the location  
        - highly variable observations may be given a lower weight  
        - can give higher weights to values from groups that were underrepresented 

$$Sample\ Weighted\ Mean = \bar x_w = \frac{\sum_i=1^nw_ix_i}{\sum_i^nw_i}$$  

$$Sample\ Trimmed\ Mean = \bar x = \frac{\sum_i=p+1^n-p+1^x_(i)}{n-2p}$$  

***
### Median
The value such that one-half of the data lies above and below the sorted data.  
- `np.median(array)`  
- robust - outliers don't affect the median  
- Variations:  
    - __Weighted Median__:  The value such that one-half of the sum of the weights lies above and below the sorted data.  
        - weighted median is a value such that the sum of the weights is equal for the lower and upper halves of the sorted list 

***
### Mode
The most frequently occurring value. 
- `df.mode()`  
- most often useful in describing ordinal or categorical data  

***
### Mean vs. Median vs. Mode 

Mode is most useful with ordinal or categorial data.  

If the distribution is:
- unimodal & symmetrical, the mean, median and mode are all the same value.   
- right-skewed: mean is higher than the median bc extreme higher values pull the mean up, but not the median.     
- left-skewed: mean is lower than the median bc extreme lower values pull the mean down, but not the median.        
  

## 5.2. __Estimating Variability__

Variability refers to whether the values are tightly clustered or spread out.  At the heart of statistics lies variability: measuring it, reducing it, distinguishing random from real variability, identifying the various sources of real variability and making decisions in the presence of it.

***
### Range & Percentile  
- __order statistics__: statistics based on sorted(ranked) data.  
- __range__: spread between the minimum and maximum values.  
    - extremely sensitive to outliers  
    - not particularly useful   
- __Interquartile Range (IQR)__: difference between the 25th and 75th percentiles  
- __Percentile__: is greater than x% of the array  
    - `np.percentile(array, [25, 50, 75])`
        

### Variance & Standard Deviation
Standard deviation and variance are the most common measures of dispersion for continuous data, and both describe how much the individual values in a data set vary from the mean.  Because they are based on the mean, they are also both sensitive to outliers.  

***
#### Variance 
The sum of squared deviations from the mean divided by n-1 where n is the number of data values.   
- `np.var()`  
- average of the squared deviations from the mean  
- not robust to outliers   

$$Population\ Variance = \sigma^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \mu)^2$$  

 
$$Sample\ Variance = s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar x)^2$$



***
#### Standard Deviation
The square root of the variance.  
- `np.std()`  
- much easier to interpret than the variance since the scale and units are the same as the original data  
- not robust to outliers  

$$Population\ Standard\ Deviation = \sigma = \sqrt{\frac{1}{n}\sum_{i=1}^n (x_i - \mu)^2}$$  

$$Sample\ Standard\ Deviation = s = \sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_i - \bar x)^2}$$

***
__Relationship Between Variance and Standard Deviation__

$$Standard\ Deviation = s = \sqrt{variance}$$

$$Population: \sigma = \sqrt{\sigma^2}$$  

$$Sample: s = \sqrt{s^2}$$

#### Standard Error
The standard error is a single metric that sums up the variability in the sampling distribution for a statistic. The standard error can be estimated using a statistic based on the standard deviation s of the sample values, and the sample size n:
 
- As sample size increases, the standard error decreases  
- rather than collecting additional samples, you can use bootstrap resamples. In modern statistics, the bootstrap has become the standard way to estimate standard error. It can be used for virtually any statistic and does not rely on the central limit theorem or other distributional assumptions.  

#### Standard Deviation vs. Standard Error  
- Standard Deviation measures the variability of individual data points
- Standard Error measures the variability of a sample metric

- _Median Absolute Deviation from the Median (MAD)_: a robust measure of variability. 
    - a more robust measure of variability  

- _Mean Absolute Deviation_: the average of the absolute values of the deviations from the mean.  
    - not robust to outliers  

***
## Central Limit Theorem
<blockquote "Boslaugh, Sarah. Statistics in a Nutshell: A Desktop Quick Reference (p. 59). O'Reilly Media. Kindle Edition."><p>The central limit theorem states that the sampling distribution of the sample mean approximates the normal distribution, regardless of the distribution of the population from which the samples are drawn if the sample size is sufficiently large. This fact enables us to make statistical inferences based on the properties of the normal distribution, even if the sample is drawn from a population that is not normally distributed.</p>
</blockquote>


## Representing Distributions
### Histogram
### PMF
### CDF
### PDF
## Empiric v Analytic Distributions

# 6. __Visualizing Data__

# 7. __Estimating from Samples__

Confidence Intervals
https://stackoverflow.com/questions/28242593/correct-way-to-obtain-confidence-interval-with-scipy

# 8. __Hypothesis Testing__

Statistical inference is the process of accounting for random variation in data as you draw conclusions.  Inference procedures, which are designed to protect you from being fooled by random variation, fall into two categories:  

__Confidence intervals__ answer the question, “How much chance error might there be in this measurement or estimate or model, owing to the luck of the draw in who/what gets selected in a sample?”

__Hypothesis tests__ answer the question, “Might this apparently interesting result have happened by chance, owing to the luck of the draw in who/what gets selected in sample(s) or assigned to different treatments?”

A __null model__ is an imaginary chance model (e.g., box with slips of paper) representing the idea that nothing new or novel is going on or that there is no difference between treatments A and B.  

The __p-value__ is the frequency with which a result as extreme as the observed result occurs just by chance, drawing from the null hypothesis model.  

The __significance level, α__ is a threshold probability level set before performing a hypothesis test. If the hypothesis test yields a p-value equal to or less than α, the result is deemed statistically significant.  

Using the resampling distribution, you can determine what value of the test statistic corresponds to a given alpha level. For alpha = 0.05, for example, you would find the value of the resampled test statistic that corresponds to the 95th percentile (1−alpha = 0.95). This is termed the __critical value__. In the formula approach, you would be calculating the critical value for a standardized test statistic such as a t-statistic.     

The __Central Limit Theorem (CLT)__ says that the means drawn from multiple samples will be Normally distributed, even if the source population is not Normally distributed, provided that the sample size is large enough and the departure from Normality is not too great.

Many books state that the “large enough” range is a sample size of 20–30, but they leave unanswered the question of how non-Normal a population must be for the Central Limit Theorem to not apply.

The Central Limit Theorem allows Normal-approximation formulas to be used in calculating sampling distributions for inference, that is, confidence intervals and hypothesis tests. With the advent of computer-intensive resampling methods, the Central Limit Theorem is not as important as it used to be because resampling methods for determining sampling distributions are not sensitive to departures from normality.

## 8.1. $t$-test

<table>
    <tr>
        <th>t-test</th>
        <th>Data type</th>
        <th>Question being answered</th>
   </tr>
   <tr>
        <td>One-sample t-test</td>
        <td>One sample, continuous data, approximate normality</td>
        <td>Does the sample come from a population with a specified mean?</td>
   </tr>
   <tr>
        <td>Two-sample t-test</td>
        <td>Two independent samples, continuous data, approximate normality, approximately equal variance</td>
        <td>Do the two samples come from populations with equal means?</td>
   </tr>
   <tr>
        <td>Repeated measures t-test</td>
        <td>Two related samples, equal sample sizes, continuous data, approximate normality of difference scores</td>
        <td>Do the two samples come from populations with equal means?</td>
   </tr>
   <tr>
        <td>Unequal variance t-test</td>
        <td>Two independent samples, continuous data, approximate normality</td>
        <td>Do the two samples come from populations with equal</td>
   </tr>
</table>


## 8.2. $Z$-test

### Comparing two sample proportions

$$Z = \frac{(\hat p_1 - \hat p_2) - 0}{\sqrt{\hat p(1 - \hat p)(\frac{1}{n_1} + \frac{1}{n_2})}}$$

where:
$$\hat p = \frac{Y_1 + Y_2}{n_1 + n_2}$$

FORMULA-BASED ALTERNATIVE: Z-TEST FOR PROPORTIONS
In this section, we discuss the formula-based counterpart to the resampling test for proportions outlined earlier.

The null hypothesis is that the proportions in the two populations, from which the two samples are drawn, are equal.

Let us denote the sample sizes of the two samples as n1 and n2 and the proportions in the two samples as p1 and p2.

As we are testing the hypothesis such that the population proportions are equal, we can obtain the unknown population proportion by pooling the sample proportions together.

Thus, the sample statistic is given by

images

where images

Z follows a standard Normal distribution.

−Zα/2 and Zα/2 are the α/2 percentile and the (1 − α/2) percentile, respectively.

Using the sample data, we calculate the value of the sample statistic Z.

If calculated Z < −Zα/2 or Z > Zα/2, we reject the null hypothesis and we conclude at significance level α that the proportions in the two populations are not equal.

If −Zα/2 < Z< Zα/2, then we do not reject the null hypothesis and we conclude that the proportions in the populations may be equal.

This formula relies on an approximation—when sample sizes n1 and n2 are large enough and/or the proportions in each sample p1 and p2 are close enough to 0.5, the difference between the two proportions is Normally distributed.

What is large enough? One guideline is that the following conditions must all hold true for this Normal approximation to be valid.

images

As you can see, low probability events require large sample sizes. A sample size of 100 would just suffice in testing a proportion of 5%. However, if the percentage drops to 1%, the required sample size jumps to 500.

The Z-test for proportions is still frequently encountered, although the resampling permutation test outlined earlier is more versatile. It requires no approximations and can operate with small samples and extreme proportions.   

7.8 CONFIDENCE INTERVAL FOR A DIFFERENCE IN PROPORTIONS
The connection between cholesterol and heart disease was first suggested back in the 1960s. You can calculate your own risk of a heart attack in the next 10 years at this website: http://hp2010.nhlbihin.net/atpiii/calculator.asp. Cholesterol levels are a key driver in these calculations.

Kahn and Sempos (Statistical Methods in Epidemiology, Oxford Univ. Press, New York, 1989, p. 61) describe some early research that quantified this connection. Men with high cholesterol were found to have heart attacks at a rate that was 64% higher than men with low cholesterol (Table 7.2).

TABLE 7.2 Cholesterol and Myocardial Infarctions (MI)

images

The difference in proportions is:

0.0741 − 0.0447 = 0.0294

In percent: 2.94%

The high cholesterol group's risk of heart attack was 2.9 percentage points higher than the low cholesterol group. In relative terms, this is a major difference, as it is almost 65% higher.

How much might this be in error, based on sampling variation? Find a 95% confidence interval.

RESAMPLING PROCEDURE

In one box—the high cholesterol box—put 10 slips of paper marked 1 for heart attacks and 125 slips marked 0 for no heart attacks.  
In a second box—the low cholesterol box—put 21 slips of paper marked 1 and 449 slips marked 0.   
From the first sample, draw a resample of size 135 randomly and with replacement. Record the proportion of ones.  
From the second sample, draw a resample of size 470 randomly and with replacement. Record the proportion of ones.   
Record the [result from step three] minus the [result from step four].   
Repeat steps three through five 1000 times.   
Find the interval that encloses the central 95% of the results—chopping 2.5% off each end. Figure 7.7 illustrates this interval. Specific software procedures for this example using Resampling Stats and StatCrunch can be found in the textbook supplements.
images

Figure 7.7 Histogram with 95% confidence interval, difference in proportion 1s, resample group of 135 minus resample group of 470.

We read the earlier-mentioned confidence interval as follows: The original study shows that men with high cholesterol suffer heart attacks at a rate that is 2.9 percentage points higher than men with low cholesterol; the 95% confidence interval runs from 7.7 percentage points higher to −1.4 percentage points lower.

Note that subsequent studies confirmed the original result, and it is now pretty well established that cholesterol and heart disease are related in ways that chance cannot explain.

# 9. __Correlation__