# <font color = 'blue'>Naked Statistics: Stripping the Dread from the Data (Charles Wheelan)

The following are my notes and may contain direct excerpts from the book.

## <font color = 'maroon'>Introduction
- 'The problem is that if the data are poor, or if the statistical techniques are used improperly, the conclusions can be wildly misleading and even potentially dangerous'

*Example:* <br>

According to a study of 36,000 office workers, those workers who repoorted leaving their offices to take regular ten-minute breaks during the workday were 41% more likely to develop cancer over the next five years, than those who don't leave their offices during the workday.<br>
- so, taking short breaks at work causes cancer?
- maybe we just need to think more clearly about what many workers are doing during that 10 minute break (many who leave for breaks are huddled outside smoking cigarettes -- creating a haze of smoke through which other break-takers have to walk in order to get in or out of the building)
- which is the likely cause of cancer -- taking breaks or smoking?
- *'Statistics is like a high-caliber weapon: helpful when used correctly and potentially disastrous in the wrong hands.'*
- 'It's easy to lie with statistics, but it's hard to tell the truth wiouth them' -- Andrejs Dunkels, Swedish mathematician and writer

## <font color = 'maroon'>Chapter 1: What's the Point?

Use data from the 'known world' to make informed inferences about the 'unknown world'
- summarize huge quantities of data
- make better decisions or answer important questions
- recognize patterns that can refine how we do things
- catch cheaters/criminals
- evaluate effectivenessd of policies, programs, products, etc.
- know that there are scoundrels out there that will use these powerful tools for nefarious purposes

Descriptive statistics exist to simplify, which always implies some loss of nuance or detail. Overreliance on any descriptive statistic can lead to misleading conclusions, or cause undesirable behavior
- **Gini index**: standard tool in economics for measuring income inequality, collapses complex information into a single number
- **Sampling** is the process of gathering data for a small area and then using those data to make an informed inference about the entire population

- **Regression analysis** enables us to isolate a relationship between two variables, while holding constant (or 'controlling for') the effects of other variables. {see notes on Chapter 11}


## <font color = 'maroon'>Chapter 2: Descriptive Statistics

Descriptive statistics summarize the information in a data set in a meaningful way that make reveal patterns  (i.e., mean, median, standard deviation, percentile scores), however, they do not allow us to make conclusions beyond the data

- **percentiles** - divide the distribution into hundredths, this can give you a measure as to how, for example, your score on a test ranks relative to all other test takers.  

- **standard deviation** - allows us to assign a single number to describe how far the data are dispersed from their mean  (standard deviation is the square root of the variance) \sigma

- **variance** - calculated by determining how far the observations within a distribution lie from the mean {formula for variance is on page 34}

- **normal distribution** - symmetrical around their mean in a bell shape in this way:
        - 68.2% are within one standard deviation of the mean()
        - 95.4% are within two standard deviations of the mean
        - 99.7% are within three standard deviations of the mean
        
<img src="normal.dist.png" width="800" height="400">

Measuring change as a percentage gives us some sense of scale. (remember percent of change is amount of change divided by original amount)

The advantage of any index is that it consolidates lots of complex information into a single number.

## <font color = 'maroon'>Chapter 3: Deceptive Description

The overall lesson of this chapter is that statistical malfeasance has very little to do with bad math. If anything, <font color = 'red'> impressive calculations can obscure nefarious motives.  The fact that you've calculated the mean correctly will not alter the fact that the median is the more accurate indicator.  Judgment and integrity turn out to be surprisingly important.<font color = 'black'> A detailed knowledge of statistics does not deter wrongdoing any more than a detailed knowledge of the law averts criminal behavior.
    
- Pay attention to the unit of analysis
- The median is not sensitive to outliers
- The median v. mean question revolves around whether the outliers in a distribution distort what is being described or are instead an important part of the message.
- Any comprehensive statistical analysis would likely present both the mean and the median.
- **real figures** are adjusted for inflation.
- **nominal figures** are not adjusted for inflation.    
A statistical index has all the potential pitfalls of any descriptive statistic - plus the distortions introduced by combining multiple indicators into a single number. By definition, any index is going to be sensitive to how it is constructed; it will be affected both by what measures go into the index and by how each of those measures is weighted. In the end, the important question is whether the simplicity and ease of use introduced by collapsing many indicators into a single number outweighs the inherent inaccuracy of the process.

## <font color = 'maroon'>Chapter 4: Correlation

Correlation measures the degree to which two phenomena are related to one another. Two variables are positively correlated if a change is one is associated with a change in the other in the same direction. In a negative association, one increases while the other decreases.

- **correlation coefficient** is a single number between -1 and 1
- **1 is a perfect positive correlation** (every change in one variable is associated with an equivalent change of the other variable in the same direction)
- **-1 is a perfect negative correlation** (every change in one variable is associated with an equivalent change of the other variable in the opposite direction)
- the closer a correlation is to 1 or -1, the stronger the association
- A correlation close to 0 suggests the variables have no meaningful association with each other

**Remember that correlation does not imply causation!**

## <font color = 'maroon'>Chapter 5: Basic Probability
Probability is the study of events and outcomes involving an element of uncertainty. They tell us what is likely to happen and what is less likely to happen.
- **binomial experiment** - aka Bernoulli trial. Has a fixed number of trials, each with two possible outcomes and the probability of success is the same in each trial.
- The probability of two independent events' **both** happening is the **product of their respective probabilities**. In other words, the probability of Event A happening *and* Event B happening is the product of Event A multiplied by Event B.  NOTE: this formula is only applicable if the events are *independent*, meaning the outcome of one has no effect on the outcome of the other.
- The probability that one event happens or another event happens:  outcome A *or* outcome B (again, assuming that they are independent).  The probability of getting A or B consists of the sum of their individual probabilities.  (the probability of A *plus* the probability of B).
- INDEPENDENT (and = x; or = +)

 If the events are not mutually exclusive (if they are not independent), the probability of getting A or B consists of the sum of their individual probabilities minus the proability of both events happening.
 
 The expected value or payoff from some event, say purchasing a lottery ticket, is the sum of all the different outcomes, each weighted by its probability and payoff.
 
 - Good decisions - as measured by the udnerlying probabilities -- can turn out badly.
 - And bad decisions can still turn out well, at least in the short run. 
 - But probability triumphs in the end. 
 
 An important theorem known as the **law of large numbers** tells us that as the number of independent trials increases, the average of the outcomes will get closer and closer to its expected value.
 - the law of large numbers explains why casinos always make money in the long run.
 
A **probability density function** merely plots the assorted outcomes along the x-axis and the expected probability of each outcome on the y-axis; the weighted probabilities -- each outcome multiplied by its expected frequency -- will add up to 1.

'Predictive policing' is part of a broader movement called predictive analytics


## <font color = 'maroon'>Chapter 6: Problems with Probability
    
**Value at Risk model** (aka VaR) - a common barometer of risk in the financial industry
- 'VaR's great appeal, and its great selling point...is that it expresses risk as a single number, a dollar figure, no less' -- Joe Nocera, New York Times business writer
    
- The crisis that began in 2008 destroyed trillions of dollars in wealth in the United States, drove unemployment over 10%, created waves of home foreclosures and business failures, and saddled governments around the world with huge debts. This is a sadly ironic outcome, given that sophisticated tools like VaR were designed to mitigate risk.
    
**The most common probability-related errors, misunderstandings and ethical dilemmas:**
- **assuming events are independent when they are not**
- **not understanding when events ARE independent**. (The very definition of statistical independence between two events is that the outcome of one has no effect on the outcome of the other.)
- **Clusters Happen** - when we see an anomalous event out of context, we [wrongly] assume that something besides randomness must be responsible.
- **the prosecutor's fallacy**:(the chances of finding a coincidental one in a million match are relatively high if you run the sample through a database with samples from a million people)
- **reversion to the mean** (or regression to the mean) - probability tells us that any outlier is likely to be followed by outcomes that are more consistent with the long-term average
- **statistical discrimination**: When is it okay to act on the basis of what probability tells us is likely to happen, and when is it not okay? The broader point here is that our ability to analyze data has grown far more sophisticated than our thinking about what we ought to do with the results. For all the elegance and precision of proability, there is no substitute for thinking about what calculations we are doing and why we are doing them.


## <font color = 'maroon'>Chapter 7:  The Importance of Data ('Garbage in, garbage out')

We generally ask our data to do one of three things: 
1. we demand a sample that is representative of some larger group or population
    - requires a simple random sample (each observation in the relevant population has an equal chance of being included in the sample.
    - a representative sample enables us to use the most powerful tools statistics has to offer
    - many of the most egregious statistical assertions are caused by good statistical methods applied to bad samples, not the opposite
    - sample size matters, bigger is better
    - if every member of the relevant population does not have an equal chance of ending up in the sample, we are going to have a problem with whatever results emerge from the sample.
2. we expect data to provide some source of comparison
3. we sometimes have no specific idea what we will do with the information - but we suspect it will come in handy at some point.

- **longitudinal data set** - the research equivalent of a Ferrari.  Particularly valuable when it comes to exploring causal relationships that may take years or decades to unfold<br>
- **cross-sectional data set** - a collection of data gathered at a single point in time<br>
- **self-selection bias** - bias that arises when individuals volunteer to be in a treatment group.<br>
- **publication bias** - positive findings are more likely to be published than negative findings, which can skew the results that we see.<br>
- **recall bias** - memory is not always a great source of data. Our memories  can be systematically fragile when we are trying to explain some particularly good or bad outcome in the present<br>
- **survivorship bias** - occurs when some or many of the observations are falling out of the sample, changing the composition of the observations that are left and therefore affecting the results of any analysis. (If you have a room of people with varying heights, forcing the short people to leave will raise the average height in the room, but it doesn't make anyone taller.)<br>
- **healthy user bias** - people who take vitamins regularly are likely to be healthy -- *because they are the kind of people who take vitamins regularly!*<br>


## <font color = 'maroon'>Chapter 8: The Central Limit Theorem

The core principle underlying the central limit theorem is that a large, properly drawn sample will resemble the population from which it is drawn. This theorem allows us to make the following inferences: 

1.  If we have detailed information about some population, then we can make powerful inferences about any properly drawn sample from that population.  According to the  central limit theorem, the average test score for the random sample of 100 students will not typically deviate sharply from the average test score for the whole school.
2.  If we have detailed information about a properly drawn sample (mean and standard deviation), we can make strikingly accurate inferences about the population from which that sample was drawn.
3. If we have data describing a particular sample, and data on a particular population, we can infer whether or not that sample is consistent with a sample that is likely to be drawn from that population. 
4. If we know the underlying characteristics of two samples, we can infer whether or not both samples were likely drawn from the same population.

**According to the central limit theorem, the sample means for any population will be distributed roughly as a normal distribution around the population mean.**

Our best guess for what the mean of any sample will be is the mean of the populations from which it's drawn. The whole point of a representative sample is that it looks like the underlying population.  

- As a rule of thumb, the sample size must be at least 30 for the central limit theorem to hold true.

### Standard Error (the standard deviation of the sample means)

- The standard error measures the dispersion of the sample means.  
- Some things to remember: 
    - The standard deviation measures dispersion in the underlying population
    - The standard error measures the dispersion of the sample means
    - So the standard error is the standard deviation of the sample means.
    
- a LARGE standard error means that the sample means are spread out widelly around the population mean
- a small standard error means that they are clustered relatively tightly. 
- Sample means will cluster more tightly around the population mean as the size of each sample gets larger
- Sample means will cluster less tightly around the population mean when the underlying population is more spread out

**THE BIG PICTURE:**

- if you draw large, random samples from any population, the means of those samples will be distributed normally around the population mean (regardless of what the distribution of the underlying population looks like)

- most sample means will lie reasonably close to the population mean; the standard error is what defines 'reasonably close'

- it is relatively unlikely that a sample mean will lie more than two standard errorsfrom the population mean

- the less likely it is that an outcome has been observed by chance, the more confident we can be in surmising that some other factor is in play


## <font color = 'maroon'>Chapter 9: Inference
    