# Why is this stuff hard?

<img src='images/einstein.jpg' width=250>

## It's not just the math

>
> "... [S]tatisticians do not in general exactly agree on how to analyze anything but the simplest of problems" - Richard McElreath
>

>
> "Observing many researchers using the same data and hypothesis reveals a hidden universe of uncertainty" https://www.pnas.org/doi/10.1073/pnas.2203150119
>

### Statistical work is usually embedded in some context with a lot at stake

* If we hope to get meaningful insights, we need to think hard about exactly what we think we know, how we might be wrong, what we're trying to learn
* We might like to believe that statistically derived outcomes are neutral knowledge, but
    * they're not always knowledge (at best they are a combination of what we already think and what the data tells us)
    * they're often not neutral
* Specifically, many settings where statistics are used are also places where "knowledge" is connected to power, to those who seek power, and those who wield power
    * Government
    * Economics
    * Business
    * Science

*Effectively, a "data-driven" culture makes analysis into an instrument of power and thus adds complexity to our attempts at good analysis.*

It's easy to forget: statistical models aren't real. They're models. There's a difference, no matter how useful the models might be.

### Humans love stories, especially simple ones with heroes and villains

For reasons outside of our scope, humans seem to be strongly influenced by stories. Of particular power are stories with simple causes and effects, where it is clear whom to blame and to whom we should assign credit.

Many real-world phenomena -- especially those to which we apply statistical analysis -- don't work like that (see below).

This creates an inherent conflict between subtle, non-simple analysis and our (or, more importantly, our bosses'/orgs') desire to present a simple story as the outcome of our analysis. Moreover, if we resist the pressure to present a simple, appealing story, we may be marginalized by others who embrace the power of simplicity (even when it's wildly wrong).

### Interesting real-world phenomena are highly non-linear and outcomes can be impossible to intuit

Humans are reasonably good at dealing with linear phenomena, even when we don't like them.

We are worse than terrible at reasoning about, intuiting, and addressing non-linear phenomena: exponential growth, tipping points/phase changes, long/fat tails, extreme sensitivity to initial conditions ... and other aspects of the world we live in.

So we -- as well as  trained scientists and statisticians -- struggle to get our heads around what the numbers mean, even when we can agree on those numbers.

### Paradoxes -- resolved and less resolved

In the world of statistics, paradoxes usually do not mean contradictory ideas in the strict sense, but puzzles with potentially multiple specifications and interpretations.

We'll discuss two of the most famous ones: Simpson's and Berkson's paradoxes, which are well understood and have accepted interpretations today.

However, there are other statistical "paradoxes" that still enjoy at least a little bit of discussion, such as 
* the St. Petersburg Paradox (https://en.wikipedia.org/wiki/St._Petersburg_paradox) 
    * (still the subject of some debate, particularly around its interpretation vis a vis economic expected utility theory)
* Bertrand's Paradox https://en.wikipedia.org/wiki/Bertrand_paradox_(probability)

### Human intuition is generally weak even around some of the simplest probabilistic phenomena

E.g., humans do terribly with simple conditional probabilities, giving rise to things like the Base Rate Fallacy (aka the Prosecutor's Fallacy)
* https://www.cambridge.org/core/journals/behavioral-and-brain-sciences/article/abs/physicians-neglect-base-rates-and-it-matters/48984E851538FA30B5DE4D6D6CC35CA3
* https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2793624

The point here is not to pick on doctors -- they're are not generally worse at this than other specialists -- it's that even among highly trained professionals in critical analytical situations ... we get it wrong if we don't do the hard analytical legwork.

More on this below.

## What is your *context* for acquiring statistical understanding of a problem?

Typically, people say it is a combination of their business experience and their math skills. Which is true ... but leaves out some important things.

Your context for statistical understanding also includes...
* your goals, and your strengths/weaknesses
* your employees' / teammates' / bosses' / organizations' strengths/weaknesses
* industry assumptions
* cultural assumptions
* your true but latent motivations
    * Don't try to solve problems you don't really want to solve -- instead, acknowledge what you really want
        * e.g., business travel cost policy; open-plan offices; etc.

## And ... on top of that ... the math is also hard

__But not in the way you might think__

The issue is *not* that statistical thinking always requires sophisticated algorithms (although those can help)

The issue is that we, humans, are not evolved nor trained to think about probabilities and distributions, and our intuition is __terrible__ for these things.


>
> __Example: in a study of average to above-average performing teams, the smallest teams had the highest performance. The larger teams had more average performance. Moreover, the sizes of teams (large to small) correlated well with performance (average to best).__
> 
> What is the conclusion? Is it meaningful? Why?
>
> Think of it this way: if you were flipping a fair coin, are you more likely to get "all heads" for a group of 3 tosses, or a group of 100 tosses?
>
> In fact, the worse performing teams ("all tails") are also likely to seem to be the smallest ones. In general, small sample sizes make it easier to find extremes in general.
>
> But the example didn't mention low-performing teams. Why? Another easy to make, fundamental mistake: "sampling on the dependent variable" ... like those airport-bookstore volumes about successful athletes that draw their conclusions from looking at a handful of ... successful athletes.


## What we *do* get right ... and why it often isn't useful

The one area of statistics that trained professionals typically get right is the Gaussian or normal distribution.

<img src='images/normal.png' width=700>

This one we know. And we use it to good effect with our six-sigma quality control and in a number of other uses.

The problem is that most distributions aren't normal. And almost all *interesting* distributions aren't normal, or anything like it.

A distribution approaches the normal when
* it represents an accumulation
* by addition
* of sufficiently many
* independent observations

But, in the real world, many interesting systems do not give rise to independent observations, and that's when this falls apart.

## Averages

__Averages__ (or "expectations") are probably the most used statistics.

And what happens when we hear about an average?
* We picture something like the normal (above) and think about the corresponding areas of high likelihood

So let's explore...

### 12 ways to go totally wrong with averages, Part 1

__Shape issues, multiple modes, and skew__

1. Multiple modes, no mass near the average

2. Multiple modes, some mass near the average

3. Skewed distributions

    e.g., Log Normal https://distribution-explorer.github.io/continuous/lognormal.html

4. Heavy-tailed distributions
    
    e.g., Exponential https://distribution-explorer.github.io/continuous/exponential.html

__Information loss, averaging models, extrema__

5. Averaging models, assessments, or projections
* imagine 12 models which each predict a COVID (or sales) surge in a different month
* now average those

6. Average as a false proxy for extrema
* Risk assessment
* "Never cross a river which is 4 feet deep on average"

### 12 ways to go totally wrong with averages, Part 2

__Good outcomes $\neq$ good decisions; bad outcomes $\neq$ bad decisions__

"Resulting" and ABIYLFOYP ("Always Be Integrating Your Loss Function Over Your Posterior")

7. Cost functions
* probability of success/failure is not always proportional to the __cost__ of success/failure
* "Don't play Russian roulette" with your life, your business, etc.

Long vs. Short time horizons

8. Don't mistake long run or asymptote for the short run
* Keynes: "In the long run, we're all dead"

__Averages that aren't "typical," are fragile, or are missing__

What if ... most of the probability mass is far from the highest probability densities? This phenomenon is common even in "well behaved" distributions, when there are many variables (high dimensionality) involved

9. Typical sets
* The curse of dimensionality
    * https://mc-stan.org/users/documentation/case-studies/curse-dims.html
* The End of Average
    * https://www.toddrose.com/endofaverage

10. Averages that cannot be inferred from your sample
* Power laws and income in San Mateo county
* How Bill Gates helps your net worth when he has a coffee at your Starbucks

11. Averages that don't exist at all

Some distributions have no meaningful .. mean.

* Extremely fat tails
    * Cauchy distribution
    * https://distribution-explorer.github.io/continuous/cauchy.html
* Mistaking this phenomenon can lead to very risky decisions. You wouldn't make this mistake ... *if* you already knew what kind of distribution you had on your hands. But unless you know the Data Generating Process, you don't: you just have some data

12. Path dependence and ergodicity
    * Ensemble average vs. time average

### Example: exploring (non-)ergodicity and path dependence via simulation

Traditional average ("expectation")

In [None]:
0.5 * 1.5 + 0.5 * 0.60

We can simulate that to get a better idea of the deviation from the ideal average

In [None]:
import numpy as np
import matplotlib.pyplot as plt

sample_size = range(100, 10000, 100)

outcomes = []

for i in sample_size:
    draws = np.random.uniform(0, 1, (i))
    draws[draws > 0.5] = 1.50
    draws[draws < 1] = 0.6
    outcomes.append(draws.mean())
    
plt.plot(sample_size, outcomes)

So it looks like, even for small samples or "bad luck" we should do pretty well with this sort of investment.

__Ensemble average vs. time average__

But this form of average assumes that we start in the same position prior to each investment or bet.

* It's a bit like looking at hundreds or thousands of individuals or firms each making one bet. On average, they will (collectively) do well!

But let's change our perspective for a moment and look at one individual or firm making a sequence of small bets/investments.

* If they make $2n$ investments, we would expect about $n$ to yield the \\$1.50 and the other $n$ to yield the \\$0.60
* So the end result would be $(1.5)^n*(0.6)^n = [(1.5)(0.6)]^n = 0.9^n$

Wait ... $0.9^n$ doesn't look very good. In fact, it will go very quickly to zero for any significant $n$

Just to be sure, let's simulate this as well:

In [None]:
steps=200
simulations=10000
draws = np.random.uniform(0, 1, (simulations, steps))
draws[draws > 0.5] = 1.50
draws[draws < 1] = 0.6
outcomes = draws.prod(axis=1)
plt.hist(outcomes, bins=100)

Just for comparison, our expected value after `steps` investments

In [None]:
expected = 1.05 ** steps
expected

In [None]:
outcomes[outcomes < 0.1].size / simulations

In [None]:
outcomes[outcomes < 1].size / simulations

In [None]:
outcomes[outcomes > 2].size / simulations

In [None]:
outcomes[outcomes >= expected].size / simulations

__A dramatic view of the "lifelines" of a number of agents facing a similar set of options__

<img src='images/ergo.webp' width=700>

From: https://www.nature.com/articles/s41567-019-0732-0

When does this occur in real life?

Although our specific numbers in the present example are contrived, path dependence is a critical factor in many real-world systems:
* economic actors
* health outcomes
* hiring and promotion
* education
* criminal justice
* participation in risk-taking and investment activities

__How does this connect to the distributions and patterns we've been talking about?__

Notice that, in the path-dependent case,
* we have a *series of multiplied values which are not independent*
    * (since each multiplication is  dependent on prior state) 
* where, in the ensemble expectation, we *assumed* that all of the events (values being multiplied) are independent
    * (they only depend on the "rules of the game" -- every trial starts with 1 dollar)
    
Once again, we see a compounding effect leading to drastically large (or small) numbers. 

A concrete example is insurance pools. A sufficiently large and diverse business can "self insure" anything from employee health costs to its own fleet of vehicles. Such self insurance can work, provided the losses are independent enough that the ensemble average holds.

If a company's employees were all concentrated in an area with common health hazards (say, contaminated air or ground water) then the sequence of repeated of heath-cost losses would not be independent -- risk would be magnified as health losses compound over time.

__How do we use this knowledge?__

Any time we are looking to achieve an "average" result over time, we can ask whether the steps are truly independent. As a technology example, we may have a device that we deploy in the field which features high uptime (time between failures). 
* To achieve long-term reliability, we want to ensure that the device is as stateless as possible when it recovers
* If a device retains state (e.g., internal storage or config) which affects its future success (after recovering from a failure) then the sequence of failures becomes path dependent