# Paradoxes and Resolutions

## "One crazy trick to succeed at programming"

<img src='images/sim1.png' width=500>

__Coding speed (x) vs. error rate (y)__: "Just write your code really fast!"



## Really?

<img src='images/sim2.png' width=500>

__Novices (blue) vs. senior engineers (orange)__: "Not so much"

This is an example of *Simpson's Paradox*. Look: the data really are the same...

<img src='images/sim.png' width=500>

See also: https://en.wikipedia.org/wiki/Simpson's_paradox

__Overcontrol__

But now ... that's why my team lead said "control for everything" ... doesn't that help?

__No.__

For a mathematical demonstration, see Pearl; for intuition, we'll come back to this soon when we discuss causality and "causal salad."

In the meantime, here's a silly counterexample that will stick with you: see `images/simpsonreverse.mp4`

<hr/>

## The reverse problem ... or "closing those tricky sales prospects"

<img src='images/deal.svg' width=700>

#### Why do we keep getting big-volume deals at low margin and high-margin deals at low volume? Can't we change this?

### Let's take a look at all the original leads

<img src='images/berk1.png' width=600>

### Now let's look at what's getting through our sales-priority assignments

<img src='images/berk2.png' width=600>

#### This is an example of *Berkson's Paradox*

## What do these paradoxes have in common?

#### Here's a hint ... we know this chart (below) is silly ... but why?
<img src='images/pirates.png' width=700>

#### We joke about plots like this and say

> <p style="font-size:1.5em">"Correlation is not causation"</p>

#### But the reality is that, especially in business, most of the time what we are interested in is causation.

__Why? Because in business -- usually -- the data relationship in the abstract is not what we're interested in.__

* We're interested in making money, or some practices that will lead in that direction
    * More sales
    * Better products
    * The right kind of innovation
    * Advantage in the market
    * etc.
* And we want to __intervene__ in the world to __cause__ those outcomes.

Which means we need to learn about causal inference if we want to make good business decisions.

Causal inference is hard, for a number of reasons...
1. It requires more complex techniques and deeper understanding
2. We (in business) often start with *descriptive analytics* and then move on to *predictive modeling* and, paradoxically, 
> "Models that are causally incorrect can make better predictions than those that are causally correct. As a result, focusing on predictions can systematically mislead us." - R. McElreath

To see why, consider that the two (causal vs. predictive) approaches are trying to answer different questions. 

* A predictive model can tell me where in Toronto I will likely find less poverty and it may be very accurate in that regard. 
* That is totally different from the causal question 
    * What causes poverty? 
    * or, more precisely, *If* I want to reduce poverty in Toronto, *what intervention* might I try, and *what effect size* might I expect?
3. (maybe the gnarliest from a philosophical point of view) __Observational data alone cannot establish causality; parts of the causal hypothesis to test must come from *outside* the data and modeling exercise.__
    * In certain cases, i.e., with strong assumptions about linearity, we can bend this rule ... but then again, if we assume linearity, that's also coming from outside the data and modeling exercise...

This certainly throws a monkeywrench into a lot of weakly informed "data driven" processes. But...

#### It's not quite as bad as it might seem...

Our business domain knowledge (and common sense) can often allow us to articulate a causal hypothesis.

For example, suppose (A) is my firm's holiday discount and (B) is increased sales volume during the subsequent holiday season. We observe a correlation.

We know that "B causes A" (I'll write it as B -> A) makes no sense. (In the traditional sense of cause, although even that is up for debate: the "prospect" of B might be considered to cause A via an agent, but let's not get too far down that rabbit hole.)

A -> B is *possible*

And we might want to investigate:

Does A cause B? (A -> B)

or does something else cause both B and A, creating the correlation I observe:

<img src='images/conf.svg'>

where (C) is, for example, a holiday season or end-of-year budget period

In this latter example, C is called a "confound" and we might want to test its role by controlling for this factor in our analysis. __In fact, this is exactly what caused our Simpson's Paradox example, above.__ 

In that example, 
* the *experience level* of the programmer was the (C) variable correlated both 
    * positively with coding speed (A) 
    * and negatively with error rates (B), 
* while A and B have their own correlation (higher coding speed is correlated with higher error rates)
    * which we could observe once we controlled for (C)

So maybe the answer is to sketch out everything we can think of (or measure) and control for all of it!

Many a researcher has taken that route, but that is not quite subtle enough. Richard McElreath calls that __Causal Salad__

> "Causal Salad means tossing various 'control' variables into a statistical model, observing changes in estimates, and then telling a story about causation. Causal salad seems founded on the notion that only omitted variables can mislead us about causation. But *included* variables can just as easily confound us."

#### Colliders and Berkson's Paradox

The causal salad problem is exactly what happened with our sales deal example of Berkson's Paradox. 

We had (A) deal volume and (B) deal margin which were, in the raw data, uncorrelated. 

We also had a factor (C) which amounted to how easy it was to get a deal through the sales process.
Both (A) and (B) affected (C): lower volume __or__ lower margin deals are easier to sell.

Graphically, it looked like this:

<img src='images/collider.svg'>

We then ... perhaps unconsciously ... conditioned on (C): that is, the deals we pursued and closed were related based on how easy it was to get them through the sales process. Controlling on C (called __conditioning on a collider__) created a spurious correlation between (A) and (B).

Another classic example of Berkson's Paradox, or conditioning on a collider, is an MBA (management degree) program which discovers that its students seem to be smarter (but worse leaders) or stronger leaders (but less smart).
* Their admissions and grading processes effectively condition on "success" -- i.e., picking students by an approximation of a "success" attribute -- which is composed of both leadership and smartness (at least in the example.)
* The result is an induced negative correlation between perceived smartness and leadership ability in their grads

### The Takeaway

At this point, you might be feeling like any way you go with statistical analyses, you may get into trouble.

* Honestly, that's a good defensive posture for any statistician
* But it also makes it really hard to get your job done!

#### The lesson here is not to give up, but to look a bit skeptically at "__data is the new oil__" or "__data-driven__" hype.

The data we have is, in fact quite useful... but not by itself. 

We need to start with 
* a human understanding of the business rules and patterns, 
* serious thinking about causality, 
* and then perform analyses to see whether the data provides evidence for actions that will help us reach our goals.

I call this approach *data driven with smart models* because the models that produce value don't come purely from the data, but partly from data and partly from thoughtful humans.