# Experiments (or data collection): different perspectives and what to watch out for

## Big data: blessing or curse?

> "The unreasonable effectiveness of data" - Peter Norvig et al., Google

For some applications, we can throw a *huge* amount of data at a single high-capacity model and get something useful; other times, we need a more surgical approach

> "Do you want one big model or lots of small models?" - Andrew Gelman

### Big data success stories: high-capacity models, high variance data

* Large language models like GPT-4
* Modern (deep-learning-based) computer vision models, like SegmentAnything

### Where big data is not so successful

For other applications, your result relies on *local features* where there are few data points

* Big-data, general "personalized advertising" models don't work well 
    * Even if we capture an extremely high number of meaningful features (which is itself rare)
        * there aren't nearly enough data points to account for the variance once we get into interesting corners of the space
    * Despite the social/economic phenomenon of Internet advertising, it is unknown if any of those ad models have actual effectiveness (!)
        * see https://us.macmillan.com/books/9780374538651/subprimeattentioncrisis
* Small data alternatives:
    * Better customer modeling: one approach is the per-customer model
    * For election modeling, the many-small-models approach, combined though a pattern called multilevel regression and post-stratification ("Mr. P") works very well
        * http://www.stat.columbia.edu/~gelman/research/published/misterp.pdf
        * https://learnbayesstats.com/episode/83-multilevel-regression-post-stratification-electoral-dynamics-tarmo-juristo/
    * Modeling the data-generating process
        * Can be combined with big-data approaches, but typically is not; i.e., typically the big data approach is discriminative rather than generative
        * Even the most impressive large generative models, such as Stable Diffusion, to not attempt to model a data-generating process
        * Ignoring the data-generating process is not always a problem...
            * These models can be strong for prediction
            * But are weak for understanding phenomena
            * And generally unable to support causal modeling on their own
        * Bayesian approaches with domain-specific priors and/or pooling are a strong contender for infering parameters while modeling a data-generating process, and works well with "small data"

*And big data is not statistically big when it is biased... and since most data collection regimes are biased in some ways, this is a topic that needs to be addressed for each study.*

## Problems with Common Analyses and Experiments

### What does "statistical significance" mean? (the non-technical version)

(this discussion is based on traditional frequentist statistics; Bayesians have a much more sensible and straightforward, albeit less widely used, approach to this problem)

* *Statistical significance* describes how (in)frequently a positive result would be detected due to random chance alone. I.e., how likely we are to find something when it's not actually there.

* *Statistical power* describes how (in)frequently a positive result would be detected when the phenomenon under test is in fact positive. I.e., how likely we are to find something when it *is* actually there.

Both of these are important: obviously we don't want to believe spurious positive results; at the same time we want to be confident that if there is a result to find, that we will find it.

These values are connected to our study by way of the number of data points we have (generally, more data points give stronger results), the effect size we're trying to detect (it is easier to detect a large effect than a small one ... but large effects are rare, so we may have to design around finding small effects), and the testing mechanisms we are using.

Definitely consult statisticians, books, or online resources if you need to work out the analyses yourself -- don't just assume whatever pops out of statsmodels or scikit-learn is your path to business success. 

The important thing to rememeber is that there's no such thing as a free lunch -- if your analyses don't take all of these elements into account, you are likely to run into problems... and those problems are tricky because they won't be obvious: your code won't crash, you won't find yourself trying to divide by zero.

Instead, you're most likely to find something that doesn't exist ... something that you found because humans are great at finding patterns that aren't really there ... something that Google Chief Decision Scientists Cassie Kozyrkov calls...

<img src='images/bread-elvis.png' width=300>

__Elvis in a slice of toast__

There are many easy ways to accidentally find Elvis in a slice of toast.

These include...

* A/B tests where results are continuously looked at and the test is stopped when a result seems to appear, especially if it's the result the boss wants to find
    * There *are* legitimate ways to design and execute A/B tests where you can analyze them incrementally, but you have to be aware that's what you're doing
* Running lots of experiments at a set significance level and "discovering" a significant result
    * This is generally called "p-hacking" and it is often done unintentionally
    * If you expect to find positive results due purely to chance 1 time in, say, 20 and then you perform 100 experiments ... it shouldn't be surprising that you make a discovery ... but perhaps that discovery shouldn't be relied upon without further work
* Looking at pairs of variables for correlation out of a large pool of variables
    * It is almost certain that some pairs will show a high correlation even if the data is totally random ... because there are so many possible pairs
* Overfitting an unseen evaluation dataset by using an "oracle" to guide your modeling
    * E.g., consider a sequence of models tuned based on their Kaggle leaderboard scores -- i.e., based on their performance against a secret test set that only Kaggle has
        * at first glance, it might seem impossible to overfit a dataset you never train on or even see
        * but consider that the leaderboard score provides an error signal which guides the stepwise evolution of the model