@briatte
François Briatte edited this page May 28, 2023
·
13 revisions
I will die on those hills
-
Report as few precision digits as necessary.
Most of the time, that's one. There are receivable counter-views.
Nathaniel Beck never published his paper in support of his
leanout
Stata package, but everything in there was correct and warranted. Theleanout
package strips regression results to their bare bones: 1-digit estimates, N, and RMSE. Damn right. -
RMSE is useful. R-squared is not.
And RMSE should be renamed ‘penalized average error’ for clarity.
-
Anything that looks like stepwise regression is a bad idea.
It's not anyone's fault regularization came much later.
-
No synthetic data, no toy data (e.g. Iris).
- 100% real-world data.
- Try to study just one dimension (counts, timelines).
- Try to study too many dimensions (text, surveys).
-
Write code for humans, write data for computers (Vince Buffalo).
- Literate programming, style guidelines.
- Clean Code concepts (e.g. JavaScript).
- Machine-readable data formats.
Not listed in the syllabus:
- Hardin, J. et al. 2015. Data Science in Statistics Curricula: Preparing Students to `Think with Data’. The American Statistician.
- Imai, K. 2015. How to Teach Quantitative Methods to Social Science Students: The Princeton Experience. Q-Step Programme Symposium, Warwick.
- Schindler, M. 2014. In Defense of Imprecision: Why Traditional Approaches to Data Visualization are Changing. Boston Data Festival.
- Most people understand the difference between percentages and percentage points, but cannot explain it.
- Most people do not know how to compute a weighted mean, even when they understand the concept.
- Due to the tons of biases that apply to human probabilistic reasoning, everyone gets some probability wrong.
- Rare independent events can occur twice in a row
- Absolute risk ≠ Relative risk, and…
- Relative risk ≠ Odds ratio
- Most people naturally understand growth rates.
- Most people naturally understand exponential growth (sometimes by confusing it with power laws).
- Understanding of linear v. nonlinear relationships: squared, exponential, asymptotic (square root).
- Almost everyone understands fractions, i.e. ratios, i.e. normalized measures.
- Compare: GDP, GDP/capita, GDP/household/week/year
- 95% of your readers will stick with simple descriptives—only the last 5% might look at the model.
- Do not use percentages on small samples.
- Do not use high levels of precision: zero or one decimal will fit most situations.
- Units: natural, indices, fractions (percentages), quantiles.
- Sometimes, the answer is not in the data (John Tukey, cited by Edward Tufte).
Remember that when building/plotting stuff.
Recommended:
- Nick J. Tierney, "Getting your head around numbers and stats" (for basic stats) (new link, assigned to Session 1)
- Edward R. Tufte (for all dataviz rules)
-
Epistemic: selected w/r/t rules of interpretation.
- Fairness, transparency
- Organised skepticism, detachment
- Openness (universalism)
-
Technical: selected w/r/t method of production.
- Effectiveness (efficiency)
- Experience (local knowledge)
- Expertise (sophistication)
-
Aesthetic: selected w/r/t elegance.
- "Authenticity"
- Introspective, "resonates with inner stuff"
- Fame
-
Normative: selected w/r/t desirable goals.
- Sollen: value-laden, moral
- Justice, rightness
- Engagement
Inspiration: Patrick Thaddeus Jackson