Skip to content


Matthew L. Jones edited this page Jan 29, 2019 · 46 revisions

Data: Past, Present, and Future


Matthew L. Jones (A&S) and Chris Wiggins (SEAS)

Aaron Plasek (TA) Yumou (Will) Wei (TA) Susannah Glickman (Grader)

Course description

Data and data-empowered algorithms now shape our professional, personal, and political realities. This course introduces students both to critical thinking and practice in understanding how we got here, and the future we now are building together as scholars, scientists, and citizens.

The intellectual content of the class will comprise

  • the history of human use of data;

  • functional literacy in how data are used to reveal insight and support decisions;

  • critical literacy in investigating how data and data-powered algorithms shape, constrain, and manipulate our commercial, civic, and personal transactions and experiences; and

  • rhetorical literacy in how exploration and analysis of data have become part of our logic and rhetoric of communication and persuasion, especially including visual rhetoric.

While introducing students to recent trends in the computational exploration of data, the course will survey the key concepts of "small data" statistics.


All students will be required to:

  • participate in all discussions and laboratory hours (20%)

  • respond to readings each week on Slack (5%)

  • write one 750 word op-ed on the ethics and practice of using data by midterm (15%)

Students will be assigned, based on their background, into one of two tracks. Basically: students with less technical background will do more technical work, including problem sets; students with more technical background will do more humanistic work, including longer writing assignments

a) more technical background track (60%)

  • pursue a semester long project culminating in a 15pp paper and any associated code

  • complete 3 problem sets

  • short final presentation on paper

b) more humanistic background track (60%)

  • write a 10 pp paper on a topic of their choice

  • complete 5 problem sets, these problem sets will involve both computational work and writing work

  • short final presentation on paper


Tentative and subject to change

Week 1: Intro



Week 2: what is at stake?


Wallach, Hanna. Big data, machine learning, and the social sciences: Fairness, accountability, and transparency. Medium. Retrieved December 20, 2014, from

boyd, danah, and Kate Crawford. 2012. "Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon." Information, Communication & Society 15.5: 662-679.

Tufekci, Zeynep. "Engineering the public: Big data, surveillance and computational politics." First Monday 19, no. 7 (2014).

Lab: Intro to Python & Data Provenance

  • A quick and painless introduction to Python
  • Group Activity: Data Provenance

Week 3: society, community, and counting


Gigerenzer, Gerd, ed. The Empire of Chance: How Probability Changed Science and Everyday Life. Cambridge: Cambridge University Press, 1989, Section 1.6 ("Risk and Insurance")

Desrosières, Alain. The politics of large numbers: A history of statistical reasoning. Harvard University Press, 2002. (Ch 1)

Igo, Sarah Elizabeth. The Averaged American: Surveys, citizens, and the making of a mass public. Harvard University Press, 2007. (Introduction)


  • EDA with found data
  • Data exploration with graphics

Week 4: Social Physics


Quetelet, Adolphe “Preface” and “Introductory,” A Treatise on Man (1842), ( )

Porter, Theodore. The Rise of Statistical Thinking, 1820-1900 (Princeton, N.J.: Princeton University Press, 1986), chap. 2 (40-70) + 100-109.


  • Exploratory Data Analysis: fun with pandas
  • Doing snazzy stuff with groupby
  • Box, Scatter, and other basic visualizations

Week 5: Quantitative racism: the Victorian Program


Desrosieres, Alain. "Correlation and the Realism of Causes," in The Politics of Large Numbers: A History of Statistical Reasoning. Cambridge, Mass.: Harvard University Press, 1998, ch 4.

Galton, Francis. “Typical Laws of Heredity,” Royal Institution of Great Britain. Notices of the Proceedings at the Meetings of the Members 8 (February 16, 1877): 282ff.

Stephen J. Gould, The mismeasure of man. WW Norton & Company, 1996, ch. 3


Student (W. S. Gosset), "The Probable Error of a Mean," Biometrika, 6 (1908), 1-25. (

Gillham, Nicholas. "Sir Francis Galton and the Birth of Eugenics." Ann. Rev. Genet. 35 (2001): 83-101.


  • describing and predicting: survival curves, smoothing, inventing error, and regression
  • death curves since Halley
  • Survivor curves and making smoothing "real" at the dawn of the 20th Century
  • Least Squares and Regression

Week 6: Intelligence and Policy; descriptive and prescriptive uses of data


Spearman, Charles. "General Intelligence," objectively determined and measured." The American Journal of Psychology 15, no. 2 (1904): 201-292, read pp. pp 272-277 ( available at )

Gould, Stephen Jay. The mismeasure of man. WW Norton & Company, 1996. ONLY pp: 280-2, 286-288, 291-302, 347-350.

Freedman, David A. "Linear statistical models for causation: A critical review." Encyclopedia of statistics in behavioral science (2005). ( available at )


  • simulating Galton
  • regression
  • Galton's plot
  • selective pressure
  • median selection

Week 7: Mathematics baptizes data


Gigerenzer, Gerd, ed. The Empire of Chance: How Probability Changed Science and Everyday Life. Cambridge: Cambridge University Press, 1989, Sections 3.4, 3.6, 3.7

Fisher, R. A. (1955), "Statistical Methods and Scientific Induction". Journal of The Royal Statistical Society Series B, 17: 69-78.

Neyman, J. (1956), "Note on an Article by Sir Ronald Fisher," Journal of the Royal Statistical Society. Series B, 18: 288-294.

Pearson, E. S. (1955), "Statistical Concepts in Their Relation to Reality," Journal of the Royal Statistical Society Series B, 17: 204-207.


  • Yule: causes of changes in pauperism in England, 1899
  • Least Squares and Regression
  • multivariate regression

Week 8: Data, the dawn of computation, and war


McGrayne, S. B."Bayes goes to War", in The Theory That Would Not Die: How Bayes’ Rule Cracked the Enigma Code, Hunted Down Russian Submarines, & Emerged Triumphant from Two Centuries of Controversy. Yale University Press, 2011, ch. 4 and short Ch. 5

Zabell, S. "Statistics at Bletchley Park", in Field, J. V, James A Reeds, and Whitfield Diffie. Breaking Teleprinter Ciphers at Bletchley Park: An Edition of I. J. Good, D. Michie and G. Timms: General Report on Tunny with Emphasis on Statistical Methods (1945). Wiley-Blackwell, 2015, pp lxxv to xci, xcvii-ci

Abbate, J. "Breaking Codes and Finding Trajectories: Women at the Dawn of the Digital Age", Recoding Gender: Women’s Changing Participation in Computing. MIT Press, 2012. 14-16, 21-22, bottom of 26-29, 33-35.


  • Bayes factors
  • Role of Bayes in codebreaking

Week 9: The (first) birth and death of AI

Turing, Alan M. “Computing Machinery and Intelligence.” Mind 59, no. 236 (1950): 433–60.

McCarthy, John, M. L. Minski, N. Rochester, and Claude E. Shannon. “Proposal for the 1956 Dartmouth Summer Research Project on Artificial Intelligence,” August 31, 1955.

1973: The ‘Lighthill Report’

OPTIONAL: the highly entertaining “movie” version of the Lighthill report: (



  • AI without ML (artificial intelligence without machine learning)
    • Chatbots
    • Expert Systems: the example of Mycin
  • The birth, death, and rebirth of the Perceptron

Week 10: Data at The Labs


Tukey, John W. "The future of data analysis." The annals of mathematical statistics 33, no. 1 (1962). Read: pp2-14 (end at "II. Spotty Data") pp60-64 (start at "VIII. How shall we Proceed?")

Tukey, John W. Exploratory data analysis. 2 vol. 1977; read "Preface", pp v-ix Sec 1A and 1B, pp 1-7

Chambers, John M. "Greater or lesser statistics: a choice for future research." Statistics and Computing 3, no. 4 (1993): 182-184.

Mallows, Colin. "Tukey's Paper After 40 Years." Technometrics 48, no. 3 (2006): only pp. 319-325.:

OPTIONAL: Diaconis, Persi. "Theories of data analysis: from magical thinking through classical statistics." In Exploring Data Tables, Trends and Shapes. Edited by D. Hoaglin, F. Mosteller, and J. Tukey. 1"36. New York: Wiley, 1985.


  • machine learning:
    • supervised learning
    • unsupervised learning
    • reinforcement learning

Week 11: Machine Learning and the AI Renaissance


Simon, Herbert A. "Why should machines learn?." In Machine Learning, Volume I, pp. 25-37. 1983.

Breiman, Leo. "Statistical modeling: The two cultures (with comments and a rejoinder by the author)." Statistical science 16, no. 3 (2001): 199-231.

Jordan, Michael I., and Tom M. Mitchell. "Machine learning: Trends, perspectives, and prospects." Science 349, no. 6245 (2015): 255-260.

Lewis-Kraus, Gideon. "The great AI awakening." The New York Times Magazine (2016): 1-37. available online via

OPTIONAL: Mitchell, Tom Michael. The discipline of machine learning. Vol. 9. Carnegie Mellon University, School of Computer Science, Machine Learning Department, 2006.


  • Decision trees
  • Random Forests

Week 12: Data Science as a Trading Zone


Galison, Peter. "Computer simulations and the trading zone." The disunity of science: Boundaries, contexts, and power (Stanford, 1996): 118-157; read only 118-121, 151-157.

Cleveland, William S. "Data science: an action plan for expanding the technical areas of the field of statistics." International statistical review 69, no. 1 (2001): 21-26. (

Jones, Matthew L. "Querying the Archive: Data Mining from Apriori to PageRank." Science in the Archives: Pasts, Presents, Futures (Chicago, 2017): 311.

Hammerbacher, Jeff. "Information platforms and the rise of the data scientist." Beautiful Data (O'Reilly, 2009): 73-84.


Donoho, David. "50 Years of Data Science." Journal of Computational and Graphical Statistics 26, no. 4 (2017): 745-766.

Luhn, Hans Peter. "A business intelligence system." IBM Journal of Research and Development 2, no. 4 (1958): 314-319.


  • Databases and recommendation engines
  • The Netflix prize
    • Pivots and counting
    • Groupby and "modeling"

Week 13: Ethics, Privacy, and Anonymity


Salganik, M. J. Bit by Bit: Social Research in the Digital Age. Princeton University Press, 2017. Chapter 6, 6.1-6.8.


Zook, Matthew, Solon Barocas, danah boyd, Kate Crawford, Emily Keller, Seeta Peña Gangadharan, Alyssa Goodman, et al. “Ten Simple Rules for Responsible Big Data Research” Edited by Fran Lewitter. PLOS Computational Biology 13, no. 3 (March 30, 2017): e1005399. (


  • Privacy and Anonymity
    • Constructing our own 'Database of Ruin'
  • Personally identifiable information
    • k-anonymity
    • differential privacy

Week 14: People, Products, and Plaforms


Richard Serra and Carlota Fay Schoolman, "Television Delivers People" (1973): (

Goldhaber, Michael H.. “The attention economy and the Net.” First Monday, 2.4 (1997) (

Janeway Doing Capitalism in the Innovation Economy: Markets, Speculation and the State., 2nd. ed., introduction

Grimmelmann, James. "The Platform is the Message." (2018).


  • Fairness,
  • Accountability
  • Transparency
You can’t perform that action at this time.