Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Data: Past, Present, and Future
Matthew L. Jones (A&S) and Chris Wiggins (SEAS)
Aaron Plasek (TA) Yumou (Will) Wei (TA) Susannah Glickman (Grader)
Data and data-empowered algorithms now shape our professional, personal, and political realities. This course introduces students both to critical thinking and practice in understanding how we got here, and the future we now are building together as scholars, scientists, and citizens.
The intellectual content of the class will comprise
the history of human use of data;
functional literacy in how data are used to reveal insight and support decisions;
critical literacy in investigating how data and data-powered algorithms shape, constrain, and manipulate our commercial, civic, and personal transactions and experiences; and
rhetorical literacy in how exploration and analysis of data have become part of our logic and rhetoric of communication and persuasion, especially including visual rhetoric.
While introducing students to recent trends in the computational exploration of data, the course will survey the key concepts of "small data" statistics.
All students will be required to:
participate in all discussions and laboratory hours (20%)
respond to readings each week on Slack (5%)
write one 750 word op-ed on the ethics and practice of using data by midterm (15%)
Students will be assigned, based on their background, into one of two tracks. Basically: students with less technical background will do more technical work, including problem sets; students with more technical background will do more humanistic work, including longer writing assignments
a) more technical background track (60%)
pursue a semester long project culminating in a 15pp paper and any associated code
complete 3 problem sets
short final presentation on paper
b) more humanistic background track (60%)
write a 10 pp paper on a topic of their choice
complete 5 problem sets, these problem sets will involve both computational work and writing work
short final presentation on paper
Tentative and subject to change
Week 1: Intro
- Connecting to Codio; basic orientation
- Basic python and jupyter orientation
- data provenance (look through data sets at https://archive.ics.uci.edu/ml/index.php if you want to get a head start)
Week 2: what is at stake?
Wallach, Hanna. Big data, machine learning, and the social sciences: Fairness, accountability, and transparency. Medium. Retrieved December 20, 2014, from https://medium.com/@hannawallach/big-data-machine-learning-and-the-social-sciences-927a8e20460d
boyd, danah, and Kate Crawford. 2012. "Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon." Information, Communication & Society 15.5: 662-679. http://www.tandfonline.com/doi/abs/10.1080/1369118X.2012.678878
Tufekci, Zeynep. "Engineering the public: Big data, surveillance and computational politics." First Monday 19, no. 7 (2014).
Lab: Intro to Python & Data Provenance
- A quick and painless introduction to Python
- Group Activity: Data Provenance
Week 3: society, community, and counting
Gigerenzer, Gerd, ed. The Empire of Chance: How Probability Changed Science and Everyday Life. Cambridge: Cambridge University Press, 1989, Section 1.6 ("Risk and Insurance")
Desrosières, Alain. The politics of large numbers: A history of statistical reasoning. Harvard University Press, 2002. (Ch 1)
Igo, Sarah Elizabeth. The Averaged American: Surveys, citizens, and the making of a mass public. Harvard University Press, 2007. (Introduction)
- EDA with found data
- Data exploration with graphics
Week 4: Social Physics
Quetelet, Adolphe “Preface” and “Introductory,” A Treatise on Man (1842), ( https://data-ppf.slack.com/files/U3SJU2P6W/F90N4JU56/quetelet_pref_intro.pdf )
Porter, Theodore. The Rise of Statistical Thinking, 1820-1900 (Princeton, N.J.: Princeton University Press, 1986), chap. 2 (40-70) + 100-109.
- Exploratory Data Analysis: fun with pandas
- Doing snazzy stuff with groupby
- Box, Scatter, and other basic visualizations
Week 5: Quantitative racism: the Victorian Program
Desrosieres, Alain. "Correlation and the Realism of Causes," in The Politics of Large Numbers: A History of Statistical Reasoning. Cambridge, Mass.: Harvard University Press, 1998, ch 4.
Galton, Francis. “Typical Laws of Heredity,” Royal Institution of Great Britain. Notices of the Proceedings at the Meetings of the Members 8 (February 16, 1877): 282ff.
Stephen J. Gould, The mismeasure of man. WW Norton & Company, 1996, ch. 3
Student (W. S. Gosset), "The Probable Error of a Mean," Biometrika, 6 (1908), 1-25. (https://www.jstor.org/stable/2331554)
Gillham, Nicholas. "Sir Francis Galton and the Birth of Eugenics." Ann. Rev. Genet. 35 (2001): 83-101.
- describing and predicting: survival curves, smoothing, inventing error, and regression
- death curves since Halley
- Survivor curves and making smoothing "real" at the dawn of the 20th Century
- Least Squares and Regression
Week 6: Intelligence and Policy; descriptive and prescriptive uses of data
Spearman, Charles. "General Intelligence," objectively determined and measured." The American Journal of Psychology 15, no. 2 (1904): 201-292, read pp. pp 272-277 ( available at https://web.archive.org/web/20140407100036/http://www.psych.umn.edu/faculty/waller/classes/FA2010/Readings/Spearman1904.pdf )
Gould, Stephen Jay. The mismeasure of man. WW Norton & Company, 1996. ONLY pp: 280-2, 286-288, 291-302, 347-350.
Freedman, David A. "Linear statistical models for causation: A critical review." Encyclopedia of statistics in behavioral science (2005). ( available at https://www.wiley.com/legacy/wileychi/eosbs/pdfs/bsa598.pdf )
- simulating Galton
- Galton's plot
- selective pressure
- median selection
Week 7: Mathematics baptizes data
Gigerenzer, Gerd, ed. The Empire of Chance: How Probability Changed Science and Everyday Life. Cambridge: Cambridge University Press, 1989, Sections 3.4, 3.6, 3.7
Fisher, R. A. (1955), "Statistical Methods and Scientific Induction". Journal of The Royal Statistical Society Series B, 17: 69-78.
Neyman, J. (1956), "Note on an Article by Sir Ronald Fisher," Journal of the Royal Statistical Society. Series B, 18: 288-294.
Pearson, E. S. (1955), "Statistical Concepts in Their Relation to Reality," Journal of the Royal Statistical Society Series B, 17: 204-207.
- Yule: causes of changes in pauperism in England, 1899
- Least Squares and Regression
- multivariate regression
Week 8: Data, the dawn of computation, and war
McGrayne, S. B."Bayes goes to War", in The Theory That Would Not Die: How Bayes’ Rule Cracked the Enigma Code, Hunted Down Russian Submarines, & Emerged Triumphant from Two Centuries of Controversy. Yale University Press, 2011, ch. 4 and short Ch. 5
Zabell, S. "Statistics at Bletchley Park", in Field, J. V, James A Reeds, and Whitfield Diffie. Breaking Teleprinter Ciphers at Bletchley Park: An Edition of I. J. Good, D. Michie and G. Timms: General Report on Tunny with Emphasis on Statistical Methods (1945). Wiley-Blackwell, 2015, pp lxxv to xci, xcvii-ci
Abbate, J. "Breaking Codes and Finding Trajectories: Women at the Dawn of the Digital Age", Recoding Gender: Women’s Changing Participation in Computing. MIT Press, 2012. 14-16, 21-22, bottom of 26-29, 33-35.
- Bayes factors
- Role of Bayes in codebreaking
Week 9: The (first) birth and death of AI
Turing, Alan M. “Computing Machinery and Intelligence.” Mind 59, no. 236 (1950): 433–60.
McCarthy, John, M. L. Minski, N. Rochester, and Claude E. Shannon. “Proposal for the 1956 Dartmouth Summer Research Project on Artificial Intelligence,” August 31, 1955. http://www-formal.stanford.edu/jmc/history/dartmouth.pdf.
1973: The ‘Lighthill Report’
OPTIONAL: the highly entertaining “movie” version of the Lighthill report: (http://www.aiai.ed.ac.uk/events/lighthill1973/1973-BBC-Lighthill-Controversy.mov)
- AI without ML (artificial intelligence without machine learning)
- Expert Systems: the example of Mycin
- The birth, death, and rebirth of the Perceptron
Week 10: Data at The Labs
Tukey, John W. "The future of data analysis." The annals of mathematical statistics 33, no. 1 (1962). Read: pp2-14 (end at "II. Spotty Data") pp60-64 (start at "VIII. How shall we Proceed?")
Tukey, John W. Exploratory data analysis. 2 vol. 1977; read "Preface", pp v-ix Sec 1A and 1B, pp 1-7
Chambers, John M. "Greater or lesser statistics: a choice for future research." Statistics and Computing 3, no. 4 (1993): 182-184.
Mallows, Colin. "Tukey's Paper After 40 Years." Technometrics 48, no. 3 (2006): only pp. 319-325.:
OPTIONAL: Diaconis, Persi. "Theories of data analysis: from magical thinking through classical statistics." In Exploring Data Tables, Trends and Shapes. Edited by D. Hoaglin, F. Mosteller, and J. Tukey. 1"36. New York: Wiley, 1985.
- machine learning:
- supervised learning
- unsupervised learning
- reinforcement learning
Week 11: Machine Learning and the AI Renaissance
Simon, Herbert A. "Why should machines learn?." In Machine Learning, Volume I, pp. 25-37. 1983.
Breiman, Leo. "Statistical modeling: The two cultures (with comments and a rejoinder by the author)." Statistical science 16, no. 3 (2001): 199-231.
Jordan, Michael I., and Tom M. Mitchell. "Machine learning: Trends, perspectives, and prospects." Science 349, no. 6245 (2015): 255-260.
Lewis-Kraus, Gideon. "The great AI awakening." The New York Times Magazine (2016): 1-37. available online via http://publicservicesalliance.org/wp-content/uploads/2016/12/The-Great-A.I.-Awakening-The-New-York-Times.pdf
OPTIONAL: Mitchell, Tom Michael. The discipline of machine learning. Vol. 9. Carnegie Mellon University, School of Computer Science, Machine Learning Department, 2006.
- Decision trees
- Random Forests
Week 12: Data Science as a Trading Zone
Galison, Peter. "Computer simulations and the trading zone." The disunity of science: Boundaries, contexts, and power (Stanford, 1996): 118-157; read only 118-121, 151-157.
Cleveland, William S. "Data science: an action plan for expanding the technical areas of the field of statistics." International statistical review 69, no. 1 (2001): 21-26. (http://www.jstor.org/stable/1403527)
Jones, Matthew L. "Querying the Archive: Data Mining from Apriori to PageRank." Science in the Archives: Pasts, Presents, Futures (Chicago, 2017): 311.
Hammerbacher, Jeff. "Information platforms and the rise of the data scientist." Beautiful Data (O'Reilly, 2009): 73-84.
Donoho, David. "50 Years of Data Science." Journal of Computational and Graphical Statistics 26, no. 4 (2017): 745-766.
Luhn, Hans Peter. "A business intelligence system." IBM Journal of Research and Development 2, no. 4 (1958): 314-319.
- Databases and recommendation engines
- The Netflix prize
- Pivots and counting
- Groupby and "modeling"
Week 13: Ethics, Privacy, and Anonymity
Salganik, M. J. Bit by Bit: Social Research in the Digital Age. Princeton University Press, 2017. Chapter 6, 6.1-6.8.
Zook, Matthew, Solon Barocas, danah boyd, Kate Crawford, Emily Keller, Seeta Peña Gangadharan, Alyssa Goodman, et al. “Ten Simple Rules for Responsible Big Data Research” Edited by Fran Lewitter. PLOS Computational Biology 13, no. 3 (March 30, 2017): e1005399. https://doi.org/10.1371/journal.pcbi.1005399. (http://journals.plos.org/ploscompbiol/article/comments?id=10.1371/journal.pcbi.1005399)
- Privacy and Anonymity
- Constructing our own 'Database of Ruin'
- Personally identifiable information
- differential privacy
Week 14: People, Products, and Plaforms
Richard Serra and Carlota Fay Schoolman, "Television Delivers People" (1973): (http://www.vdb.org/titles/television-delivers-people)
Goldhaber, Michael H.. “The attention economy and the Net.” First Monday, 2.4 (1997) (http://firstmonday.org/ojs/index.php/fm/article/view/519/440)
Janeway Doing Capitalism in the Innovation Economy: Markets, Speculation and the State., 2nd. ed., introduction
Grimmelmann, James. "The Platform is the Message." (2018).