Class meeting: Mondays and Thursdays, 11:45am – 1:25pm
Instructor: David Smith (Office hours: TBA; WVH 356 or via zoom)
A common mode for understanding artificial intelligence systems, from popular fiction to textbooks in computer science, has been the metaphor of an agent that perceives, forms beliefs about, and intervenes in the world. In the past year, some scholars have instead framed large pretrained language and vision models as cultural technologies (Gopnik; Farrell and Shalizi, 2023). Other researchers have pointed out that the builders of large AI models must take on some curatorial tasks in order to be successful and should learn from archival practice (Jo and Gebru, 2020).
In this seminar, we will read and discuss papers addressing large language and vision models as tools to investigate human language, history, and culture; analyzing and auditing corpus creation for model training; and exploring and mitigating biases and gaps in the archives of the past. Students will take turns presenting and leading discussion of papers along with the relevant background material. All students will write short reviews of the papers we read and work on writing research papers on a topic of their choice.
There are no official prerequisites; however, it is expected that students have some background either in NLP, computer vision, or other machine learning field, or in working computationally with large collections of text and images in the humanities or social sciences.
Each week, we will read about two papers on a common theme. The papers could be tied together by methodology—e.g., model or inference method—, by subject matter, or by media or archive type.
A list of the first few sets of papers is forthcoming. Further readings will be added by input from seminar participants. General topics include:
- Computational models as archives
- Archival documentation for models and datasets, “collections as data”
- Text and Natural Language Processing
- Literary and narrative archives
- Documentary archives
- Vision
- OCR and textual archives: e.g., manuscripts, typewritten records, government archives
- OCR and visual archives: e.g., text found on images, maps, photographs
- Image recognition for journalistic and documentary collections
- Image recognition for art archives
- Action recognition and audiovisual archives
- Sound
- Speech recognition: oral history, radio archives
- Sound classification: music and ambient sound
- Generative Models: Abundance and Loss
- Missing data
- Bias, error, and inference
- Text correction and restoration
- Image inpainting and video generation
- Narrative generation
- Critical fabulation
Readings scheduled so far are as follows:
-
January 8: Prolegomena: The Sociology of Information. We'll talk about the structure and organization of the course. We'll introduce some of its themes, drawing on these background readings.
- Generating and processing information: Part I of: Suzanne Briet. What Is Documentation?. Translated by Ronald E. Day et al., Scarecrow Press, 2006. Original text (1951)
- States, markets, and AI: Henry Farrell and Cosma Shalizi. Shoggoths amonst us. July 3, 2023.
-
January 11: Archival Perspectives on AI
- Eun Seo Jo and Timnit Gebru. Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning. Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, ACM, 2020, pp. 306–16. [Giulia]
- Meera Desai, Abigail Jacobs, and Dallas Card. An Archival Perspective on Pretraining Data. 2023. [Shijia]
- An example of documenting an LLM training set: Luca Soldaini. AI2 Dolma: 3 Trillion Token Open Corpus for Language Model Pretraining. August 18, 2023.
-
January 15: MLK Holiday (no class)
-
January 18: Cultural Technologies
- Eunice Yiu, Eliza Kosoy, and Alison Gopnik. Imitation versus Innovation: What Children Can Do That Large Language and Language-and-Vision Models Cannot (Yet)?. 2023. [Sheridan]
- Chapter 2 of: Albert Lord. The Singer of Tales. Harvard University Press, 2nd edition 2000, 1st edition 1960. [David]
-
January 22: Cultural Evolution
- Levin Brinkmann et al. Machine Culture. Nature Human Behaviour, Nov. 2023, pp. 1–14. [Jaydeep]
- Alberto Acerbi and Joseph M. Stubbersfield. Large Language Models Show Human-like Content Biases in Transmission Chain Experiments. Proceedings of the National Academy of Sciences, vol. 120, no. 44, Oct. 2023, p. e2313790120. [Sheridan]
-
January 25: Head Canons
- Kent K. Chang, Mackenzie Cramer, Sandeep Soni, and David Bamman. Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4. In EMNLP 2023. [Chris]
- Lyra D'Souza and David Mimno. The Chatbot and the Canon: Poetry Memorization in LLMs. In CHR 2023. [Shijia]
-
January 29: Data Archaeology
- Bonnie Mak. Archaeology of a Digitization. Journal of the Association for Information Science and Technology, 65(8), Aug. 2014, pp. 1515–26. [Giulia]
- Benjamin Lee et al. The Newspaper Navigator Dataset: Extracting And Analyzing Visual Content from 16 Million Historic Newspaper Pages in Chronicling America. arXiv:2005.01583 May 2020. [Luna]
- See also: Lee, Benjamin. Compounded Mediation: A Data Archaeology of the Newspaper Navigator Dataset. Digital Humanities Quarterly, vol. 015, no. 4, Dec. 2021.
-
February 1: Privacy and Secrecy
- Jie Huang, Hanyin Shao, and Kevin Chen-Chuan Chang. Are Large Pre-Trained Language Models Leaking Your Personal Information?. In Findings of EMNLP, 2022. [Jaydeep]
- Renato Rocha Souza, Flavio Codeco Coelho, Rohan Shah, and Matthew Connelly. Using Artificial Intelligence to Identify State Secrets. arXiv:1611.00356 November 2016. [David]
- See also: America's Most Redacted
-
February 5: Discuss paper ideas
-
February 8: The Flight to Quality
- Suchin Gururangan, Dallas Card, Sarah K. Dreier, Emily K. Gade, Leroy Z. Wang, Zeyu Wang, Luke Zettlemoyer, and Noah A. Smith. Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection. arXiv:2201.10474. January 2022. [Sheridan]
- Li Lucy, Suchin Gururangan, Luca Soldaini1, Emma Strubell1, David Bamman, Lauren Klein, and Jesse Dodge. AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters. arXiv:2401.06408, January 2024. [Chris]
-
February 12: Textual Restoration
- Shruti Rijhwani, Daisy Rosenblum, Antonios Anastasopoulos, and Graham Neubig. Lexically Aware Semi-Supervised Learning for OCR Post-Correction. arXiv:2111.02622 [Cs], Nov. 2021. [David]
- Nikolai Vogler, Jonathan Parkes Allen, Matthew Thomas Miller, Taylor Berg-Kirkpatrick. Lacuna Reconstruction: Self-Supervised Pre-Training for Low-Resource Historical Document Transcription. Findings of NAACL, pp. 206–16, 2022. [Jaydeep]
-
February 15: Auditing Image Datasets
- Abeba Birhane, Vinay Prabhu, Sang Hau, Vishnu Naresh Boddeti, Alexandra Sasha Luccioni. Into the LAIONs Den: Investigating Hate in Multimodal Datasets. arXiv:2311.03449, 6 Nov. 2023. [Shijia]
- Ali Shirali and Moritz Hardt. What Makes ImageNet Look Unlike LAION? arXiv:2306.15769, 27 June 2023. [Giulia]
-
February 19: Presidents Day holiday (no class)
-
February 22: Image and Video Restoration
- Raphaela Heil and Fredrik Wahlberg. Restoration of Archival Images Using Neural Networks. In Digital Humanities in the Nordic and Baltic Countries Conference (DHNB), 2022. [David]
- Lorenzo Agnolucci, Leonardo Galteri, Marco Bertini, and Alberto Del Bimbo. Reference-Based Restoration of Digitized Analog Videotapes. arXiv:2310.14926, 3 Nov. 2023. [Chris]
-
February 26: Googling for Data
- Nicholas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramèr, Borja Balle, Daphne Ippolito, and Eric Wallace. Extracting Training Data from Diffusion Models . arXiv:2301.13188, 30 Jan. 2023. [Sheridan]
- Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Eric Wallace, Florian Tramèr, Katherine Lee. Scalable Extraction of Training Data from (Production) Language Models. arXiv:2311.17035, Nov. 2023. [Jaydeep]
-
February 29: Sounds and Symbols
- Anne-Sophie Ghyselen, Anne Breitbarth, Melissa Farasyn, Jacques Van Keymeulen, and Arjan van Hessen. Clearing the Transcription Hurdle in Dialect Corpus Building: The Corpus of Southern Dutch Dialects as Case Study. Frontiers in Artificial Intelligence 3, 2020. [Chris]
- Martijn Bartelds, Wietse de Vries, Faraz Sanal, Caitlin Richter, Mark Liberman, and Martijn Wieling. Neural Representations for Modeling Variation in Speech. Journal of Phonetics 92 (May): 101137, 2022. [Shijia]
-
March 4: Spring break (no class)
-
March 7: Spring break (no class)
-
March 11: Draft project outlines (no class)
-
March 14: Present project outlines
-
March 18: Causal Employment
- Egami, Naoki, Christian J. Fong, Justin Grimmer, Margaret E. Roberts, and Brandon M. Stewart. How to Make Causal Inferences Using Texts. Science Advances, October 2022. [Shijia]
- Field, Anjalie, Doron Kliger, Shuly Wintner, Jennifer Pan, Dan Jurafsky, and Yulia Tsvetkov. Framing and Agenda-Setting in Russian News: A Computational Analysis of Intricate Political Strategies. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 3570–80, 2018. [Chris]
-
March 21: Imaging Bias
- Wang, Angelina, Alexander Liu, Ryan Zhang, Anat Kleiman, Leslie Kim, Dora Zhao, Iroha Shirai, Arvind Narayanan, and Olga Russakovsky. REVISE: A Tool for Measuring and Mitigating Bias in Visual Datasets. In ECCV, 2020. [David]
- Tang, Ruixiang, Mengnan Du, Yuening Li, Zirui Liu, Na Zou, and Xia Hu. Mitigating Gender Bias in Captioning Systems. In WWW, 2021. [David]
-
March 25: Memorizing and Forgetting
- Eldan, Ronen, and Mark Russinovich. Who’s Harry Potter? Approximate Unlearning in LLMs. arXiv 2310.02238, Oct. 2023. [Jaydeep]
- Gandikota, Rohit, Joanna Materzynska, Jaden Fiotto-Kaufman, and David Bau. Erasing Concepts from Diffusion Models. arXiv, 2023. [Sheridan]
-
March 28: The Ecology of Manuscript Culture
- Kestemont, Mike, Folgert Karsdorp, Elisabeth De Bruijn, Matthew Driscoll, Katarzyna A. Kapitan, Pádraig Ó Macháin, Daniel Sawyer, Remco Sleiderink, and Anne Chao. Forgotten Books: The Application of Unseen Species Models to the Survival of Culture. Science 375 (6582): 765–69, 2022. NB main paper and supplementary materials. [Jaydeep]
- Camps, Jean-Baptiste, and Julien Randon-Furling. Lost Manuscripts and Extinct Texts: A Dynamic Model of Cultural Transmission. In Computational Humanities Research, 2022. [Shijia]
-
April 1: Gaps and Errors
- Hopkins, Daniel J., and Gary King. A Method of Automated Nonparametric Content Analysis for Social Science. American Journal of Political Science 54 (1): 229–47, 2010. [Chris]
- Egami, Naoki, Musashi Hinck, Brandon M. Stewart, and Hanying Wei. Using Imperfect Surrogates for Downstream Inference: Design-Based Supervised Learning for Social Science Applications of Large Language Models. In NeurIPS, 2023. [David]
-
April 4: Critical Fabulation
- Saidiya Hartman. Venus in Two Acts. Small Axe 12 (2): 1–14, 2008.
- Nina Begus. Experimental Narratives: A Comparison of Human Crowdsourced Storytelling and AI Storytelling. arXiv, 2023. [Sheridan]
-
April 8: Total Eclipse of the Class (no class)
-
April 11: Cui Bono?
- Colavizza, Giovanni, Tobias Blanke, Charles Jeurgens, and Julia Noordegraaf. Archives and AI: An Overview of Current Debates and Future Perspectives. Journal on Computing and Cultural Heritage 15 (1): 1–15. 2022. [Giulia]
- Atari, Mohammad, Mona J. Xue, Peter S. Park, Damián Blasi, and Joseph Henrich. 2023. Which Humans? OSF preprint, Sept. 2023. [Sheridan]
-
April 15: Patriots Day holiday (no class)
-
April 18: Meta-Models
- Jacob Andreas. Language Models as Agent Models. In EMNLP, 2022. [Jaydeep]
- Harvey Lederman and Kyle Mahowald. Are Language Models More Like Libraries or Like Librarians? Bibliotechnism, the Novel Reference Problem, and the Attitudes of LLMs. arXiv, Feb. 2024. [Shijia]
-
April 22: Present projects
-
April 25: Submit papers (no class)