Replication Code Archive for "What's in a Letter? Using Natural Language Processing to Investigate Systematic Differences in Teacher Letters of Recommendation"
This repository is a public-facing archive of the analytic code used for the paper, "What's in a Letter? Using Natural Language Processing to Investigate Systematic Differences in Teacher Letters of Recommendation," by Brian Heseung Kim. An early manuscript of this paper is available as part of my dissertation:
While scholars have already uncovered many ways that inequities can manifest across the postsecondary application portfolio – from standardized tests to advanced course-taking opportunities – we know almost nothing about whether teacher letters of recommendation also present differential barriers to students’ college aspirations. This blind spot is especially concerning given mounting evidence that recommendation letters in other contexts can contain biased language, that teachers can form biased perceptions of their students’ abilities, and that narrative application components more generally may contribute to racial discrimination in selective college admissions. In this paper, I conduct the first system-wide, large-scale text analysis of teacher recommendation letters in U.S. postsecondary applications using data from 1.6 million students, 540,000 teachers, and 800 postsecondary institutions. I use sophisticated natural language processing methods to examine the prevalence of potential inequities within these letters: whether students are described by teachers in systematically different ways across race and gender groups, even after accounting for salient confounding factors like student academic and extracurricular qualifications, teacher fixed effects, and institution fixed effects. I find evidence of salient linguistic differences in letters across gender, but less evidence for differences across race – except in the case of highly competitive admissions, where both Black and Asian students tend to have markedly different letters than White students. Moreover, these differences are generally most meaningful in terms of the topical content of letters; differences in terms of the positivity of letters are far smaller in relative magnitudes and thus are less likely to be perceptible in the actual reading of letters. Taken together, these findings have broad implications for the use of recommendation letters in selective admissions, affirmative action policies, and gender diversity in STEM fields.
Importantly, the data necessary for this project were provided in close collaboration with researchers at The Common Application, Inc., and is not publicly accessible given the inherent risks of personally identifiable information and reidentification within the data. As such, this codebase is provided for transparency and instruction purposes only, given that complete replication is infeasible. Note also that my pre-analysis plan is provided as part of this repository for transparency as well.
Code in this repository is named in sequence, and all analytic scripts are ultimately called from within 00_teacher_recs_main.R. The general flow of the analysis proceeds in the following steps:
- Split the full dataset of students, teachers, and letters into a training set and a testing set, per recommmendations by Egami et al. (2018)
- Create a cleaned dataset of covariates for student data
- Create a cleaned dataset of text for the recommendations data
- Train a topic model using the training letter data, searching across several values for the number of topics
- Apply the topic model to the testing letter data to analyze their content
- Conduct sentiment analysis on the letter data using several different methods (see paper appendix for more information) and evaluate the accuracy of each method against human judgment to select the actual measures to be used for analysis
- Train a joint model for sentiment analysis using all the other methods attempted. Note that this model did not perform as well as the final model I selected, and so this code is largely deprecated.
- Create a final analytic dataset with all the letter measures and student covariates together
- Produce simple descriptive tables to describe the sample
- Run regression analyses as described in the paper
- Run supplementary word frequency analyses as described in the paper
You'll note that this codebase is complicated in three important ways. First, abiding by the recommendation of Egami et al. requires careful separation of the training and testing data, and I wrote the code such that each relevant script can quickly be applied to either dataset to avoid duplicating code entirely (as you'll see in 00_teacher_recs_main.R). Even so, many processes only apply to one or the other (e.g., training the topic model), and keeping track of this delineation becomes complicated, quickly. Second, the letter data and student sample are massive, and so many of these tasks required distributed cluster computing methods to complete in reasonable time. The slurm scripts I used to accomplish this are included here, and are called in the main script in sequence as well. Because of this distributed computing, I also need to split the data in a variety of ways at different points, which increased the level of complexity of several scripts to ensure filenames and data segmentation were completed properly. Lastly, this analysis spans both R and Python (and slurm, if you count it), given the value of using Python to leverage the Huggingface NLP libraries.
All to say, this is a highly complex codebase and the analysis required exceptional care to conduct properly. While this codebase cannot be applied "as-is" to any other data, I am happy to provide whatever support I can to other researchers interested in applying similar methods, or encountering similar technical challenges in their own analyses. I also recognize the potential for errors and issues of all kinds, despite my best efforts. Please don't hesitate to reach out!