Updated on 2024-02-19
Marshall A. Taylor and Dustin S. Stoltz
This repository contains all R code and data necessary to reproduce the plots "Cultural Cartography with Word Embeddings" forthcoming in Poetics. You can read the article at Poetics [PDF][DOI][preprint on SocArxiv].
To reproduce the figures, you will need to two sets of pretrained word embeddings: English fastText embeddings and the Historical Word2Vec embeddings trained on the Corpus of Historical American English. You can get them from the linked sites. We have prepared an R
package (text2map.pretrained
) for downloading and loading the embeddings:
# install the package
remotes::install_gitlab("culturalcartography/text2map.pretrained")
# load the package
library(text2map.pretrained)
You only need to download the embeddings models once per machine:
# download the fasttext embeddings
download_pretrained("vecs_fasttext300_commoncrawl")
# download the histwords embeddings
download_pretrained("vecs_sgns300_coha_histwords")
# Load them into the session
data("vecs_fasttext300_commoncrawl")
data("vecs_sgns300_coha_histwords")
# assign to new (shorter) object
ft.wv <- vecs_fasttext300_commoncrawl
hi_wv <- vecs_sgns300_coha_histwords
# then remove the original
rm(vecs_fasttext300_commoncrawl)
rm(vecs_sgns300_coha_histwords)
Next, we use roughly 200,000 news articles from the All The News (ATN) dataset covering 2013-2018. We also convert the texts of the articles into a Document-Term Matrix (for our preprocessing procedure see the paper), which is 11,271 unique terms, 197,814 documents, and 79,483,124 total terms. It is also a smidge too big for Github, so we've hosted on Dropbox:
You can also download it directly using R:
temp <- tempfile()
download.file("https://www.dropbox.com/scl/fi/34ic6nw4bw8ku3tdodrf7/dtm_news.Rds?rlkey=8mwuaedqpiqe6zut11ryhvg3o&raw=1",
destfile = temp
)
news_dtm_99 <- readRDS(temp)
nrow(news_dtm_99) == 197814
ncol(news_dtm_99) == 11271
Then load the metadata for the ATN corpus:
news_data <- readRDS("data/news_metadata.Rds")
Finally, below are a few extra pieces of data: our anchor lists for building semantic directions and some event data for figure 5.
df_anchors <- readRDS("data/anchor_lists.Rds")
df_events <- read.csv("data/events.csv")
We use the following packages:
library(tidyverse)
library(reshape2)
library(ggpubr)
library(text2vec)
library(text2map)
For the ggplot aesthetics, we use another package we've developed:
# this will change the colorscheme to viridis
# and tweak the rest of the ggplot2 aesthetics
remotes::install_gitlab("culturalcartography/text2map.theme")
text2map.theme::set_theme()
Provided the above dataframes, matrices, and packages are loaded, the R scripts in the Scripts folder will reproduce figures 1-6 in the paper. For a more detailed guide for using Concept Mover's Distance (used for figures 4-6) see the vignette in text2map.
We updated this repository as some of the links to data were broken and we have updated some of the code due to dependency changes.