Permalink
Branch: master
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
746 lines (585 sloc) 34.9 KB
---
title: '[R] Election Hacking: Exploring 10 Million Tweets from the Russian Internet Research
Agency Dataset, Pt. 2 - Corpus & DFM'
author: Ilja / fubits
date: '2018-11-05'
slug: election-hacking-exploring-10-million-tweets-from-the-russian-internet-research-agency-dataset-pt-2-corpus-dfm
hardslug: 2018-10-27-election-hacking-exploring-10-million-tweets-from-the-russian-internet-research-agency-dataset-pt-2-nlproc-corpus-dfm
categories:
- Rstats
- DataViz
- Natural Language Processing NLP
tags:
- ggplot
- Twitter
- CSV
- stm
- Structural Topic Modelling
- Corpus
- Quanteda
output:
blogdown::html_page:
number_sections: yes
toc: yes
lastmod: "2018-10-27T14:40:59+02:00"
description: "This is the sceond part of my \"Internet Research Agency\" series. We'll encounter in-memory computing problems, address them with tweaks on Windows and Linux, and finally succeed at building a 5.7M x 1.5M big sparse Document-Feature Matrix."
abstract: "This is the sceond part of my \"Internet Research Agency\" series. We'll encounter in-memory computing problems, address them with tweaks on Windows and Linux, and finally succeed at building a 5.7M x 1.5M big sparse Document-Feature Matrix."
thumbnail: "/img/thumbs/IRA_Tweets_2.jpg"
rmdlink: yes
keywords: []
comment: no
autoCollapseToc: no
postMetaInFooter: no
hiddenFromHomePage: no
contentCopyright: no
reward: no
mathjax: no
mathjaxEnableSingleDollar: no
mathjaxEnableAutoNumber: no
hideHeaderAndFooter: no
flowchartDiagrams:
enable: no
options: ''
sequenceDiagrams:
enable: no
options: ''
---
![*(Weekly per-language activity of specific keywords)*](/img/IRA_dataset/IRA_keywords.png)
Now that we are done with [pre-processing the Internet Research Agency dataset](/post/election-hacking-exploring-2-million-tweets-from-the-russian-internet-research-agency-dataset-pt-1), we can unleash the full R powers and dive into automated content analysis. For a primer, see the following excellent intros and posts re: core NLP concepts, R packages, and tools:
* Emmanuel Ameisen: [How to solve 90% of NLP problems: a step-by-step guide](https://blog.insightdatascience.com/how-to-solve-90-of-nlp-problems-a-step-by-step-guide-fda605278e4e){target="_blank"} (this is a very beginner-friendly overview of key NLP concepts. Not R related, but a great TL;DR primer.)
* Julia Silge:
[The game is afoot! Topic modeling of Sherlock Holmes stories ](https://juliasilge.com/blog/sherlock-holmes-stm/){target="_blank"}
* Julia Silge and David Robinson:
[Text Mining with R - A Tidy Approach](https://www.tidytextmining.com/){target="_blank"}
* Cornelius Puschmann: [Automatisierte Inhaltsanalyse mit R (in German)](http://cbpuschmann.net/inhaltsanalyse-mit-r/index.html){target="_blank"}
* Kenneth Benoit et al.: [quanteda: An R package for the quantitative analysis of textual data](http://joss.theoj.org/papers/10.21105/joss.00774){target="_blank"}
* Benoit et al.? [Quanteda vignette: Getting Started with Quanteda](http://www.et.bs.ehu.es/cran/web/packages/quanteda/vignettes/quickstart.html){target="_blank"}
* Nicolas Merz: [Using the Manifesto Corpus with quanteda](https://manifesto-project.wzb.eu/down/tutorials/quanteda){target="_blank"}
* **Bonus**: [Carsten Schwemmer](https://twitter.com/c_schwemmer){target="_blank"} (who introduced me to the `Tidyverse` and automated content analysis with R, btw!) has build an `Shiny` app, which enables you to explore Structural Topic Models with a browser-based GUI: [stminsights - A Shiny Application for Structural Topic Models](https://cschwem2er.github.io/stminsights/){target="_blank"}
> Just a word of caution: Topic Modelling is a statistical method, and doing it the way I'm doing here - on the full 5.7M Tweets corpus OR even on a subset of 2M Tweets without any frequency-based feature reduction (i.e. with `quanteda::dfm_trim()`) - will take some time. A lot of time actually. I have a decent Gamer's Laptop with 16GB RAM, a quad-core Intel i7-7500 CPU and a SSD (SATA tho) - and computing the Topic Model based on the English subset (2M Tweets, with auto-induced K=73 topics, based on 15K features -> cf. upcoming 3rd part of this series) took a whole day (22,5 hours). Don't try this at home, or without some free AWS credits :)
# Packages & Data
Let's get started. Here are some tweaks/tricks/adjustments I discovered while writing this post ([i.e., in this 8 years old Stack Overflow thread](https://stackoverflow.com/questions/5171593/r-memory-management-cannot-allocate-vector-of-size-n-mb){target="_blank"}) in order to increase the computing performance ;)
> i.e.: `memory.limit(size = NA)` sets the maximum memory limit to what's ever possible on your system (it's a bit more complicated on Windows)
> You can enforce R's garabage collection at any time with `gc()` (see the vignette for the params); this is esp. helpful after `rm(huge_object)`
> Save any large output as an `.rds` or `.RData` (if you can afford) to your hard-drive, clear your environment and start the next step with as low of a RAM footprint as possible.
> My **key-learning** of this post, however: If you're on Windows, set-up a Linux distro to dual-boot. R is so much more performant on Ubuntu et al. (also with regards to Unicode, path length, memory management and much more) You'll start the environment with at least 1GB more of free RAM; you'll have full CLI control over the swap drive/file; and then some more.
Let's start. Here are the packages we'll need:
```{r message=FALSE}
library(tidyverse)
library(quanteda)
quanteda_options("threads" = 4) # 4 is the limit on my Laptop
print(str_c("Threads for Quanteda: ", quanteda_options("threads")))
library(stm)
library(lubridate)
```
Since we've already [pre-processed the Internet Research Agency dataset](/post/election-hacking-exploring-2-million-tweets-from-the-russian-internet-research-agency-dataset-pt-1), we can just load the exported data object and go on with corpus-building.
```{r}
data_path <- here::here("data", "IRA_Tweets", "/")
data_unique <- readRDS(str_c(data_path, "infoops_data_processed.rds"))
# devtools::install_github("brooke-watson/BRRR")
BRRR::skrrrahh("flava") # play sound when ready
```
# International Corpus: An Exersise in "Too Big for In-Memory"
## Data Exploration: Time-Series Plots
As I didn't do this in my last post, let's have a quick look at the time-wise distribution of Tweets in total, and grouped by Tweet and Account languages. This way, we might see what time periods and what languages might be worth special attention.
First, we'll reduce the observations to language groups and further aggregate on a weekly basis.
```{r}
IRA_TS <- data_unique %>%
mutate(tweet_time = as_date(tweet_time),
week = lubridate::floor_date(tweet_time, "week")) %>%
group_by(week, account_language, tweet_language) %>%
count() %>%
arrange(desc(n))
```
```{r}
IRA_TS %>%
head() %>%
knitr::kable("html")
```
Now we'll plot time series grouped by account languages.
```{r fig.width=10}
IRA_TS %>%
ggplot() +
geom_line(aes(x = week, y = n, color = fct_infreq(account_language)),
alpha = 0.5) +
scale_color_brewer(palette = "Paired", direction = -1) +
scale_x_date(limits = c(as_date("2009-01-01"), NA),
breaks = scales::date_breaks(width = "1 year"),
labels = scales::date_format(format = "%Y"),
minor_breaks = "6 months") +
scale_y_continuous(breaks = scales::pretty_breaks(9),
labels = scales::number_format(big.mark = ".",
decimal.mark = ",")) +
theme_minimal() +
labs(
title = "Russian Internet Agency Dataset: Weekly Activity per Account Language",
subtitle = str_c(
"Subset of ", n_distinct(data_unique$tweetid),
" unique Tweets (no RTs) from ",
n_distinct(data_unique$userid), " unique Users"
),
caption = "@fubits",
color = "Account Language",
x = "", y = "Tweets" ) +
guides(colour = guide_legend(override.aes = list(size = 5, stroke = 1.5)))
```
Well that's some interesting peaks - first "ru", then "en", than a little phase with "ru" followed by "es" (as in Español)! We can neglect everything before 2012 for now, so the following plots should become a bit more detailed.
Let's see what we can observe by plotting the **Tweet Languages** and inspect the top peaks' dates. To further reduce noise, we'll filter out languages with less than 500 Tweets per week from 2012 on (as we have aggregated the Tweets on a weekly basis, in fact, only week-wise values will be filtered out).
```{r fig.width=10, warning=FALSE}
IRA_TS %>%
filter(week > "2011-12-25" & n > 500) %>%
ggplot() +
geom_line(aes(x = week, y = n, color = fct_infreq(tweet_language)),
alpha = 0.5) +
scale_color_brewer(palette = "Paired", direction = -1) +
scale_x_date(limits = c(as_date("2011-11-01"), NA),
breaks = scales::date_breaks(width = "1 year"),
labels = scales::date_format(format = "%Y"),
minor_breaks = "6 months") +
scale_y_continuous(breaks = scales::pretty_breaks(9),
labels = scales::number_format(big.mark = ".",
decimal.mark = ",")) +
theme_minimal() +
labs(
title = "Russian Internet Agency Dataset: Weekly Activity per Tweet Language",
subtitle = str_c(
"Subset of ", n_distinct(data_unique$tweetid),
" unique Tweets (no RTs) from ",
n_distinct(data_unique$userid), " unique Users"
),
caption = "@fubits",
color = "Tweet Language",
x = "",
y = "Tweets"
) +
guides(colour = guide_legend(override.aes = list(size = 5, stroke = 1.5)))
```
Those patterns are really telling! And we can add one more helper: Text labels (the Mondays for the respective weeks) for those top peaks (global maxima), colored by the respective Account languages. Plus, the real activity starts to gain momentum in 2014, so we should neglect everything beforehand for now
```{r fig.width=10, warning=FALSE}
IRA_TS %>%
filter(week > "2013-12-23" & n > 500) %>%
ggplot() +
geom_line(aes(x = week, y = n, color = fct_infreq(tweet_language)),
alpha = 0.5) +
geom_point(data = IRA_TS[1:10,], aes(x = week, y = n,
color = fct_infreq(tweet_language)), size = 2) +
scale_color_brewer(palette = "Paired", direction = -1, aesthetics = c("color", "fill")) +
ggrepel::geom_label_repel(data = IRA_TS[1:10,],
aes(x = week, y = n,
label = str_c(month(week, label = T, abbr = T), "-", day(week)),
fill = account_language),
size = 3) +
scale_x_date(limits = c(as_date("2013-10-01"), NA),
breaks = scales::date_breaks(width = "1 year"),
labels = scales::date_format(format = "%Y"),
minor_breaks = "6 months") +
scale_y_continuous(breaks = scales::pretty_breaks(9),
labels = scales::number_format(big.mark = ".",
decimal.mark = ",")) +
theme_minimal() +
labs(
title = "Russian Internet Agency Dataset: Weekly Activity per Language (Top 10 Peaks)",
subtitle = str_c(
"Subset of ", n_distinct(data_unique$tweetid),
" unique Tweets (no RTs) from ",
n_distinct(data_unique$userid), " unique Users", "; Label: week's Monday; Label fill: Account language"
),
caption = "@fubits",
color = "Language",
x = "",
y = "Tweets"
) +
guides(color = guide_legend(override.aes = list(size = 5, stroke = 0)),
fill = FALSE)
```
> Technically, we could re-do this for all peaks of any particular Tweet or Account language and find out more about the particular events (i.e. German-language Tweets peaking around the Federal Elections in autum 2017 - barely visibly with this color palette) with the help of Wikipedia's super-handsome [current events portal](https://en.wikipedia.org/wiki/Portal:Current_events#Events_by_month){target="_blank"}.
For instance, we can observe **Russian-language** Tweets starting to peak at the beginning of **2014**:
* post-Maidan, the Ukraine-Russia conflict just entered a full-fledged armed conflict period, complemented by Russia's Annexation of Crimea et al.
* on 17 July Malaysia Airlines Flight MH17 was shot down over Ukraine (cf. [Bellingcat's OSINT work](https://www.bellingcat.com/tag/mh17/){target="_blank"}).
As for the **English-language** Tweets (attributed to the Russian "Internet Research Agency" by **Twitter's Election Intergrity Team** in this dataset), we see them gain momentum starting in **mid 2014**, shortly after the shot-down of MH17...
Beyond that, in **2015**:
* the Greeks elected a new Parliament on 4 January
* the Charlie Hebdo attack took place on 7 January,
* the Ferguson unrest had another bloody month in March
* well... and on 16 June 2015, **Trump** announced his presidential campaign.
Post election, English-language Tweets calmed down and didn't re-peak before August 2017:
* on 3 August 2017, "Special counsel Robert Mueller has impaneled a grand jury in Washington, D.C. to investigate allegations of Russia’s interference in the 2016 elections"^[https://en.wikipedia.org/wiki/Portal:Current_events/August_2017#2017_August_3]
* (and Neymar was bought by PSG for €222 million on the same day, which might explain why Spanish Accounts where most active here)
* on 12 August 2017, "fights break out between white supremacists and counterprotesters in **Charlottesville**, Virginia, over the removal of the Robert Edward Lee Sculpture."^[https://en.wikipedia.org/wiki/Portal:Current_events/August_2017#2017_August_12].
* on 13 August 2017 the Barcelona attacks happened
As we can see, the activities really peak around certain events, either concerning US-American left-right clashes and other intra-state affairs, or they concern topics with Russian involvement. What's striking, however, is that the days around the infamous **Brexit** referendum (23/24 June 2016) are among those time periods with the **overall lowest activies**.
Just for comparison (of plain numbers to the usefulness of exploration by data visualization), here's some tables of the data we just visualized. You can easily identify the dates of the events from the numbers alone. But in order to recognize patterns, esp. language patterns, a visual representation is really helpful. However, without this table I probably would have missed the **Spanish**-language Accounts peaking with Tweets in English in two subsequent weeks...
```{r}
IRA_TS %>%
group_by(week, account_language, tweet_language) %>%
summarise(n = sum(n)) %>%
arrange(desc(n)) %>%
head(20) %>%
knitr::kable("html")
```
Alright. Point made. Let's continue with preparing our automated content analysis.
```{r}
rm(IRA_TS) # free up some RAM first
```
## Building a Corpus with `Quanteda`
We're going to build a corpus from our dataset, turn the corpus into a sparse Document-Feature Matrix (DFM), reduce unnecessary features (words/tokens) from the DFM and then use the DFM for Structural Topic Modelling.
> Now let's try to do something really **naive**: We'll build a corpus from **all Tweets** - with **all languages**. \n This will take a minute ot two (actually: each step took me **hours**...) with `corpus()`, and `dfm()`, but this way we can explore a lot of presumed topics in various languages early on, and **test** where the limits of in-memory computing with text as data might be with R ("Big Data calling")...
**Corpus from all Tweets**
```{r eval=FALSE}
system.time({ # just measuring run-time
IRA_corpus <- data_unique %>%
corpus(text_field = "tweet_text",
docid_field = "tweetid",
docvars = data_unique, # probably obsolete when docid_field is set
compress = FALSE # compression is better for in-memory, but will slow down processing
)
})
BRRR::skrrrahh("flava")
```
### Memory Management
That's my fairly refined approach (from this post) to work in-memory with as low of a RAM footprint as possible:
* save the data object to the hard-drive with `saveRDS(obj)`
* clear your R environment with `rm()` and `gc()` ( gc := garbage collection)
* restart R with `.rs.restartR()` or `rstudioapi::restartSession()` (not sure yet, where they really differ)
* load the data object from the hard-drive with `obj <- readRDS()`
* go on with the next task
**save `IRA_corpus` to HDD**
```{r eval=FALSE}
saveRDS(IRA_corpus, str_c(data_path, "IRA_corpus_naive.rds"))
# saveRDS(IRA_corpus, str_c(data_path, "IRA_corpus_naive_compressed.rds"))
```
**clear environment and restart R**
```{r eval=FALSE}
rm(data_unique)
rm(IRA_corpus)
gc(full = TRUE, verbose = TRUE)
# .rs.restartR()
rstudioapi::restartSession()
```
> Linux: down to 1.4 GiB
**load `IRA_corpus` from HDD**
```{r}
data_path <- here::here("data", "IRA_Tweets", "/")
IRA_corpus <- readRDS(str_c(data_path, "IRA_corpus_naive.rds"))
# IRA_corpus <- readRDS(stringr::str_c(data_path, "IRA_corpus_naive_compressed.rds"))
BRRR::skrrrahh("flava")
```
## Corpus Exploration: Keyword-in Context
Now we can explore some specific keywords and their full-text conetxt, i.e. those where we can already assume that they should take up a prominent spot in the corpus.
In particular, I'm interested in the following terms:
* компромат | kompromat (just for you, [\@RidT](https://twitter.com/RidT){target="_blank"})
* soros
* fake news: phrase("fake news")
* pee tape: phrase("pee tape")
* msm
* hillary | clinton
* nazi
* antifa
* zion
* civil war
* deep state
**test `kwic()` functionality**
```{r}
IRA_corpus %>%
corpus_subset(tweet_language == "ru" | tweet_language == "en") %>%
corpus_sample(1000000) %>% # always test with smaller sample
kwic(c("компромат","kompromat"), window = 10, case_insensitive = TRUE) %>%
# vs. kwic(x, phrase("term1 term2"))
as_tibble() %>% # needed for kwic()
select(pre:post) %>%
head(10) %>%
knitr::kable("html")
```
We'll put the keywords into a (`kwic()`-friendly) vector and then do a single run with all keywords instead of repeating this n-times. This will take time and might not work properly without enough RAM.
```{r}
kwic_keywords <- c(
c("компромат","kompromat"),
"soros",
phrase("fake news"),
phrase("pee tape"),
"msm",
c("hillary", "clinton"),
"nazi",
"antifa",
"zion",
phrase("civil war"),
phrase("deep state")
)
```
**enforce Garbage Collection to free up RAM**
```{r eval=FALSE}
gc(full = TRUE, verbose = TRUE)
```
**now search for Keywords-in-context from `kwic_keywords` vector**
```{r eval=FALSE}
library(magrittr)
system.time({
kwic_results <- IRA_corpus %>%
# corpus_sample(1000000) %>% # Test smale-scale sample first
quanteda::kwic(kwic_keywords, window = 10, case_insensitive = TRUE)
})
saveRDS(kwic_results, str_c(data_path, "kwic_corpus_naive.rds"))
# this genius thing plays Flavor Flav when the chunk is done
BRRR::skrrrahh("flava")
```
> I was getting this error repeatedly on Windows:\n Error: cannot allocate vector of size 64.0 Mb\n
Timing stopped at: 1599 562.6 2294
> This was the moment where I arrived at my individual "too big for in-memory" point of failure - even with 16GB RAM. After trying out some `memory.size()` tricks on Windows, among other things, I decided to change my OS (for R at least) to Ubuntu (MATE 18.10) and set up dual-boot. My first motiviation was to reduce some memory load (i.e. Skype, iTunes, Office et al. all helpers running in the background). Eventually, I had to extend the "in-memory" capacity by creating a dedicated 64GB swap drive on a USB 3.0 thumb drive. As soons as I get hold of a new, bigger, and faster SSD, I'll have a look at the performance of a SSD-based swap drive.
**Visualization**
```{r}
kwic_results <- readRDS(str_c(data_path, "kwic_corpus_naive.rds"))
```
```{r}
kwic_results %>%
mutate(keyword = tolower(keyword)) %>%
group_by(keyword) %>%
count() %>%
arrange(desc(n)) %>%
knitr::kable("html")
```
This is not as many results as I would have expected. However, this might have been caused by me by not explicitly defining the search pattern as "glob" or "regex". I'll re-do this post with better equipment soon-ish.
```{r}
kwic_results %>%
filter(keyword == "soros") %>%
select(pre, keyword, post) %>%
head(10) %>%
knitr::kable("html")
```
Now it's really easy to visualize the distribution of our selected keywords over time and by language and weekly frequency:
```{r warning=FALSE}
kwic_results %>%
left_join(data_unique, by = c("docname" = "tweetid")) %>%
select(keyword, tweet_time, account_language, tweet_language) %>%
mutate(keyword = tolower(keyword), date = floor_date(tweet_time, "week")) %>%
group_by(keyword, date, account_language) %>%
count() %>%
ggplot() +
geom_point(aes(x = date, y = fct_rev(keyword),
color = account_language,
size = n), alpha = 0.5) +
scale_color_brewer(palette = "Paired", direction = 1) +
labs(title = "Russian Internet Agency Dataset: Selected Keywords - Weekly Activity per Account Language",
subtitle = str_c("Based on n = ", nrow(kwic_results), " hits from the full IRA dataset"),
caption = "@fubits",
x = "Weekly Activity",
y = "",
color = "Account\nLanguage") +
theme_minimal() +
guides(color = guide_legend(override.aes = list(size = 5,
stroke = 0,
alpha = 1)))
```
> The Spanish-language dominance in 2017 looks really weird.
```{r}
kwic_results %>%
left_join(data_unique, by = c("docname" = "tweetid")) %>%
filter(account_language == "es") %>%
select(account_language, tweet_language, pre, keyword, post, tweet_time) %>%
head(20) %>%
knitr::kable("html")
```
> Hm, does not look too Spanish to me. That's another topic for further exploration and investigation.
> \#TODO
## Building a 5.7M x 1.5M Sparse Document-Feature Matrix
From a tactical perspective with regards to large corpora and in-memory processing, this particular Stack Overflow introduced an interesting but actually misleading approach for creating a huge `dfm`: [000andy8484 - Create dfm step by step with quanteda](https://stackoverflow.com/questions/38931507/create-dfm-step-by-step-with-quanteda){target="_blank"}.
After failing to build a `dfm` from the untouched `corpus` repeatedly (on Windows & Linux), this SO-post explained how to reduce the number of features in the corpus by tokenizing the texts and removing unwanted tokens **before** creating the `dfm`. This way I really gained some momentum for `dfm()` (by reducing dimensions), but we would also loose the analytic methods / functionality of a corpus object (i.e. `kwic()`) as we destroy the basic corpus structure (full Tweets in our case) by replacing the texts with tokens. This experiment was more of a performance evaluation than a recommended practice (for now, at least). I ended up with something like a 60M x 1.5M DFM, which obviously shows that I had lost the document-based corpus structure.
Having said that, here's our **regular strategy**:
* create untouched corpus (we did this above, already)
* ~~remove obsolete tokens from corpus by directly tokenizing the texts~~
* ~~remove stopwords from corpus~~
* create dfm from ~~reduced~~ untouched corpus
* reduce dfm by removing unwanted features (stopwords, URLs, 2-char tokens and so on)
* build dfm subsets for specific analyses
* build topic models with stm() based on variations of the `dfm`
**Naive Corpus**
```{r eval = FALSE}
# data_path <- here::here("data", "IRA_Tweets", "/")
IRA_corpus <- readRDS(stringr::str_c(data_path, "IRA_corpus_naive.rds"))
```
### Stopwords
First of all, we will need to create list of very frequent tokens/words which we can simply filter out from our corpus. This is my current **best-practice** with regards to stopwords. I'm using three different sources to compile a super-list of stopwords for the languages concerned. (As for "amp", this is an artifact I encountered in my [previous posts](/post/r-german-academic-twitter-pt-2-from-data-to-corpus-with-a-turkish-twist/#inspect-keyword-in-context-kwic-for-amp): It's just a left-over of `&amp`, which is the HTML entity for string-parsing `&`, and is created when we use `remove_punct = TRUE`. Same is true for `&quot` and probably some other unparsed/escaped entities in the IRA dataset).
```{r eval = FALSE}
# repo: https://github.com/stopwords-iso
en_stopwords <- readr::read_lines("https://raw.githubusercontent.com/stopwords-iso/stopwords-en/master/stopwords-en.txt")
ru_stopwords <- readr::read_lines("https://raw.githubusercontent.com/stopwords-iso/stopwords-ru/master/stopwords-ru.txt")
de_stopwords <- readr::read_lines("https://raw.githubusercontent.com/stopwords-iso/stopwords-de/master/stopwords-de.txt")
custom_stopwords_en <- unique(c(en_stopwords,
tidytext::stop_words$word,
quanteda::stopwords("english"))) # only keep uniques
custom_stopwords_ru <- unique(c(ru_stopwords,
quanteda::stopwords("russian"))) # only keep uniques
custom_stopwords_de <- unique(c(de_stopwords,
quanteda::stopwords("german"))) # only keep uniques
stopwords_extra <- c("amp", "http", "https", "quot")
custom_stopwords_en <- c(custom_stopwords_en, stopwords_extra)
custom_stopwords_ru <- c(custom_stopwords_ru, stopwords_extra)
custom_stopwords_ru <- c(custom_stopwords_ru, stopwords_extra)
rm(en_stopwords) # free up RAM. We'll need it soon-ish
rm(ru_stopwords)
rm(de_stopwords)
```
### Deprecated Approach
> Caution: This is the **taboo**-approach, which I tried out and would not recommend re-doing. It performs. **But** the analytical value/utility of the output is rather limited!
> This task alone would first approach taking up more than 20GB "RAM"" on my Windows OS Laptop and eventually crash, so be warned again: proceed with caution!
> On Ubuntu Linux we'll get along with 9GB-11GB RAM for the same task (corpus building, tokenizing and reducing) without even touching the swap file. The difference is insane... However, this still would not work out for creating a `dfm` from the untouched, all-tweets corpus. If you want to replicate this, I suggest at least 16GB RAM + 20GB swap memory or 32GB RAM and a regular-sized swap file/drive. SSD performance might be better. Or cloud, if you're willing (and able) to pay 2-3€/hour for the larger instances...
```{r eval=FALSE}
library(quanteda)
quanteda::quanteda_options("threads" = 4)
system.time({
texts(IRA_corpus) <- quanteda::tokens(IRA_corpus,
remove_twitter = TRUE,
remove_url = TRUE,
remove_punct = TRUE,
remove_numbers = TRUE,
# ngrams = 1:2
)
})
BRRR::skrrrahh("flava")
system.time({
texts(IRA_corpus) <- tokens_remove(texts(IRA_corpus),
c(custom_stopwords_en,custom_stopwords_ru))
})
BRRR::skrrrahh("flava")
saveRDS(IRA_corpus, str_c(data_path, "IRA_corpus_naive_clean.rds"))
BRRR::skrrrahh("flava")
# **Reset session**
rm(data_unique)
rm(IRA_corpus)
rm(custom_stopwords_en)
rm(custom_stopwords_ru)
gc(full = TRUE, verbose = TRUE)
# .rs.restartR()
rstudioapi::restartSession()
data_path <- here::here("data", "IRA_Tweets", "/")
IRA_corpus <- readRDS(str_c(data_path, "IRA_corpus_naive_clean.rds"))
BRRR::skrrrahh("flava")
```
### Regular Approach
**Corpus**
We already have a plain, untouched corpus object from further above, so we can skip this step.
```{r eval=FALSE}
IRA_corpus <- data_unique %>%
corpus(text_field = "tweet_text",
docid_field = "tweetid",
docvars = data_unique, # probably obsolete when docid_field is set
compress = FALSE # compression is better for in-memory, but'll slow down processing
)
```
Instead, we'll load the clean, untouched corpus from our HDD and proceed with `dfm()`
```{r eval = FALSE}
data_path <- here::here("data", "IRA_Tweets", "/")
IRA_corpus <- readRDS(str_c(data_path, "IRA_corpus_naive.rds"))
BRRR::skrrrahh("flava")
```
**Sparse Document-Feature Matrix with `quanteda::dfm()`**
Now we'll turn the international corpus into a sparse [Document-Feature Matrix / Document-Term-Matrix](https://en.wikipedia.org/wiki/Document-term_matrix){target="_blank"} with `quanteda::dfm()` ~~and include n-grams of 1 and 2 words (bigrams)~~.
> My apporach for smaller tasks still is to clear the environment after every big step with `rm()` on the objects and enforcing `gc()`, then `.rs.restartR()`, and then `readRDS()` the last processed object...
> On Windows, this task would slow down first, and then crash, no matter what. Eventually, I was able to get this done by:
* installing Linux/Ubuntu MATE 18.10
* setting up a USB 3.0 thumb drive as a 64GB swap drive (my current SSD is too small for Windows and Ubuntu + swap partition)
* calling `gcinfo(TRUE)` to make the garbage collection (`gc()`) verbose, so that I was able to monitor the memory states
* and finally, starting with a clean session (see the steps described just above)
**dfm() pt.1**
```{r eval = FALSE}
gcinfo(TRUE) # sets gc() to verbose = TRUE; even when output == "FALSE""
system.time({
IRA_dfm <- dfm(
# corpus_sample(IRA_corpus, 1000000), #always test with small n first
IRA_corpus, # full corpus
tolower = TRUE,
verbose = TRUE,
min_nchar = 3,
remove = c(custom_stopwords_en, custom_stopwords_ru, custom_stopwords_de),
remove_twitter = FALSE,
remove_numbers = TRUE,
remove_punct = TRUE,
remove_url = TRUE)
})
BRRR::skrrrahh("flava")
```
`> Argument min_nchar not used`
> We'll keep that in mind...
` ... created a 5,708,124 x 1,534,498 sparse dfm`
> Success! Thanks Linux! But don't forget to save this dfm object!!!
**dfm() pt.2**
Now we'll make sure that there are no left-over 1- and 2-character tokens (which are called `features` from now on) and then save the dfm to our HDD.
```{r eval = FALSE}
rm(IRA_corpus)
IRA_dfm <- dfm_select(IRA_dfm, min_nchar = 3) # double-checking
saveRDS(IRA_dfm, str_c(data_path, "IRA_dfm_naive_clean.rds"))
BRRR::skrrrahh("flava")
gcinfo(FALSE) # turn off verbosing for garbage collection
```
```{r}
IRA_dfm <- readRDS(str_c(data_path, "IRA_dfm_naive_clean.rds"))
```
Applying `min_nchar = 3` twice actually really helped. Now we are down to a:
> `r ndoc(IRA_dfm)` x `r nfeat(IRA_dfm)` sparse DFM (4.5K features less now).
```{r}
summary(IRA_dfm)
```
> Holy sh\*t. That's the power of `R` + `Quanteda` + `Linux` I guess. So my "little" proof-of-concept experiment worked out. You **can** build a large dfm from a large corpus without resorting to cloud computing / AWS et al.
Now we can explore a bit more and then proceed over to building a topic model.
## International Tweets: Top-Features
Let's have a quick look at the most frequent features from all ~5.7M Tweets (in all Tweet languages)
```{r}
topfeatures(IRA_dfm, 40) %>%
knitr::kable("html")
```
> Funny! **Russian terms** dominate the international corpus aka Twitter's "Internet Research Agency" dataset :)\n What's also noteworthy, is the top-ranked **news** hashtag, probably pointing at the proliferation of the "fake news" term.
Now let's create a subset of our dfm without \#hashtags and @account handles for a purer content analysis. It's a bit redundant, but due to the (in-memory) size of the full-languages dfm (2.8GB **with** hashtags/accounts; 2.7GB **w/o**) this is necessary if we want to work with and without Twitter artifacts.
```{r eval=FALSE}
IRA_dfm_no_twitter <- dfm_select(IRA_dfm, pattern = list('#*',"@*"),
selection = "remove")
saveRDS(IRA_dfm_no_twitter, str_c(data_path, "IRA_dfm_naive_clean_no_twitter.rds"))
```
> We've further reduced the number of features from 1.530.058 to 965.278 by removing Twitter @handles and \#hashtags
> This is another good moment to restart our session with a lower in-memory footprint.
**Reset session**
```{r eval=FALSE}
rm(IRA_dfm)
rm(IRA_dfm_no_twitter)
gc(full = TRUE, verbose = TRUE)
# .rs.restartR()
rstudioapi::restartSession()
```
**re-load packages and dfm**
```{r message=FALSE}
library(tidyverse)
library(quanteda)
quanteda_options("threads" = 4) # 4 is the limit on my Laptop
print(str_c("Threads for Quanteda: ", quanteda_options("threads")))
library(stm)
data_path <- here::here("data", "IRA_Tweets", "/")
```
> Linux: down to 1.3GB RAM load
```{r}
data_path <- here::here("data", "IRA_Tweets", "/")
IRA_dfm <- readRDS(str_c(data_path, "IRA_dfm_naive_clean_no_twitter.rds"))
```
Now we can have another look at the top features from our dfm, this time without hashtags.
> While the default `scheme = "count"` would give us the absolut/total feature frequencies (i.e. 3x news in a single Tweet = 3x news in absolut terms), `scheme = "docfreq"` returns per-document frequencies (number of Tweets with this feature), which is more precise with regards to Tweets.
```{r cache=TRUE}
topfeatures(IRA_dfm, 40, scheme = "docfreq") %>%
knitr::kable("html")
```
> Wow. Russian features concerning Ukraine, Russia, and USA still dominate the top features in terms of frequency!
We should also have a look at these frequencies grouped by Tweet languages:
```{r cache=TRUE}
topfeatures(IRA_dfm, 20, scheme = "docfreq",
groups = "tweet_language") %>%
knitr::kable("html")
```
> for orientation: this is the order of languages in this table (can't get them to render as `col.names = ` with `knitr:kable()` yet):\n`"ru", "bg", "en", "und", "other_44", "de", "uk", "tl", "es", "ar", "fr", "in", "sr"`
> Mind that I only removed stopwords for Tweets labelled as English, Russian, and German (by Twitter). As we've explored in the previous post, `und` (=undefined) mostly containes Tweets in Russian. What's interesting is that Bulgarian as well as Ukrainian both predominantly contain Russian features, with Russian orthography, which should not be the case here.
Let's have a look at how these features might be connected to each other:
```{r fig.width=10, fig.asp=0.75, cache=TRUE}
topfeats <- names(topfeatures(IRA_dfm, 100))
textplot_network(dfm_select(IRA_dfm, topfeats), min_freq = 0.2,
edge_alpha = 0.7, edge_size = 5, edge_color = "grey",
vertex_size = 1)
```
> The two worlds collide on the topic of USA ...
With the mightiness of Quanteda, we could do a lot of further exploratory analysis and so on... However, it's really about time to focus on a smaller and language-specific subset, as the following task (to be addressed in part \#3 of this series) will be even more intense in terms of computing power and memory usage.
> However, here's a preview from Topic Modelling on the English-language subset (~2M Tweets):
![*(73 Topics from 2M English-language Tweets; modelled with the stm()-package; K auto-induced with t-SNE/PCA)*](/img/IRA_dataset/topic_modelling.jpg)