Permalink
Branch: master
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
721 lines (600 sloc) 29.2 KB
---
title: '[R] Election Hacking: Exploring 10 Million Tweets from the Russian Internet Research
Agency Dataset, Pt. 1 - Apporaching 5.3GB in R'
author: Ilja / fubits
date: '2018-10-26'
slug: election-hacking-exploring-2-million-tweets-from-the-russian-internet-research-agency-dataset-pt-1
categories:
- Rstats
- DataViz
- Natural Language Processing NLP
tags:
- ggplot
- Twitter
- CSV
output:
blogdown::html_page:
number_sections: yes
toc: yes
lastmod: '2018-10-26T16:18:07+02:00'
description: "A bit over a week ago, Twitter's new-ish Elections integrity team released two datasets with 'all the accounts and related content associated with potential information operations that we have found on our service since 2016.' We're talking about millions of Tweets in dozens of languages stored in a single 5.3 GB CSV file. Tidyverse to the rescue!"
abstract: 'Post updated with data from the full 5.3GB CSV file from the "Internet Research Agency" dataset'
thumbnail: "/img/thumbs/IRA_Tweets.jpg"
rmdlink: yes
keywords: []
comment: no
autoCollapseToc: no
postMetaInFooter: no
hiddenFromHomePage: no
contentCopyright: no
reward: no
mathjax: no
mathjaxEnableSingleDollar: no
mathjaxEnableAutoNumber: no
hideHeaderAndFooter: no
flowchartDiagrams:
enable: no
options: ''
sequenceDiagrams:
enable: no
options: ''
---
![*(Creation Date of Top 15 Accounts in the IRA Dataset)*](/img/IRA_dataset/IRA_Tweets_Dates.png){width=100%}
A bit over a week ago, [Twitter's new-ish Elections integrity team released two datasets](https://blog.twitter.com/official/en_us/topics/company/2018/enabling-further-research-of-information-operations-on-twitter.html){target="_blank"} with "all the accounts and related content associated with potential information operations that we have found on our service since 2016."
In particular, this is what we are talking about:
> "Our initial disclosures cover two previously disclosed campaigns, and include information from 3,841 accounts believed to be connected to the Russian Internet Research Agency [also known as IRA], and 770 accounts believed to originate in Iran." ([Twitter's Election Integrity Team](https://blog.twitter.com/official/en_us/topics/company/2018/enabling-further-research-of-information-operations-on-twitter.html){target="_blank"})
For a bit of a context on the IRA's activities and the Russian Influence Operation in general, [Mashable](https://mashable.com/article/russian-trolls-spies-operatives/?europe=true&geo=AS&utm_cid=mash-prod-nav-geo#7pbvrw1_DaqI){target="_blank"} offers a nice overview.
The IRA zip alone is **1.24 GB** big! Let's dive in and explore. Before we can start with any analysis, automated or not, we have to inspect and prepare the data - remember: [EDA FTW](https://ellocke.github.io/post/r-german-academic-twitter-pt-2-from-data-to-corpus-with-a-turkish-twist/#the-turkish-plot-twist)!
Anyways, a dataset of this size is a perfect exercise in data wrangling and exploratory analysis with tools from the galactic `tidyverse`. So what I'm aiming to highlight with this post, is my more or less systematic approach to turning an granular dataset with millions of observations into something more useful (and reliable!) for further in-depth analysis.
# Data Preperation
You can get the data from here:
* source: https://about.twitter.com/en_us/values/elections-integrity.html#data
* IRA CSV ZIP: https://storage.googleapis.com/twitter-election-integrity/hashed/ira/ira_tweets_csv_hashed.zip
* README: https://storage.googleapis.com/twitter-election-integrity/hashed/Twitter_Elections_Integrity_Datasets_hashed_README.txt
```{r message=FALSE}
library(tidyverse)
```
First, we need to unpack the `.zip` file and then read the `.csv` file into R.
```{r eval=FALSE}
csvfile <- "ira_tweets_csv_hashed.csv"
data_raw <- read_csv("ira_tweets_csv_hashed.csv")
# Just for comparison
data2_raw <- data.table::fread(csvfile,
encoding = "UTF-8",
# na.strings = ",,",
verbose = TRUE)
```
> At first – I ([like others](https://twitter.com/avuko/status/1052573187869986816){target="_blank"}) – was getting an CRC error message when unpacking the zip file. The resulting CSV was only 1.17 GB big, so that were not all the Tweets from the IRA dataset. After upgrading `7-zip` to `v18.05` unpacking the `zip` results in a **5.3 GB big CSV**, which is quite a difference I'd say :)
> `data.table` is by light years the fastest method, but `read_csv()` gives me `NA`'s without an extra hassle while `data.table::fread()` seems to be ignoring any patterns for `na.strings`. Therefore I'll be working with the data object from `read_csv()` here. Since `read_csv()` is rather slow with a `CSV` this big, you can speed up your exploration if you save the data object as an `.rds` or `.RData` file.
```{r}
data_path <- here::here("data", "IRA_Tweets", "/")
# if (!dir.exists(data_path)) dir.create(data_path)
# saveRDS(data_raw, str_c(data_path, "infoops_data.rds"))
data_raw <- readRDS(str_c(data_path, "infoops_data.rds"))
```
With a dataset this big, skimr::skim() is just perfect (and it's output much more functional in RStudio)!
```{r cache=TRUE}
data_raw %>% skimr::skim_to_wide() %>% knitr::kable("html")
```
We can already make some interesting observations from this summary alone:
* The IRA dataset consist of ~~1.899.595~~ 9.041.308 Tweets in ~~51~~ 58 languages, from ~~3.460~~ 3.667 unique accounts and 11 account languages. That's pretty "diverse" but also quite complex.
* `$is_retweet` has only 2 unique values, so it's obviously a Boolean -> `mutate()`
* There's ~~1.899.595~~ 9.041.308 observations for `$tweet_text`, but only ~~1.634.942~~ 6.598.905 are unique. This huge delta just screams: spam bots and/or coordinated campaigns!
* only ~~50K~~ 266K Tweets are replies -> rather few interactions
* there are some prominent accounts with up to 257K followers
* ~~743.828~~ 2.760.160 URLs to explore
* we can see from `$retweet_userid` that apparently, ~~703.467~~ 3.333.184 Tweets are just Retweets and not unique/original Tweets.
* if we were to try to classify accounts by profile description, there's a corpus of ~~2.451~~ 2.597 unique profile descriptions (`$user_profile_description`) and 200 unique `$user_profile_url`s
* all the `$*_tweetid` vars were read as numeric, which we'll also need to change, as IDs are supposed to be unique identifiers and not continuous values -> `mutate()`
* the Tweets were posted in the period from 2009-05-09 (!) to 2018-06-21, with the median around 2015-07-17
* half of the accounts were created on or after 2014-03-28. Like there was an upcoming election or a referendum or something :)
* there are 608 unique account locations (shared by an unknown number of those 3.667 accounts at this point), and there are 4.779 geolocated Tweets. That's not much, but we could try to double-check these locations with the respective `$account_language` values.
That should give us enough leads for an initial inquiry. Let's continue with the data preparation and address what we have discovered so far.
## Change Variable Types
**convert `$is_retweet` into a boolean**
```{r}
data_raw$is_retweet <- as.logical(data_raw$is_retweet)
```
**convert `$*_tweetid` vars into strings**
```{r}
data_raw <- data_raw %>%
mutate_at(vars(ends_with("tweetid")),
funs(as.character))
```
Now we can `skim` just the `$*_tweetid` vars and `$is_retweet`
```{r}
data_raw %>%
select(is_retweet, ends_with("tweetid")) %>%
skimr::skim_to_wide(noten_raw) %>%
knitr::kable("html")
```
Ok, now we know that 3.333.184 (~1/3) Tweets are Retweets (of 1.725.841 unique Tweets). Good to know for any Natural Language Processing Method which depends on statistics, i.e. Topic Modelling, or when building a corpus for descriptive analyses.
## Remove Duplicates
It's probably more efficient to remove duplicates as the first step. This reduces the data object we're working with.
```{r}
data_unique <- data_raw %>% filter(is_retweet == FALSE)
```
Now we're down to `r n_distinct(data_unique$tweetid)` unique Tweets in `r n_distinct(data_unique$tweet_language)` Tweet languages by `r n_distinct(data_unique$userid)` Users with `r n_distinct(data_unique$account_language)` account languages.
## Reduce/Recode Language Variables
**$tweet_language**
```{r cache=TRUE}
data_unique %>%
group_by(tweet_language) %>%
count() %>%
filter(n > 1000) %>%
arrange(desc(n)) %>%
knitr::kable(format = "html")
```
That's quite a lot, even if we only consider Tweet languages with n > 1000 Tweets.
**merge NA and und[defined]**
```{r}
data_unique <- data_unique %>%
mutate(tweet_language = if_else(is.na(tweet_language), "und", tweet_language))
```
**recode all langs with n < 5000 as "other"**
We need to reduce the scope of Tweet languages for now. 5000 is only slightly less than 1% of the unique Tweets, so this sounds like a good threshold.
```{r}
# I'm very sure that there's a more elegant solution for mutating observations row-wise based on grouped counts... However, whatever works, works.
other_langs <- data_unique %>%
group_by(tweet_language) %>%
count() %>%
filter(n < 5000) %>%
select(tweet_language)
```
```{r}
data_unique <- data_unique %>%
mutate(tweet_language =
if_else(tweet_language %in% other_langs$tweet_language, "other_44",
tweet_language))
```
```{r cache=TRUE}
n_distinct(data_unique$tweet_language)
```
We're down to `r n_distinct(data_unique$tweet_language)` language categories for the Tweets. That's far better!
**$account_language**
```{r cache=TRUE}
data_unique %>%
group_by(account_language) %>%
distinct(userid) %>%
count() %>%
arrange(desc(n)) %>%
knitr::kable("html")
```
So there's `r n_distinct(data_unique$account_language)` account languages. I'm going to focus on languages with n > 100 accounts, and recode the rest as "other_6" (since we can assume "uk" == "en-gb", that's one language less).
```{r}
other_langs_acc <- data_unique %>%
distinct(userid, .keep_all = TRUE) %>%
group_by(account_language) %>%
count() %>%
filter(n < 100) %>%
select(account_language)
data_unique <- data_unique %>%
mutate(account_language =
if_else(account_language %in% other_langs_acc$account_language, "other_6",
account_language))
n_distinct(data_unique$account_language)
```
Great! Now we're down to `r n_distinct(data_unique$account_language)` account languages.
## IRA Dataset Languages: Summary
Let's see how many different languages we have by now.
```{r}
unique(c(data_unique$tweet_language, data_unique$account_language))
```
Now let's create a preliminary overview.
```{r cache=TRUE}
data_count_by_tweet_lang <- data_unique %>%
# filter(is_retweet == TRUE) %>%
group_by(tweet_language) %>%
distinct(tweetid) %>%
count() %>%
rename(Tweets = n)
data_count_by_account_lang <- data_unique %>%
# filter(is_retweet == TRUE) %>%
group_by(account_language) %>%
distinct(userid) %>%
count() %>%
rename(Accounts = n)
lang_stats <- full_join(data_count_by_tweet_lang, data_count_by_account_lang,
by = c("tweet_language" = "account_language")) %>%
rename(Language = tweet_language) %>%
mutate(
T.Share = round(Tweets / sum(.$Tweets, na.rm = TRUE) * 100, 2),
A.Share = round(Accounts / sum(.$Accounts, na.rm = TRUE) * 100, 2)
) %>%
select(Language, Tweets, T.Share, Accounts, A.Share) %>%
arrange(desc(Accounts))
```
```{r}
lang_stats %>%
knitr::kable("html",
format.args = list(
big.mark = ".",
decimal.mark = ","),
caption = "#TwitterDump 2018 – Russian InfoOP Dataset: Languages (unique Tweets)"
)
```
> That's already interesting, but let's not jump to conclusion about who tweeted in what language, yet... This summary alone does enable us to claim that, for instance, Russian accounts where responsible for all the Russian Tweets, and so on.
> What's really striking is that 50% of the Tweets from this 9M Tweets IRA dataset are in Russian (or at least labelled as such), which does not quite fit the dominant narrative of a solely US-centric information operation. These numbers show that Russia's activities were concered with Russian-speaking people as much as with an English-speaking audience (among German and Spanish).
**Check for any remaining language NA's**
```{r cache=TRUE}
data_unique %>% filter(is.na(tweet_language) | is.na(account_language)) %>%
count()
```
That's great news so far!
**(Optional: recode if(tweet_lang == NA &/| user_lang == NA))**
Since we've reduced our dataset and already recoded the `NA`s, this step is not necessary anymore (before that, I worked without `is_retweet == FALSE` and things looked a bit different). However, I'm just leaving the syntax here, since it might be useful to others (and to myself).
```{r eval=FALSE}
data_recoded <- data_unique %>%
mutate(tweet_language = if_else(is.na(tweet_language) & is.na(account_language),
"und",
if_else(is.na(tweet_language) & !is.na(account_language),
account_language,
tweet_language)
)
) %>%
mutate(account_language = if_else(is.na(tweet_language) & is.na(account_language),
"und",
if_else(is.na(account_language) & !is.na(tweet_language),
tweet_language,
account_language)
)
)
```
> This is a good moment in our data preparation cycle to create a hardcopy of our processed `data_unique` object in a `.rds` file. This way, we won't have to redo all the wrangling and recoding we did so far, and can just start with any in-depth analysis by loading the object with `data_unique <- readRDS(file)`. ANd we can reduce our local in-memory load by `rm(data_raw)`
```{r eval=FALSE}
# data_path <- here::here("data", "IRA_Tweets", "/")
# if (!dir.exists(data_path)) dir.create(data_path)
saveRDS(data_unique, str_c(data_path, "infoops_data_processed.rds"))
rm(data_raw)
data_unique <- readRDS(str_c(data_path, "infoops_data_processed.rds"))
```
# Who tweeted in what language?
Now it's about time to look into which account language groups tweeted in what languages.
## Create Language-specific Subsets
> This is also a good moment to create language-specific or other intereting subsets from our refined dataset.
**German subset**
```{r cache=TRUE}
german_subset <- data_unique %>% filter(tweet_language == "de" | account_language == "de")
```
The German subset has `r n_distinct(german_subset$tweetid)` Tweets by `r n_distinct(german_subset$userid)` users.
**Undefined Subset**
```{r cache=TRUE}
undefined_subset <- data_unique %>%
filter(tweet_language == "und" | account_language == "und" |
is.na(tweet_language) | is.na(account_language))
```
The undefined ("und" & "NA") subset has `r n_distinct(undefined_subset$tweetid)` Tweets by `r n_distinct(undefined_subset$userid)` users.
## Summary Plots: Languages & General Activity
```{r fig.width=7, cache=TRUE}
data_unique %>%
group_by(tweet_language) %>%
summarise(n = n()) %>%
mutate(
share = n / sum(n),
tweet_language = case_when(
tweet_language == "" ~ "unspec.",
tweet_language == "und" ~ "undef.",
TRUE ~ tweet_language
)
) %>%
arrange(desc(n)) %>%
ggplot(aes(area = share)) +
treemapify::geom_treemap(aes(fill = n), alpha = 0.8) +
treemapify::geom_treemap_text(
aes(label = paste0(tweet_language, "\n(", round(share * 100, 1), "%)"))
) +
scale_fill_viridis_c(direction = -1, option = "D") +
labs(
title = "#TwitterDump 2018 – Russian InfoOP Dataset: Shares of Languages (Tweets)",
subtitle = str_c("Total of ",
n_distinct(data_unique$tweetid), " unique Tweets (no RTs) from ",
n_distinct(data_unique$userid), " unique Users"
),
caption = str_c("@fubits")
) +
guides(fill = FALSE)
```
```{r fig.width=7, cache=TRUE}
data_unique %>%
group_by(account_language) %>%
summarise(n = n()) %>%
mutate(
share = n / sum(n),
account_language = case_when(
account_language == "" ~ "unspec.",
account_language == "und" ~ "undef.",
TRUE ~ account_language
)
) %>%
arrange(desc(n)) %>%
ggplot(aes(area = share)) +
treemapify::geom_treemap(aes(fill = n), alpha = 0.8) +
treemapify::geom_treemap_text(
aes(label = paste0(account_language, "\n(", round(share * 100, 1), "%)"))
) +
scale_fill_viridis_c(direction = -1, option = "D") +
labs(
title = "#TwitterDump 2018 – Russian InfoOP Dataset: Shares of Languages (Accounts)",
subtitle = str_c("Total of ",
n_distinct(data_unique$tweetid), " unique Tweets (no RTs) from ",
n_distinct(data_unique$userid), " unique Users"
),
caption = str_c("@fubits")
) +
guides(fill = FALSE)
```
### Consolidating Tweet Languages per Account
```{r cache=TRUE}
data_counts <- data_unique %>%
group_by(userid) %>%
mutate(Account_Lang = account_language) %>%
summarise(
Created = unique(account_creation_date),
Account_Lang = unique(Account_Lang),
Tweets = n(),
RT = sum(retweet_count),
Follower = unique(follower_count),
Following = unique(following_count),
Influence = (((Follower + 1) / (Following + 1)) + (Follower + 1)),
Tweet_Langs = list(tweet_language),
Tweet_Langs_Counts = list(unlist(Tweet_Langs) %>% fct_count())
) %>%
arrange(desc(Tweets))
```
Now the `$Tweet_Langs` var contains a list of all Tweet Languages from every single Tweet posted by an Account. Compare the number of `$Tweets` with the length of the vector (in the list) further below.
For `$Tweet_Langs_Counts`, we have utilized the quite elegant `forcats::fct_count()` which gives us a `list` of the aggregated language counts.
So the first User in our dataset - who has tweeted a total of `r data_counts[1,]$Tweets` times - has been busy in 6 languages. Impressive language skills :)
```{r cache=TRUE}
data_counts[1,]$Tweet_Langs_Counts[[1]] %>% knitr::kable("html")
```
And this is what this tibble looks like (without `$userid` and `$Tweet_Langs` for better Website readability):
```{r cache=TRUE}
data_counts %>%
select(-userid, -Tweet_Langs) %>%
head(10) %>% knitr::kable("html", digits = 0)
```
I'll get back to this in a minute. Let's now visualize the general characteristics of all accounts in the IRA dataset.
### General Activity Plots
```{r fig.width=10, cache=TRUE}
ggplot(data_counts, aes(x = Follower, y = Following)) +
geom_point(aes(size = Tweets, color = userid, alpha = Influence)) +
scale_color_viridis_d(direction = -1) +
scale_alpha_continuous(range = c(0.3,1),
breaks = scales::pretty_breaks(5)) +
scale_size(range = c(1,5), labels = scales::number_format(big.mark = ".",
decimal.mark = ",")) +
scale_x_continuous(breaks = scales::pretty_breaks(6),
labels = scales::number_format(big.mark = ".",
decimal.mark = ",")) +
scale_y_continuous(breaks = scales::pretty_breaks(6),
labels = scales::number_format(big.mark = ".",
decimal.mark = ",")) +
coord_fixed() +
theme_minimal() +
labs(
title = "#TwitterDump 2018 – Russian InfoOP Dataset: Account Stats",
subtitle = str_c("Total of ",
n_distinct(data_unique$tweetid), " unique Tweets (no RTs) from ",
n_distinct(data_unique$userid), " unique Users"
),
caption = str_c("@fubits"),
#x = "",
# y = "",
size = "# of Tweets",
alpha = "Alpha: # Retweets"
) +
guides(color = FALSE, alpha = FALSE)
```
From what we can see here, we have quite an amount of "influencers" - accounts with lots of followers and low rates of following others.
What if we look at the account languages?
```{r fig.width=10, cache=TRUE}
ggplot(data_counts, aes(x = Follower, y = Following)) +
geom_point(aes(size = Tweets, color = fct_infreq(Account_Lang), alpha = Influence)) +
scale_color_viridis_d(option = "B", direction = 1) +
scale_alpha_continuous(range = c(0.3,1),
labels = scales::number_format(big.mark = ".",
decimal.mark = ",")) +
scale_size(range = c(1,5), labels = scales::number_format(big.mark = ".",
decimal.mark = ",")) +
scale_x_continuous(breaks = scales::pretty_breaks(6),
labels = scales::number_format(big.mark = ".",
decimal.mark = ",")) +
scale_y_continuous(breaks = scales::pretty_breaks(6),
labels = scales::number_format(big.mark = ".",
decimal.mark = ",")) +
coord_fixed() +
theme_minimal() +
labs(
title = "#TwitterDump 2018 – Russian InfoOP Dataset: Activity by Account Language",
subtitle = str_c("Total of ",
n_distinct(data_unique$tweetid), " unique Tweets (no RTs) from ",
n_distinct(data_unique$userid), " unique Users"
),
caption = str_c("@fubits"),
#x = "",
# y = "",
size = "# of Tweets",
color = "Account Language"
) +
guides(alpha = FALSE,
colour = guide_legend(override.aes = list(size = 5, stroke = 1.5))
)
```
And now let's just look at when the most influential accounts were created.
```{r fig.width=10, cache=TRUE}
influencers <- data_counts %>%
arrange(desc(Influence)) %>%
top_n(15, Influence)
data_counts %>%
arrange(desc(Influence)) %>%
ggplot(data = ., aes(x = Follower, y = Following)) +
geom_point(aes(size = Tweets, color = fct_infreq(Account_Lang), alpha = Influence)) +
ggrepel::geom_label_repel(data = influencers,
aes(label = lubridate::year(Created),
fill = Account_Lang),
alpha = 0.7) +
scale_color_viridis_d(option = "B", direction = 1) +
scale_alpha_continuous(range = c(0.3,1),
labels = scales::number_format(big.mark = ".",
decimal.mark = ",")) +
scale_size(range = c(1,5), labels = scales::number_format(big.mark = ".",
decimal.mark = ",")) +
scale_x_continuous(breaks = scales::pretty_breaks(6),
labels = scales::number_format(big.mark = ".",
decimal.mark = ",")) +
scale_y_continuous(breaks = scales::pretty_breaks(6),
labels = scales::number_format(big.mark = ".",
decimal.mark = ",")) +
coord_fixed() +
theme_minimal() +
labs(
title = "#TwitterDump 2018 – Russian InfoOP Dataset: Top 15 (Influence) - Account Creation Date",
subtitle = str_c("Total of ",
n_distinct(data_unique$tweetid), " unique Tweets (no RTs) from ",
n_distinct(data_unique$userid), " unique Users"
),
caption = str_c("@fubits"),
#x = "",
# y = "",
size = "# of Tweets",
color = "Account Language"
) +
guides(alpha = FALSE,
colour = FALSE)
```
Interesting, one of the most influential accounts is neither Russian nor English-speaking!
```{r}
data_counts %>%
arrange(desc(Influence)) %>%
select(userid, Account_Lang, Follower, Following) %>%
head(15) %>% knitr::kable("html")
```
> 2518710111 is not hashed :) Let's find out who this is!
```{r}
data_raw %>% filter(userid == 2518710111) %>% select(userid, user_display_name, user_screen_name, account_language) %>% head(1) %>% knitr::kable("html")
```
> "Вестник Новосибирска" - Newspaper of Novosibirsk - doesn't sound too British :) the account has been suspended by Twitter, btw.
What about accounts with the most Tweets?
```{r fig.width=10, cache=TRUE}
data_counts %>%
ggplot(data = ., aes(x = Follower, y = Following)) +
geom_point(aes(size = Tweets, color = fct_infreq(Account_Lang), alpha = RT)) +
ggrepel::geom_label_repel(data = data_counts[1:15,],
aes(label = lubridate::year(Created),
fill = Account_Lang),
alpha = 0.7) +
scale_color_viridis_d(option = "B", direction = 1) +
scale_alpha_continuous(range = c(0.3,1),
labels = scales::number_format(big.mark = ".",
decimal.mark = ",")) +
scale_size(range = c(1,5), labels = scales::number_format(big.mark = ".",
decimal.mark = ",")) +
scale_x_continuous(breaks = scales::pretty_breaks(6),
labels = scales::number_format(big.mark = ".",
decimal.mark = ",")) +
scale_y_continuous(breaks = scales::pretty_breaks(6),
labels = scales::number_format(big.mark = ".",
decimal.mark = ",")) +
coord_fixed() +
theme_minimal() +
labs(
title = "#TwitterDump 2018 – Russian InfoOP Dataset: Top 15 (# Tweets) - Account Creation Date",
subtitle = str_c("Total of ",
n_distinct(data_unique$tweetid), " unique Tweets (no RTs) from ",
n_distinct(data_unique$userid), " unique Users"
),
caption = str_c("@fubits"),
#x = "",
# y = "",
size = "# of Tweets",
color = "Account Language"
) +
guides(alpha = FALSE,
colour = FALSE)
```
We can see from both plots that the most influential or active accounts were created in 2014 or later, and that the relation between Russian- and English-labelled accounts is rather balanced in terms of max. Tweet numbers. However, English-speaking accounts are more dominant in terms of numeric dominance (Following/Followers).
## Languages: Accounts vs Tweets
Now it's time to have look at the **account language to tweet languages** relations.
This is what the Top 20 (of 57) language combinations look like:
```{r cache=TRUE}
data_unique %>%
group_by(account_language, tweet_language) %>%
count() %>%
arrange(desc(n)) %>%
head(20) %>%
knitr::kable("html", caption = "Top 20 Language Combinations from the IRA Dataset")
```
Ok, so we probably could have expected that Russian and English speaking accounts would mostly tweet in their respective languages. But who would have suspected that 185.803 Russian language tweets were posted by English-speaking accounts? Right :)
Now let's visualize all the 57 **Account Language -> Tweet Language** combinations. For an easier understanding of these plots, just keep in mind that Tweets are posted by accounts, so the Tweet language (bottom) is our dependent variable here.
```{r fig.width=7, cache=TRUE}
data_unique %>%
group_by(account_language, tweet_language) %>%
count() %>%
arrange(desc(n)) %>%
ggplot() +
geom_tile(aes(x = tweet_language,
y = account_language,
fill = n)) +
scale_fill_viridis_c(option = "B", direction = -1,
breaks = scales::pretty_breaks(4),
labels = scales::number_format(big.mark = ".",
decimal.mark = ",")) +
coord_fixed() +
theme_minimal() +
labs(
title = "#TwitterDump 2018 – Russian InfoOP Dataset: Language Matrix",
subtitle = str_c(
"Subset of ", n_distinct(data_unique$tweetid),
" unique Tweets (no RTs) from ",
n_distinct(data_unique$userid), " unique Users"
),
caption = str_c("@fubits"),
fill = "Language Combo:\nTotals",
x = "Tweet Language",
y = "Account Language"
)
```
```{r fig.width=7, cache=TRUE}
data_unique %>%
group_by(account_language, tweet_language) %>%
count() %>%
arrange(desc(n)) %>%
ggplot() +
geom_tile(aes(x = tweet_language,
y = account_language,
fill = n/sum(n))) +
scale_fill_viridis_c(option = "B", direction = -1,
breaks = scales::pretty_breaks(6),
labels = scales::percent_format(accuracy = 1)) +
coord_fixed() +
theme_minimal() +
# theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
labs(
title = "#TwitterDump 2018 – Russian InfoOP Dataset: Language Matrix",
subtitle = str_c(
"Subset of ", n_distinct(data_unique$tweetid),
" unique Tweets (no RTs) from ",
n_distinct(data_unique$userid), " unique Users"
),
caption = str_c("@fubits"),
fill = "Language Combo:\nShare of Total"
)
```
> Alright, that's enough exploration and heavy data wrangling for today. Stay tuned for Part 2: Content Analysis
Here's just a teaser of what is expecting us:
```{r}
data_unique %>%
filter(tweet_language == "ru") %>%
select(tweet_text) %>%
head(10) %>%
knitr::kable("html")
```
Or what about "undefined"
```{r}
data_unique %>%
filter(tweet_language == "und") %>%
select(tweet_text) %>%
head(10) %>%
knitr::kable("html")
```
> Btw, have you already discovered the US-Republican-sponsored (sic!) [RussiaTweets.com](https://russiatweets.com/){target="_blank"} Project?