-
Notifications
You must be signed in to change notification settings - Fork 41
/
16-text-mining.qmd
370 lines (256 loc) · 14.5 KB
/
16-text-mining.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
# Text mining
## Case study: Trump tweets
During the 2016 US presidential election, then candidate Donald J. Trump used his twitter account as a way to communicate with potential voters. On August 6, 2016, Todd Vaziri tweeted^[https://twitter.com/tvaziri/status/762005541388378112/photo/1] about Trump that "Every non-hyperbolic tweet is from iPhone (his staff). Every hyperbolic tweet is from Android (from him)."
Data scientist David Robinson conducted an analysis^[http://varianceexplained.org/r/trump-tweets/] to determine if data supported this assertion. Here, we go through David's analysis to learn some of the basics of text mining. To learn more about text mining in R, we recommend the Text Mining with R book^[https://www.tidytextmining.com/] by Julia Silge and David Robinson.
```{r,echo=FALSE}
set.seed(2002)
```
We will use the following libraries:
```{r, message=FALSE, warning=FALSE}
library(tidyverse)
library(lubridate)
library(scales)
```
In general, we can extract data directly from Twitter using the __rtweet__ package. However, in this case, a group has already compiled data for us and made it available at [http://www.trumptwitterarchive.com](http://www.trumptwitterarchive.com). We can get the data from their JSON API using a script like this:
```{r, eval=FALSE}
url <- 'http://www.trumptwitterarchive.com/data/realdonaldtrump/%s.json'
trump_tweets <- map(2009:2017, ~sprintf(url, .x)) |>
map_df(jsonlite::fromJSON, simplifyDataFrame = TRUE) |>
filter(!is_retweet & !str_detect(text, '^"')) |>
mutate(created_at = parse_date_time(created_at,
orders = "a b! d! H!:M!:S! z!* Y!",
tz="EST"))
```
For convenience, we include the result of the code above in the __dslabs__ package:
```{r}
library(dslabs)
```
You can see the data frame with information about the tweets by typing
```{r, eval=FALSE}
head(trump_tweets)
```
with the following variables included:
```{r}
names(trump_tweets)
```
The help file `?trump_tweets` provides details on what each variable represents. The tweets are represented by the `text` variable:
```{r}
trump_tweets$text[16413] |> str_wrap(width = options()$width) |> cat()
```
and the source variable tells us which device was used to compose and upload each tweet:
```{r}
trump_tweets |> count(source) |> arrange(desc(n)) |> head(5)
```
We are interested in what happened during the campaign, so for this analysis we will focus on what was tweeted between the day Trump announced his campaign and election day. We define the following table containing just the tweets from that time period. Note that we use `extract` to remove the `Twitter for` part of the source and filter out retweets.
```{r}
campaign_tweets <- trump_tweets |>
extract(source, "source", "Twitter for (.*)") |>
filter(source %in% c("Android", "iPhone") &
created_at >= ymd("2015-06-17") &
created_at < ymd("2016-11-08")) |>
filter(!is_retweet) |>
arrange(created_at) |>
as_tibble()
```
We can now use data visualization to explore the possibility that two different groups were tweeting from these devices. For each tweet, we will extract the hour, East Coast time (EST), it was tweeted and then compute the proportion of tweets tweeted at each hour for each device:
```{r tweets-by-time-by-device}
campaign_tweets |>
mutate(hour = hour(with_tz(created_at, "EST"))) |>
count(source, hour) |>
group_by(source) |>
mutate(percent = n / sum(n)) |>
ungroup() |>
ggplot(aes(hour, percent, color = source)) +
geom_line() +
geom_point() +
scale_y_continuous(labels = percent_format()) +
labs(x = "Hour of day (EST)", y = "% of tweets", color = "")
```
We notice a big peak for the Android in the early hours of the morning, between 6 and 8 AM. There seems to be a clear difference in these patterns. We will therefore assume that two different entities are using these two devices.
We will now study how the tweets differ when we compare Android to iPhone. To do this, we introduce the __tidytext__ package.
## Text as data
The __tidytext__ package helps us convert free form text into a tidy table. Having the data in this format greatly facilitates data visualization and the use of statistical techniques.
```{r}
library(tidytext)
```
The main function needed to achieve this is `unnest_tokens`. A _token_ refers to a unit that we are considering to be a data point. The most common _token_ will be words, but they can also be single characters, ngrams, sentences, lines, or a pattern defined by a regex. The functions will take a vector of strings and extract the tokens so that each one gets a row in the new table. Here is a simple example:
```{r}
poem <- c("Roses are red,", "Violets are blue,",
"Sugar is sweet,", "And so are you.")
example <- tibble(line = c(1, 2, 3, 4),
text = poem)
example
example |> unnest_tokens(word, text)
```
Now let's look at an example from the tweets. We will look at tweet number 3008 because it will later permit us to illustrate a couple of points:
```{r}
i <- 3008
campaign_tweets$text[i] |> str_wrap(width = 65) |> cat()
campaign_tweets[i,] |>
unnest_tokens(word, text) |>
pull(word)
```
Note that the function tries to convert tokens into words. A minor adjustment is to remove the links to pictures:
```{r, message=FALSE, warning=FALSE}
links <- "https://t.co/[A-Za-z\\d]+|&"
campaign_tweets[i,] |>
mutate(text = str_replace_all(text, links, "")) |>
unnest_tokens(word, text) |>
pull(word)
```
Now we are now ready to extract the words for all our tweets.
```{r, message=FALSE, warning=FALSE}
tweet_words <- campaign_tweets |>
mutate(text = str_replace_all(text, links, "")) |>
unnest_tokens(word, text)
```
And we can now answer questions such as "what are the most commonly used words?":
```{r}
tweet_words |>
count(word) |>
arrange(desc(n))
```
It is not surprising that these are the top words. The top words are not informative. The _tidytext_ package has a database of these commonly used words, referred to as _stop words_, in text mining:
```{r}
stop_words
```
If we filter out rows representing stop words with `filter(!word %in% stop_words$word)`:
```{r, message=FALSE, warning=FALSE}
tweet_words <- campaign_tweets |>
mutate(text = str_replace_all(text, links, "")) |>
unnest_tokens(word, text) |>
filter(!word %in% stop_words$word)
```
we end up with a much more informative set of top 10 tweeted words:
```{r}
tweet_words |>
count(word) |>
top_n(10, n) |>
mutate(word = reorder(word, n)) |>
arrange(desc(n))
```
Some exploration of the resulting words (not shown here) reveals a couple of unwanted characteristics in our tokens. First, some of our tokens are just numbers (years, for example). We want to remove these and we can find them using the regex `^\d+$`. Second, some of our tokens come from a quote and they start with `'`. We want to remove the `'` when it is at the start of a word so we will just `str_replace`. We add these two lines to the code above to generate our final table:
```{r, message=FALSE, warning=FALSE}
tweet_words <- campaign_tweets |>
mutate(text = str_replace_all(text, links, "")) |>
unnest_tokens(word, text) |>
filter(!word %in% stop_words$word &
!str_detect(word, "^\\d+$")) |>
mutate(word = str_replace(word, "^'", ""))
```
Now that we have all our words in a table, along with information about what device was used to compose the tweet they came from, we can start exploring which words are more common when comparing Android to iPhone.
For each word, we want to know if it is more likely to come from an Android tweet or an iPhone tweet. We therefore compute, for each word, what proportion of all words it represent for Android and iPhone, respectively.
```{r}
android_vs_iphone <- tweet_words |>
count(word, source) |>
pivot_wider(names_from = "source", values_from = "n", values_fill = 0) |>
mutate(p_a = Android / sum(Android), p_i = iPhone / sum(iPhone),
percent_diff = (p_a - p_i) / ((p_a + p_i)/2) * 100)
```
For words appearing at least 100 times in total, here are the highest percent differences for Android
```{r}
android_vs_iphone |> filter(Android + iPhone >= 100) |>
arrange(desc(percent_diff))
```
and the top for iPhone:
```{r}
android_vs_iphone |> filter(Android + iPhone >= 100) |>
arrange(percent_diff)
```
We already see somewhat of a pattern in the types of words that are being tweeted more from one device versus the other. However, we are not interested in specific words but rather in the tone. Vaziri's assertion is that the Android tweets are more hyperbolic. So how can we check this with data? _Hyperbolic_ is a hard sentiment to extract from words as it relies on interpreting phrases. However, words can be associated to more basic sentiment such as anger, fear, joy, and surprise. In the next section, we demonstrate basic sentiment analysis.
## Sentiment analysis
In sentiment analysis, we assign a word to one or more "sentiments". Although this approach will miss context-dependent sentiments, such as sarcasm, when performed on large numbers of words, summaries can provide insights.
The first step in sentiment analysis is to assign a sentiment to each word. As we demonstrate, the __tidytext__ package includes several maps or lexicons. We will also be using the __textdata__ package.
```{r, message=FALSE, warning=FALSE}
library(tidytext)
library(textdata)
```
The `bing` lexicon divides words into `positive` and `negative` sentiments. We can see this using the _tidytext_ function `get_sentiments`:
```{r, eval=FALSE}
get_sentiments("bing")
```
The `AFINN` lexicon assigns a score between -5 and 5, with -5 the most negative and 5 the most positive. Note that this lexicon needs to be downloaded the first time you call the function `get_sentiment`:
```{r, eval=FALSE}
get_sentiments("afinn")
```
The `loughran` and `nrc` lexicons provide several different sentiments. Note that these also have to be downloaded the first time you use them.
```{r}
get_sentiments("loughran") |> count(sentiment)
```
```{r}
get_sentiments("nrc") |> count(sentiment)
```
For our analysis, we are interested in exploring the different sentiments of each tweet so we will use the `nrc` lexicon:
```{r}
nrc <- get_sentiments("nrc") |>
select(word, sentiment)
```
We can combine the words and sentiments using `inner_join`, which will only keep words associated with a sentiment. Here are 10 random words extracted from the tweets:
```{r}
tweet_words |> inner_join(nrc, by = "word", relationship = "many-to-many") |>
select(source, word, sentiment) |>
sample_n(5)
```
Now we are ready to perform a quantitative analysis comparing Android and iPhone by comparing the sentiments of the tweets posted from each device. Here we could perform a tweet-by-tweet analysis, assigning a sentiment to each tweet. However, this will be challenging since each tweet will have several sentiments attached to it, one for each word appearing in the lexicon. For illustrative purposes, we will perform a much simpler analysis: we will count and compare the frequencies of each sentiment appearing in each device.
```{r}
sentiment_counts <- tweet_words |>
left_join(nrc, by = "word", relationship = "many-to-many") |>
count(source, sentiment) |>
pivot_wider(names_from = "source", values_from = "n") |>
mutate(sentiment = replace_na(sentiment, replace = "none"))
sentiment_counts
```
For each sentiment, we can compute the percent difference in proportion for Android compared to iPhone:
```{r}
sentiment_counts |>
mutate(p_a = Android / sum(Android) ,
p_i = iPhone / sum(iPhone),
percent_diff = (p_a - p_i) / ((p_a + p_i)/2) * 100) |>
arrange(desc(percent_diff))
```
So we do see some differences and the order is interesting: the largest three sentiments are disgust, anger, and negative!
If we are interested in exploring which specific words are driving these differences, we can refer back to our `android_iphone_or` object:
```{r}
android_vs_iphone |> inner_join(nrc, by = "word") |>
filter(sentiment == "disgust") |>
arrange(desc(percent_diff))
```
and we can make a graph:
```{r percent-diff-by-word, out.width="100%", echo=FALSE}
levels <- sentiment_counts |>
mutate(p_a = Android / sum(Android) ,
p_i = iPhone / sum(iPhone),
percent_diff = (p_a - p_i) / ((p_a + p_i)/2) * 100) |>
arrange(desc(percent_diff)) |>
pull(sentiment)
android_vs_iphone |> inner_join(nrc, by = "word") |>
mutate(sentiment = factor(sentiment, levels = levels)) |>
filter(Android + iPhone > 25 & abs(percent_diff) > 25) |>
mutate(word = reorder(word, percent_diff)) |>
ggplot(aes(word, percent_diff, fill = percent_diff < 0)) +
facet_wrap(~sentiment, scales = "free_x", nrow = 2) +
geom_col(show.legend = FALSE) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
This is just a simple example of the many analyses one can perform with tidytext.
To learn more, we again recommend the Tidy Text Mining book^[https://www.tidytextmining.com/].
## Exercises (will not be included in midterm)
Project Gutenberg is a digital archive of public domain books. The R package __gutenbergr__ facilitates the importation of these texts into R.
You can install and load by typing:
```{r, eval=FALSE}
install.packages("gutenbergr")
library(gutenbergr)
```
You can see the books that are available like this:
```{r, eval=FALSE}
gutenberg_metadata
```
(@) Use `str_detect` to find the ID of the novel _Pride and Prejudice_.
(@) We notice that there are several versions. The `gutenberg_works()` function filters this table to remove replicates and include only English language works. Read the help file and use this function to find the ID for _Pride and Prejudice_.
(3) Use the `gutenberg_download` function to download the text for Pride and Prejudice. Save it to an object called `book`.
(@) Use the __tidytext__ package to create a tidy table with all the words in the text. Save the table in an object called `words`
(@) We will later make a plot of sentiment versus location in the book. For this, it will be useful to add a column with the word number to the table.
(@) Remove the stop words and numbers from the `words` object. Hint: use the `anti_join`.
(@) Now use the `AFINN` lexicon to assign a sentiment value to each word.
(@) Make a plot of sentiment score versus location in the book and add a smoother.
(@) Assume there are 300 words per page. Convert the locations to pages and then compute the average sentiment in each page. Plot that average score by page. Add a smoother that appears to go through data.