-
Notifications
You must be signed in to change notification settings - Fork 0
/
word-network-inference.Rmd
357 lines (281 loc) · 12.9 KB
/
word-network-inference.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
---
title: "Word network inference"
author: "Dereck de Mézquita"
date: "14/09/2020"
knit: (function(inputFile, encoding) {
rmarkdown::render(inputFile,
encoding=encoding,
output_file=file.path(dirname(inputFile), "reports", "word-network-inference.html")) })
output:
html_document:
fig_caption: yes
keep_md: yes
number_sections: yes
toc: yes
toc_float: yes
editor_options:
chunk_output_type: inline
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, fig.align = "center")
```
# Description
This document is meant as a script for the inference of word networks. These networks are composed of words from my chat with Liza, the nodes are individual words and the vertices are the associations between those words. A correlation limit is set in order to inferr these asssociations; as explained more in detail in the following sections. These networks are directed.
These steps to word network inference are as follows:
- Load and clean the data.
- Subset the data for network inference.
- Use a custom network inference function based on the `findAssocs()` function of the "tm" package.
- Find word associations.
- Inferr networks at different correlation limits or word list sizes for fine tuning.
- Subset objects into a directed adjacency list.
- Test plot the networks object to select a viable network.
## Project structure
The goal of this project is to analyse chat data with my girlfriend; apply statistical methods, graph theory, and other data science techniques.
Please note that this project is presented in knitted interactive `.html` reports, you can obtain these by downloading them from this github repo at the directory `./reports` *or* you can visit them hosted on my website at: https://www.derecksnotes.com/sharing/data-science-portfolio/ds-NLP-network-inference/
This project is broken up into three big sub-projects. The project structure is as follows (hosted reports links):
- Text mining, statistical, and exploratory analysis: [derecksnotes.com: text-analysis.html](https://www.derecksnotes.com/sharing/data-science-portfolio/ds-NLP-network-inference/text-analysis.html)
- Word association network inference: [derecksnotes.com: word-network-inference.html](https://www.derecksnotes.com/sharing/data-science-portfolio/ds-NLP-network-inference/word-network-inference.html)
- Word association/network topological analysis and visualisation: [derecksnotes.com: word-network-analysis.html](https://www.derecksnotes.com/sharing/data-science-portfolio/ds-NLP-network-inference/word-network-analysis.html)
- **Bonus -** interactive visualisation of network: [derecksnotes.com: word-network-interactive.html](https://www.derecksnotes.com/sharing/data-science-portfolio/ds-NLP-network-inference/word-network-interactive.html)
# Libraries
```{r libraries, message=FALSE, warning=FALSE}
library("rwhatsapp")
library("tidyverse")
library("DT")
library("tm")
library("stopwords")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")
library("english")
library("igraph")
```
# Load data
```{r load-data}
# Date format: [2/1/20, 12:01:05]
# date_format <- "M/d/yy', HH:mm:ss"
date_format <- "M/d/yy, HH:mm:ss"
if(!file.exists("./outputs/chat.rds")) {
message("Load does not exist. Loading raw and saving object.")
chat_init <- rwa_read("./data/chat.txt", tz = "Europe/Paris", verbose = TRUE, format = date_format) %>% filter(!is.na(author))
saveRDS(chat_init, "./outputs/chat.rds")
} else {
message("Load exists, loading previous RDS.")
chat_init <- readRDS("./outputs/chat.rds")
}
chat_init <- as.data.frame(chat_init)
data <- iconv(chat_init$text, "latin1", "ASCII", sub = "")
```
# Word cloud
I create a quick wordcloud to assess the quality of the data; this is a quick way to visualise what I am doing before going onto the network inference portion which requires a lot more computational power and time!
## Corpus create
```{r init-corpus}
cloud <- Corpus(VectorSource(data)) # data | cloud
# inspect(cloud)
```
## Transform and clean text
```{r transform-clean, warning=FALSE, message=FALSE}
toSpace <- content_transformer(function (x , pattern) {
gsub(pattern, " ", x)
})
cloud <- tm_map(cloud, toSpace, "/")
cloud <- tm_map(cloud, toSpace, "@")
cloud <- tm_map(cloud, toSpace, "\\|")
```
```{r convert-text, warning=FALSE, message=FALSE}
# Convert the text to lower case
cloud <- tm_map(cloud, content_transformer(tolower))
# Remove numbers
cloud <- tm_map(cloud, removeNumbers)
# Remove english common stopwords
cloud <- tm_map(cloud, removeWords, stopwords("english"))
# Remove your own stop word
# specify your stopwords as a character vector
to_remove <- c(stopwords("en"), "media", "attached", "2020", "photo", "jpg", "02", "i'm", "03", "omitted", "image", "vm.tiktok", "d0", "d1", "0", "53", "ttl", "ms", "216.58.213.142", "80", "www.derecksnotes", "www.derecksnotes.com", "icmp_seq")
wordvec <- c(to_remove,
"image",
"ommitted",
"omitted",
"http",
"https",
"'m",
"'s",
"wwwderecksnotescom",
"wwwyoutubecom")
cloud <- tm_map(cloud, removeWords, wordvec)
# Remove punctuations
cloud <- tm_map(cloud, removePunctuation)
# Eliminate extra white spaces
cloud <- tm_map(cloud, stripWhitespace)
# Text stemming
# docs <- tm_map(docs, stemDocument)
```
## Word frequencies
```{r word-freq}
dtm <- TermDocumentMatrix(cloud)
m <- as.matrix(dtm)
v <- sort(rowSums(m), decreasing = TRUE)
data <- data.frame(word = names(v), freq = v)
datatable(data, rownames = FALSE)
```
## Wordcloud plot
```{r plot-cloud, fig.width=5, fig.height=5}
par(mar = c(1,1,1,1))
wordcloud(
words = data$word,
freq = data$freq,
min.freq = 1,
max.words = 300,
random.order = FALSE,
rot.per = 0.35,
colors = brewer.pal(8, "Set1")
)
title("Test wordcloud")
par(mar = c(5.1, 4.1, 4.1, 2.1))
```
# Network creation
## Term associations data
```{r subset-words}
net_init <- head(data, 100)
```
A quick test plot to view the distribution of the frequencies. The goal is to get only the most important words so as to reduce computational power requirements. This will also increase the quality of the network and exclude infrequently used terms.
```{r test-plot-freq, fig.width=10, fig.height=10}
par(mar = c(5.1, 8, 2.1, 2.1))
barplot(height = net_init$freq,
names.arg = net_init$word,
horiz = TRUE,
las = 1)
par(mar = c(5.1, 4.1, 2.1, 2.1))
```
## Custom network inference function
This is a custom function based on the `findAssocs()` function of the "tm" package (text-mining). This function from "tm" allows you to calculate the correlation of every given word in a document term matrix (DTM). Scores go from 0 to 1, 1 meaning that two words only appear together in the document. Correlation values can be low because of word diversity in a document.
I used this function in order to get the word associations in a document. Then I converted this object to a dataframe and ran over every word in the object along with it's associations and pushed them into a new object; an `adjacency object`. The origin is set to the word, then the target is all of the combinations of the associated words. Errors are caught as sometimes in the object there are empty targets, as the word in question did not have any associated partners.
Finally all rows are bound and a new dataframe type object is created. This is the `adjacency list` object describing the network; *note these networks are directed*.
### Fine tuning the models
In order to fine tune the models and obtain a network which is usable, I will first run the infer networks function by varying the two main parameters independently: correlation limit and word list size. Then after qualitatively selecting a range of correlation limits and word list sizes, I select the best model.
```{r func-inference-word-net}
inferenceWordNet <- function(dtm, data, corlim = 0.1) {
assoc <- list()
for (i in 1:dim(data)[1]) {
message("Run: ", i, " of ", dim(data)[1])
assoc[[i]] <- findAssocs(dtm, terms = data$word[i], corlimit = corlim)
}
df <- list()
for(i in 1:length(assoc)) {
df[[i]] <- as.data.frame(assoc[[i]])
}
res <- list()
for (i in 1:length(df)) {
res[[i]] <- df[[i]]
tryCatch({
res[[i]]$origin <- colnames(df[[i]])
res[[i]]$target <- rownames(df[[i]])
res[[i]][,colnames(df[[i]])] <- NULL
}, error = function(e) {
message("Error: ", e, " index: ", i)
})
}
net <- as.data.frame(bind_rows(res, .id = NULL))
net <- net[,1:2]
return(net)
}
```
## Networks plotter function
This plotter function is used to assess the two different parameters on the networks; word list size, and correlation limit.
```{r func-plotter, warning=FALSE, message=FALSE}
plotter <- function(net, main = "", main_cex = 1) {
plot.igraph(
net,
edge.arrow.mode = net$direction,
edge.width = 0.4,
edge.arrow.width = 0.3,
edge.arrow.size = 0.3,
vertex.size = 1,
vertex.size2 = 1,
# vertex.label.cex = 1,
# vertex.label.dist = 0.2,
vertex.label = NA, # NA = no labels
layout = layout_components,
# asp = 1,
# controls aspect ratio
margin = 0
)
title(main = main, cex.main = main_cex)
}
```
# Infer networks; vary correlation limit (word list size: 50)
After having reviewed the above barplot for the word list, I qualitatively find that around 50 words is a good size to attempt a first inference.
```{r subset-top-50}
net_init <- head(data, 50)
```
Here I run over a number of different correlation thresholds in order to obtain a usuable network. I then plot all networks at once and select only the one I deem visually viable. Tthe correlation limit for network inference. This allows us to fine tune the size of the network and the "tightness" as the higher the correlation limit threshold the higher the association frequency required in order to make a connection between two nodes in the network.
*Note that these next chunks takes a while to run due to a number of factors including: the size of the initial object, the size of the `dtm` and the number of times networks are inferred - at different correlation limits.*
```{r run-inference-word-nets-vary-correlation, message=FALSE, results="hide"}
limits <- c(0.005, 0.007, 0.010, 0.030, 0.050, 0.075, 0.100, 0.300, 0.450, 0.475, 0.480, 0.481, 0.482, 0.483, 0.484, 0.485, 0.486, 0.487, 0.489, 0.490, 0.491)
if(!file.exists("./outputs/nets_vcorrs.Rds")) {
message("Previous network inference object (varying correlation limit) does not exist; running network inference and saving object.")
nets_vcorrs <- list()
for (i in 1:length(limits)) {
nets_vcorrs[[i]] <- inferenceWordNet(dtm, net_init, limits[i])
}
names(nets_vcorrs) <- as.character(limits)
saveRDS(nets_vcorrs, "./outputs/nets_vcorrs.Rds")
} else {
message("Previous network inference object exists; loading object.")
nets_vcorrs <- readRDS("./outputs/nets_vcorrs.Rds")
}
```
## Test plots
```{r test-plot-varying-corr-limit, message=FALSE, warning=FALSE, fig.width=27.5, fig.height=27.5}
nets <- lapply(nets_vcorrs, graph_from_data_frame, directed = TRUE)
par(mfrow = c(4, 4), mar = c(1, 1, 5, 1))
for (i in 1:length(nets)) {
message("Plotting ", i, " of ", length(nets))
plotter(nets[[i]], main = names(nets)[i], main_cex = 3.5)
}
# lapply(nets, plotter)
par(mfrow = c(1, 1), mar = c(5.1, 4.1, 2.1, 2.1))
```
## Infer networks; vary word list sizes (correlation limit 0.075)
Here a different approach is used to fine tuning the network, instead of varying the correlation limit, set at a fixed 0.450, I vary the size of the word list to find the associations from! This varies the size of the network and does not touch the association frequency requirment.
```{r run-inference-word-nets-vary-nwords, message=FALSE, warning=FALSE}
if(!file.exists("./outputs/nets_vwords.Rds")) {
message("Previous network inference object (varying word list size) does not exist; running network inference and saving object.")
size <- 24
nets_vwords <- list()
nms <- list()
system.time(
for (i in 0:85) {
size <- size + 1
words <- head(data, size)
nets_vwords[[i]] <- inferenceWordNet(dtm, words, 0.075)
nms[[i]] <- as.english(size) %>% as.character() %>% strsplit(" |-") %>% unlist() %>% paste0(collapse = "_")
}
)
names(nets_vwords) <- nms
saveRDS(nets_vwords, "./outputs/nets_vwords.Rds")
} else {
message("Previous network inference object exists; loading object.")
nets_vwords <- readRDS("./outputs/nets_vwords.Rds")
}
```
## Test plots
```{r test-plot-varying-word-list-size, message=FALSE, warning=FALSE, fig.width=27.5, fig.height=27.5}
nets <- lapply(nets_vwords, graph_from_data_frame, directed = TRUE)
par(mfrow = c(4, 4), mar = c(1, 1, 5, 1))
for (i in 1:length(nets)) {
# message("Plotting ", i, " of ", length(nets))
plotter(nets[[i]], main = names(nets)[i], main_cex = 3.5)
}
# lapply(nets, plotter)
par(mfrow = c(1, 1), mar = c(5.1, 4.1, 2.1, 2.1))
```
## Save networks object
```{r save-net}
# write.csv(net_init, "./outputs/network-adjacency-list.csv", row.names = FALSE)
saveRDS(nets, "./outputs/networks.rds")
```
# Session information
```{r sesssion-info}
sessionInfo()
```