/
exeter-r-intro.Rmd
1400 lines (924 loc) · 57.1 KB
/
exeter-r-intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Text Analysis with R (22 November 2019), Part I: Fundamentals"
output:
html_document:
toc: yes
html_notebook:
theme: united
toc: yes
---
## What exactly is programming?
Every computer program is a series of instructions---a sequence of separate, small commands. The art of programming is to take a general idea and break it apart into separate steps. (This may be just as important as learning the rules and syntax of a particular language.)
Programming (or code) consists of either imperative or declarative style. R uses imperative style, meaning it strings together instructions to the computer. (Declarative style involves telling the computer what the end result should be, like HTML code.) There are many subdivisions of imperative style, but the primary concern for beginning programmers should be procedural style: that is, describing the steps for achieving a task.
Each step/instruction is a *statement*---words, numbers, or equations that express a thought.
## Why are there so many languages?
The central processing unit (CPU) of the computer does not understand any of them! The CPU only takes in *machine code*, which runs directly on the computer's hardware. Machine code is basically unreadable, though: it's a series of tiny numerical operations.
Several popular computer programming languages are actually translations of machine code; they are literally interpreted---as opposed to a compiled---languages. They bridge the gap between machine code/computer hardware and the human programmer. What we call our *source code* is our set of statements in our preferred language that interacts with machine code.
Source code is simply written in plain text in a text editor. **Do not** use a word processor.
The computer knows understands source code by the file extension. For us, that means the ".R" extension (and the R notebook is ".Rmd").
While you do not need a special program to write code, it is usually a good idea to use an **IDE** (integrated development environment) to help you. Many people (like me) use the [oXygen](https://www.oxygenxml.com/) IDE for editing XML documents and creating transformations with XSLT. Python users often use [Pycharm](https://www.jetbrains.com/pycharm/) or [Anaconda](https://www.anaconda.com/). For R, I like to use [RStudio](https://www.rstudio.com/) (more on that in a moment).
## Why are we using R?
Short answer: because I like R. I have learned some Python, too, but for some reason R worked better for me. This suggests an important takeaway from this session: there is no single language that is *better* than any other. What you chose to work with will depend on what materials you are working on, what level of comfort you have with a given language, and what kinds of outputs you would like from your code.
For example, if I am primarily interested in text-based edition projects, I would be wise to work mostly with XML technologies: TEI-XML, XPath, XSLT, and XQuery, just to name a few. However, I have seen people use Python and JavaScript to transform XML. While I would advocate XSLT for such an operation, it is better for you to use your preferred language to get things done.
That all said, R does have some distinct advantages:
- The visualisation libraries are excellent.
- Being so dependent on variables, the code is more readable than many other languages (like JavaScript).
- It was built by data scientists and linguists, so it is optimal for dealing with structured text and data sets.
## The R Environment (for those who are new to R)
When you first launch R, you will see a console:
![R image](https://daedalus.umkc.edu/StatisticalMethods/images/R-Console-300x280.png)
This interface allows you to run R commands just like you would run commands on a Bash scripting shell.
When you open this file in RStudio, the command line interface (labeled "Console") is below the editing window. When you run a code block in the editing window, you will see the results appear in the Console below.
## About R Markdown
This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook. When you execute code within the notebook, the results appear beneath the code.
## Some basic R functions
- to activate a package: library(XML)
- to set working directory: setwd("path/to/my/file")
- to find your current location: getwd(). Or to change it: setwd(("~/Desktop"))
We do this to situate ourselves correctly within the filing system.
**Note**: the `~` takes you to your home directory in a Unix-based system like Mac OS; it's a handy short-cut.
In **Windows** OS you would need to type out the file path, e.g. `C:\Users\[username]\Desktop`. A handy tip: start to type your file path and use the `tab` button to auto-complete or to see a dropdown menu of your current file location.
- to list files in your current location: list.files()
- to get help: ?<function>, e.g. ?stylo
- to quit R: q()
### Variables in R
Variables in R are used to store data. The data stored in variables can be changed or used according to your needs, they can be single values or complex objects, and variables can be passed to functions (as a way of passing larger or more complex data, like word vectors, to functions for text processing).
Variable names are cAse seNsitiVe, can contain a combindation of letters, digits, full stops (periods) or underscores (_). They can begin with a letter or full stop, but cannot start with a digit, and as before, reserved words cannot be used as variable names.
```{r}
8variable <- 'invalid variable'
```
```{r}
.myVariaBl3 <- 'valid variable'
# show the variable?
```
R also does not like it when you use hyphens (-) in variable names.
### Constants in R
Constants in programming are atomic values themselves and once created cannot be changed. There basic types of constants are numeric and character constants. Numeric constants are numbers (either integers, doubles or floating point numbers [i.e. complex numbers]) and character constants can be combined into strings of text. Constants can be assigned to variables.
```{r}
typeof(10)
typeof(5)
typeof("line of text")
typeof('10')
```
Notice that the quote characters around the '10' turn this from a constant of type double to a constant of type character. Single or double quotes can be used to define a character constant.
### Operators in R
Operations allow you to carry out mathematical or logical operations, such as addition, subtraction. There are 4 main types of operators:
- Arithmetic
- Relational
- Logical
- Assignment
#### Arithmetic operators
Arithmetic operators are used to carry out mathematical operations, like addition and subtraction.
`+` addition
`-` subtraction
`*` multiplication
`/` division
`^` exponent
`%%` modulus
For example:
```{r, echo=TRUE}
x <- 3
y <- 20
```
```{r}
y
# 20 to the power of 3?
```
#### Relational operators
Relational operators are used to compare two values and to control the flow of the script.
`<` less than
`>` greater than
`<=` less than or equal to
`>=` great than or equal to
`==` equal to (NB: a single `=` is an assignment, not a relational comparison)
`!=` not equal to
For example:
```{r}
x <= 5
y <= 5.1
# Is x greater than y?
```
The context of the '<=' in the code block above is subtle. The first line is assigning a constant to a variable named 'x'. The last line is comparing two variables using the relational operator.
```{r}
x<=y
```
Relational operators also work across vectors and will apply to each element of the vector. Using the `c()` function, which creates simple vectors, we can show how to add 5 to each element in a single operation:
```{r}
my.v <- c(1,2,3,4,5)
# add 5 to all the elements of the vector variable my.v?
my.v
```
You can also add two vectors together, which combines each element in turn. This is called an **element-wise operation**:
```{r}
new.v <- c(5,4,9,2,1)
my.v + new.v
```
If your vectors are of different lengths, element-wise operations repeat the shortest vector and continue to apply element-wise to the longer vector:
```{r}
short.v <- c(1,2)
my.v + short.v
```
You don't need to first create a variable either, you can dynamically create a vector as part of the relational operation:
```{r}
my.v - c(1,2,3,4,4)
```
Note, that the elements of your vectors must be of compatible types. You can't add a character to a number.
#### Logical operators
Logical operators are used to perform Boolean operations between constants, variables or vectors.
`!` logical NOT
`&` element-wise logical AND (for use with vectors)
`&&` logical AND (for use with constants or simple variables)
`|` element-wise logical OR
`||` logical OR
Note, the `AND` and `OR` logical operands are different for constant or element wise.
When performing logical operands, non-zero numbers are considered `TRUE` and `0` is considered `FALSE`
For example:
```{r}
x <- c(TRUE, FALSE, 12, 1)
y <- c(FALSE, FALSE, 0, 1)
# negate the elements of x?
x
```
```{r}
# perform element wise AND to x and y?
y
```
Element-wise logical operands are useful when you start to work with large lists of words. You can quickly create a vector of TRUE/FALSE elements which indicate which items in the vector match which words you might be interested in.
```{r}
word.v <- c('the', 'quick', 'brown', 'fox')
word.v == 'quick'
```
You can then use this boolean vector to filter or 'slice' elements out of the word vector, according to their position in the vector (the position of an element is called it's 'index', and square brackets are used to select elements from a vector by their index:
```{r}
# Find all words which DO NOT MATCH the word 'quick'?
# (hint, use the last line in the code block above inside the square brackets, and negate the conditional)
word.v[]
```
When doing text analaysis you will work with a lot of word vectors (lists) and data frames (tables), and a common variable naming convention which you will see throughout this course is to use the full stop followed by a 'v' to indicate that the variable contains a vector object.
Using the scan() function, you can load a file, split it line by line into a vector. The filename.txt below does not exist - you will need to change this to a file which does exist. Try to find a text file on your computer, and display it:
```{r}
# this variable name suggests the contents of this variable object are a word vector
# change the file to something which works!
myfile.v <- scan("filename.txt", what="character", sep="\n", encoding = "UTF-8")
# this variable name suggests the contents of this variable object are vector
myfile.v
```
Running the line above will display the entire file! To display a sub-set of elements in a vector you can use the square bracket notation `[x:y]` to slice elements from a vector. So, for example, to slice the 25th to 30th elements you would use [25:30]:
```{r}
# display the first 15 lines from `myfile.v`
```
## Vectors
Recall that a vector is a numbered list stored under a single name. An easy way to create a vector is to use the `c` command, which basically means "combine."
```{r}
v1 <- c("i", "wait", "with", "bated", "breath")
# confirm the value of the variable by running v1
v1
# identify a specific value by indicating it in brackets
v1[4]
```
[Jeff Rydberg-Cox](https://daedalus.umkc.edu/StatisticalMethods/preparing-literary-data.html) provides some helpful tips for preparing data for R processing:
- Download the text(s) from a source repository.
- Remove extraneous material from the text(s).
- Transform the text(s) to answer your research questions.
Get used to the functions that help you understand R: `?` and `example()`.
```{r}
?c
example(c, echo = FALSE) # change the echo value to TRUE to get the results
```
The `c` function is widely used, but it is really only useful for creating small data sets. Many of you will probably want to load already existing data files.
The other important data structure is called a data frame. This is probably the most useful for sophisticated analyses, because it renders the data in a table that is similar to a spreadsheet. It is also more than that: a data frame is actually a special kind of list of vectors and factors that have the same length.
It is important to input your data in an Excel or Google docs spreadsheet and then export that data into a comma separated value (.csv) or tab separated value (.tsv) file.
## Generating, loading, and manipulating data frames
Data frames are basically two-dimensional matrices, whereas vectors are uni-dimensional. Suppose you have a group of texts and you want to keep track of some of their metadata.
David Copperfield / Charles Dickens / novel / British
Pictures from Italy / Charles Dickens / nonfiction / British
Leaves of Grass / Walt Whitman / poetry / American
Sartar Resartus / Thomas Carlyle / nonfiction / British
We can **create** a data frame to arrange this material in tabular format:
```{r}
title <- c("David Copperfield", "Pictures from Italy", "Leaves of Grass", "Sartar Resartus")
author <- c("Charles Dickens", "Charles Dickens", "Walt Whitman", "Thomas Carlyle")
genre <- c("novel", "nonfiction", "poetry", "nonfiction")
nationality <- c("British", "British", "American", "British")
```
Here we have just created variables containing vectors. The `data.frame` function, which takes the vector variables as arguments and combines them into a table.
```{r}
metadata <- data.frame(title, author, genre, nationality)
# the arguments after data.frame create row labels for each data type
str(metadata)
summary(metadata)
```
You have just created a data frame. The `str` function shows you the structure of the data frame, and the `summary` function shows you the unique values, among other interesting facts. The dollar sign ($) can be used to identify specific variables in the data frame.
```{r}
metadata$author
metadata$nationality
# how would you only print the unique data?
```
This is a fairly simple example to show you the syntax and meaning of a data frame, but most of you will be loading data into R. (Though you should remember that the `data.frame` fucntion is often used in code to transform lists.) Usually that data comes from a spreadsheet software (Microsoft Excel, Apple Numbers, Google Sheets).
To **load** data we use the `read.csv` or `read.table` function. (See [Gries](https://www.routledge.com/Quantitative-Corpus-Linguistics-with-R-A-Practical-Introduction-2nd-Edition/Gries/p/book/9781138816275), pp. 53-54.) Our GitHub repository has the `bow-in-the-cloud-metadata-box1.csv` file. Let's use that to run some experiments on data frames.
```{r}
rm(list = ls(all=TRUE))
bow.metadata <- read.csv(file = "bow-in-the-cloud-metadata-box1.csv", header = TRUE, sep = ",")
str(bow.metadata)
```
```{r}
bow.metadata$Creator[1:10]
```
You may also want to output a file using `write.table`.
```{r}
write.table(bow.metadata, file = "bow-metadata-df.csv", sep = "\t", quote = FALSE, row.names = FALSE)
```
In your working directory you should now have a new csv file that looks quite similar to the original spreadsheet. Again, not particularly interesting here, but in many cases you will find yourself turning vectors into data frames in R, and then outputting your results into csv files. It's also important to know the difference between the `read.csv` and `write.table` functions.
### Reading Data in R
The best way to load text files is with the `scan` function. First, download a text file of Dickens's [*Great Expectations*](https://www.dropbox.com/s/qji9ueb46ajait9/dickens_great-expectations.txt?dl=0) onto your working directory (it is also available in our corpus directory, in the c19-20 subdirectory).
```{r}
dickens.v <- scan("corpus/c19-20_prose/dickens_great-expectations.txt", what="character", sep="\n", encoding = "UTF-8")
```
You have now loaded *Great Expectations* into a variable called `dickens.v`. It is now a vector of paragraphs in the book that can be analysed. Let's see if that is true.
```{r}
head(dickens.v)
```
The head function is the same as the basic Unix command for showing the first part of a file. This can be useful for testing whether your code has worked.
### Data wrangling
There are a few functions in R that use regular expressions: `regexpr`, `gregexpr`, `regmatches`, `sub`, `gsub`.
Briefly we will perform a basic data wrangling exercise. Allison Parrish created a data set that gathers all of the poems in Project Gutenberg into one json file, which can be found on [github](https://github.com/aparrish/gutenberg-poetry-corpus). But suppose we do not want to work with json, and we just want a plain text file of all of the poems in Project Gutenberg? That could be useful. We would then use regular expressions to strip out the json and render a plain text file.
```{r}
setwd("~/Desktop") # make sure your notebook file and all other files are saved on your Desktop
gutenberg.poetry.v <- scan(file="gutenberg-poetry-v001-sample500k.ndjson", what="character", sep="\n", encoding = "UTF-8") # you may want to use the smaller file "gutenberg-poetry-v001-sample10k.ndjson" with 10k lines to test
poetry.strip.s.v <- gsub('\\{"s": "', " ", gutenberg.poetry.v)
poetry.strip.s.v
gutenberg.poems.plain.v <- gsub(', "gid": "\\d+"\\}', " ", poetry.strip.s.v)
gutenberg.poems.plain.v[1:10] # show the first ten lines just to see if it worked
write.table(gutenberg.poems.plain.v, "gutenberg-poems.txt", row.names=F)
```
Now you have a plain text file with a numbered list of lines of poetry. Now you can upload this file into Voyant or run it through AntConc for basic text analysis results.
#### Cleaning up Dickens
If you have not already, download the text file of Dickens's [*Great Expectations*](https://www.dropbox.com/s/qji9ueb46ajait9/dickens_great-expectations.txt?dl=0), or copy the file from our github corpus, onto your working directory and scan the text.
```{r}
dickens.v <- scan("/corpus/c19-20_prose/great-expectations.txt", what="character", sep="\n", encoding = "UTF-8")
```
You have now loaded *Great Expectations* into a variable called `dickens.v`.
With the text loaded, you can now run quick statistical operations, such as the number of lines and word frequencies.
```{r}
length(dickens.v) # this finds the number of lines in the book
dickens.lower.v <- tolower(dickens.v) # this makes the whole text lowercased, and each sentence is now in a list
dickens.words <- strsplit(dickens.lower.v, "\\W") # strsplit is very important: it takes each sentence in the lowercased words vector and puts each word in a list by finding non-words, i.e., word boundaries
# each list item (word) corresponds to an element of the book's sentences that has been split. In the simplest case, x is a single character string, and strsplit outputs a one-item list.
class(dickens.words) # the class function tells you the data structure of your variable
dickens.words.v <- unlist(dickens.words)
class(dickens.words.v)
dickens.words.v[1:20] # find the first 20 ten words in Great Expectations
```
Did you notice the "\\W" in the `strsplit` argument? What is that again? Regex! Notice that in R you need to use another backslash to indicate a character escape.
Also, did you notice the blank result on the 10th word? This requires a little clean-up step.
```{r}
not.blanks.v <- which(dickens.words.v!="")
dickens.words.v <- dickens.words.v[not.blanks.v]
```
Extra white spaces often cause problems for text analysis.
```{r}
dickens.words.v[1:20]
```
Voila! We might want to examine how many times the third result "father" occurs (the fourth word result, and one that will probably be an important word in this book).
```{r}
length(dickens.words.v[which(dickens.words.v=="father")])
```
Or produce a list of all unique words.
```{r}
unique(sort(dickens.words.v, decreasing = FALSE))[1:50]
```
Here we find another problem: we find in our unique word list some odd non-words such as "0037m." We should strip those out.
## Exercise
Create a regular expression to remove those non-words in `dickens.words.v`? Remember that you use two backslashes (//) for character escape. For more information on using regex in R, RStudio has a helpful [cheat sheet](https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf).
```{r}
```
Now let's re-run that not.blanks vector to strip out the blank you just added.
```{r}
not.blanks.v <- which(dickens.words.clean.v!="")
dickens.words.clean.v <- dickens.words.clean.v[not.blanks.v]
unique(sort(dickens.words.clean.v, decreasing = FALSE))[1:50]
```
Returning to basic functions, now that we have done some more clean-up: how many unique words are in the book?
```{r}
length(unique(dickens.words.clean.v))
```
Divide this by the amount of words in the whole book to calculate vocabulary density ratios.
```{r}
unique.words <- length(unique(dickens.words.clean.v))
total.words <- length(dickens.words.clean.v)
unique.words/total.words
# you could do this quicker this way:
# length(unique(dickens.words.v))/length(dickens.words.v)
# BUT it's good to get into the practice of storing results in variables
```
That's actually a fairly small density number, 5.7% (*Moby-Dick* by comparison is about 8%).
The other important data structures are tables and data frames. These are probably the most useful for sophisticated analyses, because it renders the data in a table that is very similar to a spreadsheet. It is important to input your data in an Excel or Google docs spreadsheet and then export that data into a comma separated value (.csv) or tab separated value (.tsv) file. Many of the tidytext operations work with data frames, as we'll see later.
# Stylometry and Text Analysis with the `stylo` package
## Installing stylo
- run RStudio (or the R console)
- in the Console, type install.packages("stylo"); or, find "Packages" in the lower-right pane, then click "Install," and type "stylo" and click "Install."
- click Enter
```{r}
library(stylo)
```
## Installation issues
**NOTE** (Mac OS users): the package stylo requires the installation of X11 support. (See http://cran.r-project.org/bin/macosx/RMacOSX-FAQ.html: “Each binary distribution of R available through CRAN is build to use the X11 implementation of Tcl/Tk. Of course a X windows server has to be started first: this should happen automatically on OS X, provided it has been installed (it needs a separate install on Mountain Lion or later). The first time things are done in the X server there can be a long delay whilst a font cache is constructed; starting the server can take several seconds”.)
You may also need to download XQuart at https://www.xquartz.org/.
- Install XQuartz, restart Mac
- Open Terminal, type: sudo ln -s /opt/X11 /usr/X11
- Run XQuartz
- Run R, type: system('defaults write org.R-project.R force.LANG en_US.UTF-8')
On MacOS Mojave one usually faces the problem of not properly recognized tcltk support. Open your terminal and type the following command:
`xcode-select --install`.
This will download and install xcode developer tools and fix the problem. The problem is that one needs to explicitly agree to the license agreement.
You might also run into encoding errors when you start up R (e.g. “WARNING: You’re using a non-UTF8 locale” etc.). In that case, you should close R, open a new window in Applications > Terminal and execute the following line:
`defaults write org.R-project.R force.LANG en_US.UTF-8`.
Next, close the Terminal and start up R again.
See more at https://github.com/computationalstylistics/stylo.
## Activate Stylo
The main function is a single function: `stylo()`. This only works, though, if you have already set up a corpus (which is titled "corpus").
Download the stylo corpus [here](https://www.dropbox.com/sh/n1tep3om866esa9/AABXfmK8syaAyXCjZ-ESKY2Za?dl=0). Make sure you have saved it to your Desktop, along with the R notebook. It contains works by Mark Twain and Charles Dudley Warner (more on why later).
It computes distances (differences) between texts which are converted into rows of frequencies of most frequent words (MFW).
Then it plots graphs of those distances:
- Cluster Analysis plots (dendrograms)
- Multidimensional Scaling scatterplots
- Principal Components Analysis scatterplots
- Bootstrap Consensus Trees plots (for multiple parameter settings)
- Bootstrap Consensus Networks (other software will be needed to take over)
The plots can be both displayed on screen and saved to a file (e.g. PNG).
# stylo GUIs
The package currently has two graphical user interfaces (GUI). One creates static visualisations of stylometric results, and the other creates a dynamic network graph that represents the distance measurements.
```{r}
stylo()
```
What should pop up is the following stylo GUI:
![stylo-gui.png](stylo-gui.png)
For your first experiment you should just click "OK" and see what happens.
The default settings use the ratios of the 100 most frequent words, Classic Delta distance measure, and Ward clustering algorithm to produce a hierarchical clustering dendrogram (think of it as akin to a stylistic family tree).
There are various parameters in `stylo` that are worth exploring further and experimenting with.
- INPUT: the text format
- LANGUAGE (several options)
- FEATURES: units to count, either words or characters. Ngram size: 1 for single words or characters, 2 for pairings, and so on. Usually people choose word unigrams (1 word).
- MFW SETTINGS: how many most frequent words to use. In most cases, you will use a range from Minimum = Maximum ngram size.
-CULLING filters out unwanted words. 0 = all the words survive culling;
20 = a given word has to appear in at least 20% texts; 100 = removal of all words that don’t appear in all the texts (this is not typical).
- DISTANCES: choose how the similarities between texts should be measured
- Classic Delta: perhaps a best choice to start; focuses on most common word frequencies
- Cosine Delta (aka Wurzburg Delta): perhaps an even better choice
- Eder’s Delta: a good choice for highly inflected languages
- SAMPLING: option for splitting the texts
- no sampling: the texts will be analyzed as they are
- normal sampling: dividing the texts into equal-sized blocks
- random sampling: randomly harvesting N words from each text
- number of samples: random harvesting can be repeated n times
Other distance measurements and text paramters can be defined in the GUI. But the GUI is not necessary; one can also use the stylo function with various arguments.
```{r}
# this function activates an already-existing dataset:
data(lee)
# this funcion launches the analysis with pre-defined parameters:
stylo(frequencies = lee, analysis.type = "BCT",
mfw.min = 100, mfw.max = 3000, custom.graph.title = "Harper Lee",
write.png.file = TRUE, gui = FALSE)
```
## Creating a network of similarities
Run the chunk below to get the GUI for the network function, which outputs a bootstrap consensus network. Make sure you have installed the "networkD3" package before executing this code.
```{r}
stylo.network()
```
The relative distances are now mapped on a web browser and can be saved as html files for later use.
## Corpus ingestion and analysis
In the corpus dircetory above, I have included a subdirectory including by Mark Twain and Charles Dudley Warner. The reason for this is that they were near contemporaries and friends, and they co-wrote a novel called *The Gilded Age*. We are going to run stylo experiments to investigate the differences between the two authors.
```{r}
my.corpus <- load.corpus.and.parse(corpus.dir = "corpus", markup.type = "text", ngram.size = 1)
```
```{r}
mt.cdw.freq.l <- make.frequency.list(my.corpus, value = FALSE, head = NULL,
relative = TRUE)
# this generates a word frequency list for the entire corpus
```
```{r}
#these two lines of code automatically generate relative frequencies based on the above frequency list
words = txt.to.words.ext(my.corpus)
mt.cdw.rel.freq.t <- make.frequency.list(words, value = TRUE)
mt.cdw.rel.freq.t[1:10]
```
```{r}
make.samples(words, sampling = "normal.sampling", sample.size = 50)
```
```{r}
complete.word.list = make.frequency.list(words)
make.table.of.frequencies(words, complete.word.list)
```
```{r}
mt.cdw.table <- write.csv(make.table.of.frequencies(words, complete.word.list), "mark-twain-warner-table.csv")
#this outputs all of the work-based relative frequency data into a csv file
```
```{r}
tokenized.corpus <- txt.to.words.ext(my.corpus, language = "English.all",
preserve.case = FALSE)
summary(tokenized.corpus)
```
```{r}
sliced.corpus <- make.samples(tokenized.corpus, sampling = "normal.sampling",
sample.size = 100)
frequent.features <- make.frequency.list(sliced.corpus)
frequent.features[1:50]
frequent.features[100:150]
```
## More stylo code
The code below puts your stylo results into a variable so that you can call upon different columns from its data frame.
```{r}
stylo.results <- stylo()
```
```{r}
stylo.results$features[1:100]
```
```{r}
stylo.results$distance.table
```
## Craig's zeta comparison
Craig's zeta will allow you to compare two data sets based on juxtaposing word preferences. In order to do this you need to create subdirectories within the `corpus` called `primary_set` and `secondary_set`. Copy the Mark Twain texts into the primary set, and the Warner into the secondary one.
```{r}
corpus <- as.data.frame("corpus/")
corpus.all <- txt.to.words.ext(corpus, language = "English.all",
preserve.case = TRUE)
corpus.mt <- corpus.all[grep("twain", names(corpus.all))]
corpus.cdw <- corpus.all[grep("warner", names(corpus.all))]
zeta.results <- oppose(primary.corpus = corpus.mt,
secondary.corpus = corpus.cdw, gui = TRUE)
# In the GUI, navigate to the corpus folder, in which you have put primary_set and secondary_set
```
This outputs a list of preferred and avoided words by the texts in the primary set (Mark Twain).
```{r}
zeta.results$words.preferred[1:20]
zeta.results$words.avoided[1:20]
```
So, what is distinctly Mark Twain and what is Warner-esque?
## Other useful functions
For performing supervised machine-learning analyses, including Burrows’s Delta, Support Vector Machines, and so forth:
```{r}
classify()
```
Performing contrastive analyses of two subcorpora:
```{r}
oppose()
# in the GUI I have chosen Craig's zeta, which was used above, except I have checked the boxes for visualising differences: "Markers" and "Identify Points".
```
What can you gather from here about the probability of majority authorship?
Rolling Stylometry technique (which slices an input text into equal-sized samples and compares them sequentially with reference data; it is good at finding local idiosyncrasies in longer texts). It can also analyse collaborative works and try to determine the authorship of fragments extracted from them. This requires that the working directory contains two subdirectories: "reference_set" and "test_set."
```{r}
rolling.classify(write.png.file = TRUE)
```
What you're seeing is a series of "windows" of the test text as against the reference texts. By “windowing” I mean that each reference text is divided into consecutive, equal-sized samples. It employs the relative frequencies of a (preferably small) set of n words which were also frequent in the reference collection. As Eder et al suggest, " If the curve for a text would show a sudden drop at a given position, this could be indicative of a stylistic change in the text (which might, for instance, be caused by one author taking over from another."
The vertical lines in the plot can be thought to mark the position of certain events in the test text, either a change in chapter or a change in style.
To learn more about stylo, consult [Eder, Rybicki, and Kestemont's documentation](https://4bc8d809-a-62cb3a1a-s-sites.googlegroups.com/site/computationalstylistics/stylo/stylo_howto.pdf?attachauth=ANoY7coDX7i5IQiUFMzj3t5plryJdzEX6HalsOFNYcY0MuEkRjEcgRdxintmXDmiTmrk9iiKOLNf_u-sXgosAnlG75tz1USWfoHiNe4rhFuFjoyqPfPaFIb3W4q63VxJ3a4Etpec8SMrqdMRMvkeApHeHzPNO3zvvUwmieVvBW3H68wOsWG2ZRRc4_nO0rM5dm2cb4obSiqjRe4_-VaDfN2vshvxBf_fwtvvzmzQGpCH5U9hnvTQb-M%3D&attredirects=0).
## Using TidyText for distant reading
For these two lessons we will be modifying code from Julia Silge and David Robinson's [*Text Mining with R: A Tidy Approach*](https://www.tidytextmining.com/).
Before getting started, make sure you have set your working directory.
```{r warning = FALSE}
setwd("~/Desktop")
```
Next we load the necessary libraries for these lessons. **Note**: If you get error messages, you will need to install the libraries by navigating to the "Packages" tab on the right-side panel of RStudio. Then click "Install," enter the name of the package, and install it.
```{r warning=FALSE, message=FALSE}
library(tidytext)
library(dplyr)
library(stringr)
library(glue)
library(tidyverse)
library(tidyr)
library(ggplot2)
library(gutenbergr)
```
Before going into more details, I will briefly explain the 'tidy' approach to data that will be used in the following. The tidy approach assumes three principles regarding data structure:^[For more on this, see Hadley Wickham's “Tidy Data,” *Journal of Statistical Software* 59 (2014): 1–23. https://doi.org/10.18637/jss.v059.i10.]
- Each variable is a column
- Each observation is a row
- Each type of observational unit is a table
What results is a **table with one-token-per-row**. (Recall that a token is any meaningful unit of text: usually it is a word, but it can also be an n-gram, sentence, or even a root of a word.)
```{r}
pound_poem <- c("The apparition of these faces in the crowd;", "Petals on a wet, black bough.")
pound_poem
```
Here we have created a character vector like we did before: the vector consists of two strings of text. In order to transform this into tidy format, we need to transform it into a data frame (here called a 'tibble', a type of data frame in R that is more convenient for text-based analysis).
```{r}
pound_poem_df <- tibble(line = 1:2, text = pound_poem)
pound_poem_df
```
While better, this format is still not useful for tidy text analysis because we still need each word to be individually accounted for. To accomplish this act of tokenization, use the `unnest_tokens` function.
```{r}
pound_poem_df %>% unnest_tokens(word, text)
# the unnest_tokens function requires two arguments: the output column name (word), and the input column that the text comes from (text)
```
Notice how each word is in its own row, but also that its original line number is still intact. That is the basic logic of tidy text analysis. Now let's apply this to a larger data set.
**Using the `gutenbergr`package with tidytext:**
By running the gutenberg_authors function, you can see the file format of the names.
```{r}
gutenberg_authors
```
Let's run our first file loading function.
```{r}
# this searches gutenberg for titles with the author name specified after the 'str_detect' function
gutenberg_works(str_detect(author, "Livy"))$title
```
Did you notice anything wrong with this? The first result duplicates some of the content of the fourth, so we should not use that first text id. Remember, the first rule of scholarship is TRUST NO ONE. In computing, never trust your data. So we'll narrow the ingestion of the gutenberg ids to start with the second result.
```{r message=FALSE}
# creates a variable that takes all the gutenberg ids of
ids <- gutenberg_works(str_detect(author, "Livy"))$gutenberg_id[2:5]
livy <- gutenbergr::gutenberg_download(ids)
livy <- livy %>%
group_by(gutenberg_id) %>%
mutate(line = row_number()) %>%
ungroup()
```
Here we created a new vector called ```livy``` and invoked the 'gutenberg_works' function to find Livy. What does the ```gutenberg_download``` function do? Again, type in the ? before the function to receive a description from the R Documentation. Try the `example` function, too.
Also, from the code above you might be wondering what the ```$``` and ```%>%``` symbols mean. The ```$``` refers to a variable. The ```%>%``` is a connector (a pipe) that mimics nesting. The rule is that the object on the left side is passed as the first argument to the function on the right hand side, so considering the last two lines, ```mutate(line = row_number()) %>% ungroup()``` is the same as ```ungroup(mutate(line = row_number()))```. It just makes the code (and particularly multi-step functions) more readable.^[Granted, it is not part of R's base code, but it was defined by the `magrittr` package and is now widely used in the ```dplyr``` and ```tidyr``` packages.]
```{r}
?gutenberg_download
```
Now let's see what we have downloaded. R has a summary function to show metadata about the new vector we just created, ```livy```.
```{r}
summary(livy)
```
Now we transform this into a tidy data set.
```{r}
tidy_livy <- livy %>%
unnest_tokens(output = word, input = text, token = "words")
tidy_livy %>%
count(word, sort = TRUE) %>%
filter(n > 4000) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) +
coord_flip()
```
Now we are mostly seeing functions words in these results. But what is interesting about the function words? Notice the prominence of pronouns, for example.
Of course you will want to complement these results with substantive results (i.e., with stop words filtered out).
```{r}
data(stop_words)
tidy_livy <- tidy_livy %>%
anti_join(stop_words)
livy_plot <- tidy_livy %>%
count(word, sort = TRUE) %>%
filter(n > 600) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
ylab("Word frequencies in Livy's History of Rome") +
coord_flip()
livy_plot
```
In the visual above, you might want to locate the button in the upper right corner 'Show in New Window', so that you can zoom the results out.
We might also want to read (or have a searchable list in a table) of the word frequencies. The first code block below renders the results above in a table, and the second code block writes all of the results into a csv (spreadsheet) file.
```{r}
tidy_livy %>%
count(word, sort = TRUE)
livy_words <- tidy_livy %>%
count(word, sort = TRUE)
write_csv(livy_words, "livy_words.csv")
# Note that if you want to retain the tidy data (that is, the title-line-word columns in multiple works, say),
# then you would just invoke the tidy_livy variable: write_csv(tidy_livy, "livy_words.csv")
```
Much of what we have done can also be done in [Voyant Tools](http://voyant-tools.org/), to be sure. However, we have been able to load data *faster* in R, and we have also organized the data is tidytext tables that allow us to make judgments about the similarities and differences between the works in the corpus. It is also important to stress that you retain more control over organizing and manipulating your data with R, whereas in Voyant you are beholden to unstructured text files in a pre-built visualization interface.
To illustrate this flexibility, let's investigate the data in ways that are unique to R (and programming in general).
We might want to make similar calculations by book, which is easier now due to the tidy data structure.
```{r}
livy_word_freqs_by_book <- tidy_livy %>%
group_by(gutenberg_id) %>%
count(word, sort = TRUE) %>%
ungroup()
livy_word_freqs_by_book %>%
filter(n > 250) %>%
ggplot(mapping = aes(x = word, y = n)) +
geom_col() +
coord_flip()
```
This shows you the general trend of each word that is used more than 250 times in alphabetical order. We can also break up the results into individual graphs for each book.
```{r}
livy_word_freqs_by_book %>%
filter(n > 250) %>%
ggplot(mapping = aes(x = word, y = n)) +
geom_col() +
coord_flip() + facet_wrap(facets = ~ gutenberg_id)
```
This might appear to be an overwhelming picture, but it is an immediate display of similarities and differences between books. Granted, they are slightly out of order (id 10907 is The History of Rome, Books 09 to 26, and 12582 is Books 01 to 08), but you can immediately notice how the first half differs from the second in its content.
We could re-engineer the code in the previous examples to look more closely at these results. First we'll narrow our data set to the more interesting id numbers mentioned already.
```{r}
livy2 <- gutenberg_download(c(10907, 44318))
livy_tidy2 <- livy2 %>%
group_by(gutenberg_id) %>%
mutate(line = row_number()) %>%
ungroup()
livy_tidy2 <- livy_tidy2 %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
livy_word_freqs_by_book <- livy_tidy2 %>%
group_by(gutenberg_id) %>%
count(word, sort = TRUE) %>%
ungroup()
livy_word_freqs_by_book %>%
filter(n > 210) %>%
ggplot(mapping = aes(x = word, y = n)) +
geom_col() +
coord_flip() + facet_wrap(facets = ~ gutenberg_id)
```
What is the most consistent word used throughout Livy's *History*?
Let's now compare these results to another important chronicler, from a different era: Herodotus.
```{r}
herodotus <- gutenberg_download(c(2707, 2456))
```
This downloads the two-volume *Histories* of Herodotus e-text (note that the c values are the gutenberg ids of two vols of Herodotus' Histories. The ids can be found by searching for texts on gutenberg.org, clicking on the Bibrec tab, and copying the EBook-No.).
```{r}
tidy_herodotus <- herodotus %>%
unnest_tokens(word, text)
tidy_herodotus %>%
count(word, sort = TRUE)
```
What are the differences here with the Livy results?
Now let's filter out the stop words again.
```{r}
tidy_herodotus <- herodotus %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
tidy_herodotus %>%
count(word, sort = TRUE)
```
We could also add into the mix yet another text. Let's try Edward Gibbon's majesterial *Decline and Fall of the Roman Empire*.
```{r}
gutenberg_works(str_detect(author, "Gibbon, Edward"))
eg.ids <- gutenberg_works(str_detect(author, "Gibbon, Edward"))$gutenberg_id[1:6]
eg.ids
gibbon <- gutenbergr::gutenberg_download(eg.ids)
tidy_gibbon <- gibbon %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)