# Methods

## Data Cleaning 

The US Department of State produces the *Foreign Relations of the United States (FRUS)* in both print and online forms. Their online volumes are housed in a Github for public use in TEI XML [here](https://github.com/HistoryAtState/frus). While the markup in the files depict useful identifiers such as the officials involved, the type of document, et cetera, this project requires plain text format rather than encoded documents. Using an XQuery in Oxygen XML Editor, I removed the markup and introductory information, leaving only the contents within the <text> tag. This creates a unique version of the files particular to my project. The XQuery removes the <teiHeader> which contains metadata, the front material which includes the introductory publication statements, and table of contents, and the back information such as the index. The query, which can be found [here] (https://drive.google.com/file/d/1MkVhOUxD4IJbdr4Hg0z7JN83tt5Sc9Se/view?usp=sharing), made each file into a plain text document.

To begin, make sure that the you are in the correct location. Run the next chunk of code to ensure that the file path ends with "CloutierDHProject/Methods and Code". 

In [1]:
getwd()

In [2]:
install.packages("tidyverse")
install.packages("tidytext")
install.packages("magrittr")
install.packages("devtools")
install.packages("tsne")
install.packages("usethis")
install.packages("SnowballC")


The downloaded binary packages are in
	/var/folders/kw/rp8_72k54n5316nblkqcy8vr0000gq/T//RtmpqTTFas/downloaded_packages

The downloaded binary packages are in
	/var/folders/kw/rp8_72k54n5316nblkqcy8vr0000gq/T//RtmpqTTFas/downloaded_packages

The downloaded binary packages are in
	/var/folders/kw/rp8_72k54n5316nblkqcy8vr0000gq/T//RtmpqTTFas/downloaded_packages

The downloaded binary packages are in
	/var/folders/kw/rp8_72k54n5316nblkqcy8vr0000gq/T//RtmpqTTFas/downloaded_packages


“downloaded length 0 != reported length 21722”
“URL 'https://cran.r-project.org/bin/macosx/el-capitan/contrib/3.6/tsne_0.1-3.tgz': status was 'Failure when receiving data from the peer'”


Error in download.file(url, destfile, method, mode = "wb", ...) : 
  download from 'https://cran.r-project.org/bin/macosx/el-capitan/contrib/3.6/tsne_0.1-3.tgz' failed


“download of package ‘tsne’ failed”
also installing the dependencies ‘credentials’, ‘zip’, ‘gitcreds’, ‘gert’, ‘gh’, ‘rappdirs’





The downloaded binary packages are in
	/var/folders/kw/rp8_72k54n5316nblkqcy8vr0000gq/T//RtmpqTTFas/downloaded_packages

The downloaded binary packages are in
	/var/folders/kw/rp8_72k54n5316nblkqcy8vr0000gq/T//RtmpqTTFas/downloaded_packages


In [5]:
library(tidyverse)
library(tidytext)
library(magrittr)
library(devtools)
library(tsne)
library(usethis)
library(SnowballC)
# Hold off on running the line below until after you get to the next section 
library(wordVectors)

In [4]:
devtools::install_github('bmschmidt/wordVectors', force=TRUE)

Downloading GitHub repo bmschmidt/wordVectors@HEAD



rlang  (0.4.8    -> 0.4.10  ) [CRAN]
vctrs  (0.3.5    -> 0.3.6   ) [CRAN]
pillar (1.4.6    -> 1.4.7   ) [CRAN]
fansi  (0.4.1    -> 0.4.2   ) [CRAN]
crayon (1.3.4    -> 1.4.0   ) [CRAN]
cli    (2.1.0    -> 2.3.0   ) [CRAN]
cpp11  (0.2.4    -> 0.2.6   ) [CRAN]
BH     (1.72.0-3 -> 1.75.0-0) [CRAN]
tibble (3.0.4    -> 3.0.6   ) [CRAN]
hms    (0.5.3    -> 1.0.0   ) [CRAN]


Installing 10 packages: rlang, vctrs, pillar, fansi, crayon, cli, cpp11, BH, tibble, hms




The downloaded binary packages are in
	/var/folders/kw/rp8_72k54n5316nblkqcy8vr0000gq/T//RtmpqTTFas/downloaded_packages
[32m✔[39m  [38;5;247mchecking for file ‘/private/var/folders/kw/rp8_72k54n5316nblkqcy8vr0000gq/T/RtmpqTTFas/remoteseb081eef972c/bmschmidt-wordVectors-7f1914c/DESCRIPTION’[39m[36m[36m (3.1s)[36m[39m
[38;5;247m─[39m[38;5;247m  [39m[38;5;247mpreparing ‘wordVectors’:[39m[36m[36m (687ms)[36m[39m
[32m✔[39m  [38;5;247mchecking DESCRIPTION meta-information[39m[36m[39m
[38;5;247m─[39m[38;5;247m  [39m[38;5;247mcleaning src[39m[36m[39m
[38;5;247m─[39m[38;5;247m  [39m[38;5;247mchecking for LF line-endings in source and make files and shell scripts[39m[36m[36m (402ms)[36m[39m
[38;5;247m─[39m[38;5;247m  [39m[38;5;247mchecking for empty or unneeded directories[39m[36m[39m
[38;5;247m─[39m[38;5;247m  [39m[38;5;247mlooking to see if a ‘data/datalist’ file should be added[39m[36m[39m
[38;5;247m─[39m[38;5;247m  [39m[38;5;247

For this next portion, please be sure to provide the correct file name. This path2file currently directs the code to the documents concerning Latin America during the Cold War. If you would rather analyze the European files, input "data/frusEU".

In [6]:
path2file <- "data/frusLA"
fileList <- list.files(path2file,full.names = TRUE) 

readTextFiles <- function(file) { 
  message(file)
  rawText = paste(scan(file, sep="\n",what="raw",strip.white = TRUE))
  output = tibble(filename=gsub(path2file,"",file),text=rawText) %>% 
    group_by(filename) %>% 
    summarise(text = paste(rawText, collapse = " "))
  return(output)
}

combinedTexts <- tibble(filename=fileList) %>% 
  group_by(filename) %>% 
  do(readTextFiles(.$filename)) 

data/frusLA/frus1945v09.txt

data/frusLA/frus1946v11.txt

data/frusLA/frus1947v08.txt

data/frusLA/frus1948v09.txt

data/frusLA/frus1949v02.txt

data/frusLA/frus1950v02.txt

data/frusLA/frus1951v02.txt

data/frusLA/frus1952-54v04.txt

data/frusLA/frus1955-57v06.txt

data/frusLA/frus1955-57v07.txt

data/frusLA/frus1958-60v05.txt

data/frusLA/frus1958-60v06.txt

data/frusLA/frus1961-63v10.txt

data/frusLA/frus1961-63v12.txt

data/frusLA/frus1964-68v31.txt

data/frusLA/frus1964-68v32.txt



In the next chunk of code, users should  

In [7]:
# Don't forget to change the text in the first line to whatever you want to call your model file
baseFile <- "your_file_name"
w2vInput <- paste("data/",baseFile,".txt", sep = "")
w2vCleaned <- paste("data/",baseFile,"_cleaned.txt", sep="")
w2vBin <- paste("data/",baseFile,".bin", sep="")
combinedTexts$text %>% write_lines(w2vInput)

## Parameters

Each model is based on the parameters that are selected prior to running the code. Word embedding models allow you to “choose how expansive you want the explored space to be” (Schmidt 2015). Tuning the parameters results in a greater accuracy depending on the analysis you are intending to complete. In order to test the usage of terms in the corpus, I tested a large variety of parameters in order to determine the accuracy of each model. For each set of parameters, both the Latin American and European corpora were tested and were run through six iterations on each corpus. Ultimately, there were six sets of parameters and therefore 62 models created in total. A list of the models and their parameters can be found here. 

In [3]:
THREADS <- 3

prep_word2vec(origin=w2vInput,destination=w2vCleaned,lowercase=T,bundle_ngrams=1)

if (!file.exists(w2vBin)) {
  w2vModel <- train_word2vec(
    w2vCleaned,
    output_file=w2vBin,
    vectors=500,
    threads=THREADS,
    window=10, iter=10, negative_samples=15
  )
} else {
  w2vModel <- read.vectors(w2vBin)
}

ERROR: Error in prep_word2vec(origin = w2vInput, destination = w2vCleaned, lowercase = T, : could not find function "prep_word2vec"


If you choose to read in an existing .bin file, please see the next chunk of code. For example, I have included the .bin files for both Latin America documents and European documents that I use in the later analysis. If you would like to explore those documents instead of waitinig for a model to run, either input "LA6a.bin" for the corpus on Latin America, or "EU6a.bin" for the European-related corpus. To use this code chunk, please remove the "#" before the code. "#" acts as a way to comment in the code, thus rendering the code inactive, if the user chooses. 

In [None]:
#This is to read in an existing file. Please see the Github repository for various iterations of this model that you can feed in. Be sure to remove the hashtag to choose this option instead.
  #w2vModel <- read.vectors("data/your_file_name.bin")

Here is an image of the model and some exploration into it.

In [None]:
w2vModel %>% plot(perplexity=10)

In [None]:
w2vModel %>% closest_to("girl", 30) %>% View()

In [None]:
w2vModel %>% closest_to(~"girl"+"woman"+"girls"+"women", 20) %>% View()

## Clusters

My thoughts on how these clusters show interesting patters of foreign policy.

In [None]:
centers <- 150
clustering <- kmeans(w2vModel,centers=centers,iter.max = 40)

sapply(sample(1:centers, 10), function(n) {
  names(clustering$cluster[clustering$cluster==n][1:15])
})
```

#Change "name_of_your_query" to a descriptive name that you want to give to your export file.
w2vExport <-sapply(sample(1:centers,150),function(n) {
  names(clustering$cluster[clustering$cluster==n][1:15])
})

write.csv(file="output/name_of_your_query.csv", x=w2vExport)

## Validating the Model

The original code includes a short method for validating the model. I have included five more word pairs that are good representations of the corpus. These were instrumental in determining which iterations of the model were the strongest representations of the corpus. You can find the Latin American validation data [here] (https://drive.google.com/file/d/1k75cYaVa8hmUtyoxOsi_0HX_mNmldD4B/view?usp=sharing) and the Europen validation data [here](https://drive.google.com/file/d/1oFj3rbjzaCtJgTvN0LvwPfVmyxIigJuo/view?usp=sharing).

In [None]:
files_list  = list.files(pattern="*.bin$", recursive=TRUE)

rownames <- c()

data_frame <- data.frame()
data = list(c("away", "off"),
            c("before", "after"),
            c("cause", "effects"),
            c("children", "parents"),
            c("come", "go"),
            c("day", "night"),
            c("first", "second"),
            c("good", "bad"),
            c("last", "first"),
            c("kind", "sort"),
            c("leave", "quit"),
            c("life", "death"),
            c("girl", "boy"),
            c("little", "small"),
            c("oil", "petroleum"),
            c("state", "department"),
            c("confidential", "secret"),
            c("east", "west"),
            c("aid", "assistance"))


data_list = list()

for(fn in files_list) {
  
  wwp_model = read.vectors(fn)
  sims <- c()
  for(pairs in data)
  {
    vector1 <- c()
    for(x in wwp_model[[pairs[1]]]) {
      vector1 <- c(vector1, x)
    }
    
    vector2 <- c()
    for(x in wwp_model[[pairs[2]]]) {
      vector2 <- c(vector2, x)
    }
    
    sims <- c(sims, cosine(vector1, vector2))
    f_name <- strsplit(fn, "/")[[1]][[2]]
    data_list[[f_name]] <- sims
  }
  
}

for(pairs in data) {
  rownames <- c(rownames, paste(pairs[1], pairs[2], sep="-"))
}

results <- structure(data_list,
                     class     = "data.frame",
                     row.names = rownames
)

write.csv(file="output/model-test-results.csv", x=results)