# Methods

## Data Cleaning 

The US Department of State produces the *Foreign Relations of the United States (FRUS)* in both print and online forms. Their online volumes are housed in a Github for public use in TEI XML [here](https://github.com/HistoryAtState/frus). While the markup in the files depict useful identifiers such as the officials involved, the type of document, et cetera, this project requires plain text format rather than encoded documents. Using Oxygen XML Editor, I removed the markup and introductory information, leaving only the contents within the <text> tag. This creates a unique version of the files particular to my project. The XQuery removes the <teiHeader> which contains metadata, the front material which includes the introductory publication statements, and table of contents, and the back information such as the index. The query, which can be found [here] (https://drive.google.com/file/d/1MkVhOUxD4IJbdr4Hg0z7JN83tt5Sc9Se/view?usp=sharing), made each file into a plain text document.

## Why word embeddings? 

Some more info on word embeddings for the academic audience.

In [3]:
setwd(https://github.com/ccloutier312/CloutierDHProject.git

ERROR: Error in parse(text = x, srcfile = src): <text>:1:13: unexpected '/'
1: setwd(https:/
                ^


In [None]:
install.packages("tidyverse")
install.packages("tidytext")
install.packages("magrittr")
install.packages("devtools")
install.packages("tsne")
install.packages("usethis")
install.packages("SnowballC")

In [None]:
library(tidyverse)
library(tidytext)
library(magrittr)
library(devtools)
library(tsne)
library(usethis)
library(SnowballC)
# Hold off on running the line below until after you get to the next section 
library(wordVectors)

In [None]:
devtools::install_github('bmschmidt/wordVectors', force=TRUE)

In [None]:
# Change "name_of_your_folder" to match the name of the folder with your corpus
path2file <- "data/frus/"
fileList <- list.files(path2file,full.names = TRUE) 

readTextFiles <- function(file) { # Remember that the code that defines functions must be run by putting your cursor at the beginning or end, or by selecting the whole section of code
  message(file)
  rawText = paste(scan(file, sep="\n",what="raw",strip.white = TRUE))
  output = tibble(filename=gsub(path2file,"",file),text=rawText) %>% 
    group_by(filename) %>% 
    summarise(text = paste(rawText, collapse = " "))
  return(output)
}

combinedTexts <- tibble(filename=fileList) %>% 
  group_by(filename) %>% 
  do(readTextFiles(.$filename)) 

In [None]:
# Don't forget to change the text in the first line to whatever you want to call your model file
baseFile <- "your_file_name"
w2vInput <- paste("data/",baseFile,".txt", sep = "")
w2vCleaned <- paste("data/",baseFile,"_cleaned.txt", sep="")
w2vBin <- paste("data/",baseFile,".bin", sep="")
combinedTexts$text %>% write_lines(w2vInput)

## Parameters

Each model is based on the parameters that are selected prior to running the code. Word embedding models allow you to “choose how expansive you want the explored space to be” (Schmidt 2015). Tuning the parameters results in a greater accuracy depending on the analysis you are intending to complete. In order to test the usage of terms in the corpus, I tested a large variety of parameters in order to determine the accuracy of each model. For each set of parameters, both the Latin American and European corpora were tested and were run through six iterations on each corpus. Ultimately, there were six sets of parameters and therefore 62 models created in total. A list of the models and their parameters can be found here. 

In [None]:
THREADS <- 3

prep_word2vec(origin=w2vInput,destination=w2vCleaned,lowercase=T,bundle_ngrams=1)

#See the introductory file for a reminder on how you might adjust the parameters below
if (!file.exists(w2vBin)) {
  w2vModel <- train_word2vec(
    w2vCleaned,
    output_file=w2vBin,
    vectors=500,
    threads=THREADS,
    window=10, iter=10, negative_samples=15
  )
} else {
  w2vModel <- read.vectors(w2vBin)
}

In [None]:
#This is to read in an existing file. Please see the Github repository for various iterations of this model that you can feed in. Be sure to remove the hashtag to choose this option instead.
  #w2vModel <- read.vectors("data/test_for_cassie.bin")

Here is an image of the model and some exploration into it.

In [None]:
w2vModel %>% plot(perplexity=10)

In [None]:
w2vModel %>% closest_to("girl", 30) %>% View()

In [None]:
w2vModel %>% closest_to(~"girl"+"woman"+"girls"+"women", 20) %>% View()

## Clusters

My thoughts on how these clusters show interesting patters of foreign policy.

In [None]:
centers <- 150
clustering <- kmeans(w2vModel,centers=centers,iter.max = 40)

sapply(sample(1:centers, 10), function(n) {
  names(clustering$cluster[clustering$cluster==n][1:15])
})
```

#Change "name_of_your_query" to a descriptive name that you want to give to your export file.
w2vExport <-sapply(sample(1:centers,150),function(n) {
  names(clustering$cluster[clustering$cluster==n][1:15])
})

write.csv(file="output/name_of_your_query.csv", x=w2vExport)

## Validating the Model

The original code includes a short method for validating the model. I have included five more word pairs that are good representations of the corpus. These were instrumental in determining which iterations of the model were the strongest representations of the corpus. You can find the Latin American validation data [here] (https://drive.google.com/file/d/1k75cYaVa8hmUtyoxOsi_0HX_mNmldD4B/view?usp=sharing) and the Europen validation data [here](https://drive.google.com/file/d/1oFj3rbjzaCtJgTvN0LvwPfVmyxIigJuo/view?usp=sharing).

In [None]:
## Evaluate the Model

You can run this test by hitting `command-return` or `control-return` to run one line a time, or just hit the green button in the top right of the code block below. 

files_list  = list.files(pattern="*.bin$", recursive=TRUE)

rownames <- c()

data_frame <- data.frame()
data = list(c("away", "off"),
            c("before", "after"),
            c("cause", "effects"),
            c("children", "parents"),
            c("come", "go"),
            c("day", "night"),
            c("first", "second"),
            c("good", "bad"),
            c("last", "first"),
            c("kind", "sort"),
            c("leave", "quit"),
            c("life", "death"),
            c("girl", "boy"),
            c("little", "small"),
            c("oil", "petroleum"),
            c("state", "department"),
            c("confidential", "secret"),
            c("east", "west"),
            c("aid", "assistance"))


data_list = list()

for(fn in files_list) {
  
  wwp_model = read.vectors(fn)
  sims <- c()
  for(pairs in data)
  {
    vector1 <- c()
    for(x in wwp_model[[pairs[1]]]) {
      vector1 <- c(vector1, x)
    }
    
    vector2 <- c()
    for(x in wwp_model[[pairs[2]]]) {
      vector2 <- c(vector2, x)
    }
    
    sims <- c(sims, cosine(vector1, vector2))
    f_name <- strsplit(fn, "/")[[1]][[2]]
    data_list[[f_name]] <- sims
  }
  
}

for(pairs in data) {
  rownames <- c(rownames, paste(pairs[1], pairs[2], sep="-"))
}

results <- structure(data_list,
                     class     = "data.frame",
                     row.names = rownames
)

write.csv(file="output/model-test-results.csv", x=results)