Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
1 contributor

Users who have contributed to this file

229 lines (154 sloc) 10.3 KB
---
title: "Word2Vec introduction"
author: "Ben Schmidt"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Vignette Title}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
# Intro
This vignette walks you through training a word2vec model, and using that model to search for similarities, to build clusters, and to visualize vocabulary relationships of that model in two dimensions. If you are working with pre-trained vectors, you might want to jump straight to the "exploration" vignette; it is a little slower-paced, but doesn't show off quite so many features of the package.
# Package installation
If you have not installed this package, paste the below. More detailed installation instructions are at the end of the [package README](https://github.com/bmschmidt/wordVectors).
```{r}
if (!require(wordVectors)) {
if (!(require(devtools))) {
install.packages("devtools")
}
devtools::install_github("bmschmidt/wordVectors")
}
```
# Building test data
We begin by importing the `wordVectors` package and the `magrittr` package, because its pipe operator makes it easier to work with data.
```{r}
library(wordVectors)
library(magrittr)
```
First we build up a test file to train on.
As an example, we'll use a collection of cookbooks from Michigan State University.
This has to download from the Internet if it doesn't already exist.
```{r}
if (!file.exists("cookbooks.zip")) {
download.file("http://archive.lib.msu.edu/dinfo/feedingamerica/cookbook_text.zip","cookbooks.zip")
}
unzip("cookbooks.zip",exdir="cookbooks")
```
Then we *prepare* a single file for word2vec to read in. This does a couple things:
1. Creates a single text file with the contents of every file in the original document;
2. Uses the `tokenizers` package to clean and lowercase the original text,
3. If `bundle_ngrams` is greater than 1, joins together common bigrams into a single word. For example, "olive oil" may be joined together into "olive_oil" wherever it occurs.
You can also do this in another language: particularly for large files, that will be **much** faster. (For reference: in a console, `perl -ne 's/[^A-Za-z_0-9 \n]/ /g; print lc $_;' cookbooks/*.txt > cookbooks.txt` will do much the same thing on ASCII text in a couple seconds.) If you do this and want to bundle ngrams, you'll then need to call `word2phrase("cookbooks.txt","cookbook_bigrams.txt",...)` to build up the bigrams; call it twice if you want 3-grams, and so forth.
```{r}
if (!file.exists("cookbooks.txt")) prep_word2vec(origin="cookbooks",destination="cookbooks.txt",lowercase=T,bundle_ngrams=2)
```
To train a word2vec model, use the function `train_word2vec`. This actually builds up the model. It uses an on-disk file as an intermediary and then reads that file into memory.
```{r}
if (!file.exists("cookbook_vectors.bin")) {model = train_word2vec("cookbooks.txt","cookbook_vectors.bin",vectors=200,threads=4,window=12,iter=5,negative_samples=0)} else model = read.vectors("cookbook_vectors.bin")
```
A few notes:
1. The `vectors` parameter is the dimensionality of the representation. More vectors usually means more precision, but also more random error and slower operations. Likely choices are probably in the range 100-500.
2. The `threads` parameter is the number of processors to use on your computer. On a modern laptop, the fastest results will probably be between 2 and 8 threads, depending on the number of cores.
3. `iter` is how many times to read through the corpus. With fewer than 100 books, it can greatly help to increase the number of passes; if you're working with billions of words, it probably matters less. One danger of too low a number of iterations is that words that aren't closely related will seem to be closer than they are.
4. Training can take a while. On my laptop, it takes a few minutes to train these cookbooks; larger models take proportionally more time. Because of the importance of more iterations to reducing noise, don't be afraid to set things up to require a lot of training time (as much as a day!)
5. One of the best things about the word2vec algorithm is that it *does* work on extremely large corpora in linear time.
6. In RStudio I've noticed that this sometimes appears to hang after a while; the percentage bar stops updating. If you check system activity it actually is still running, and will complete.
7. If at any point you want to *read in* a previously trained model, you can do so by typing `model = read.vectors("cookbook_vectors.bin")`.
Now we have a model in memory, trained on about 10 million words from 77 cookbooks. What can it tell us about food?
## Similarity searches
Well, you can run some basic operations to find the nearest elements:
```{r}
model %>% closest_to("fish")
```
With that list, you can expand out further to search for multiple words:
```{r}
model %>%
closest_to(model[[c("fish","salmon","trout","shad","flounder","carp","roe","eels")]],50)
```
Now we have a pretty expansive list of potential fish-related words from old cookbooks. This can be useful for a few different things:
1. As a list of potential query terms for keyword search.
2. As a batch of words to use as seed to some other text mining operation; for example, you could pull all paragraphs surrounding these to find ways that fish are cooked.
3. As a source for visualization.
Or we can just arrange them somehow. In this case, it doesn't look like much of anything.
```{r}
some_fish = closest_to(model,model[[c("fish","salmon","trout","shad","flounder","carp","roe","eels")]],150)
fishy = model[[some_fish$word,average=F]]
plot(fishy,method="pca")
```
## Clustering
We can use standard clustering algorithms, like kmeans, to find groups of terms that fit together. You can think of this as a sort of topic model, although unlike more sophisticated topic modeling algorithms like Latent Direchlet Allocation, each word must be tied to single particular topic.
```{r}
set.seed(10)
centers = 150
clustering = kmeans(model,centers=centers,iter.max = 40)
```
Here are a ten random "topics" produced through this method. Each of the columns are the ten most frequent words in one random cluster.
```{r}
sapply(sample(1:centers,10),function(n) {
names(clustering$cluster[clustering$cluster==n][1:10])
})
```
These can be useful for figuring out, at a glance, what some of the overall common clusters in your corpus are.
Clusters need not be derived at the level of the full model. We can take, for instance,
the 20 words closest to each of four different kinds of words.
```{r}
ingredients = c("madeira","beef","saucepan","carrots")
term_set = lapply(ingredients,
function(ingredient) {
nearest_words = model %>% closest_to(model[[ingredient]],20)
nearest_words$word
}) %>% unlist
subset = model[[term_set,average=F]]
subset %>%
cosineDist(subset) %>%
as.dist %>%
hclust %>%
plot
```
# Visualization
## Relationship planes.
One of the basic strategies you can take is to try to project the high-dimensional space here into a plane you can look at.
For instance, we can take the words "sweet" and "sour," find the twenty words most similar to either of them, and plot those in a sweet-salty plane.
```{r}
tastes = model[[c("sweet","salty"),average=F]]
# model[1:3000,] here restricts to the 3000 most common words in the set.
sweet_and_saltiness = model[1:3000,] %>% cosineSimilarity(tastes)
# Filter to the top 20 sweet or salty.
sweet_and_saltiness = sweet_and_saltiness[
rank(-sweet_and_saltiness[,1])<20 |
rank(-sweet_and_saltiness[,2])<20,
]
plot(sweet_and_saltiness,type='n')
text(sweet_and_saltiness,labels=rownames(sweet_and_saltiness))
```
There's no limit to how complicated this can get. For instance, there are really *five* tastes: sweet, salty, bitter, sour, and savory. (Savory is usually called 'umami' nowadays, but that word will not appear in historic cookbooks.)
Rather than use a base matrix of the whole set, we can shrink down to just five dimensions: how similar every word in our set is to each of these five. (I'm using cosine similarity here, so the closer a number is to one, the more similar it is.)
```{r}
tastes = model[[c("sweet","salty","savory","bitter","sour"),average=F]]
# model[1:3000,] here restricts to the 3000 most common words in the set.
common_similarities_tastes = model[1:3000,] %>% cosineSimilarity(tastes)
common_similarities_tastes[20:30,]
```
Now we can filter down to the 50 words that are closest to *any* of these (that's what the apply-max function below does), and
use a PCA biplot to look at just 50 words in a flavor plane.
```{r}
high_similarities_to_tastes = common_similarities_tastes[rank(-apply(common_similarities_tastes,1,max)) < 75,]
high_similarities_to_tastes %>%
prcomp %>%
biplot(main="Fifty words in a\nprojection of flavor space")
```
This tells us a few things. One is that (in some runnings of the model, at least--there is some random chance built in here.) "sweet" and "sour" are closely aligned. Is this a unique feature of American cooking? A relationship that changes over time? These would require more investigation.
Second is that "savory" really is an acting category in these cookbooks, even without the precision of 'umami' as a word to express it. Anchovy, the flavor most closely associated with savoriness, shows up as fairly characteristic of the flavor, along with a variety of herbs.
Finally, words characteristic of meals seem to show up in the upper realms of the file.
# Catchall reduction: TSNE
Last but not least, there is a catchall method built into the library
to visualize a single overall decent plane for viewing the library; TSNE dimensionality reduction.
Just calling "plot" will display the equivalent of a word cloud with individual tokens grouped relatively close to each other based on their proximity in the higher dimensional space.
"Perplexity" is the optimal number of neighbors for each word. By default it's 50; smaller numbers may cause clusters to appear more dramatically at the cost of overall coherence.
```{r}
plot(model,perplexity=50)
```
A few notes on this method:
1. If you don't get local clusters, it is not working. You might need to reduce the perplexity so that clusters are smaller; or you might not have good local similarities.
2. If you're plotting only a small set of words, you're better off trying to plot a `VectorSpaceModel` with `method="pca"`, which locates the points using principal components analysis.
You can’t perform that action at this time.