# Word Vector Analysis
March 05, 2021

## Notebook Introduction

This notebook trains a Gensim Word2Vec model using 266 issues of the journal *Stone: An Illustrated Magazine.* It then uses the model to query a series of keywords to find what words were used in similar context to those keywords within the corpus.

In addition to a model using the full corpus I made two additional models by splitting up my corpus to see if I could identify changes over time. The first sub-corpus consists of all issues from 1888 through 1910. The second sub-corpus includes all issues after 1910. Due to the missing issues in the 1890s-1900s both sub-corpora consisted of approximately the same number of texts (there are only two additional issues in the post-1910 corpus). 

In [None]:
import os
import re
from glob import iglob
import gensim
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from sklearn.manifold import MDS
import numpy as np
from scipy.spatial.distance import cosine
from sklearn.metrics import pairwise
from sklearn.manifold import MDS, TSNE
%pylab inline
matplotlib.style.use('ggplot')

## Model Creation

### Full Text Model
Create a model using the entire corpus.

In [None]:
data_folder_path = os.path.join(os.getcwd(), "data")

issue_list = []

for filename in iglob(os.path.join(data_folder_path, '*.txt')):
    
    with open(filename) as file_in:
        this_issue = file_in.read()
    
    # Add text as single string to master list
    issue_list.append(this_issue)

In [None]:
issue_list[0][8600:9000] #Testing to see if it is working

In [None]:
len(issue_list) #Here I am verifying it is picking up all the issues

### Pre-processing


In [None]:
#making the text in all the issues lower case
issues_lower = []
for issue in issue_list:
    issues_lower.append(issue.lower())

In [None]:
#Removing hyphenated words that were appearing in the model and replacing them with full words.

replacements = [
    # find -> replace
    ('mar-\nket', "market"),
    ('vein-\ning', 'veining'),
    ("effi-\nciency", "efficiency"),
    ("en-\ngine", "engine"),
    ("acci-\ndent", "accident"),
    ("explo-\nsives", "explosives"),
    ("econ-\nomy", "economy"),
    ("regu-\nlations", "regulations")
    ]

issues_cleaned = []

for issue in issues_lower:
    for rep, new in replacements:
        issue = issue.replace(rep, new)
    issues_cleaned.append(issue)
                          
    

In [None]:
issues_cleaned[0][8600:9000] #Testing to see if it is working

In [None]:
#Splitting each issue into sentences using NLTK's "sent_tokenize" function.
sentences = [sentence for issue in issues_cleaned for sentence in sent_tokenize(issue)]
sentences[0]

In [None]:
# Custom Tokenizer to prepare text for processing by Word2Vec model

def fast_tokenize(text):
    """
    A version of this function was written by Dr. Laura Nelson and provided to her "Analyzing Complex Digitized Data" class /
    in Fall 2020 for easy text pre-processing. It takes each sentence, removes punctuation, /
    and then turns each sentence into a list of words.
    
    Input: text string
    Output: list of words in string processed to remove punctuation
    """
    
    # Get a list of punctuation marks
    from string import punctuation
    
    
    # Iterate through text removing punctuation characters
    no_punct = "".join([char for char in text if char not in punctuation])
    
    
    # Split text over whitespace into list of words
    tokens = no_punct.split()
    
    return tokens

In [None]:
words_by_sentence = [fast_tokenize(sentence) for sentence in sentences]

In [None]:
words_by_sentence = [sentence for sentence in words_by_sentence if sentence != []]

In [None]:
#Test to see if it is working by asking for a random sentence
words_by_sentence[700]

### Training the model
I played around with the values in the model in my investigations, but ultimately used default values as there was little difference between my results.

Model value meanings (from Dr. Laura Nelson's Word2Vec class tutorial for "Analyzing Complex Digitized Data," Fall 2020):

- Size: Number of dimensions for word embedding model
- Window: Number of context words to observe in each direction
- min_count: Minimum frequency for words included in model
- sg: whether it is a "Skip-Gram" or "Continuous Bag of Words Model": '1' indicates Skip-Gram
- Alpha: Learning rate (initial); prevents model from over-correcting, enables finer tuning
- Iterations: Number of passes through dataset
- Batch Size: Number of words to sample from data during each pass

I used the same process to create two smaller models -- one containing all the issues through 1910 and one containing all issues from 1911 to 1922. I separated out the text files into two separate folders and then directed the code to each of these folders in turn.

In [None]:
model = gensim.models.Word2Vec(words_by_sentence, size=100, window=5, \
                               min_count=40, sg=1, alpha=0.025, iter=5, batch_words=10000)

### Saving Model

In [None]:
# Save current model for later use

model.wv.save_word2vec_format('resources/word2vec.stonejournal-alltext.txt')

### Loading in the models
Here I am loading the saved models into this Jupyter notebook rather so that I don't have to go through the code to create each within this notebook every time I use it.

In [None]:
#Full corpus model
model = gensim.models.KeyedVectors.load_word2vec_format('resources/word2vec.stonejournal-alltext.txt')

#smaller models with pre and post 1910 issues
to1910_model = gensim.models.KeyedVectors.load_word2vec_format('resources/word2vec.stonejournal-to1910.txt')
post1910_model = gensim.models.KeyedVectors.load_word2vec_format('resources/word2vec.stonejournal-post1910.txt')

## Vector-Space Operations - Full Corpus
First I stared with a basic investigation several of my health and safety keywords to see what similar words turned up in different scenarios.

In [None]:
model.most_similar('safety')

In [None]:
# combining safety and safe
model.most_similar('safety', 'safe')

In [None]:
model.most_similar(positive=['health'], negative=['cost'])

### Visualizations

#### Setting up the tokens

An example of a visualization with vocabulary generated from health-related keyword terms.

In [None]:
health_tokens = [token for token,weight in model.most_similar(positive=['health',], topn=50)]

In [None]:
health_tokens[:20] #print out top 20 results 

In [None]:
#create graph
vectors = [model[word] for word in health_tokens]
dist_matrix = pairwise.pairwise_distances(vectors, metric='cosine')
mds = MDS(n_components = 2, dissimilarity='precomputed')
embeddings = mds.fit_transform(dist_matrix)

In [None]:
_, ax = plt.subplots(figsize=(10,10))
ax.scatter(embeddings[:,0], embeddings[:,1], alpha=0)
for i in range(len(vectors)):
    ax.annotate(health_tokens[i], ((embeddings[i,0], embeddings[i,1])))

## Attempt to Introduce Time into Word Embedding

Because I'm interested to see if language around safety changed over time I performed some of the same keyword investigations on my two smaller models: one of 1910 and pre-1910 issues (the to1910_model) and one of post-1910 issues (the post1910_model).  Each of these models has approximately 20 million "words."

In [None]:
to1910_model.most_similar('safety')

In [None]:
post1910_model.most_similar('safety')

### Visualizations

In [None]:
to1910_health_tokens = [token for token,weight in to1910_model.most_similar(positive=['health',], topn=50)]

In [None]:
to1910_health_tokens[:20] #print top 20 results

In [None]:
post1910_health_tokens = [token for token,weight in post1910_model.most_similar(positive=['health',], topn=50)]

In [None]:
post1910_health_tokens[:20]

In [None]:
vectors = [model[word] for word in to1910_health_tokens]
dist_matrix = pairwise.pairwise_distances(vectors, metric='cosine')
mds = MDS(n_components = 2, dissimilarity='precomputed')
embeddings = mds.fit_transform(dist_matrix)

In [None]:
_, ax = plt.subplots(figsize=(10,10))
ax.scatter(embeddings[:,0], embeddings[:,1], alpha=0)
for i in range(len(vectors)):
    ax.annotate(to1910_health_tokens[i], ((embeddings[i,0], embeddings[i,1])))

In [None]:
vectors = [model[word] for word in post1910_health_tokens]
dist_matrix = pairwise.pairwise_distances(vectors, metric='cosine')
mds = MDS(n_components = 2, dissimilarity='precomputed')
embeddings = mds.fit_transform(dist_matrix)

In [None]:
_, ax = plt.subplots(figsize=(10,10))
ax.scatter(embeddings[:,0], embeddings[:,1], alpha=0)
for i in range(len(vectors)):
    ax.annotate(post1910_health_tokens[i], ((embeddings[i,0], embeddings[i,1])))