<div style="font-size:30px" align="center"> <b> Visualizing Word2Vec Models Trained on Biomedical Abstracts in PubMed </b> </div>
<div style="font-size:22px" align="center"> <b> A Comparison of Race and Diversity Over Time </b> </div>
<br>

<div style="font-size:18px" align="center"> <b> Brandon L. Kramer - University of Virginia's Bicomplexity Institute </b> </div>

<br>

This notebook explores two Word2Vec models trained the PubMed database taken from January 2021. Overall, I am interested in testing whether diversity and racial terms are becoming more closely related over time. To do this, I [trained](https://github.com/brandonleekramer/diversity/blob/master/src/04_word_embeddings/03_train_word2vec.ipynb) two models (one from 1990-1995 data and then a random sample of the 2015-2020 data). Now, I will visualize the results of these models to see which words are similar to race/diversity as well as plotting some comparisons of these two terms over time.

For those unfamiliar with Word2Vec, it might be worth reading [this post from Connor Gilroy](https://ccgilroy.github.io/community-discourse/introduction.html) - a sociologist that details how word embeddings can help us better understand the concept of "community." The post contains information on how Word2Vec and other word embedding approaches can teach us about word/document similarity, opposite words, and historical changes in words. Basically, Word2Vec turns all of the words in the corpus into a number based on how they are used in the context of sentences, making all of the words directly compariable to one another within a vector space. The end result is that we are able to compare how similar or different words are or, as we will see below, how similar or different words become over time. 

#### Import packages and ingest data 

Let's load all of our packages and the `.bin` files that hold our models. 

In [None]:
# load packages
import os
import pandas.io.sql as psql
import pandas as pd
from gensim.models import Word2Vec
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
import matplotlib.cm as cm

# load data 
os.chdir("/sfs/qumulo/qhome/kb7hp/git/diversity/data/word_embeddings/")
earlier_model = Word2Vec.load("word2vec_1990_95.bin")
later_model = Word2Vec.load("word2vec_2015_20.bin")

#### Analyzing Most Similar Words 

What words are most similar to "racial" and "diversity"? 

In [None]:
racial_sim_early = earlier_model.wv.most_similar('racial')
print("In the 1990-1995 model, 'racial' is similar to:")
racial_sim_early

In [None]:
racial_sim_later = later_model.wv.most_similar('racial')
print("In the 2015-2020 model, 'racial' is similar to:")
racial_sim_later

In [None]:
diversity_sim_early = earlier_model.wv.most_similar('diversity')
print("In the 1990-1995 model, 'diversity' is similar to:")
diversity_sim_early

In [None]:
diversity_sim_later = later_model.wv.most_similar('diversity')
print("In the 2015-2020 model, 'diversity' is similar to:")
diversity_sim_later

As we can see, "racial" is mostly similar to other racialized and/or gendered terms. "Diversity", on the other hand, is most similar to heterogeneity and a number of terms more generally classified under differences and/or complexity. That makes it a little difficult to directly compare the terms, so let's use the `wv.similarity` function to directly look at that.

#### Comparing Race and Diversity 

In [23]:
racial_diversity_early = earlier_model.wv.similarity('racial','diversity')
racial_diversity_later = later_model.wv.similarity('racial','diversity')
race_diversity_early = earlier_model.wv.similarity('race','diversity')
race_diversity_later = later_model.wv.similarity('race','diversity')
ethnic_diversity_early = earlier_model.wv.similarity('ethnic','diversity')
ethnic_diversity_later = later_model.wv.similarity('ethnic','diversity')
ethnicity_diversity_early = earlier_model.wv.similarity('ethnicity','diversity')
ethnicity_diversity_later = later_model.wv.similarity('ethnicity','diversity')

print('Comparing racial and diversity:')
print('The 1990-1995 score is:', racial_diversity_early)
print('The 2015-2020 score is:', racial_diversity_later)
print('The overall difference is:', racial_diversity_early - racial_diversity_later)
print('Comparing race and diversity:')
print('The 1990-1995 score is:', race_diversity_early)
print('The 2015-2020 score is:', race_diversity_later)
print('The overall difference is:', race_diversity_early - race_diversity_later)
print('Comparing ethnic and diversity:')
print('The 1990-1995 score is:', ethnic_diversity_early)
print('The 2015-2020 score is:', ethnic_diversity_later)
print('The overall difference is:', ethnic_diversity_early - ethnic_diversity_later)
print('Comparing ethnicity and diversity:')
print('The 1990-1995 score is:', ethnicity_diversity_early)
print('The 2015-2020 score is:', ethnicity_diversity_later)
print('The overall difference is:', ethnicity_diversity_early - ethnicity_diversity_later)

Comparing racial and diversity:
The 1990-1995 score is: 0.22878885
The 2015-2020 score is: 0.19454905
The overall difference is: 0.0342398
Comparing race and diversity:
The 1990-1995 score is: 0.10367876
The 2015-2020 score is: 0.08713666
The overall difference is: 0.0165421
Comparing ethnic and diversity:
The 1990-1995 score is: 0.26469263
The 2015-2020 score is: 0.24086899
The overall difference is: 0.023823649
Comparing ethnicity and diversity:
The 1990-1995 score is: 0.17010239
The 2015-2020 score is: 0.14082599
The overall difference is: 0.0292764


Looks like in each case the scores drops, signifying that race and ethnicity are both becoming conceptually closer to diversity over time.

#### Analyzing Analogies 

In [None]:
white_racism = earlier_model.wv.most_similar(positive=['black', 'racism'], negative=['white'], topn=20)
white_racism

In [None]:
black_racist = earlier_model.wv.most_similar(positive=['white', 'racist'], negative=['black'], topn=20)
black_racist

In [None]:
%%capture
earlier_vocab = list(earlier_model.wv.vocab)
earlier_x = earlier_model[earlier_vocab]
earlier_tsne = TSNE(n_components=2)
earlier_tsne_x = earlier_tsne.fit_transform(earlier_x)
df_earlier = pd.DataFrame(earlier_tsne_x, index=earlier_vocab, columns=['x', 'y'])

keys = ['race', 'racial', 'ethnic', 'ethnicity', 'diverse', 'diversity']

earlier_embedding_clusters = []
earlier_word_clusters = []
for word in keys:
    earlier_embeddings = []
    earlier_words = []
    for similar_word, _ in earlier_model.wv.most_similar(word, topn=30):
        earlier_words.append(similar_word)
        earlier_embeddings.append(earlier_model[similar_word])
    earlier_embedding_clusters.append(earlier_embeddings)
    earlier_word_clusters.append(words)
    
earlier_embedding_clusters = np.array(earlier_embedding_clusters)
n, m, k = earlier_embedding_clusters.shape
e_tsne_model_en_2d = TSNE(perplexity=15, n_components=2, init='pca', n_iter=3500, random_state=32)
e_embeddings_en_2d = np.array(e_tsne_model_en_2d.fit_transform(earlier_embedding_clusters.reshape(n * m, k))).reshape(n, m, 2)

In [None]:
later_vocab = list(later_model.wv.vocab)
later_x = later_model[later_vocab]
later_tsne = TSNE(n_components=2)
later_tsne_x = later_tsne.fit_transform(later_x)
df_later = pd.DataFrame(later_tsne_x, index=later_vocab, columns=['x', 'y'])

later_embedding_clusters = []
later_word_clusters = []
for word in keys:
    later_embeddings = []
    later_words = []
    for similar_word, _ in later_model.wv.most_similar(word, topn=30):
        later_words.append(similar_word)
        later_embeddings.append(later_model[similar_word])
    later_embedding_clusters.append(later_embeddings)
    later_word_clusters.append(words)
    
later_embedding_clusters = np.array(later_embedding_clusters)
n, m, k = later_embedding_clusters.shape
l_tsne_model_en_2d = TSNE(perplexity=15, n_components=2, init='pca', n_iter=3500, random_state=32)
l_embeddings_en_2d = np.array(l_tsne_model_en_2d.fit_transform(later_embedding_clusters.reshape(n * m, k))).reshape(n, m, 2)

In [None]:
def tsne_plot_similar_words(title, labels, earlier_embedding_clusters, earlier_word_clusters, a, filename=None):
    plt.figure(figsize=(16, 9))
    colors = cm.rainbow(np.linspace(0, 1, len(labels)))
    for label, earlier_embeddings, earlier_words, color in zip(labels, earlier_embedding_clusters, earlier_word_clusters, colors):
        x = earlier_embeddings[:, 0]
        y = earlier_embeddings[:, 1]
        plt.scatter(x, y, c=color, alpha=a, label=label)
        for i, word in enumerate(words):
            plt.annotate(word, alpha=0.5, xy=(x[i], y[i]), xytext=(5, 2),
                         textcoords='offset points', ha='right', va='bottom', size=8)
    plt.legend(loc=4)
    plt.title(title)
    plt.grid(True)
    if filename:
        plt.savefig(filename, format='png', dpi=150, bbox_inches='tight')
    plt.show()


test =tsne_plot_similar_words('Comparing the Use of Race, Ethnicity and Diversity (PubMed 1990-1995)', 
                        keys, e_embeddings_en_2d, earlier_word_clusters, 
                        0.7, 'earlier_comparison.png')
test

In [None]:
os.chdir("/sfs/qumulo/qhome/kb7hp/git/diversity/data/word_embeddings/")
plt.savefig('earlier_comparison.png')

In [None]:
def tsne_plot_similar_words(title, labels, later_embedding_clusters, later_word_clusters, a, filename=None):
    plt.figure(figsize=(16, 9))
    colors = cm.rainbow(np.linspace(0, 1, len(labels)))
    for label, later_embeddings, later_words, color in zip(labels, later_embedding_clusters, later_word_clusters, colors):
        x = later_embeddings[:, 0]
        y = later_embeddings[:, 1]
        plt.scatter(x, y, c=color, alpha=a, label=label)
        for i, word in enumerate(words):
            plt.annotate(word, alpha=0.5, xy=(x[i], y[i]), xytext=(5, 2),
                         textcoords='offset points', ha='right', va='bottom', size=8)
    plt.legend(loc=4)
    plt.title(title)
    plt.grid(True)
    if filename:
        plt.savefig(filename, format='png', dpi=150, bbox_inches='tight')
    plt.show()


tsne_plot_similar_words('Comparing the Use of Race, Ethnicity and Diversity (PubMed 2015-2020)', 
                        keys, l_embeddings_en_2d, later_word_clusters, 
                        0.7, 'later_comparison.png')

In [None]:
os.chdir("/sfs/qumulo/qhome/kb7hp/git/diversity/data/word_embeddings/")
plt.savefig('later_comparison.png')

In [None]:
df_earlier.to_csv("/sfs/qumulo/qhome/kb7hp/git/diversity/data/word_embeddings/pubmed_earlier.csv")
df_later.to_csv("/sfs/qumulo/qhome/kb7hp/git/diversity/data/word_embeddings/pubmed_later.csv")

#### References

[Connor Gilroy's Tutorial](https://ccgilroy.github.io/community-discourse/word-similarity.html)
[Dominiek Ter Heide's Word2Vec Explorer](https://github.com/dominiek/word2vec-explorer)
[Sergey Smetanin's Medium Tutorial](https://towardsdatascience.com/google-news-and-leo-tolstoy-visualizing-word2vec-word-embeddings-with-t-sne-11558d8bd4d)