# Computational Text Analysis: Word Embedding Models

In this notebook, I will read in the text files that I prepared in the previous notebook. In the previous notebook, I prepared five corpora of the <em>Game of Thrones</em> fanfictions that have been cleaned (the words have been stemmed, all the captial letters are lower-cased, and the stopwords/punctuation have been removed). The fanfics have been split based on when GoT episodes have been published. Then I will create word embedding models for each corpus to compare the results across time.

In [1]:
#pandas for working with dataframes
import pandas as pd
import numpy as np

#nltk libraries
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

#word2vec models
import gensim
from gensim.models.word2vec import LineSentence
from gensim.models import Word2Vec as wv

#visualizing
import plotly.express as px

#condenses vector data
import umap
from sklearn.decomposition import PCA

#getting rid of pesky warnings
import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\caram\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Read in Texts

First I need to read in all the texts. I will be using the "read_txt" function, which reads a .txt file and then tokenizes it as a string. I am using the <strong>clean</strong> version of the GoT corpora, as stopwords are not necessary for this moment. 

In [2]:
def read_txt(filePath):
    '''
    This function reads a file (specifically a text file) and tokenizes that file
    Input: a .txt filepath of a string of words
    Output: a tokenized list of words
    '''
    file = open(filePath, "r") 
    new_string = file.read() 
    file.close()
    corpus_token = word_tokenize(new_string)
    return corpus_token

In [3]:
#CLEANED
gots12 = read_txt('data/group_month/got_all_txt/gotS1_2_unclean.txt')
gots34 = read_txt('data/group_month/got_all_txt/gotS3_4_unclean.txt')
gots56 = read_txt('data/group_month/got_all_txt/gotS5_6_unclean.txt')
gots7 = read_txt('data/group_month/got_all_txt/gotS7_unclean.txt')
gots8 = read_txt('data/group_month/got_all_txt/gotS8_unclean.txt')

KeyboardInterrupt: 

In [16]:
gotALL = gots12 + gots34 + gots56 + gots7 + gots8

## Word2Vec

Using the LineSentence function (from gensim), which takes a file, reads it in, and does the necessary pre-processing for you, I read in all my files and then created word2vec models for each. Then, once these files are read in using the LineSentence function, I can create word2vec models for each corpus.

Here are the five corpora I'm using from the <em>Game of Thrones</em> fanfics:
- GoT Season 1 & 2: all GoT fanfics published before the beginning of season 1 to right before season 3 begins
- GoT Seasons 3 & 4: all GoT fanfics published from the beginning of season 3 to right before season 5 begins
- GoT Seasons 5 & 6: all GoT fanfics published from the beginning of season 5 to right before season 7 begins
- GoT Season 7: all GoT fanfics published from the beginning of season 7 to right before season 8 begins
- GoT Season 8: all GoT fanfics published from the beginning of season 8 to all those published by Sep 2019, when the texts were collected

The first step for creating Word2Vec models is using gensims' "LineSentence" function, which reads in .txt files into a method that gensim's word embedding model creator can read.  

In [2]:
sent_got12 = LineSentence('data/group_month/got_all_txt/gotS1_2_clean.txt')
sent_got34 = LineSentence('data/group_month/got_all_txt/gotS3_4_clean.txt')
sent_got56 = LineSentence('data/group_month/got_all_txt/gotS5_6_clean.txt')
sent_got7 = LineSentence('data/group_month/got_all_txt/gotS7clean.txt')
sent_got8 = LineSentence('data/group_month/got_all_txt/gotS8clean.txt')

In [17]:
sent_gotALL = LineSentence(gotALL)

In [26]:
#Seasons 1 & 2
w2v_got12 = wv(sent_got12, window=20, min_count=5, workers=4)

#Seasons 3 & 4
w2v_got34 = wv(sent_got34, window=20, min_count=5, workers=4)

#Seasons 5 & 6
w2v_got56 = wv(sent_got56, window=20, min_count=5, workers=4)

#Season 7
w2v_got7 = wv(sent_got7, window=20, min_count=5, workers=4)

#Season 8
w2v_got8 = wv(sent_got8, window=20, min_count=5, workers=4)

In [None]:
w2v_gotALL = wv(gots12, window=20, min_count=5, workers=4)

### Saving Models

So I do not need to reload them, I am saving my models as .bin files

In [66]:
w2v_got12.save('./models/got_word2vec/got12')
w2v_got34.save('./models/got_word2vec/got34')
w2v_got56.save('./models/got_word2vec/got56')
w2v_got7.save('./models/got_word2vec/got7')
w2v_got8.save('./models/got_word2vec/got8')


This function is deprecated, use smart_open.open instead. See the migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function



### Loading Models


In [3]:
w2v_got12 = wv.load("./models/got_word2vec/got12")
w2v_got34 = wv.load("./models/got_word2vec/got34")
w2v_got56 = wv.load("./models/got_word2vec/got56")
w2v_got7 = wv.load("./models/got_word2vec/got7")
w2v_got8 = wv.load("./models/got_word2vec/got8")

## Exploring Results from Models

Now that I have created the models and saved them, I will go through the models to find interesting results. A lot of the results for GoT-specific words are GoT-specific words, which makes sense. For example, "Dothraki" results reflect GoT-specific language around Dothraki, such as Khal and Khaleesi.

In order to solve this, I will use some of the "equations" that gensim offers, such as subtracting Dotrhaki from Khal, which in turn may bring up results that are less GoT-specific.

In [10]:
w2v_got12.wv.most_similar(['dothraki'], topn=20)

[('khalasar', 0.8875465393066406),
 ('khal', 0.8551653027534485),
 ('bloodrid', 0.8520013093948364),
 ('rhaego', 0.80260169506073),
 ('rakharo', 0.7965694665908813),
 ('khaleesi', 0.7860901355743408),
 ('drogo', 0.7815247774124146),
 ('xaro', 0.7618790864944458),
 ('vae', 0.7612751126289368),
 ('dothrak', 0.7565141320228577),
 ('hord', 0.7261720299720764),
 ('kayla', 0.7237821817398071),
 ('qarth', 0.703815758228302),
 ('jorah', 0.6969307661056519),
 ('irri', 0.6960237622261047),
 ('ikko', 0.688761830329895),
 ('arakh', 0.6871383786201477),
 ('aggo', 0.6764395833015442),
 ('meereen', 0.6701558828353882),
 ('ko', 0.6608179807662964)]

In [23]:
w2v_got56.wv.most_similar(['dothraki'], topn=20)

[('khalasar', 0.8306858539581299),
 ('khal', 0.8028331398963928),
 ('horselord', 0.8016074895858765),
 ('charro', 0.7850687503814697),
 ('bloodrid', 0.7560696601867676),
 ('jhogo', 0.7554223537445068),
 ('dothrak', 0.7431109547615051),
 ('vae', 0.7388004064559937),
 ('khalassar', 0.7270313501358032),
 ('aggo', 0.7207942008972168),
 ('rakharo', 0.7187821865081787),
 ('qotho', 0.7158142328262329),
 ('ko', 0.7157270908355713),
 ('lhazar', 0.6985082626342773),
 ('cohollo', 0.686269998550415),
 ('drogo', 0.6822618246078491),
 ('dorthraki', 0.6818742752075195),
 ('lhazareen', 0.6811560392379761),
 ('kha', 0.6733536124229431),
 ('khaleesi', 0.6684357523918152)]

In [27]:
w2v_got8.wv.most_similar(['dothraki'], topn=20)

[('bloodrid', 0.8588789701461792),
 ('khalasar', 0.8302082419395447),
 ('screamer', 0.7820744514465332),
 ('unsulli', 0.7630407214164734),
 ('khal', 0.7473677396774292),
 ('qhono', 0.731020450592041),
 ('horselord', 0.7270990610122681),
 ('hord', 0.7237305641174316),
 ('aggo', 0.7087967395782471),
 ('inmaculado', 0.70172518491745),
 ('pono', 0.6996252536773682),
 ('arakh', 0.6956338882446289),
 ('rakharo', 0.6929371356964111),
 ('kha', 0.6843449473381042),
 ('jhaqo', 0.6838114261627197),
 ('onro', 0.6830466389656067),
 ('vae', 0.6715271472930908),
 ('dothrak', 0.6676204204559326),
 ('moro', 0.6647101044654846),
 ('hosho', 0.6501988172531128)]

In [25]:
w2v_got7.wv.most_similar(['dothraki'], topn=20)

[('horselord', 0.8685603141784668),
 ('bloodrid', 0.8234521150588989),
 ('khalasar', 0.8066673278808594),
 ('screamer', 0.7891376614570618),
 ('kovarro', 0.7872434258460999),
 ('qhono', 0.785574197769165),
 ('aggo', 0.7855238914489746),
 ('unsulli', 0.7792567610740662),
 ('khal', 0.7327853441238403),
 ('moro', 0.7207859754562378),
 ('khalakka', 0.7159173488616943),
 ('rakharo', 0.7121165990829468),
 ('jhogo', 0.7112764120101929),
 ('ko', 0.7053215503692627),
 ('hord', 0.7031327486038208),
 ('kha', 0.6981562972068787),
 ('khaleesi', 0.6972546577453613),
 ('dorthraki', 0.694033145904541),
 ('khalessi', 0.6806465983390808),
 ('qoy', 0.6777968406677246)]

In [48]:
w2v_got8.wv.most_similar(['khaleesi'], topn=20)

[('jorah', 0.707273542881012),
 ('dosh', 0.6951413154602051),
 ('khalessi', 0.6875307559967041),
 ('bloodrid', 0.6803869009017944),
 ('khaleen', 0.6759089827537537),
 ('jhaqo', 0.6588600277900696),
 ('qhono', 0.6587767601013184),
 ('khal', 0.6577175259590149),
 ('irri', 0.6411817669868469),
 ('khalasar', 0.6368321180343628),
 ('ko', 0.6353309154510498),
 ('rakharo', 0.6311588287353516),
 ('jhiqui', 0.6214292645454407),
 ('dothraki', 0.6133835315704346),
 ('dothrak', 0.6102162003517151),
 ('vae', 0.6071327924728394),
 ('aggo', 0.6043835878372192),
 ('mhysa', 0.5972708463668823),
 ('drogo', 0.5906261205673218),
 ('torgo', 0.5880109667778015)]

In [16]:
ah = w2v_got8['khaleesi']
ah

array([-6.244144  , -5.6756334 ,  0.07476219,  3.533827  ,  2.2133505 ,
       -2.1928945 ,  1.4939376 ,  0.49453655,  1.5108477 , -0.08372274,
       -3.6240368 , -4.410201  ,  1.8026439 , -0.02654359, -3.7482915 ,
       -0.90297526,  1.8327967 ,  3.9062798 ,  2.2364933 ,  0.8684295 ,
        2.5081234 ,  0.54380935, -2.6268833 ,  1.3240451 ,  3.6416183 ,
       -2.299615  ,  1.2661232 ,  2.9855976 , -3.457127  ,  1.2866989 ,
       -0.07038362,  2.3689134 , -3.209186  ,  2.1301017 , -2.0172687 ,
        0.15313597, -1.9586468 , -1.4660865 ,  4.3812766 ,  1.2006686 ,
        2.2344925 ,  0.01155869, -0.1917317 , -0.27645957, -0.14072743,
        4.077713  ,  3.377289  , -3.7378435 ,  0.56044537, -0.51589704,
        5.789945  ,  1.6326916 , -0.5742531 ,  0.7769463 ,  1.371531  ,
        1.9110696 ,  1.8181335 , -1.7439198 ,  2.397974  , -4.0086923 ,
        1.7223254 ,  0.08434583,  0.87856364, -0.16006066,  4.7695875 ,
        6.0887833 , -3.4186635 , -1.176097  , -3.814159  , -0.84

## Principal Component Analysis and UMap

I need to use Principal Component Analysis (PCA), which is a dimensionality reduction algorithm used for unsupervised machine learning. Basically, each word vector model is a huge matrix where each word has a ton of vectors and information. PCA allows for better visualization as well as reducing information to make our data 'readable'; we can take multi-dimensional data and make it 2 or 3 dimensional data. Helpful video about [PCA by Michael Galarnyk](https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60). 


In [4]:
def PCA_2_UMAP(model):
    vectors = model[model.wv.vocab]
    words = list(model.wv.vocab)
    
    #PCA to reduce vectors
    pca = PCA(n_components=2)
    pca_results = pca.fit_transform(vectors)
    
    #run UMAP on PCA
    reducer = umap.UMAP()
    embedding = reducer.fit_transform(pca_results)
    
    #set up x & y values using the UMAP reducer results
    x = []
    y = []
    for value in embedding:
        x.append(value[0])
        y.append(value[1])
        
    #create dataframe
    umap_df = pd.DataFrame({'word':words,
                        "x":x,
                        "y":y,})

    return umap_df

In [5]:
umap_got12 = PCA_2_UMAP(w2v_got12)
umap_got12

Unnamed: 0,word,x,y
0,wall,-5.779251,-10.079158
1,imposs,-7.631354,-9.271155
2,warm,-7.304817,-10.470057
3,wind,-7.063386,-10.479785
4,blow,-6.463011,-9.030274
...,...,...,...
20273,juliann,3.085968,-4.952371
20274,parachut,1.408003,-9.534025
20275,chandler,5.308085,-0.717615
20276,algeria,9.248048,6.742660


In [6]:
umap_got34 = PCA_2_UMAP(w2v_got34)
umap_got34

Unnamed: 0,word,x,y
0,john,3.403437,1.697630
1,seem,12.092190,11.342078
2,kind,13.416574,8.470289
3,guy,14.985053,5.797641
4,like,10.390861,9.591648
...,...,...,...
56939,ultron,-4.346008,-8.488113
56940,chimichanga,2.805781,-12.135231
56941,blood-traitor,-6.097734,0.238141
56942,t'challa,-3.609701,7.567342


### Run UMAP on PCA

In [1]:
new_fig = px.scatter(umap_got12, x="x", y="y", hover_data=["word"],  text="word", render_mode="svg")


# fig.update_traces(text_position="top center")

new_fig.update_layout(title_text="Game of Thrones Word2Vec")
new_fig.show()

#play with transparency – 

NameError: name 'px' is not defined

In [9]:
fig = px.scatter(umap_got34, x="x", y="y", hover_data=["word"], render_mode="svg")

# fig.update_traces(text_position="top center")

fig.update_layout(title_text="Game of Thrones Word2Vec")
fig.show()