# Word Embeddings
- Brigitte Hogan (bwh5v@virginia.edu) & Jason Tiezzi (jbt5am@virginia.edu)  
- DS 5001: Exploratory Text Analytics
- April 2020  

## Overview
This notebook creates word embeddings for each time period -- one for the mid-1800s, one for the late 1800s, and one for the early 1900s so we can more easily detect changes over time. 

## Set Up

In [None]:
import pandas as pd
import numpy as np
import os
import datetime
import re
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', None)
import seaborn as sns
sns.set()
import math
import nltk
from gensim.models import word2vec
from sklearn.manifold import TSNE
import plotly_express as px
import re
import plotly.express as px 


### Setting the OHCO and Bag

In [2]:
OHCO = ['period','book_id', 'vol_num','chap_num', 'recp_num','para_num', 'sent_num', 'token_num']

In [3]:
BAG = OHCO[:5]
BAG

['period', 'book_id', 'vol_num', 'chap_num', 'recp_num']

### Loading Files

In [4]:
file_dir = 'C:/Users/Jason/Documents/Data Science/Spring 2020/Text Analytics/final_project/From_Github_427/'
data_dir = 'Tables/'
os.chdir(file_dir)

In [5]:
TOKEN = pd.read_csv(data_dir + 'TOKEN.csv')
TOKEN.head()

Unnamed: 0,book_id,vol_num,chap_num,recp_num,para_num,sent_num,token_num,pos_tuple,pos,token_str,term_str
0,9935,1,1,1.0,0,0,0,"('1', 'CD')",CD,1,1
1,9935,1,1,1.0,0,0,1,"('.', '.')",.,.,
2,9935,1,1,1.0,0,1,0,"('Without', 'IN')",IN,Without,without
3,9935,1,1,1.0,0,1,1,"('doubt', 'NN')",NN,doubt,doubt
4,9935,1,1,1.0,0,1,2,"(',', ',')",",",",",


In [6]:
TOKEN.shape

(1130904, 11)

In [7]:
LIB = pd.read_csv(data_dir + 'LIB.csv')
LIB.head()

Unnamed: 0,book_id,author_last,author_full,book_year,book_title,book_file,period
0,9935,WIDAS,Woman's Institute of Domestic Arts and Sciences,1923,"Woman's Institute Library of Cookery, Vol. 1",Cookbooks/WIDAS1923_WILCV01_pg9935.txt,1900s
1,9936,WIDAS,Woman's Institute of Domestic Arts and Sciences,1923,"Woman's Institute Library of Cookery, Vol. 2",Cookbooks/WIDAS1923_WILCV02_pg9936.txt,1900s
2,9937,WIDAS,Woman's Institute of Domestic Arts and Sciences,1923,"Woman's Institute Library of Cookery, Vol. 3",Cookbooks/WIDAS1923_WILCV03_pg9937.txt,1900s
3,9938,WIDAS,Woman's Institute of Domestic Arts and Sciences,1923,"Woman's Institute Library of Cookery, Vol. 4",Cookbooks/WIDAS1923_WILCV04_pg9938.txt,1900s
4,9939,WIDAS,Woman's Institute of Domestic Arts and Sciences,1923,"Woman's Institute Library of Cookery, Vol. 5",Cookbooks/WIDAS1923_WILCV05_pg9939.txt,1900s


In [8]:
#merging the library table into the token tables
TOKEN = pd.merge(TOKEN,LIB, on='book_id')
TOKEN.shape

(1130904, 17)

In [9]:
TOKEN = TOKEN.set_index(OHCO)
TOKEN.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,pos_tuple,pos,token_str,term_str,author_last,author_full,book_year,book_title,book_file
period,book_id,vol_num,chap_num,recp_num,para_num,sent_num,token_num,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1900s,9935,1,1,1.0,0,0,0,"('1', 'CD')",CD,1,1,WIDAS,Woman's Institute of Domestic Arts and Sciences,1923,"Woman's Institute Library of Cookery, Vol. 1",Cookbooks/WIDAS1923_WILCV01_pg9935.txt
1900s,9935,1,1,1.0,0,0,1,"('.', '.')",.,.,,WIDAS,Woman's Institute of Domestic Arts and Sciences,1923,"Woman's Institute Library of Cookery, Vol. 1",Cookbooks/WIDAS1923_WILCV01_pg9935.txt
1900s,9935,1,1,1.0,0,1,0,"('Without', 'IN')",IN,Without,without,WIDAS,Woman's Institute of Domestic Arts and Sciences,1923,"Woman's Institute Library of Cookery, Vol. 1",Cookbooks/WIDAS1923_WILCV01_pg9935.txt
1900s,9935,1,1,1.0,0,1,1,"('doubt', 'NN')",NN,doubt,doubt,WIDAS,Woman's Institute of Domestic Arts and Sciences,1923,"Woman's Institute Library of Cookery, Vol. 1",Cookbooks/WIDAS1923_WILCV01_pg9935.txt
1900s,9935,1,1,1.0,0,1,2,"(',', ',')",",",",",,WIDAS,Woman's Institute of Domestic Arts and Sciences,1923,"Woman's Institute Library of Cookery, Vol. 1",Cookbooks/WIDAS1923_WILCV01_pg9935.txt


In [10]:
VOCAB = pd.read_csv(data_dir + 'VOCAB.csv')
VOCAB.head()

Unnamed: 0,term_id,term_str,n,num,stop,stem_porter,stem_snowball,term_rank,term_rank2,p,zipf_k,zipf_k2,zipf_k3,TFIDF_sum_book,TFIDF_sum_recipe,TFIDF_sum_period
0,15108,the,60407,0,1,the,the,1,1,3.598654,60407,60407,3.598654,0.0,14.919713,0.0
1,10502,of,35149,0,1,of,of,2,2,2.093947,70298,70298,4.187895,0.0,13.203503,0.0
2,1546,and,33319,0,1,and,and,3,3,1.984928,99957,99957,5.954784,0.0,7.244055,0.0
3,1062,a,28726,0,1,a,a,4,4,1.711307,114904,114904,6.845228,0.0,12.451291,0.0
4,8071,in,22204,0,1,in,in,5,5,1.322769,111020,111020,6.613845,0.0,9.951277,0.0


## Word Embeddings for Period #1 (mid-1800s)

In [11]:
#getting a tokens table of just the mid1800s
subset = TOKEN.index.get_level_values('period') == 'mid1800s' 
TOKEN1 = TOKEN[subset]
TOKEN1.sample(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,pos_tuple,pos,token_str,term_str,author_last,author_full,book_year,book_title,book_file
period,book_id,vol_num,chap_num,recp_num,para_num,sent_num,token_num,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
mid1800s,28681,4,33,30.0,16,1,108,"('some', 'DT')",DT,some,some,Kitchiner,William Kitchiner,1830,The Cook's Oracle; and Housekeeper's Manual,Cookbooks/Kitchiner1830_TCO_pg28681.txt
mid1800s,12519,0,11,262.0,1,0,32,"('sugar', 'NN')",NN,sugar,sugar,Randolf,Mary Randolph,1860,The Virginia Housewife,Cookbooks/Randolf1860_VAHousewife_pg12519.txt
mid1800s,28681,3,27,22.0,192,3,1,"(')', ')')",),),,Kitchiner,William Kitchiner,1830,The Cook's Oracle; and Housekeeper's Manual,Cookbooks/Kitchiner1830_TCO_pg28681.txt
mid1800s,28681,3,28,22.0,14,0,21,"(',', ',')",",",",",,Kitchiner,William Kitchiner,1830,The Cook's Oracle; and Housekeeper's Manual,Cookbooks/Kitchiner1830_TCO_pg28681.txt
mid1800s,28681,3,26,20.0,55,0,7,"('meat', 'NN')",NN,meat,meat,Kitchiner,William Kitchiner,1830,The Cook's Oracle; and Housekeeper's Manual,Cookbooks/Kitchiner1830_TCO_pg28681.txt
mid1800s,12519,0,18,418.0,1,1,15,"('mixed', 'JJ')",JJ,mixed,mixed,Randolf,Mary Randolph,1860,The Virginia Housewife,Cookbooks/Randolf1860_VAHousewife_pg12519.txt
mid1800s,28681,3,24,18.0,114,0,165,"(',', ',')",",",",",,Kitchiner,William Kitchiner,1830,The Cook's Oracle; and Housekeeper's Manual,Cookbooks/Kitchiner1830_TCO_pg28681.txt
mid1800s,12519,0,10,205.0,1,3,28,"('out', 'RP')",RP,out,out,Randolf,Mary Randolph,1860,The Virginia Housewife,Cookbooks/Randolf1860_VAHousewife_pg12519.txt
mid1800s,28681,3,21,15.0,183,1,1,"('.', '.')",.,.,,Kitchiner,William Kitchiner,1830,The Cook's Oracle; and Housekeeper's Manual,Cookbooks/Kitchiner1830_TCO_pg28681.txt
mid1800s,28681,3,30,25.0,200,0,152,"('to', 'TO')",TO,to,to,Kitchiner,William Kitchiner,1830,The Cook's Oracle; and Housekeeper's Manual,Cookbooks/Kitchiner1830_TCO_pg28681.txt


In [12]:
#removing blank term_strs
TOKEN1 = TOKEN1[~TOKEN1.term_str.isna()] #getting rid of NaN term strings
TOKEN1.term_str.isna().sum()

0

In [13]:
corpus = TOKEN1[~TOKEN1.pos.str.match('NNPS?')]\
    .groupby(BAG)\
    .term_str.apply(lambda  x:  x.tolist())\
    .reset_index()['term_str'].tolist()

In [14]:
corpus

[['take',
  'four',
  'large',
  'bunches',
  'of',
  'asparagus',
  'scrape',
  'it',
  'nicely',
  'cut',
  'off',
  'one',
  'inch',
  'of',
  'the',
  'tops',
  'and',
  'lay',
  'them',
  'in',
  'water',
  'chop',
  'the',
  'stalks',
  'and',
  'put',
  'them',
  'on',
  'the',
  'fire',
  'with',
  'a',
  'piece',
  'of',
  'bacon',
  'a',
  'large',
  'onion',
  'cut',
  'up',
  'and',
  'pepper',
  'and',
  'salt',
  'add',
  'two',
  'quarts',
  'of',
  'water',
  'boil',
  'them',
  'till',
  'the',
  'stalks',
  'are',
  'quite',
  'soft',
  'then',
  'pulp',
  'them',
  'through',
  'a',
  'sieve',
  'and',
  'strain',
  'the',
  'water',
  'to',
  'it',
  'which',
  'must',
  'be',
  'put',
  'back',
  'in',
  'the',
  'pot',
  'put',
  'into',
  'it',
  'a',
  'chicken',
  'cut',
  'up',
  'with',
  'the',
  'tops',
  'of',
  'asparagus',
  'which',
  'had',
  'been',
  'laid',
  'by',
  'boil',
  'it',
  'until',
  'these',
  'last',
  'articles',
  'are',
  'sufficien

In [15]:
model_1800s = word2vec.Word2Vec(corpus, size=200, window=5, min_count=5, workers=4, seed=2887)

In [16]:
coords = pd.DataFrame(index=range(len(model_1800s.wv.vocab)))
coords['label'] = [w for w in model_1800s.wv.vocab]
coords['vector'] = coords['label'].apply(lambda x: model_1800s.wv.get_vector(x))
coords

Unnamed: 0,label,vector
0,take,"[0.110894434, 0.2795687, 0.21189256, 0.2566023..."
1,four,"[-0.21101385, -0.49630365, 0.11883352, 0.26328..."
2,large,"[-0.22206426, -0.33758843, 0.10379771, 0.19037..."
3,bunches,"[-0.030345231, -0.0024995485, 0.025864126, 0.0..."
4,of,"[-0.4067936, -0.5959758, 0.16668107, 0.1585551..."
...,...,...
2847,dumplings,"[-0.014896452, 0.010846127, 0.014518696, 0.005..."
2848,direction,"[-0.040493008, 0.052375216, 0.04544599, 0.0086..."
2849,polish,"[-0.03693674, 0.02992585, 0.026838413, -0.0003..."
2850,b,"[-0.024653722, 0.019357871, 0.027739028, 0.005..."


In [17]:
tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, random_state=23)
tsne_values = tsne_model.fit_transform(coords['vector'].tolist())

In [18]:
coords['x'] = tsne_values[:,0]
coords['y'] = tsne_values[:,1]

In [19]:
coords.head()

Unnamed: 0,label,vector,x,y
0,take,"[0.110894434, 0.2795687, 0.21189256, 0.2566023...",80.867691,-16.602646
1,four,"[-0.21101385, -0.49630365, 0.11883352, 0.26328...",62.371803,-25.710085
2,large,"[-0.22206426, -0.33758843, 0.10379771, 0.19037...",61.386784,-23.576122
3,bunches,"[-0.030345231, -0.0024995485, 0.025864126, 0.0...",-7.136811,4.855859
4,of,"[-0.4067936, -0.5959758, 0.16668107, 0.1585551...",57.943058,-26.613533


### Visualizations

In [20]:
#merging with the VoCAB table
coords.rename({'label':'term_str'}, axis=1, inplace=True)
coords = pd.merge(coords,VOCAB, on='term_str')
coords.head()

Unnamed: 0,term_str,vector,x,y,term_id,n,num,stop,stem_porter,stem_snowball,term_rank,term_rank2,p,zipf_k,zipf_k2,zipf_k3,TFIDF_sum_book,TFIDF_sum_recipe,TFIDF_sum_period
0,take,"[0.110894434, 0.2795687, 0.21189256, 0.2566023...",80.867691,-16.602646,14903,1336,0,0,take,take,103,101,0.07959,137608,134936,8.038604,0.0,5.208822,0.0
1,four,"[-0.21101385, -0.49630365, 0.11883352, 0.26328...",62.371803,-25.710085,6710,665,0,0,four,four,198,185,0.039616,131670,123025,7.329024,0.000362,4.251444,0.0
2,large,"[-0.22206426, -0.33758843, 0.10379771, 0.19037...",61.386784,-23.576122,8886,1261,0,0,larg,larg,109,106,0.075122,137449,133666,7.962945,0.000661,5.919678,0.0
3,bunches,"[-0.030345231, -0.0024995485, 0.025864126, 0.0...",-7.136811,4.855859,2757,18,0,0,bunch,bunch,3293,614,0.001072,59274,11052,0.658406,0.000208,0.444265,0.0
4,of,"[-0.4067936, -0.5959758, 0.16668107, 0.1585551...",57.943058,-26.613533,10502,35149,0,1,of,of,2,2,2.093947,70298,70298,4.187895,0.0,13.203503,0.0


In [21]:
coords = coords[coords.n >=100] #paring down the vocabulary to only words with a usage of over 100
coords = coords[(coords.stop ==0) & (coords.num == 0)] #removing stop words and integers

Because each running of Word2Vec will slightly alter the layout (even with seeds set), it isn't possible to excerpt particular clusters. But using this simple function, you can find where any particular word is located and then dig deeper with the interactive plotly express chart:

In [22]:
def find_cords(word):
    value = coords[coords.term_str == word]
    if len(value) > 0:
        value_x = value.x
        value_y = value.y
        print("x coordinate: " + value_x.to_string(index=False))
        print("y coordinate: " + value_y.to_string(index=False))
    else:
        print("Word not Found")
    

In [90]:
find_cords('cayenne')

x coordinate: -12.500719
y coordinate: -10.722942


In [91]:
fig = px.scatter(coords, 'x', 'y', text='term_str', height=400, width=400).update_traces(mode='text')
fig.update_layout(title="Mid-1800 Word Neighbors to Cayenne")


## Running Word Embeddings for Period #2 (late 1800s)

In [25]:
#getting a tokens table of just the late 1800s
subset2 = TOKEN.index.get_level_values('period') == 'late1800s' 
TOKEN2 = TOKEN[subset2]
TOKEN2.sample(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,pos_tuple,pos,token_str,term_str,author_last,author_full,book_year,book_title,book_file
period,book_id,vol_num,chap_num,recp_num,para_num,sent_num,token_num,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
late1800s,61185,0,7,0.0,5,0,13,"('being', 'VBG')",VBG,being,being,Payne,Arthur Gay Payne,1877,Common - Sense Papers on Cookery,Cookbooks/Payne1877_CSPC_pg61185.txt
late1800s,61185,0,16,0.0,19,4,12,"('a', 'DT')",DT,a,a,Payne,Arthur Gay Payne,1877,Common - Sense Papers on Cookery,Cookbooks/Payne1877_CSPC_pg61185.txt
late1800s,54568,2,24,244.0,2,1,3,"('into', 'IN')",IN,into,into,Boland,Mary A. Boland,1893,A Handbook of Invalid Cooking,Cookbooks/Boland1893_Invalid_pg54568.txt
late1800s,53521,0,3,23.0,0,0,0,"('=Plain', 'NN')",NN,=Plain,plain,Murrey,Thomas J. Murrey,1888,Oysters and Fish,Cookbooks/Murrey1888_Fish_pg53521.txt
late1800s,29519,0,4,70.0,1,1,1,"('them', 'PRP')",PRP,them,them,Hooper,Mary Hooper,1892,Nelson's Home Comforts,Cookbooks/Hooper1892_NHC_pg29519.txt
late1800s,53521,0,3,29.0,0,0,5,"('up', 'RP')",RP,up,up,Murrey,Thomas J. Murrey,1888,Oysters and Fish,Cookbooks/Murrey1888_Fish_pg53521.txt
late1800s,54568,2,16,156.0,2,0,33,"('boiling', 'VBG')",VBG,boiling,boiling,Boland,Mary A. Boland,1893,A Handbook of Invalid Cooking,Cookbooks/Boland1893_Invalid_pg54568.txt
late1800s,61185,0,14,0.0,44,5,52,"('an', 'DT')",DT,an,an,Payne,Arthur Gay Payne,1877,Common - Sense Papers on Cookery,Cookbooks/Payne1877_CSPC_pg61185.txt
late1800s,29519,0,3,49.0,1,0,39,"('over', 'IN')",IN,over,over,Hooper,Mary Hooper,1892,Nelson's Home Comforts,Cookbooks/Hooper1892_NHC_pg29519.txt
late1800s,61185,0,14,0.0,15,2,48,"('live', 'VB')",VB,live,live,Payne,Arthur Gay Payne,1877,Common - Sense Papers on Cookery,Cookbooks/Payne1877_CSPC_pg61185.txt


In [26]:
TOKEN2 = TOKEN2[~TOKEN2.term_str.isna()] #getting rid of NaN term strings
TOKEN2.term_str.isna().sum()

0

In [27]:
corpus2 = TOKEN2[~TOKEN2.pos.str.match('NNPS?')]\
    .groupby(BAG)\
    .term_str.apply(lambda  x:  x.tolist())\
    .reset_index()['term_str'].tolist()

In [28]:
corpus2

[['dinner',
  'may',
  'be',
  'pleasant',
  'may',
  'social',
  'tea',
  'but',
  'yet',
  'methinks',
  'the',
  'breakfast',
  'is',
  'best',
  'of',
  'all',
  'the',
  'three',
  'the',
  'importance',
  'of',
  'preparing',
  'a',
  'variety',
  'of',
  'dainty',
  'dishes',
  'for',
  'the',
  'breakfast',
  'table',
  'is',
  'but',
  'lightly',
  'considered',
  'by',
  'many',
  'who',
  'can',
  'afford',
  'luxuries',
  'quite',
  'as',
  'much',
  'as',
  'by',
  'those',
  'who',
  'little',
  'dream',
  'of',
  'the',
  'delightful',
  'palate',
  'pleasing',
  'compounds',
  'made',
  'from',
  'unconsidered',
  'trifles',
  'the',
  'desire',
  'of',
  'the',
  'average',
  'man',
  'is',
  'to',
  'remain',
  'in',
  'bed',
  'until',
  'the',
  'very',
  'last',
  'moment',
  'a',
  'hurried',
  'breakfast',
  'of',
  'food',
  'long',
  'cooked',
  'awaits',
  'the',
  'late',
  'riser',
  'who',
  'will',
  'not',
  'masticate',
  'it',
  'properly',
  'when',
  

In [29]:
model_late1800s = word2vec.Word2Vec(corpus2, size=200, window=5, min_count=5, workers=4, seed=2887)

In [30]:
coords2 = pd.DataFrame(index=range(len(model_late1800s.wv.vocab)))
coords2['label'] = [w for w in model_late1800s.wv.vocab]
coords2['vector'] = coords2['label'].apply(lambda x: model_late1800s.wv.get_vector(x))
tsne_model2 = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, random_state=28)
tsne_values2 = tsne_model2.fit_transform(coords2['vector'].tolist())
tsne_values2


array([[ 85.95296  ,   0.9486415],
       [ 95.55147  ,   6.047885 ],
       [ 99.15255  ,   1.0007138],
       ...,
       [-58.865345 , -42.140434 ],
       [-49.40892  , -25.700869 ],
       [-31.341259 , -29.646711 ]], dtype=float32)

In [31]:
coords2['x'] = tsne_values2[:,0]
coords2['y'] = tsne_values2[:,1]
coords2.head()

Unnamed: 0,label,vector,x,y
0,dinner,"[-0.011866427, 0.34549123, 0.23151517, -0.0491...",85.952957,0.948641
1,may,"[-0.19179231, 0.6514762, -0.034354717, -0.0734...",95.551468,6.047885
2,be,"[-0.013019375, 0.6702347, 0.57766366, -0.12187...",99.15255,1.000714
3,pleasant,"[0.010603945, 0.051528383, 0.046345446, -0.001...",-27.659208,15.333883
4,tea,"[0.19054641, 0.045194928, 0.18603972, 0.096796...",52.101986,-20.208649


### Visualizations

In [32]:
#merging with the VoCAB table
coords2.rename({'label':'term_str'}, axis=1, inplace=True)
coords2 = pd.merge(coords2,VOCAB, on='term_str')
coords2.head()

Unnamed: 0,term_str,vector,x,y,term_id,n,num,stop,stem_porter,stem_snowball,term_rank,term_rank2,p,zipf_k,zipf_k2,zipf_k3,TFIDF_sum_book,TFIDF_sum_recipe,TFIDF_sum_period
0,dinner,"[-0.011866427, 0.34549123, 0.23151517, -0.0491...",85.952957,0.948641,4967,382,0,0,dinner,dinner,349,300,0.022757,133318,114600,6.827118,0.000432,2.780056,0.0
1,may,"[-0.19179231, 0.6514762, -0.034354717, -0.0734...",95.551468,6.047885,9575,4059,0,0,may,may,25,25,0.241809,101475,101475,6.045216,0.001478,11.286535,0.0
2,be,"[-0.013019375, 0.6702347, 0.57766366, -0.12187...",99.15255,1.000714,2143,13014,0,1,be,be,9,9,0.775289,117126,117126,6.9776,0.0,16.731512,0.0
3,pleasant,"[0.010603945, 0.051528383, 0.046345446, -0.001...",-27.659208,15.333883,11477,44,0,0,pleasant,pleasant,1905,588,0.002621,83820,25872,1.541284,0.000224,0.518346,0.0
4,tea,"[0.19054641, 0.045194928, 0.18603972, 0.096796...",52.101986,-20.208649,14985,593,0,0,tea,tea,221,204,0.035327,131053,120972,7.20672,0.000788,5.685245,0.0


In [33]:
coords2 = coords2[coords2.n >=100] #paring down the vocabulary to only words with a usage of over 100
coords2 = coords2[(coords2.stop ==0) & (coords2.num == 0)] #removing stop words and integers

In [34]:
def find_cords2(word):
    value = coords2[coords2.term_str == word]
    if len(value) > 0:
        value_x = value.x
        value_y = value.y
        print("x coordinate: " + value_x.to_string(index=False))
        print("y coordinate: " + value_y.to_string(index=False))
    else:
        print("Word not Found")

Because each running of Word2Vec will slightly alter the layout (even with seeds set), it isn't possible to excerpt particular clusters. But using this simple function, you can find where any particular word is located and then dig deeper with the interactive plotly express chart:

In [35]:
find_cords2('milk')

x coordinate:  49.776691
y coordinate: -29.47077


In [36]:
px.scatter(coords2, 'x', 'y', text='term_str', height=1000, width=1000).update_traces(mode='text')

## Running Word Embeddings for Period #2 (early 1900s)

In [37]:
#getting a tokens table of just the early 1900s
subset3 = TOKEN.index.get_level_values('period') == '1900s' 
TOKEN3 = TOKEN[subset3]
TOKEN3.sample(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,pos_tuple,pos,token_str,term_str,author_last,author_full,book_year,book_title,book_file
period,book_id,vol_num,chap_num,recp_num,para_num,sent_num,token_num,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1900s,9936,2,27,77.0,1,4,21,"('of', 'IN')",IN,of,of,WIDAS,Woman's Institute of Domestic Arts and Sciences,1923,"Woman's Institute Library of Cookery, Vol. 2",Cookbooks/WIDAS1923_WILCV02_pg9936.txt
1900s,9939,5,215,564.0,4,0,10,"('and', 'CC')",CC,and,and,WIDAS,Woman's Institute of Domestic Arts and Sciences,1923,"Woman's Institute Library of Cookery, Vol. 5",Cookbooks/WIDAS1923_WILCV05_pg9939.txt
1900s,9937,2,48,83.0,0,1,22,"('been', 'VBN')",VBN,been,been,WIDAS,Woman's Institute of Domestic Arts and Sciences,1923,"Woman's Institute Library of Cookery, Vol. 3",Cookbooks/WIDAS1923_WILCV03_pg9937.txt
1900s,9935,3,97,197.0,0,1,11,"('have', 'VBP')",VBP,have,have,WIDAS,Woman's Institute of Domestic Arts and Sciences,1923,"Woman's Institute Library of Cookery, Vol. 1",Cookbooks/WIDAS1923_WILCV01_pg9935.txt
1900s,19077,1,2,1.0,1,3,13,"('expressed', 'VBN')",VBN,expressed,expressed,Hill,Janet McKenzie Hill,1909,"Salads, Sandwiches and Chafing - Dish Dainties",Cookbooks/Hill1909_SSCDD_pg19077.txt
1900s,9939,2,87,204.0,0,3,14,"('she', 'PRP')",PRP,she,she,WIDAS,Woman's Institute of Domestic Arts and Sciences,1923,"Woman's Institute Library of Cookery, Vol. 5",Cookbooks/WIDAS1923_WILCV05_pg9939.txt
1900s,32472,0,2,0.0,70,0,2,"('-', ':')",:,-,,Lusk,Graham Lusk,1918,Food in War Time,Cookbooks/Lusk1918_War_pg32472.txt
1900s,9936,1,26,66.0,3,0,7,"('in', 'IN')",IN,in,in,WIDAS,Woman's Institute of Domestic Arts and Sciences,1923,"Woman's Institute Library of Cookery, Vol. 2",Cookbooks/WIDAS1923_WILCV02_pg9936.txt
1900s,9937,1,2,2.0,0,5,7,"('as', 'IN')",IN,as,as,WIDAS,Woman's Institute of Domestic Arts and Sciences,1923,"Woman's Institute Library of Cookery, Vol. 3",Cookbooks/WIDAS1923_WILCV03_pg9937.txt
1900s,15464,1,1,1.0,21,3,9,"('bread', 'NN')",NN,bread,bread,Goudiss,Alice Bradley,1918,Foods That Will Win The War And How To Cook,Cookbooks/Goudiss1918_War_pg15464.txt


In [38]:
#removing blank term_strs
TOKEN3 = TOKEN3[~TOKEN3.term_str.isna()] #getting rid of NaN term strings
TOKEN3.term_str.isna().sum()

0

In [39]:
corpus3 = TOKEN3[~TOKEN3.pos.str.match('NNPS?')]\
    .groupby(BAG)\
    .term_str.apply(lambda  x:  x.tolist())\
    .reset_index()['term_str'].tolist()

In [40]:
corpus3

[['1',
  'without',
  'doubt',
  'the',
  'greatest',
  'problem',
  'confronting',
  'the',
  'human',
  'race',
  'is',
  'that',
  'of',
  'food',
  'in',
  'order',
  'to',
  'exist',
  'every',
  'person',
  'must',
  'eat',
  'but',
  'eating',
  'simply',
  'to',
  'keep',
  'life',
  'in',
  'the',
  'body',
  'is',
  'not',
  'enough',
  'aside',
  'from',
  'this',
  'the',
  'body',
  'must',
  'be',
  'supplied',
  'with',
  'an',
  'ample',
  'amount',
  'of',
  'energy',
  'to',
  'carry',
  'on',
  'each',
  'day',
  's',
  'work',
  'as',
  'well',
  'as',
  'with',
  'the',
  'material',
  'needed',
  'for',
  'its',
  'growth',
  'repair',
  'and',
  'working',
  'power',
  'to',
  'meet',
  'these',
  'requirements',
  'of',
  'the',
  'human',
  'body',
  'there',
  'is',
  'nothing',
  'to',
  'take',
  'the',
  'place',
  'of',
  'food',
  'not',
  'merely',
  'any',
  'kind',
  'however',
  'but',
  'the',
  'kind',
  'indeed',
  'so',
  'important',
  'is',
  't

In [41]:
model_1900s = word2vec.Word2Vec(corpus3, size=200, window=5, min_count=5, workers=4, seed=2887)

In [42]:
coords3 = pd.DataFrame(index=range(len(model_1900s.wv.vocab)))
coords3['label'] = [w for w in model_1900s.wv.vocab]
coords3['vector'] = coords3['label'].apply(lambda x: model_1900s.wv.get_vector(x))
tsne_model3 = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, random_state=28)
tsne_values3 = tsne_model3.fit_transform(coords3['vector'].tolist())
tsne_values3


array([[ 62.802937 ,  17.725075 ],
       [ -5.2374864,  54.932117 ],
       [-35.327034 ,  16.02762  ],
       ...,
       [-25.746922 ,  -2.2157176],
       [ -2.6858907,  -5.767989 ],
       [ -4.139302 ,  -4.3917384]], dtype=float32)

In [43]:
coords3['x'] = tsne_values3[:,0]
coords3['y'] = tsne_values3[:,1]
coords3.head()

Unnamed: 0,label,vector,x,y
0,1,"[-0.11525625, 0.7747628, 0.7591729, -0.4639023...",62.802937,17.725075
1,without,"[-0.0794528, 0.05570672, 0.032807745, -0.10882...",-5.237486,54.932117
2,doubt,"[0.02626246, 0.051787723, 0.11587783, -0.07828...",-35.327034,16.02762
3,the,"[0.44550192, 0.33612403, 0.93510944, 0.0653333...",23.040428,36.208652
4,greatest,"[0.037300125, 0.122449726, 0.26171458, -0.1388...",-32.63324,47.372467


### Visualization

In [44]:
coords3.rename({'label':'term_str'}, axis=1, inplace=True)
coords3 = pd.merge(coords3,VOCAB, on='term_str') #merging with the vocab table
coords3.head()

Unnamed: 0,term_str,vector,x,y,term_id,n,num,stop,stem_porter,stem_snowball,term_rank,term_rank2,p,zipf_k,zipf_k2,zipf_k3,TFIDF_sum_book,TFIDF_sum_recipe,TFIDF_sum_period
0,1,"[-0.11525625, 0.7747628, 0.7591729, -0.4639023...",62.802937,17.725075,14,4035,1,0,1,1,26,26,0.240379,104910,104910,6.249851,0.00964,25.807714,0.0
1,without,"[-0.0794528, 0.05570672, 0.032807745, -0.10882...",-5.237486,54.932117,16577,751,0,0,without,without,181,169,0.04474,135931,126919,7.561003,0.000302,3.990442,0.0
2,doubt,"[0.02626246, 0.051787723, 0.11587783, -0.07828...",-35.327034,16.02762,5252,63,0,0,doubt,doubt,1506,569,0.003753,94878,35847,2.13553,0.000224,0.504124,0.0
3,the,"[0.44550192, 0.33612403, 0.93510944, 0.0653333...",23.040428,36.208652,15108,60407,0,1,the,the,1,1,3.598654,60407,60407,3.598654,0.0,14.919713,0.0
4,greatest,"[0.037300125, 0.122449726, 0.26171458, -0.1388...",-32.63324,47.372467,7253,93,0,0,greatest,greatest,1146,539,0.00554,106578,50127,2.986239,0.000371,0.605662,0.0


In [48]:
coords3 = coords3[coords3.n >=100] #paring down the vocabulary to only words with a usage of over 100
coords3 = coords3[(coords3.stop ==0) & (coords3.num == 0)] #removing stop words and integers

Because each running of Word2Vec will slightly alter the layout (even with seeds set), it isn't possible to excerpt particular clusters. But using this simple function, you can find where any particular word is located and then dig deeper with the interactive plotly express chart:

In [49]:
def find_cords3(word):
    value = coords3[coords3.term_str == word]
    if len(value) > 0:
        value_x = value.x
        value_y = value.y
        print("x coordinate: " + value_x.to_string(index=False))
        print("y coordinate: " + value_y.to_string(index=False))
    else:
        print("Word not Found")

In [61]:
find_cords3('cayenne')

x coordinate:  52.761322
y coordinate:  19.960247


In [51]:
px.scatter(coords3, 'x', 'y', text='term_str', height=1000, width=1000).update_traces(mode='text')

## Writing to CSV

In [52]:
#coords.to_csv(data_dir + 'embeddings_mid1800s.csv')
#coords2.to_csv(data_dir + 'embeddings_late1800s.csv')
#coords3.to_csv(data_dir + 'embeddings_1900s.csv')

## Placeholder functions to explore and experiment

In [None]:
def complete_analogy(A, B, C, model, n=2):
    try:
        return model.wv.most_similar(positive=[B, C], negative=[A])[0:n]
    except KeyError as e:
        print('Error:', e)
        return None

In [None]:
complete_analogy('wife', 'cook', 'husband', model_late1800s) #makes an analogy -- remember to select the appropriate model

In [None]:
model_1800s.wv.similarity('example', 'word') #finds the similarity between two words -- again, remember to select the appropriate model

In [82]:
model_1900s.wv.most_similar(positive='cayenne', topn=20) #finds the most similar words to another word -- for least similar, do negative='word'

[('teaspoon', 0.9788909554481506),
 ('tsp', 0.9639654159545898),
 ('paprica', 0.9626137018203735),
 ('mace', 0.9595282673835754),
 ('cinnamon', 0.959049642086029),
 ('mustard', 0.9581506252288818),
 ('bicarbonate', 0.9579082727432251),
 ('nutmeg', 0.9572535753250122),
 ('teaspoons', 0.9486986398696899),
 ('molasses', 0.9481988549232483),
 ('curry', 0.9477304220199585),
 ('peppercorns', 0.9419469237327576),
 ('18', 0.9384093284606934),
 ('allspice', 0.9360338449478149),
 ('paprika', 0.9339739084243774),
 ('sifted', 0.9267960786819458),
 ('spices', 0.9266539812088013),
 ('grating', 0.9245128035545349),
 ('granulated', 0.9239494800567627),
 ('ale', 0.9197518825531006)]

### Exploring 
So as not to overclutter the notebook, I consolidated down my exploration of these models. But below are a few selected things I found interesting

#### Looking at Similar Words to Teaspoon and Tablespoon in the 1900s

In [53]:
model_1900s.wv.most_similar(positive='teaspoon', topn=10) 

[('tsp', 0.9873462915420532),
 ('cayenne', 0.9788909554481506),
 ('mace', 0.9657526612281799),
 ('teaspoons', 0.9649165868759155),
 ('mustard', 0.9602622985839844),
 ('cinnamon', 0.9567856192588806),
 ('curry', 0.9566049575805664),
 ('bicarbonate', 0.9513323307037354),
 ('18', 0.9475886821746826),
 ('nutmeg', 0.9431237578392029)]

In [54]:
model_1900s.wv.most_similar(positive='tsp', topn=10) 

[('teaspoon', 0.9873462915420532),
 ('teaspoons', 0.977495551109314),
 ('18', 0.9691494703292847),
 ('cinnamon', 0.9676050543785095),
 ('mace', 0.967529296875),
 ('cayenne', 0.9639654159545898),
 ('peppercorns', 0.9588087201118469),
 ('curry', 0.9565569758415222),
 ('mustard', 0.955380380153656),
 ('ginger', 0.953309178352356)]

In [56]:
model_1900s.wv.most_similar(positive='tablespoon', topn=10) 

[('mace', 0.9627312421798706),
 ('cinnamon', 0.9600908756256104),
 ('mustard', 0.9580898284912109),
 ('nutmeg', 0.9517815113067627),
 ('tablespoons', 0.9505784511566162),
 ('ginger', 0.9501645565032959),
 ('cloves', 0.9475411176681519),
 ('tsp', 0.9462791681289673),
 ('bay', 0.9459686279296875),
 ('allspice', 0.9450176954269409)]

#### Looking at Similar Words to Teaspoon and Tablespoon in the late 1800s

In [68]:
model_late1800s.wv.most_similar(positive='tablespoon', topn=10) 

[('curry', 0.9996558427810669),
 ('sifted', 0.9992803335189819),
 ('saffron', 0.9990987777709961),
 ('wine', 0.9988948702812195),
 ('vanilla', 0.9988189935684204),
 ('sage', 0.9986270666122437),
 ('peppers', 0.9985697269439697),
 ('couple', 0.998537540435791),
 ('rind', 0.9985110759735107),
 ('turnip', 0.998396098613739)]

In [64]:
model_late1800s.wv.most_similar(positive='teaspoon', topn=10) 

[('tablespoon', 0.999019980430603),
 ('tablespoons', 0.9982521533966064),
 ('cream', 0.9969624876976013),
 ('teaspoonful', 0.9960319995880127),
 ('pinch', 0.9938425421714783),
 ('sugar', 0.993497908115387),
 ('juice', 0.9925577044487),
 ('tablespoonful', 0.9924266338348389),
 ('milk', 0.9922417998313904),
 ('½', 0.9921829700469971)]

It definitely looks like there are differences b/w the early 1900s and late 1800s, as teaspoon and tablespoon are more associated with liquids (e.g., juice, milk, cream)in the late 1800s and spices in the early 1900s, although it should be noted all the similarity scores are very high 

### Drilling down into the relationship between spices and measurement terms

In [74]:
#So in light of the above, what were spices associated with in the late 1800s?

model_late1800s.wv.most_similar(positive='cayenne', topn=10) 

[('white', 0.9975549578666687),
 ('chopped', 0.9961152672767639),
 ('vinegar', 0.9955838918685913),
 ('saltspoonful', 0.9945665001869202),
 ('juice', 0.9944452047348022),
 ('cream', 0.9941798448562622),
 ('bit', 0.9940236806869507),
 ('parsley', 0.993755578994751),
 ('onion', 0.9920457601547241),
 ('½', 0.9913073778152466)]

In [75]:
model_late1800s.wv.most_similar(positive='cinnamon', topn=10) 

[('powder', 0.999305009841919),
 ('cloves', 0.9991910457611084),
 ('celery', 0.9991387128829956),
 ('grated', 0.9991214275360107),
 ('finely', 0.999059796333313),
 ('teaspoonfuls', 0.9989877939224243),
 ('minced', 0.9989538192749023),
 ('fourth', 0.9987632036209106),
 ('glassful', 0.9985424280166626),
 ('ham', 0.9985405802726746)]

In [78]:
model_late1800s.wv.most_similar(positive='nutmeg', topn=10) 

[('curry', 0.9996558427810669),
 ('sifted', 0.9992803335189819),
 ('saffron', 0.9990987777709961),
 ('wine', 0.9988948702812195),
 ('vanilla', 0.9988189935684204),
 ('sage', 0.9986270666122437),
 ('peppers', 0.9985697269439697),
 ('couple', 0.998537540435791),
 ('rind', 0.9985110759735107),
 ('turnip', 0.998396098613739)]

They still appear in the corpus, just perhaps measured in different ways? For example, who knew saltspoons were a thing?