# ELMo

The point of this notebook is to associate sentences with vectors so as to be able to compare them semantically uing the machine. It is very memory heavy and to run it you might have to use digitalocean

## Imports:

packages needed for this notebook : numpy pandas tensorflow tensorflow_hub scikit-learn spacy ipython chart_studio

In [4]:
import os
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow_hub as hub
from sklearn import preprocessing


If the below cell does not work on the first try, restart the kernel and try again

In [5]:
#!python3 -m spacy download en_core_web_md
import spacy
from spacy.lang.en import English
from spacy import displacy
nlp = spacy.load('en_core_web_md')

In [6]:
from IPython.display import HTML
import logging
logging.getLogger('tensorflow').disabled = True #OPTIONAL - to disable outputs from Tensorflow

## Get the data 

We have a two options to do that


## only get entries of certain categories

Here we use the metadata found in m-k-manuscript-data/metadata/entry_metadata.csv to select entries with the proper category

In [8]:
metadata=pd.read_csv(os.getcwd() + '/../metadata/entry_metadata.csv')
painting_entries=[]
varnish_entries=[]
armor_entries=[]
for i in range(929):
    if metadata['categories'][i]=='painting':
        painting_entries.append(metadata['div_id'][i])
    if metadata['categories'][i]=='varnish':
        varnish_entries.append(metadata['div_id'][i])
    if metadata['categories'][i]=='arms and armor':
        armor_entries.append(metadata['div_id'][i])


entries

120 33 48


Unnamed: 0,folio,folio_display,div_id,categories,heading_tc,heading_tcn,heading_tl,al_tc,al_tcn,al_tl,...,it_tl,la_tc,la_tcn,la_tl,oc_tc,oc_tcn,oc_tl,po_tc,po_tcn,po_tl
0,001r,1r,001r_1,lists,[Liste de noms],[Liste de noms],[List of names],,,,...,,,,,,,,,,
1,001r,1r,001r_2,lists,[Liste],[Liste],[List],,,,...,,sacra eleusinæ deæ propalare nefas,sacra eleusinae deae propalare nefas.,sacra eleusinae deae propalare nefas,,,,,,
2,001r,1r,001r_3,lists,[Liste de livres et d'autheurs],[Liste de livres et d'autheurs],[List of books and authors],aucupio,aucupio,aucupio,...,,cum permultis aliis; in aeneidem; thebaidos; m...,cum permultis aliis; mathematicus ingolstadien...,cum permultis aliis; mathematicus ingolstadien...,,,,,,
3,001r,1r,001r_4,lists,[Liste de livres],[Liste de livres],[List of books],,,,...,,aquatilium animalium historiæ hypolito salvian...,"aquatilium animalium historiae, hypolito salvi...","aquatilium animalium historiae, hypolito salvi...",,,,,,
4,001v,1v,001v_1,medicine,Pour lascher le ventre,Pour lascher le ventre,For loosening the belly,poulet,poulet,chicken,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
924,170r,170r,170r_6,casting,Nettoyer moules clos,Nettoyer moules clos,Cleaning closed molds,,,,...,,,,,,,,,,
925,170v,170v,170v_1,manuscript structure,[Première page d'origine (1578–1579)],[Première page d'origine (1578–1579)],[Original first page (1578–1579)],,,,...,,,,,,,,,,
926,170v,170v,170v_2,medicine,Contre peste,Contre peste,Against plague,,,,...,,othonis episcopi frisigensis ab orbe condito,othonis episcopi frisigensis ab orbe condito,othonis episcopi frisigensis ab orbe condito,,,,,,
927,170v,170v,170v_3,medicine,Pour preserver,Pour preserver,For preserving,,,,...,,acetum paratum ex ruta baccis juniperi simul t...,hyeronimus mercurialis variarum; abbatis usper...,"hyeronimus mercurialis, variarum; abbatis ursp...",,,,,,


In [9]:
text=''
for entry in painting_entries:
    fil = open(os.getcwd() + '/../entries/txt/tl/tl_'+entry+'.txt')
    text = text + fil.read()
    fil.close()
for entry in armor_entries:
    fil = open(os.getcwd() + '/../entries/txt/tl/tl_'+entry+'.txt')
    text = text + fil.read()
    fil.close()

## don't discriminate by entry and just get the full text

for that we take the full untagged txt file in m-k-manuscript-data/allFolios/txt/all_tl.txt

In [7]:
fileo = open(os.getcwd()+'/../allFolios/txt/all_tl.txt','r',encoding='utf-8')
text=fileo.read()
fileo.close()

## Create sentence embeddings

In [8]:
url = "https://tfhub.dev/google/elmo/3"
path2=os.getcwd()+"/ELMo2"
path3=os.getcwd()+"/ELMo3"  #these were backup plans when the caching of the online model stopped working
embed = hub.Module(url)

In [9]:
import re

text = text.lower().replace('\n', ' ').replace('\t', ' ').replace('\xa0',' ').replace('&amp;','&')
text = ' '.join(text.split())
doc = nlp(text)

counter=0
sentences = []
for i in doc.sents:
  if len(i)>1 and counter<100:   #the counter here ensures that the data you want to encode is not too big for your memory, adjust it to your needs
    sentences.append(i.string.strip())
    counter+=1
    
len(sentences)

100

Here we visualize the sentences to make sure that they were properly cut

In [8]:
sentences[110:120]

['venice masks are made with the hollow & the male face of copper.',
 'the flemish do not use any whites for flesh colors in oil other than lead white because the ceruse turns yellow.',
 '4 or 5 year-old walnut oil which is clear is the best color, it keeps off dust.',
 'the kind which has recently been drawn with the press in the manner of almond oil is white, especially if the walnutsâ€™ skin is removed.',
 'one needs to make at least three layers of flesh color to accomplish faces in oil.',
 'and at the beginning, one puts the black and umber where it is appropriate.',
 'next, the heightening with lead white must not be put on the black.',
 'flesh colors, and where the ceruse enters will yellow in five or six months, but lead white does not change.',
 'florence lake is better than that from flanders for in florence the best dyes are made.',
 'to make a beautiful flesh color, the reddest & liveliest lake is the best, for the kind that contains purple & violet, by admixture of too muc

In [10]:
embeddings = embed(
    sentences,
    signature="default",
    as_dict=True)["default"]

The cell that requires a lot of memory is the below cell

In [11]:
%%time
gpu_options = tf.GPUOptions(allow_growth=True) 
with tf.Session(config=tf.ConfigProto(gpu_options=gpu_options)) as sess:
  sess.run(tf.global_variables_initializer())
  sess.run(tf.tables_initializer())
  x = sess.run(embeddings)

(100, 1024)
Wall time: 26.5 s


## Visualize the sentences using PCA and TSNE

In [16]:
from sklearn.decomposition import PCA

pca = PCA(n_components=50)
y = pca.fit_transform(x)
from sklearn.manifold import TSNE

y = TSNE(n_components=2).fit_transform(y)

In [17]:
import chart_studio.plotly as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

init_notebook_mode(connected=True)


data = [
    go.Scatter(
        x=[i[0] for i in y],
        y=[i[1] for i in y],
        mode='markers',
        text=[i for i in sentences],
    marker=dict(
        size=16,
        color = [len(i) for i in sentences], #set color equal to a variable
        opacity= 0.8,
        colorscale='Viridis',
        showscale=False
    )
    )
]
layout = go.Layout()
layout = dict(
              yaxis = dict(zeroline = False),
              xaxis = dict(zeroline = False)
             )
fig = go.Figure(data=data, layout=layout)
file = plot(fig, filename='Sentence_encode/test.html')

## Create a semantic search engine:

Enter a set of words to find matching sentences. 'results_returned' can be used to modify the number of matching sentences retured

This will take tons of memory again

In [25]:
search_string = "turpentine" #@param {type:"string"}
results_returned = "5" 

from sklearn.metrics.pairwise import cosine_similarity


embeddings2 = embed(
    [search_string],
    signature="default",
    as_dict=True)["default"]

with tf.Session() as sess:
  sess.run(tf.global_variables_initializer())
  sess.run(tf.tables_initializer())
  search_vect = sess.run(embeddings2)
  

cosine_similarities = pd.Series(cosine_similarity(search_vect, x).flatten())
output =""
for i,j in cosine_similarities.nlargest(int(results_returned)).iteritems():
  output +='<p style="font-family:verdana; font-size:110%;"> '
  for i in sentences[i].split():
    if i.lower() in search_string:
      output += " <b>"+str(i)+"</b>"
    else:
      output += " "+str(i)
  output += "</p><hr>"
    
output = '<h3>Results:</h3>'+output
display(HTML(output))
#   print(sentences[i])
#   print('\n')


INFO:tensorflow:Saver not created because there are no variables in the graph to restore
