# Experimenting with word2vec

The following experiments draw heavily from Ryan Heuser's experiments with [Word Vectors in the Eighteenth Century, Episode 1: Concepts](http://ryanheuser.org/word-vectors-1/). For more on word2vec, see his great introduction and prelimianry experiments with Edward Young's Riches are to Virtue as Learning is to Genius, as well the great post by [Ben Schmidt](http://bookworm.benschmidt.org/posts/2015-10-25-Word-Embeddings.html), and others many and various, who Ryan mentions in his introduction.

#### Why am I interested in this word2vec stuff?

My work focuses on the changing linguistic style of eighteenth-century British travel writing, from 1700 to 1830 (although I'm also working on an early modern corpus for the Early Modern Conversions project). For example, how did the use of first-person narration change? How can we track the development of aesthetic language in travel writing? How do subgenres develop later in the century, especially considering our received literary histories of divisions between personal narrative and guidebooks, for example?

Okay, okay, but how is word2vec going to be useful in that endeavour? To be honest, I'm not exactly certain, but I have some ideas. The purpose of the following explorations are akin to brainstorming.

First, if, like me, you want to use i/python to experiment, see [Ryan's page](http://ryanheuser.org/word-vectors-1/) once again (what a champ, right?) for his word2vec model that he trained on the ECCO-TCP corpus using gensim. As long as you have a working iPython installation, gensim installed, and that model, you should be good to go to follow along below or experiment on your own!

So, let's get started.

In [2]:
import gensim

model = gensim.models.Word2Vec.load_word2vec_format('C:/Users/ASUS/Documents/iPython/word2vec.ECCO-TCP.txt')

First, let's just do a few quick tests to see if things are working as we expect...

In [28]:
# does it recognize the same word as occupying the same vector space?
model.similarity('woman', 'woman')

1.0000000000000002

In [29]:
# are similar terms pretty similar?
model.similarity('woman', 'man')

0.87246097287540969

In [30]:
# and checking the most_similar...
print(*model.most_similar(['woman']), sep='\n')

('man', 0.8724610209465027)
('girl', 0.8591000437736511)
('gentleman', 0.7977981567382812)
('creature', 0.7968071699142456)
('gentlewoman', 0.7913839221000671)
('person', 0.7872239351272583)
('lover', 0.7776159048080444)
('boy', 0.7674461603164673)
('nobleman', 0.7542575597763062)
('coquette', 0.7531524300575256)


Great! Everything looks good to go. Remember, this model was trained on the ECCO-TCP corpus, so if you're surprised by these or any of the other results, it may be because of the eighteenth century context.

The most common example given of how word2vec works is the "king - man + woman" analogy. We know that the concept of a king, removed of its concept of man, but adding the concept of woman, should be queen, but what does the model think?

In [27]:
print(*model.most_similar(['woman','king'], ['man']), sep='\n')

('queen', 0.7854657173156738)
('emperor', 0.7523161768913269)
('prince', 0.7436755895614624)
('princess', 0.7133168578147888)
('conqueror', 0.7111818194389343)
('regent', 0.7088087797164917)
('empress', 0.6977599263191223)
('sultan', 0.6729022264480591)
('confessor', 0.6569845676422119)
('duke', 0.6366889476776123)


Success! I'm not sure how to interpret some of thos eoter results - for example, why is emperor significantly higher than empress? - but let's keep those questions in mind as we push forward.

### Travel

Let's start with just a basic question: what other questions are associated with travel?

In [33]:
print(*model.most_similar(['travel']), sep='\n')

('ride', 0.8182368874549866)
('wander', 0.7832238674163818)
('go', 0.7742351293563843)
('pass', 0.7557092905044556)
('walk', 0.751531720161438)
('run', 0.6949197053909302)
('live', 0.6935456991195679)
('steer', 0.6895339488983154)
('swim', 0.6809810996055603)
('advance', 0.6802776455879211)


Travel, here, is largely associated with verbs, rather than nouns. What if we add in some other elements?

In [31]:
print(*model.most_similar(['journey']), sep='\n')

('voyage', 0.8800842761993408)
('tour', 0.7904102802276611)
('route', 0.7388333082199097)
('pilgrimage', 0.7387519478797913)
('excursion', 0.713157057762146)
('departure', 0.7096347212791443)
('march', 0.6997677087783813)
('journies', 0.6862897276878357)
('travels', 0.6799992322921753)
('expedition', 0.6781469583511353)


Ah, that is more of the kind of travel that I'm interested in - although I'm also intrigued by how the verbs of travel and the nouns of travel might intersect. What if we look at travel*s*?

In [34]:
print(*model.most_similar(['travels']), sep='\n')

('tour', 0.7089195251464844)
('researches', 0.6972652673721313)
('journey', 0.6799992918968201)
('voyage', 0.6547428965568542)
('rambles', 0.6496587991714478)
('commentaries', 0.618072509765625)
('pastorals', 0.616107165813446)
('arrival', 0.612122654914856)
('synopsis', 0.6118921637535095)
('miscellanies', 0.6080632209777832)


Aaah, very interesting. The numbers aren't as high as voyages, but the similarity of travels to terms like researches, commentaries, pastorals (?!), synopsis, and miscellanies may bear further fruit. 
*food for thought* 
- is there a a way to track how the vectors are changing over time?
- if nothing else, this seems like a good way to populate word lists, for example if you want to do some machine learning on travel texts...

For example, what if we take some of the highest ranked terms from our above journey experiment?

In [35]:
print(*model.most_similar(['journey', 'voyage', 'tour', 'route', 'pilgrimage', 'excursion', 'departure', 'departure', 'excursion']), sep='\n')

('arrival', 0.7368205785751343)
('embassy', 0.7238683700561523)
('expedition', 0.7216842174530029)
('excursions', 0.7171652317047119)
('entry', 0.7155791521072388)
('retreat', 0.7114778757095337)
('adventure', 0.699175238609314)
('jaunt', 0.698330819606781)
('travels', 0.6968503594398499)
('passage', 0.6941586136817932)


Huh! Travels still appears pretty low on the list. However, from my corpus building experiments with DREaM/EMC, "travel" as a title search term was a relatively reliable way to find texts about travel. But it looks like, outside of titles, other words occupy similar semantic space, with travel more on teh outskirts.

#### What other questions can we ask about travel?

Since my mind is on the gender track from earlier, what if we ask the same question about V(woman) + V(king) - V(man), but with travel?

In [41]:
print(*model.most_similar(['man','journey'], ['woman']), sep='\n')
# man + journey - woman

('voyage', 0.805957555770874)
('tour', 0.7144608497619629)
('pilgrimage', 0.6863122582435608)
('march', 0.6787299513816833)
('route', 0.6664501428604126)
('travels', 0.6568373441696167)
('excursion', 0.6390223503112793)
('passage', 0.6379192471504211)
('journies', 0.6373951435089111)
('career', 0.6349523067474365)


In [42]:
print(*model.most_similar(['woman','journey'], ['man']), sep='\n')
# woman _ journey - man

('voyage', 0.7651640176773071)
('tour', 0.6965904235839844)
('departure', 0.6628884673118591)
('route', 0.6525277495384216)
('arrival', 0.636233925819397)
('excursion', 0.6341248750686646)
('pilgrimage', 0.632487416267395)
('jaunt', 0.6280655860900879)
('rambles', 0.6192945241928101)
('retreat', 0.6117883920669556)


I'm not an expert on gender and travel in the eighteenth century, but some of these differences seem notable - for example, march and career in men, and arrival, jaunt, rambles, and retreat in women. 

### differences in location

Another question under consideration is how travel is conceptualized differently depending on whether one is traveling in, say, England, France, Africa, or the Caribbean. It *seems* like it should be a good way of teasing out some of the differences between these locations, but as you can see below, my experiments so far haven't been terribly successful.

In [60]:
print(*model.most_similar(['england']), sep='\n')

('scotland', 0.9411507844924927)
('france', 0.9026877880096436)
('ireland', 0.8989571332931519)
('spain', 0.8748767375946045)
('holland', 0.8384954929351807)
('europe', 0.8096485137939453)
('italy', 0.7995690703392029)
('america', 0.797402024269104)
('poland', 0.7930401563644409)
('portugal', 0.7820289731025696)


In [62]:
print(*model.most_similar(['africa']), sep='\n')

('asia', 0.8637418746948242)
('peru', 0.8552894592285156)
('siberia', 0.8434224724769592)
('america', 0.8319177627563477)
('persia', 0.8298490047454834)
('mexico', 0.8288007974624634)
('germany', 0.8287796974182129)
('switzerland', 0.8209482431411743)
('tartary', 0.8190882802009583)
('russia', 0.8178454637527466)


In [70]:
print(*model.most_similar(['africa', 'journey'], ['england']), sep='\n')

('voyage', 0.7463215589523315)
('tour', 0.6522268056869507)
('route', 0.5995424389839172)
('coast', 0.5970306396484375)
('travels', 0.5917956829071045)
('coasts', 0.5887242555618286)
('excursion', 0.5636414885520935)
('north-east,', 0.5633217096328735)
('journies', 0.5607912540435791)
('frontiers', 0.5507453680038452)


In [82]:
print(*model.most_similar(['africa', 'journey'], ['england']), sep='\n')

('voyage', 0.7463215589523315)
('tour', 0.6522268056869507)
('route', 0.5995424389839172)
('coast', 0.5970306396484375)
('travels', 0.5917956829071045)
('coasts', 0.5887242555618286)
('excursion', 0.5636414885520935)
('north-east,', 0.5633217096328735)
('journies', 0.5607912540435791)
('frontiers', 0.5507453680038452)


In [71]:
print(*model.most_similar(['description']), sep='\n')

('specimen', 0.792880654335022)
('representation', 0.7700939774513245)
('sketch', 0.7623703479766846)
('picture', 0.7562278509140015)
('descriptions', 0.7303003072738647)
('narration', 0.7277277708053589)
('simile', 0.7254989147186279)
('delineation', 0.7222810983657837)
('definition', 0.7179701328277588)
('narrative', 0.7146041393280029)


In [72]:
print(*model.most_similar(['beauty']), sep='\n')

('loveliness', 0.7935914993286133)
('excellence', 0.7844477891921997)
('elegance', 0.780930757522583)
('softness', 0.7781577110290527)
('charms', 0.7685889005661011)
('sweetness', 0.7612348794937134)
('splendour', 0.7471537590026855)
('virtue', 0.7430456876754761)
('splendor', 0.7428613901138306)
('delicacy', 0.7419913411140442)


In [73]:
print(*model.most_similar(['africa', 'beauty'], ['england']), sep='\n')

('whiteness', 0.6618044972419739)
('richness', 0.6395145654678345)
('fertility', 0.6272088289260864)
('brilliancy', 0.6214094161987305)
('sublimity', 0.62090665102005)
('loveliness', 0.6144739389419556)
('beauties', 0.6129385828971863)
('brightness', 0.6119914054870605)
('excellence', 0.610403835773468)
('hues', 0.6103043556213379)


In [74]:
print(*model.most_similar(['asia', 'beauty'], ['england']), sep='\n')

('loveliness', 0.6491484045982361)
('sublimity', 0.6464680433273315)
('brightness', 0.6405013799667358)
('whiteness', 0.6328679919242859)
('splendor', 0.632349967956543)
('splendour', 0.6321321725845337)
('beauties', 0.6315566301345825)
('grandeur', 0.6281092762947083)
('hues', 0.6206766963005066)
('splendors', 0.6132285594940186)


In [80]:
print(*model.most_similar(['england', 'beauty'], ['france']), sep='\n')

('excellence', 0.7609107494354248)
('loveliness', 0.7390191555023193)
('virtue', 0.7209703922271729)
('sweetness', 0.7065260410308838)
('merit', 0.6959548592567444)
('sensibility', 0.694877564907074)
('elegance', 0.6942763924598694)
('softness', 0.6914743781089783)
('charms', 0.6880205869674683)
('beauties', 0.6824964880943298)


In [81]:
print(*model.most_similar(['france', 'beauty'], ['england']), sep='\n')

('elegance', 0.7341766953468323)
('softness', 0.7318840026855469)
('splendour', 0.7203804850578308)
('charms', 0.718169093132019)
('loveliness', 0.7148070335388184)
('splendor', 0.709806501865387)
('grandeur', 0.6994746923446655)
('delicacy', 0.6949677467346191)
('brilliancy', 0.6926240921020508)
('sweetness', 0.6878679990768433)


In [77]:
model.n_similarity(['england'], ['england'])

1.0000000000000002

In [76]:
model.n_similarity(['england'], ['ireland'])
0.61540466561049689

0.6154046656104969

In [79]:
model.n_similarity(['england'], ['africa'])

0.65522880572924469

Part of my problem, I think, is that I'm not thinking in terms of how the WEM is working - I expect the categories under beauty (so, descriptions of mountains) to show up, rather than just synonyms for beauty.