# Old English word embeddings
 
 <img style="float: right;" src="./wyrm.gif" width="250">  
 
Learning various word embeddings for Old English words using the [Dictionary of Old English Corpus](https://www.doe.utoronto.ca/pages/index.html). This is a complete corpus of the surviving Old English texts. Old English was a West Germanic language spoken in England from 500 to around 1100 AD, and the forerunner to Middle and Modern English. 

Because of its proclivity towards compound words, like many West Germanic langauges, Old English has many *Hapax legomena*: words seen only once in the entire corpus. Because of this, I hypothesize that embeddings that capture sub-lexical patterns should perform better than embeddings that represent the whole word as a unique token.



In [26]:
import re
import os
from bs4 import BeautifulSoup
from bs4 import NavigableString,Tag
import html5lib
from DOECparser import DOEC_text

In [8]:
import pandas as pd
import numpy as np
import gensim

### Get metadata on each text

In [10]:
path = '../../DOEC_complete/2488/sgml-corpus/'
text_metadata = pd.read_csv(path+'DOEC_text_metadata.csv')
text_metadata.head()

Unnamed: 0,cameron_number,filename,short_title,text_type,sentences,words
0,A1.1,T00010.sgml,"GenA,B",Poetry,917,17075
1,A1.2,T00020.sgml,Ex,Poetry,173,2974
2,A1.3,T00030.sgml,Dan,Poetry,219,4472
3,A1.4,T00040.sgml,Sat,Poetry,252,4381
4,A2.1,T00050.sgml,And,Poetry,571,9288


### Read in every poetry and prose text

We will exclude glosses and inscriptions

In [36]:
texts = text_metadata[text_metadata['text_type'].isin(['Poetry','Prose'])]['filename'].tolist()
#texts = ['T00010.sgml']

In [37]:
corpus_lines = []
for i in range(len(texts)):
    text_t = DOEC_text(path+texts[i]).read_file()
    text_t.sentence_splitter()
    for sent in text_t.split_sentences:
        if len(sent[0])>1:
            corpus_lines.append(sent[0])

In [38]:
len(corpus_lines)

103360

# Learn embeddings

### Word2Vec:  skipgram, 7 negative samples, window = 7

In [103]:
w2v_skipgram = gensim.models.Word2Vec(size=300, window=7, min_count=1, workers=8, sg=1,
                                     negative=7)

In [104]:
w2v_skipgram.build_vocab(corpus_lines)
len(w2v_skipgram.wv.vocab)

130411

In [105]:
%%time
w2v_skipgram.train(corpus_lines
                   ,total_examples=w2v_skipgram.corpus_count
                   ,epochs=15
                   ,compute_loss=True)

CPU times: user 14min 51s, sys: 3 s, total: 14min 54s
Wall time: 3min 54s


27072673

### Word2Vec:  skipgram, hierarchical sampling, window = 7

In [106]:
w2v_skipgram_hs = gensim.models.Word2Vec(size=300, window=7, min_count=1, workers=8, hs=1)

In [107]:
w2v_skipgram_hs.build_vocab(corpus_lines)
len(w2v_skipgram_hs.wv.vocab)

130411

In [108]:
%%time
w2v_skipgram_hs.train(corpus_lines
                   ,total_examples=w2v_skipgram.corpus_count
                   ,epochs=15
                   ,compute_loss=True)

CPU times: user 5min 56s, sys: 1.91 s, total: 5min 58s
Wall time: 1min 39s


27072764

### FastText

In [112]:
fasttext = gensim.models.FastText(size=300, window=7, min_count=1, workers=8, sg=1, 
                                 negative=7,
                                 word_ngrams=1,min_n=2,max_n=6)

In [113]:
fasttext.build_vocab(corpus_lines)
len(fasttext.wv.vocab)

130411

In [114]:
%%time
fasttext.train(corpus_lines
               ,total_examples=fasttext.corpus_count
               ,epochs=15)

CPU times: user 13min 26s, sys: 3.49 s, total: 13min 29s
Wall time: 3min 50s


# Qualitative checks of embeddings

## Nouns

### Cyning = king
Fasttext mostly gives alternate spellings and compounds containing this word. 

- guðcyning,guðkyning = good-king
- ðeodcyning = nation-king
- folccyning = people-king

Word2Vec on the other hand gives a mix of alternate spellings, the synonym æþeling ("prince / noble"), inflections of Mercia (a powerful Anglo-Saxon kingdom), and a list of names of important Anglo-Saxon kings: Osweo, Alfred, Egbert.

In [115]:
fasttext.wv.most_similar(positive=['cyning'])

[('cyningwic', 0.9308552742004395),
 ('guðcyning', 0.9296610355377197),
 ('cyningæ', 0.9289181232452393),
 ('cyningc', 0.9288219213485718),
 ('cining', 0.9265061616897583),
 ('folccyning', 0.9183368682861328),
 ('eastcyning', 0.9179111123085022),
 ('guðkyning', 0.9177619218826294),
 ('cynning', 0.9173005819320679),
 ('kyning', 0.9166631698608398)]

In [109]:
w2v_skipgram.wv.most_similar(positive=['cyning'])

[('cining', 0.566181480884552),
 ('cing', 0.5258333683013916),
 ('cynincg', 0.489766925573349),
 ('herma', 0.47887736558914185),
 ('cyninge', 0.4763665199279785),
 ('madon', 0.4750809073448181),
 ('someron', 0.47457608580589294),
 ('partha', 0.47169744968414307),
 ('thenach', 0.46782156825065613),
 ('cyng', 0.46777451038360596)]

In [111]:
w2v_skipgram_hs.wv.most_similar(positive=['cyning'])

[('cing', 0.6217490434646606),
 ('cining', 0.5889766216278076),
 ('cynincg', 0.5676072835922241),
 ('cyningc', 0.5443241596221924),
 ('kyning', 0.5227587819099426),
 ('cyng', 0.5082994699478149),
 ('cyninge', 0.4633118808269501),
 ('casere', 0.4627220332622528),
 ('kasere', 0.4574437737464905),
 ('ealdormon', 0.44780227541923523)]

### Feorh = soul, life

Note: this word is used frequently in *Beowulf*

Again, FastText gives alternate spellings and compounds as most similar.

Word2Vec gives the plural form "feores" and a list of words that don't seem to be very close synonymns (arguable hæleþa is the closest). Possibly some of these may have been used in place of feorh in a metaphorical way (as Germanic poems are fond of doing).

- Scyldinga = descendent from Scyld Scefing, i.e. a Dane
- banan = slayer's
- carcern = dungeon
- hæleþa = heros, warriors, men
- eðle = homeland
- sinc = treasure
- hrinan = touch

In [116]:
fasttext.wv.most_similar(positive=['feorh'])

[('feorhræd', 0.9149259924888611),
 ('feorhgebeorh', 0.9044870138168335),
 ('feorhdolg', 0.9022006988525391),
 ('feorhseoc', 0.9001412391662598),
 ('feorhlif', 0.898395299911499),
 ('feorhadl', 0.8902829885482788),
 ('feorhbealu', 0.8769589066505432),
 ('feorg', 0.8765217065811157),
 ('feoroh', 0.8763668537139893),
 ('feorhbealo', 0.8749235272407532)]

In [117]:
w2v_skipgram.wv.most_similar(positive=['feorh'])

[('feorg', 0.5855077505111694),
 ('onwist', 0.5599532127380371),
 ('ealgian', 0.5524983406066895),
 ('magoræswan', 0.5391068458557129),
 ('frætwa', 0.5371189713478088),
 ('geyrne', 0.5327767133712769),
 ('feores', 0.5320269465446472),
 ('wedra', 0.5288721919059753),
 ('radost', 0.5286190509796143),
 ('uðgenge', 0.5285945534706116)]

In [118]:
w2v_skipgram_hs.wv.most_similar(positive=['feorh'])

[('feores', 0.4086899161338806),
 ('lic', 0.3772948682308197),
 ('aldre', 0.3772868514060974),
 ('feore', 0.3678957223892212),
 ('carcern', 0.36658716201782227),
 ('steopmoder', 0.34895774722099304),
 ('aldor', 0.34359925985336304),
 ('ongynnisse', 0.34219813346862793),
 ('sið', 0.3409523367881775),
 ('eþel', 0.34078097343444824)]

### Wyrm = serpent, dragon

In [119]:
fasttext.wv.most_similar(positive=['wyrm'])

[('wyrmfah', 0.8998173475265503),
 ('wyrp', 0.8878295421600342),
 ('wyrms', 0.887391984462738),
 ('slawyrm', 0.8822410702705383),
 ('wyrmsi', 0.8638601303100586),
 ('wyrmod', 0.8498482704162598),
 ('wyrma', 0.8465002775192261),
 ('wyrmcyn', 0.8428762555122375),
 ('wyrmð', 0.8350868225097656),
 ('wyrgðu', 0.8079687356948853)]

In [120]:
w2v_skipgram.wv.most_similar(positive=['wyrm'])

[('swelge', 0.6503645777702332),
 ('tuncersan', 0.6176624298095703),
 ('weaxsealf', 0.6163438558578491),
 ('rysele', 0.6114765405654907),
 ('bite', 0.6096867322921753),
 ('gedrince', 0.606326699256897),
 ('aþwænan', 0.6043746471405029),
 ('gæten', 0.6021493673324585),
 ('micgan', 0.5992810130119324),
 ('worms', 0.5991789698600769)]

In [121]:
w2v_skipgram_hs.wv.most_similar(positive=['wyrm'])

[('fot', 0.4604165852069855),
 ('lig', 0.4559553563594818),
 ('searwum', 0.3889991343021393),
 ('seað', 0.37408506870269775),
 ('swile', 0.36562398076057434),
 ('horn', 0.35987937450408936),
 ('ra', 0.35331493616104126),
 ('þrexwold', 0.35093754529953003),
 ('leg', 0.3472810983657837),
 ('hunta', 0.34098267555236816)]

## Verbs

### Beo = be

In [122]:
fasttext.wv.most_similar(positive=['beo'])

[('beoit', 0.8672382831573486),
 ('beonn', 0.7934653759002686),
 ('beod', 0.7904794216156006),
 ('beom', 0.7872384786605835),
 ('beoc', 0.7866171598434448),
 ('beotige', 0.7862136363983154),
 ('beonat', 0.7854796648025513),
 ('beot', 0.7851861715316772),
 ('beteo', 0.7799439430236816),
 ('beote', 0.7732082009315491)]

In [76]:
w2v_skipgram.wv.most_similar(positive=['beo'])

[('byð', 0.7594594955444336),
 ('sig', 0.7413039803504944),
 ('wurðe', 0.7262946367263794),
 ('sy', 0.7200414538383484),
 ('si', 0.7192427515983582),
 ('byd', 0.6763575077056885),
 ('gewurðe', 0.6673541069030762),
 ('underfo', 0.6545195579528809),
 ('stande', 0.6478850245475769),
 ('byst', 0.6473221778869629)]

In [123]:
w2v_skipgram_hs.wv.most_similar(positive=['beo'])

[('sy', 0.6826016902923584),
 ('si', 0.6311945915222168),
 ('byð', 0.599108874797821),
 ('bið', 0.5752897262573242),
 ('sig', 0.5575801134109497),
 ('bist', 0.4888310432434082),
 ('cume', 0.48649322986602783),
 ('byst', 0.4778839647769928),
 ('wurðe', 0.4456484317779541),
 ('byþ', 0.43626153469085693)]

### Willan = to wish

In [83]:
fasttext.wv.most_similar(positive=['will'])

[('wil', 0.9359115362167358),
 ('willat', 0.9171969890594482),
 ('willæ', 0.9115278720855713),
 ('iwill', 0.9002031683921814),
 ('willm', 0.895616352558136),
 ('willimot', 0.8786477446556091),
 ('willun', 0.8776760101318359),
 ('willabyg', 0.8757930994033813),
 ('willæn', 0.8712247610092163),
 ('willemot', 0.8691423535346985)]

In [84]:
w2v_skipgram.wv.most_similar(positive=['will'])

[('sal', 0.9215754270553589),
 ('euere', 0.9197388887405396),
 ('hauede', 0.9145123958587646),
 ('ther', 0.912217378616333),
 ('may', 0.9099560976028442),
 ('mikel', 0.9089739322662354),
 ('ilke', 0.9077520966529846),
 ('euer', 0.9058969616889954),
 ('goð', 0.9033337235450745),
 ('my', 0.903140664100647)]

### Seo = see

In [85]:
fasttext.wv.most_similar(positive=['seo'])

[('seok', 0.8226551413536072),
 ('seox', 0.7983690500259399),
 ('iseo', 0.7713162899017334),
 ('seoðe', 0.7677260041236877),
 ('seod', 0.7589434385299683),
 ('seogen', 0.7533072233200073),
 ('seonoðe', 0.7514855861663818),
 ('seofæn', 0.7501230239868164),
 ('seoþe', 0.749377965927124),
 ('þeos', 0.7478351593017578)]

In [86]:
w2v_skipgram.wv.most_similar(positive=['seo'])

[('þeos', 0.6924037337303162),
 ('sio', 0.6012263298034668),
 ('fæmne', 0.56919926404953),
 ('sunne', 0.5559064149856567),
 ('stow', 0.5449384450912476),
 ('ðeo', 0.5420321226119995),
 ('ðeos', 0.5267632603645325),
 ('þeo', 0.5217249393463135),
 ('forme', 0.52135169506073),
 ('cwen', 0.5133857727050781)]

### Scan = shone

In [87]:
fasttext.wv.most_similar(positive=['scan'])

[('scacan', 0.9706835746765137),
 ('scancan', 0.9619744420051575),
 ('scadan', 0.9559793472290039),
 ('scuan', 0.9491746425628662),
 ('scafan', 0.9462645053863525),
 ('scagan', 0.9453237056732178),
 ('scuwan', 0.9410423040390015),
 ('sceancan', 0.9408941268920898),
 ('ycan', 0.9403120279312134),
 ('scuccan', 0.9401297569274902)]

In [88]:
w2v_skipgram.wv.most_similar(positive=['scan'])

[('dagunge', 0.926880955696106),
 ('sweartre', 0.9178236126899719),
 ('uhtan', 0.916837751865387),
 ('scima', 0.9160252809524536),
 ('upeode', 0.9137285947799683),
 ('neahte', 0.909358024597168),
 ('ðunor', 0.9064586162567139),
 ('regn', 0.9054130911827087),
 ('nontid', 0.9046434760093689),
 ('tohlad', 0.9038349390029907)]

## Abjectives

### Eald = old

In [124]:
fasttext.wv.most_similar(positive=['eald'])

[('ealdfind', 0.9155492782592773),
 ('ealdur', 0.9001302123069763),
 ('ealdwig', 0.8833523988723755),
 ('ealdæ', 0.8796008825302124),
 ('geald', 0.8776580095291138),
 ('ealdferð', 0.8776400089263916),
 ('ealdwif', 0.8748879432678223),
 ('ealdmetod', 0.873623788356781),
 ('ealdred', 0.8729549646377563),
 ('eadweald', 0.8710095286369324)]

In [125]:
w2v_skipgram.wv.most_similar(positive=['eald'])

[('endylfon', 0.5665035247802734),
 ('nihta', 0.5654948949813843),
 ('ateorod', 0.5626211166381836),
 ('eþelweard', 0.5415925979614258),
 ('sunð', 0.5403499007225037),
 ('fyrngeare', 0.536300539970398),
 ('mona', 0.5272260308265686),
 ('tel', 0.5196020603179932),
 ('wyntre', 0.5174105167388916),
 ('iunii', 0.5168569684028625)]

In [126]:
w2v_skipgram_hs.wv.most_similar(positive=['eald'])

[('twentig', 0.3950759768486023),
 ('clænsigeanne', 0.3756531774997711),
 ('sumor', 0.3562009334564209),
 ('þrittig', 0.3560253977775574),
 ('aprilis', 0.3555823564529419),
 ('dead', 0.35175517201423645),
 ('frod', 0.3489319980144501),
 ('ealdne', 0.3457050919532776),
 ('februariusmonð', 0.34340110421180725),
 ('mai', 0.33607977628707886)]

### Micel = great

In [92]:
fasttext.wv.most_similar(positive=['micel'])

[('micell', 0.9547616243362427),
 ('midmicel', 0.9180570840835571),
 ('mycel', 0.905677318572998),
 ('emmicel', 0.9032089114189148),
 ('efnmicel', 0.8915281295776367),
 ('micelu', 0.8772553205490112),
 ('micellic', 0.8711910247802734),
 ('mycell', 0.8686025738716125),
 ('micelys', 0.857871413230896),
 ('emmycel', 0.8497353792190552)]

In [93]:
w2v_skipgram.wv.most_similar(positive=['micel'])

[('mycel', 0.8458054065704346),
 ('wæl', 0.6736177802085876),
 ('geslogon', 0.6557415127754211),
 ('unlytel', 0.6553495526313782),
 ('geslægen', 0.652004599571228),
 ('egeslic', 0.6513754725456238),
 ('gefeoht', 0.6414756774902344),
 ('gehwæþere', 0.6368350982666016),
 ('geslagen', 0.6359255909919739),
 ('wundorlic', 0.635911762714386)]

### Yfel = bad

In [94]:
fasttext.wv.most_similar(positive=['yfel'])

[('yfell', 0.9452179670333862),
 ('yfelo', 0.8896181583404541),
 ('yfelu', 0.8511879444122314),
 ('yfelæ', 0.8501334190368652),
 ('yfelsoð', 0.8486564755439758),
 ('yfol', 0.8382625579833984),
 ('yfelam', 0.8342252969741821),
 ('yffel', 0.8311987519264221),
 ('ðyfel', 0.8217541575431824),
 ('yfyl', 0.8201266527175903)]

In [95]:
w2v_skipgram.wv.most_similar(positive=['yfel'])

[('good', 0.7591409683227539),
 ('nauht', 0.7514868378639221),
 ('facen', 0.736697256565094),
 ('forðæmþe', 0.7362229824066162),
 ('facn', 0.7343299388885498),
 ('wiðerweard', 0.7302685976028442),
 ('wenen', 0.7287782430648804),
 ('goode', 0.7230398654937744),
 ('nanwuht', 0.7202662825584412),
 ('wyrcanne', 0.7200943231582642)]

## Numerals

### Siex = six

In [97]:
fasttext.wv.most_similar(positive=['siex'])

[('siox', 0.9139299392700195),
 ('sielm', 0.8922587633132935),
 ('sifax', 0.8647642135620117),
 ('sieo', 0.8634213209152222),
 ('siexta', 0.8548995852470398),
 ('siþa', 0.8391749262809753),
 ('sient', 0.8310285806655884),
 ('six', 0.8277508020401001),
 ('siolf', 0.8229187726974487),
 ('sido', 0.822192370891571)]

In [98]:
w2v_skipgram.wv.most_similar(positive=['siex'])

[('syfan', 0.9511535167694092),
 ('eahtatyne', 0.9501603245735168),
 ('ðrittig', 0.9464792609214783),
 ('hundseofonti', 0.9444952607154846),
 ('seofontig', 0.9443520307540894),
 ('sixhund', 0.9425896406173706),
 ('hundeahtatigum', 0.9422545433044434),
 ('monþas', 0.9411824345588684),
 ('hundseofantig', 0.9360958337783813),
 ('chanan', 0.9348127841949463)]

### Forma = first

In [100]:
fasttext.wv.most_similar(positive=['forma'])

[('formelta', 0.9107708930969238),
 ('formis', 0.8550630807876587),
 ('feorma', 0.8548452258110046),
 ('formosus', 0.8275351524353027),
 ('ferma', 0.8262588977813721),
 ('feorða', 0.8253859877586365),
 ('sexta', 0.8243184089660645),
 ('syxta', 0.8204258680343628),
 ('fifta', 0.8150545358657837),
 ('fisica', 0.8138123750686646)]

In [101]:
w2v_skipgram.wv.most_similar(positive=['forma'])

[('æresta', 0.9077373743057251),
 ('feorða', 0.886397659778595),
 ('æftera', 0.8814852833747864),
 ('ðridda', 0.88039231300354),
 ('twelfta', 0.8571304678916931),
 ('þridda', 0.855779230594635),
 ('syxta', 0.8556851744651794),
 ('ærra', 0.8526087403297424),
 ('teoða', 0.8523637056350708),
 ('fifta', 0.8485428094863892)]