# Völuspá analysis

The aim of this notebook is to see how the use of NLP techniques may help philological studies. My first analysis is on Völuspá, the first poem of the Elder Edda, telling the destiny of the worlds.


#### Set your own user path

In [1]:
USER_PATH = "/home/pi"

### Download Old Norse Corpora to cltk_data directory

In [2]:

from cltk.corpus.utils.importer import CorpusImporter
onc = CorpusImporter("old_norse")
onc.import_corpus("old_norse_texts_heimskringla")
onc.import_corpus('old_norse_models_cltk')
#onc.list_corpora

Configure **ipython**.

```bash
$ ipython profile create
$ ipython locate
$ nano .ipython/profile_default/ipython_config.py
```
 Add it a the end of the file:
```bash
c.InteractiveShellApp.exec_lines = [
    'import sys; sys.path.append("/home/pi/cltk_data")'
]
```
It is necessary to do that because it makes things easier to utilize data furnished by CLTK. You will see later in the notebook how it is used.

Install the **kernel** associated with **python3.6** [https://ipython.readthedocs.io/en/stable/install/kernel_install.html](https://ipython.readthedocs.io/en/stable/install/kernel_install.html) 

In [3]:
# old_norse_text_perseus

import os
module_path = os.path.join(USER_PATH, "cltk_data/old_norse/text/old_norse_texts_heimskringla/")
here = os.getcwd()
os.chdir(module_path)
corpus_path = os.path.join(module_path, "Völuspá")
import text_manager
loader = text_manager.TextLoader(os.path.join("Sæmundar-Edda", "Völuspá"), "txt")
text = loader.load()
os.chdir(here)
print(text[:200])


1.
Hljóðs bið ek allar
helgar kindir,
meiri ok minni
mögu Heimdallar;
viltu at ek, Valföðr,
vel fyr telja
forn spjöll fira,
þau er fremst of man.
 
2.
Ek man jötna
ár of borna,
þá er forðum mik
fædda 


### Split text in paragraphs ans verses
Extract the organisation of verses in Völuspá

In [4]:
# The following function is useful to improve existing functions
import re
def remove_punctuations(text):
    res = text
    # for punctuation in "-:?":
    #    res = "".join(res.split(punctuation))
    res = re.sub("[\-:\?;,]", "", res)
    res = re.sub("z", "s", res)
    res = re.sub("x", "ks", res)
    res = re.sub(r" +", " ", res)
    return res

  * Get the indices of paragraphs delimiters
  * Extract the content between two following delimiters and seperate lines and put them in a list
  * Clean all the lines/verses

In [5]:
import re

indices = [(m.start(0), m.end(0)) for m in re.finditer(r"[0-9]{1,2}\.", text)]  #re.findall(r"[0-9]{1,2}\.", text)
paragraphs = [text[indices[i][1]:indices[i+1][0]].split("\n") for i in range(len(indices)-1)]
paragraphs = [[remove_punctuations(verse).strip() for verse in paragraph if remove_punctuations(verse) != "" and verse != "\xa0"] for paragraph in paragraphs ]
paragraphs[:3]

[['Hljóðs bið ek allar',
  'helgar kindir',
  'meiri ok minni',
  'mögu Heimdallar',
  'viltu at ek Valföðr',
  'vel fyr telja',
  'forn spjöll fira',
  'þau er fremst of man.'],
 ['Ek man jötna',
  'ár of borna',
  'þá er forðum mik',
  'fædda höfðu',
  'níu man ek heima',
  'níu íviðjur',
  'mjötvið mæran',
  'fyr mold neðan.'],
 ['Ár var alda',
  'þat er ekki var',
  'vara sandr né sær',
  'né svalar unnir',
  'jörð fannsk æva',
  'né upphiminn',
  'gap var ginnunga',
  'en gras hvergi.']]

#### Use the phonology module to transcribe phonetically the text

In [6]:
from cltk.phonology import utils as phu
from cltk.phonology.old_norse import transcription as ont 
tr = phu.Transcriber(ont.DIPHTHONGS_IPA, ont.DIPHTHONGS_IPA_class, ont.IPA_class, ont.old_norse_rules)
for paragraph in paragraphs[:3]:
    for verse in paragraph:
        print(repr(verse)+"\t\t->\t"+tr.main(verse))

'Hljóðs bið ek allar'		->	[hljoːðs bið ɛk alːar]
'helgar kindir'		->	[hɛlɣar kindir]
'meiri ok minni'		->	[mɛiri ɔk minːi]
'mögu Heimdallar'		->	[mœɣu hɛimdalːar]
'viltu at ek Valföðr'		->	[viltu at ɛk valvœðr]
'vel fyr telja'		->	[vɛl fyr tɛlja]
'forn spjöll fira'		->	[fɔrn spjœlː fira]
'þau er fremst of man.'		->	[θɒu ɛr frɛmst ɔv man]
'Ek man jötna'		->	[ɛk man jœtna]
'ár of borna'		->	[aːr ɔv bɔrna]
'þá er forðum mik'		->	[θaː ɛr fɔrðum mik]
'fædda höfðu'		->	[fɛːdːa hœvðu]
'níu man ek heima'		->	[niːu man ɛk hɛima]
'níu íviðjur'		->	[niːu iːviðjur]
'mjötvið mæran'		->	[mjœtvið mɛːran]
'fyr mold neðan.'		->	[fyr mɔld nɛðan]
'Ár var alda'		->	[aːr var alda]
'þat er ekki var'		->	[θat ɛr ɛkːi var]
'vara sandr né sær'		->	[vara sandr neː sɛːr]
'né svalar unnir'		->	[neː svalar unːir]
'jörð fannsk æva'		->	[jœrð fanːsk ɛːva]
'né upphiminn'		->	[neː upːhiminː]
'gap var ginnunga'		->	[gap var ginːunɣa]
'en gras hvergi.'		->	[ɛn gras hvɛrɣi]


### Show alliterations in the text
Alliterations are the main stylistic literary device in Old Norse poetry.

In [7]:
import re
from collections import Counter

for paragraph in paragraphs[:3]:
    for verse in paragraph:
        ipa_verse = tr.main(verse)
        print(Counter(re.sub(" ", "", ipa_verse)))
        print(Counter([ipa_verse[i:i+2] for i in range(1,len(ipa_verse)-1) if " " not in ipa_verse[i:i+2]]))

Counter({'l': 2, 'ː': 2, 'ð': 2, 'a': 2, '[': 1, 'h': 1, 'j': 1, 'o': 1, 's': 1, 'b': 1, 'i': 1, 'ɛ': 1, 'k': 1, 'r': 1, ']': 1})
Counter({'hl': 1, 'lj': 1, 'jo': 1, 'oː': 1, 'ːð': 1, 'ðs': 1, 'bi': 1, 'ið': 1, 'ɛk': 1, 'al': 1, 'lː': 1, 'ːa': 1, 'ar': 1, 'r]': 1})
Counter({'r': 2, 'i': 2, '[': 1, 'h': 1, 'ɛ': 1, 'l': 1, 'ɣ': 1, 'a': 1, 'k': 1, 'n': 1, 'd': 1, ']': 1})
Counter({'hɛ': 1, 'ɛl': 1, 'lɣ': 1, 'ɣa': 1, 'ar': 1, 'ki': 1, 'in': 1, 'nd': 1, 'di': 1, 'ir': 1, 'r]': 1})
Counter({'i': 4, 'm': 2, '[': 1, 'ɛ': 1, 'r': 1, 'ɔ': 1, 'k': 1, 'n': 1, 'ː': 1, ']': 1})
Counter({'mɛ': 1, 'ɛi': 1, 'ir': 1, 'ri': 1, 'ɔk': 1, 'mi': 1, 'in': 1, 'nː': 1, 'ːi': 1, 'i]': 1})
Counter({'m': 2, 'a': 2, '[': 1, 'œ': 1, 'ɣ': 1, 'u': 1, 'h': 1, 'ɛ': 1, 'i': 1, 'd': 1, 'l': 1, 'ː': 1, 'r': 1, ']': 1})
Counter({'mœ': 1, 'œɣ': 1, 'ɣu': 1, 'hɛ': 1, 'ɛi': 1, 'im': 1, 'md': 1, 'da': 1, 'al': 1, 'lː': 1, 'ːa': 1, 'ar': 1, 'r]': 1})
Counter({'v': 3, 'l': 2, 't': 2, 'a': 2, '[': 1, 'i': 1, 'u': 1, 'ɛ': 1, 'k': 1,

A better way to find aliterration is not to look for exact alitterations but for approximates ones. We can look at common consonant features like "bilabial stops or "labio-dental frictative".

### Show vocabulary

Vocabulary of an eddic poem is 

In [8]:
from cltk.utils.frequency import Frequency
from cltk.tag.pos import POSTag

tagger = POSTag('old_norse')

frq = Frequency()

text = " ".join([verse for paragraph in paragraphs for verse in paragraph])
times = frq.counter_from_str(text)
common_words = times.most_common(15)
for word in common_words:
    print(word[0], tagger.tag_tnt(word[0])[0][1])


ok N-N
er C
á P
of P
í P
en CONJ
hon Unk
ek PRO-N
um P
at C
var BEDI
né CONJ
enn ADV
eða CONJ
sá D-N


Even in a laguage which is higly flexionnal, there are a lot of small words, which we may call *stop words* that we find everywhere.

### Show syntax

The syntax of a poem is more concise than the one of a saga. Eddic poems obey strict rules like the one which require 4 syllables in a verse. Yet, it is not really respected. Are there any obstacls to write more or less than 4 syllables? In which situations do the irregulaties occur? 

In [9]:

from cltk.tag.pos import POSTag

tagger = POSTag('old_norse')


text = [verse for paragraph in paragraphs for verse in paragraph]

for sentence in text[:5]:
    words = sentence.split(" ")
    tagged_words = [tagger.tag_tnt(word) for word in words]
    print(" ".join([word+"|"+tag[0][1] for word, tag in zip(words, tagged_words)]))

Hljóðs|Unk bið|VBPI ek|PRO-N allar|Q-A
helgar|ADJ-A kindir|Unk
meiri|QR-N ok|N-N minni|QR-N
mögu|Unk Heimdallar|Unk
viltu|Unk at|C ek|PRO-N Valföðr|Unk


### Future tasks
* annotate Völuspá
    * with POS tags
    * with syllabified words
    * with tokenized words
    * phonetical transcription


By Clément Besnier, email address: clemsciences@aol.com, web site: https://clementbesnier.pythonanywhere.com/, twitter: clemsciences