# Dating texts about the discovery of America by Vikings

### Context 


The discovery of America by Norse people is now a fact. Archeological evidences show the presence of Vikings at *L’Anse aux Meadows* in the current Canada. (https://www.erudit.org/en/journals/nflds/2003-v19-n1-nflds_19_1/nflds19_1art02/)

[A Nature article: Evidence for European presence in the Americas in ad 1021](https://rdcu.be/cCdv8) (https://doi.org/10.1038/s41586-021-03972-8 or https://www.nature.com/articles/s41586-021-03972-8)



In [1]:
import cltk

In [2]:
print(cltk.__version__)

cltk 1.0.21


## Text retrieval

First, we need to retrieve the texts we want to analyse.

In [3]:
import codecs

def load_file(filename):
    with codecs.open(filename, encoding="utf-8") as f:
        
        return f.read()
    
def load_clean_text(filename):
    text = load_file(filename)
    text = text.replace('"', "")
    text = text.replace("-", " ")
    text = text.lower()
    return text

ESR = load_clean_text("eiriks_saga_rauda.txt")

    
GS = load_clean_text("grænlendiga_saga.txt")

In [4]:
print(f"{ESR[:500]}\n\n===================\n\n{GS[:500]}")

1. frá auði djúpúðgu ok vífli.

óláfr hét herkonungr, er kallaðr var óláfr hvíti. hann var sonr ingjalds konungs helgasonar, óláfssonar, guðröðarsonar, hálfdanarsonar hvítbeins upplendingakonungs.
óláfr herjaði í vestrvíking ok vann dyflinni á írlandi ok dyflinnarskíri. þar gerðist hann konungr yfir. hann fekk auðar djúpúðgu, dóttur ketils flatnefs, bjarnarsonar bunu, ágæts manns ór nóregi. þorsteinn rauðr hét sonr þeira.
óláfr fell á írlandi í orrostu, en auðr ok þorsteinn fóru þá í suðreyjar. 


1. fundit ok byggt grænland.

þorvaldr hét maðr, sonr ásvalds úlfssonar, öxna þórissonar. þorvaldr ok eiríkr inn rauði, sonr hans, fóru af jaðri til íslands fyrir víga sakir. þá var víða byggt ísland. þeir bjuggu fyrst at dröngum á hornströndum. þar andaðist þorvaldr.
eiríkr fekk þá þjóðhildar, dóttur jörundar úlfssonar ok þorbjargar knarrarbringu, er þá átti þorbjörn inn haukdælski. réðst eiríkr þá norðan ok bjó á eiríksstöðum hjá vatnshorni. sonr eiríks ok þjóðhildar hét leifr.
enn efti

One way to estimate the date of writing of a text is by using linguistic features.

[https://timarit.is/page/7340957](https://timarit.is/page/7340957)

> The of/um particle can be used as a dating criterium only when the material for analysis is extensive.

In [5]:
from cltk import NLP

In [6]:
# help(NLP)

In [7]:
from cltk.tokenizers.processes import OldNorseTokenizationProcess
from cltk.stops.processes import StopsProcess
from cltk.text.processes import OldNorsePunctuationRemovalProcess


In [8]:
#help(StopsProcess)

In [9]:
from cltk.core.data_types import Pipeline

In [10]:
non_pipeline_1 = Pipeline(language="non", description="", processes=[OldNorseTokenizationProcess, StopsProcess])

In [11]:
non_nlp_1 = NLP("non", custom_pipeline=non_pipeline_1)

‎𐤀 CLTK version '1.0.21'.
Pipeline for language 'Old Norse' (ISO: 'non'): `OldNorseTokenizationProcess`, `StopsProcess`.


In [12]:
ESR_analysed_1 = non_nlp_1.analyze(ESR)

In [13]:
ESR_analysed_1.sentences_strings[0][:100] # it does not work yet

'1 . frá auði djúpúðgu ok vífli . óláfr hét herkonungr , er kallaðr var óláfr hvíti . hann var sonr i'

In [14]:
len(ESR_analysed_1.sentences)

1

In [15]:
from cltk.sentence.sentence import RegexSentenceTokenizer

sent_end_chars = [".", "!", "?"]


class OldNorseRegexSentenceTokenizer(RegexSentenceTokenizer):
    """``RegexSentenceTokenizer`` for Old Norse."""

    def __init__(self: object):
        super().__init__(language="non", sent_end_chars=sent_end_chars)
        from cltk.sentence.sentence import RegexSentenceTokenizer


In [16]:
from copy import deepcopy
from dataclasses import dataclass

from boltons.cacheutils import cachedproperty

from cltk.core import CLTKException
from cltk.core.data_types import Doc, Process
from cltk.sentence.sentence import SentenceTokenizer

@dataclass
class SentenceTokenizationProcess(Process):

    model: object = None

    @cachedproperty
    def algorithm(self):
        raise CLTKException(f"No sentence tokenization algorithm for language '{self.language}'.")

    def run(self, input_doc: Doc) -> Doc:
        output_doc = deepcopy(input_doc)
        sentence_tokenizer = self.algorithm
        if not isinstance(sentence_tokenizer, SentenceTokenizer):
            raise CLTKException("Algorithm must be an instance of SentenceTokenizer subclass")

        sentences = sentence_tokenizer.tokenize(output_doc.raw, self.model)
        sentence_indices = []
        for i, sentence in enumerate(sentences):
            if i >= 1:
                sentence_indices.append(sentence_indices[-1] + len(sentences[i]))
            else:
                sentence_indices.append(len(sentence))
        sentence_index = 0
        for j, word in enumerate(output_doc.words):
            if sentence_indices[sentence_index] < word.index_char_stop and\
                    sentence_index + 1 < len(sentence_indices):
                sentence_index += 1
            word.index_sentence = sentence_index
        return output_doc


@dataclass
class OldNorseSentenceTokenizationProcess(SentenceTokenizationProcess):

    @cachedproperty
    def algorithm(self):
        return OldNorseRegexSentenceTokenizer()


In [17]:
non_pipeline_2 = Pipeline(language="non", description="", 
                          processes=[OldNorseTokenizationProcess, 
                                     OldNorseSentenceTokenizationProcess,
                                     OldNorsePunctuationRemovalProcess,
                                     StopsProcess])

In [18]:
non_nlp_2 = NLP("non", custom_pipeline=non_pipeline_2)

‎𐤀 CLTK version '1.0.21'.
Pipeline for language 'Old Norse' (ISO: 'non'): `OldNorseTokenizationProcess`, `OldNorseSentenceTokenizationProcess`, `OldNorsePunctuationRemovalProcess`, `StopsProcess`.


In [19]:
ESR_analysed_2 = non_nlp_2.analyze(ESR.lower())

In [20]:
GS_analysed_2 = non_nlp_2.analyze(GS.lower())

In [21]:
len(ESR_analysed_2.sentences)

587

In [22]:
len(GS_analysed_2.sentences)

413

### Caracterisation of the topic

In [23]:
print([(word.string, word.stop) for word in ESR_analysed_2.sentences[1].words])

[('frá', True), ('auði', False), ('djúpúðgu', False), ('ok', True), ('vífli', False)]


In [24]:
# from cltk.alphabet.non import 

In [25]:
custom_stop_list = ["var", "hann", "þar", "hon", "váru", "af", "ek", "svá", "eigi", "nú", "hafði", "honum", 
                    "hafa", "henni", "þér", "höfðu", "mun", "hans", "sér", "eftir", "vera", "ekki", "mér", 
                    "þú", "aftr", "einn", "hana", "sitt", "haf", "vér", "sínum", "hennar", "sínu", "þaðan", 
                    "allt", "sinn", "hvat", "sama"]



In [26]:
common_verbs = ["hét", "fór", "kom", "tók", "fara", "mælti", "sjá", "kvað", "þótti", "fóru", 
                "átti", "sagði", "bjó", "kómu", "kveðst", "verða", "segir", "leita", "sigla", 
                "vil", "segja", "svarar"]

In [27]:
def keep_informative_words(words):
    l = []
    for word in words:
        #if word.string not in GMH_PUNCTUATION and not word.stop:
        if not word.stop and word.string not in custom_stop_list and word.string not in common_verbs:
            l.append(word)
    return l

In [28]:
words_to_keep = []
words_to_keep_ESR = keep_informative_words(ESR_analysed_2.words)
words_to_keep_GS = keep_informative_words(GS_analysed_2.words)

In [29]:
len(words_to_keep_ESR), len(words_to_keep_GS)

(3712, 3025)

In [30]:
from collections import Counter

In [31]:
c_ESR = Counter([w.string for w in words_to_keep_ESR], )
c_GS = Counter([w.string for w in words_to_keep_GS])

In [32]:
print(c_ESR.most_common(50))

[('menn', 40), ('karlsefni', 40), ('eiríkr', 32), ('þorbjörn', 25), ('þorsteinn', 24), ('maðr', 24), ('leifr', 22), ('landit', 17), ('guðríðr', 17), ('vetr', 16), ('verit', 16), ('væri', 15), ('skip', 14), ('ormr', 14), ('manna', 14), ('land', 13), ('bóndi', 13), ('sonr', 12), ('kona', 12), ('mönnum', 11), ('suðr', 11), ('sumar', 11), ('einarr', 11), ('vildi', 11), ('bjarni', 11), ('skrælingar', 11), ('dóttur', 10), ('grænlands', 10), ('brattahlíð', 10), ('mikill', 10), ('eitt', 10), ('hafi', 10), ('grænlandi', 10), ('þorkell', 10), ('öllum', 10), ('móti', 10), ('þórhallr', 10), ('fundu', 10), ('gerðist', 9), ('auðr', 9), ('skipi', 9), ('koma', 9), ('gekk', 9), ('mik', 9), ('tóku', 9), ('vetrinn', 9), ('íslands', 8), ('mundu', 8), ('föður', 8), ('hefir', 8)]


In [33]:
print(c_GS.most_common(50))

[('leifr', 33), ('skip', 30), ('karlsefni', 28), ('land', 25), ('þorsteinn', 24), ('menn', 21), ('landit', 20), ('guðríðr', 20), ('eiríkr', 18), ('maðr', 17), ('bjarni', 17), ('þorvaldr', 15), ('skipi', 15), ('manna', 14), ('vetr', 13), ('grænlandi', 13), ('freydís', 12), ('landi', 12), ('síns', 12), ('sína', 12), ('it', 11), ('mikil', 11), ('gera', 11), ('taka', 11), ('bað', 11), ('gekk', 11), ('sonr', 10), ('kona', 10), ('skips', 10), ('grænland', 9), ('skal', 9), ('stund', 9), ('bræðr', 9), ('bjuggu', 8), ('várit', 8), ('eiríksfjarðar', 8), ('fyrr', 8), ('grænlands', 8), ('þegar', 8), ('brátt', 8), ('allir', 8), ('ferð', 8), ('skála', 8), ('fekk', 7), ('landinu', 7), ('væri', 7), ('brattahlíð', 7), ('herjólfr', 7), ('sigldu', 7), ('þrjá', 7)]


In [34]:

def find_common_words(l1, l2):
    s1 = set(l1)
    s2 = set(l2)
    return s1.intersection(s2)

def find_rare_words(l1, l2):
    s1 = set(l1)
    s2 = set(l2)
    s3 = s1.difference(s2)
    s4 = s2.difference(s1)
    return s3.union(s4)
    

In [35]:
common_words = find_common_words([w.string for w in words_to_keep_ESR], [w.string for w in words_to_keep_GS])

In [36]:
len(common_words)

681

In [37]:
different_words = find_rare_words([w.string for w in words_to_keep_ESR], [w.string for w in words_to_keep_GS])

In [38]:
len(different_words)

2116

In [39]:
!ls

 annotated_pos_ESR.csv	    entries
 annotated_pos_GS.csv	    get_wiktionary_old_norse_lemma_pages.ipynb
 beowulf-comparison.ipynb   grænlendiga_saga.txt
 beowulf.txt		   'Hrólfs saga kraka ok kappa hans.txt'
 eiriks_saga_rauda.txt	    of_um_datation.ipynb


###  Loading POS annotations

In [40]:
with codecs.open("annotated_pos_ESR.csv", "r") as f:
    pos_lines_ESR = [line.split("\t") for line in f.read().split("\n")]
    

In [41]:
with codecs.open("annotated_pos_GS.csv", "r") as f:
    pos_lines_GS = [line.split("\t") for line in f.read().split("\n")]


In [42]:
PUNCTUATION = [".", ",", ";", "!", "?", ":", "\"", "'", "(", ")", "[", "]", "{", "}", "--", ".]", "-"]

In [43]:
ESR_analysed_2.tokens[:5]

['1', 'frá', 'auði', 'djúpúðgu', 'ok']

### Aligning tags with text

In [51]:
i = 0
j = 0
cltk_words = ESR_analysed_2.words
m = len(cltk_words)
n = len(pos_lines_ESR)
while i < m and j < n:
    word_from_pos_annotated = pos_lines_ESR[j][0]
    if not word_from_pos_annotated or word_from_pos_annotated in PUNCTUATION:
        # print(f"removing {word_from_pos_annotated}")
        j += 1
    elif cltk_words[i].string in PUNCTUATION:
        # print(f"removing {cltk_words[i].string}")
        i += 1
    else:
        if word_from_pos_annotated.lower() != cltk_words[i].string:
            print("problème")
            print(f"Aligned {cltk_words[i].string}-{word_from_pos_annotated.lower()} equal={word_from_pos_annotated.lower() == cltk_words[i].string}")
        cltk_words[i].pos = pos_lines_ESR[j][1]
        i += 1
        j += 1

problème
Aligned [svá-svá equal=False
problème
Aligned heldr'-heldr equal=False
problème
Aligned s-'s equal=False


In [48]:
for i, word in enumerate(cltk_words):
    if word.pos in ["NC", "NP"]:
        print(word.string)

frá
auði
vífli
óláfr
herkonungr
óláfr
hann
sonr
ingjalds
konungs
helgasonar
óláfssonar
guðröðarsonar
hvítbeins
óláfr
vestrvíking
írlandi
þar
konungr
hann
auðar
dóttur
manns
þorsteinn
sonr
óláfr
írlandi
orrostu
auðr
suðreyjar
þar
dóttur
systur
þau
börn
hann
lags
sigurði
jarli
glumru
þeir
katanes
suðrland
ross
skotland
gerðist
konungr
orrostu
katanesi
fall
þorsteins
hon
knörr
skógi
laun
orkneyjar
þar
gró
dóttur
þorsteins
hon
móðir
þorfinnr
jarl
hausakljúfr
hon
skipi
karla
auðr
íslands
vetr
bjarnarhöfn
birni
bróður
síðan
dalalönd
dögurðarár
hon
hvammi
hon
krosshólum
þar
krossa
trúuð
með
menn
herteknir
vestrvíking
hann
maðr
haf
ok
skipverjum
bústað
mönnum
auðr
skipta
göfgan
hon
vífilsdal
hann
konu
synir
þorbjörn
þeir
menn
óxu
föður
eiríkr
maðr
hann
sonr
þórissonar
sonr
feðgar
jaðri
sakar
námu
land
hornströndum
bjuggu
þar
andaðist
eiríkr
þjóðhildar
dóttur
úlfssonar
knarrarbringu
þorbjörn
réðst
land
haukadal
eiríksstöðum
vatnshorni
felldu
eiríks
skriðu
valþjófs
valþjófsstöðum
eyjólfr
frændi


þeir
bát
seltjöru
þeir
bátinn
vinnast
þá
bjarni
bátrinn
helming
manna
ráð
menn
hlutaðir
bátinn
mannvirðingu
þetta
mæla
mennina
bátinn
helmingr
manna
bátrinn
bátinn
maðr
skipinu
bjarna
íslandi
skiljast
föður
íslandi
skiljast
eigi
bátinn
skipit
fjörsins
bjarni
skipit
maðr
bátinn
leiðar
dyflinnar
írlandi
sögu
manna
bjarni
menn
skipinu
maðksjónum
14
karlsefni
íslands
sumar
íslands
kostar
vetr
guðríðr
kvenskörungr
samfarar
snorra
karlsefnissonar
hallfríðr
móðir
byskups
runólfssonar
þau
son
dóttir
þórunn
móðir
bjarnar
byskups
þorgeirr
sonr
snorra
karlsefnissonar
faðir
móður
byskups
sögu


In [None]:
for i, word in enumerate(GS_analysed_2.words):
    word.pos = pos_lines_GS[i][1]
    

In [None]:
s = set()
for i in words_to_keep:
    s.add(i.string)

In [None]:
len(s)

In [None]:
from cltk.tag.pos import POSTag

In [None]:
# POSTag(language="non")

In [None]:
POSTag(language="lat")