# About the discovery of America by Vikings with CLTK

## Introduction
The aims of the presentation are to show:
- to show how to practically use CLTK,
- to understand how CLTK works,
- to explore classical texts with CLTK on an exploratory perspective.

We analyse here two texts *The saga of Erik the Red* (Eiríks saga rauða, or **ESR**) and *The saga of the Greenlanders* (Grœnlendinga saga or **GS**) in the original language, namely Old Norse. The sagas tells how Norwegians/Icelanders colonized Greenland and found new lands in what is now America.


### Context

#### Archeology
The discovery of America by Norse people is now a fact. Archeological evidences show the presence of Vikings at *L’Anse aux Meadows* in the current Canada. ([The Norse in Newfoundland](https://www.erudit.org/en/journals/nflds/2003-v19-n1-nflds_19_1/nflds19_1art02/) and [A Nature article: Evidence for European presence in the Americas in ad 1021](https://doi.org/10.1038/s41586-021-03972-8).

#### Interpretations
Scholars interpret **ESR** as a more logical story reinterpreted by the author whereas **GS** is supposedly more raw transcription of an oral tradition. There is a debate on which one came first.




## Installing CLTK 

It works on Posix system: **Linux**, **MacOs**, **Windows** with **WSL** (Windows subsystem for Linux).

It is a good practice to code in a Python virtual environment. Create a virtual environment to work on a project.

```bash
$ python3 -m venv cltk_venv
$ source cltk_venv/bin/activate
(cltk_venv) $ pip install cltk 
```


In [1]:
import cltk

In [2]:
print(cltk.__version__)

cltk 1.0.21


## Text retrieval
First, we need to retrieve the texts we want to analyse.

The texts come from [heimskringla.no](http://heimskringla.no/wiki/Forside).

In [3]:
import codecs

def load_file(filename):
    with codecs.open(filename, encoding="utf-8") as f:
        
        return f.read()
    
def load_clean_text(filename):
    text = load_file(filename)
    text = text.replace('"', "")
    text = text.replace("-", " ")
    text = text.lower()
    return text

ESR = load_clean_text("eiriks_saga_rauda.txt")    
GS = load_clean_text("grænlendiga_saga.txt")

In [4]:
print(f"{ESR[:500]}\n\n===================\n\n{GS[:500]}")

1. frá auði djúpúðgu ok vífli.

óláfr hét herkonungr, er kallaðr var óláfr hvíti. hann var sonr ingjalds konungs helgasonar, óláfssonar, guðröðarsonar, hálfdanarsonar hvítbeins upplendingakonungs.
óláfr herjaði í vestrvíking ok vann dyflinni á írlandi ok dyflinnarskíri. þar gerðist hann konungr yfir. hann fekk auðar djúpúðgu, dóttur ketils flatnefs, bjarnarsonar bunu, ágæts manns ór nóregi. þorsteinn rauðr hét sonr þeira.
óláfr fell á írlandi í orrostu, en auðr ok þorsteinn fóru þá í suðreyj


1. fundit ok byggt grænland.

þorvaldr hét maðr, sonr ásvalds úlfssonar, öxna þórissonar. þorvaldr ok eiríkr inn rauði, sonr hans, fóru af jaðri til íslands fyrir víga sakir. þá var víða byggt ísland. þeir bjuggu fyrst at dröngum á hornströndum. þar andaðist þorvaldr.
eiríkr fekk þá þjóðhildar, dóttur jörundar úlfssonar ok þorbjargar knarrarbringu, er þá átti þorbjörn inn haukdælski. réðst eiríkr þá norðan ok bjó á eiríksstöðum hjá vatnshorni. sonr eiríks ok þjóðhildar hét leifr.
enn efti

## CLTK pipeline

![CLTK pipeline](cltk-Page-2.jpg)

If everything is available in CLTK, you only need to import `NLP`.

In [5]:
from cltk import NLP

Default pipeline for Old Norse language.

In [6]:
non_nlp_default = NLP("non")

‎𐤀 CLTK version '1.0.21'.
Pipeline for language 'Old Norse' (ISO: 'non'): `OldNorseTokenizationProcess`, `StopsProcess`, `OldNorseLexiconProcess`.


In [7]:
non_nlp_default = NLP("non", suppress_banner=True)

Which processes are there? The answer is in the banner or in an attribute.

In [8]:
non_nlp_default.pipeline.processes

[cltk.tokenizers.processes.OldNorseTokenizationProcess,
 cltk.stops.processes.StopsProcess,
 cltk.lexicon.processes.OldNorseLexiconProcess]

A tokenization process for Old Norse, a stop process (a way to filter uninformative words) and a lexicon process. For our task, we do not need the lexicon process.

### Custom pipeline for Old Norse

To customize our pipeline, we have to import processed.

In [9]:
from cltk.tokenizers.processes import OldNorseTokenizationProcess
from cltk.stops.processes import StopsProcess
from cltk.text.processes import OldNorsePunctuationRemovalProcess


In [10]:
from cltk.core.data_types import Pipeline

In [11]:
non_pipeline_custom_1 = Pipeline(language="non", description="", processes=[OldNorseTokenizationProcess, StopsProcess])

In [12]:
non_nlp_custom_1 = NLP("non", custom_pipeline=non_pipeline_custom_1)

‎𐤀 CLTK version '1.0.21'.
Pipeline for language 'Old Norse' (ISO: 'non'): `OldNorseTokenizationProcess`, `StopsProcess`.


In [13]:
ESR_analysed_1 = non_nlp_custom_1.analyze(ESR)

In [14]:
ESR_analysed_1.sentences_strings[0][:100] # it does not work yet

'1 . frá auði djúpúðgu ok vífli . óláfr hét herkonungr , er kallaðr var óláfr hvíti . hann var sonr i'

In [15]:
len(ESR_analysed_1.sentences)

1

Hmmm, 1 sentence means that sentences were not recognized.

When available processed are not enough, you can create one by yourself.

###  Creating a `Process` for CLTK

A process is a class that inherits from `Process` and that implements the `run` method.

In [16]:
from copy import deepcopy
from dataclasses import dataclass

from cltk.core import CLTKException
from cltk.core.data_types import Doc, Process
from cltk.sentence.sentence import RegexSentenceTokenizer

non_sent_end_chars = [".", "!", "?"]


@dataclass
class OldNorseSentenceTokenizationProcess(Process):

    model: object = None

    def run(self, input_doc: Doc) -> Doc:
        output_doc = deepcopy(input_doc)
        sentence_tokenizer = RegexSentenceTokenizer(language="non", sent_end_chars=non_sent_end_chars)

        sentences = sentence_tokenizer.tokenize(output_doc.raw, self.model)
        sentence_indices = []
        for i, sentence in enumerate(sentences):
            if i >= 1:
                sentence_indices.append(sentence_indices[-1] + len(sentences[i]))
            else:
                sentence_indices.append(len(sentence))
        sentence_index = 0
        for j, word in enumerate(output_doc.words):
            if sentence_indices[sentence_index] < word.index_char_stop and\
                    sentence_index + 1 < len(sentence_indices):
                sentence_index += 1
            word.index_sentence = sentence_index
        return output_doc

In order to be more language-agnostic, we should add the `algorithm` method, then seperate the `run` method, that can stay in the mother class, from the `algorithm` method that is specific to each language and that is in a daughter class.

The sentence tokenizer, that can be used alone.

In [17]:
from cltk.sentence.sentence import RegexSentenceTokenizer

sent_end_chars = [".", "!", "?"]


class OldNorseRegexSentenceTokenizer(RegexSentenceTokenizer):
    """``RegexSentenceTokenizer`` for Old Norse."""

    def __init__(self: object):
        super().__init__(language="non", sent_end_chars=sent_end_chars)


The global sentence tokenization process and the Old Norse one.

In [18]:
from copy import deepcopy
from dataclasses import dataclass

from boltons.cacheutils import cachedproperty

from cltk.core import CLTKException
from cltk.core.data_types import Doc, Process
from cltk.sentence.sentence import SentenceTokenizer

@dataclass
class SentenceTokenizationProcess(Process):

    model: object = None

    @cachedproperty
    def algorithm(self):
        raise CLTKException(f"No sentence tokenization algorithm for language '{self.language}'.")

    def run(self, input_doc: Doc) -> Doc:
        output_doc = deepcopy(input_doc)
        sentence_tokenizer = self.algorithm
        if not isinstance(sentence_tokenizer, SentenceTokenizer):
            raise CLTKException("Algorithm must be an instance of SentenceTokenizer subclass")

        sentences = sentence_tokenizer.tokenize(output_doc.raw, self.model)
        sentence_indices = []
        for i, sentence in enumerate(sentences):
            if i >= 1:
                sentence_indices.append(sentence_indices[-1] + len(sentences[i]))
            else:
                sentence_indices.append(len(sentence))
        sentence_index = 0
        for j, word in enumerate(output_doc.words):
            if sentence_indices[sentence_index] < word.index_char_stop and\
                    sentence_index + 1 < len(sentence_indices):
                sentence_index += 1
            word.index_sentence = sentence_index
        return output_doc


@dataclass
class OldNorseSentenceTokenizationProcess(SentenceTokenizationProcess):

    @cachedproperty
    def algorithm(self):
        return OldNorseRegexSentenceTokenizer()


We can now use it in our custom pipeline at the second position.

In [19]:
non_pipeline_2 = Pipeline(language="non", description="", 
                          processes=[OldNorseTokenizationProcess, 
                                     OldNorseSentenceTokenizationProcess,
                                     OldNorsePunctuationRemovalProcess,
                                     StopsProcess])

In [20]:
non_nlp_2 = NLP("non", custom_pipeline=non_pipeline_2)

‎𐤀 CLTK version '1.0.21'.
Pipeline for language 'Old Norse' (ISO: 'non'): `OldNorseTokenizationProcess`, `OldNorseSentenceTokenizationProcess`, `OldNorsePunctuationRemovalProcess`, `StopsProcess`.


Let's analyse our texts.

In [21]:
ESR_analysed_2 = non_nlp_2.analyze(ESR.lower())

In [22]:
GS_analysed_2 = non_nlp_2.analyze(GS.lower())

In [23]:
len(ESR_analysed_2.sentences), len(GS_analysed_2.sentences)

(587, 413)

Out texts have the sentences we needed.

## Characterisation of the topic

The idea is to find differences in the user vocabulary. A non-factual vocabulary may show assumptions made by the authors. 

Some words are meaningful to see what it talks about whereas others are just here for syntactic reasons. The least-informative words are named *stop words*. There is a Stop Removal Process for Old Norse. It is however not complete enough. That is why I completed the list of stop words.

In [24]:
custom_stop_list = ["hann", "þar", "hon", "af", "ek", "svá", "eigi", "nú", "hafði", "honum", 
                    "hafa", "henni", "þér", "höfðu", "mun", "hans", "sér", "eftir", "vera", "ekki", "mér", 
                    "þú", "aftr", "hana", "sitt", "haf", "vér", "sínum", "hennar", "sínu", "þaðan", 
                    "allt", "sinn", "hvat", "sama", "eitt", "einn", "ein", "öllum", "öðrum", ""]



In [25]:
common_verbs = ["var", "hét", "váru", "fór", "kom", "tók", "fara", "mælti", "sjá", "kvað", "þótti", "fóru", 
                "átti", "sagði", "bjó", "kómu", "kveðst", "verða", "segir", "leita", "sigla", 
                "vil", "segja", "svarar", "gekk", "koma", "hefir", "tóku", "væri", "vildi", "gerðist", 
                "gera", "gaf", "kölluðu", "kalla", "myndi", "mundu", "vildu", "gengu", "eru", "skal", "ætla",
                "fekk", "lét", "svaraði", "þóttust", "farit", "sé", "vill", "sigldu", "vita", "taka", "mundi", 
                "varð", "fundu", "gerðu", "gerði", "láta", "halda", "bjuggu", "fann", "halda"]

###  Loading POS annotations

There is currently no official POS tagger for Old Norse in CLTK. There is however one that has been trained with **spaCy** on **Menota** data. 

In [26]:
pos_tags = dict(NC="common noun",
     NP="proper noun",
     AJ="adjective",
     PE="personal pronoun",
     PI="indefinite pronoun",
     DP="possessive determiner",
     DD="demonstrative determiner",
     DQ="quantifier determiner",
     PD="pronoun/determiner",
     NA="cardinal determiner",
     NO="ordinal determiner",
     VB="verb",
     AV="adverb",
     AT="article",
     AP="preposition",
     CC="coordinating conjunction",
     CS="subordinating conjunction",
     IT="interjection",
     IM="infinitive marker",
     RP="relative particle",
     UA="unassigned"
    )

informative_pos_tags = ["NC", "NP", "AJ", "VB"]

In [27]:
with codecs.open("annotated_pos_ESR.csv", "r") as f:
    pos_lines_ESR = [line.split("\t") for line in f.read().split("\n")]
    

In [28]:
with codecs.open("annotated_pos_GS.csv", "r") as f:
    pos_lines_GS = [line.split("\t") for line in f.read().split("\n")]


In [29]:
PUNCTUATION = [".", ",", ";", "!", "?", ":", "\"", "'", "(", ")", "[", "]", "{", "}", "--", ".]", "-"]

In [30]:
ESR_analysed_2.tokens[:5]

['1', 'frá', 'auði', 'djúpúðgu', 'ok']

### Aligning tags with text

In [31]:
def align_annotations(cltk_doc, spacy_annotations):
    """
    Assign POS tags to words of a CLTK doc from spaCy annotations
    """
    cltk_words = cltk_doc.words
    i, j = 0, 0
    m = len(cltk_words)
    n = len(spacy_annotations)
    while i < m and j < n:
        spacy_token = spacy_annotations[j][0]
        if not spacy_token or spacy_token in PUNCTUATION:
            j += 1
        elif cltk_words[i].string in PUNCTUATION:
            i += 1
        else:
            #print(spacy_token.lower(), cltk_words[i].string)
            if spacy_token.lower() == cltk_words[i].string:
                
                cltk_words[i].xpos = spacy_annotations[j][1]
            else:
                cltk_words[i].xpos = "UA"
            i += 1
            j += 1
    
    cltk_doc.words = cltk_words
    return cltk_doc

In [32]:
ESR_analysed_2 = align_annotations(ESR_analysed_2, pos_lines_ESR)

In [33]:
ESR_analysed_2.words[78].xpos

'PE'

### Filtering words

In [34]:
from collections import Counter

In [35]:
def kiw(cltk_word, pos_tags=None):
    
    if pos_tags is None:
        pos_tags = informative_pos_tags
    return not cltk_word.stop and cltk_word.string and\
        cltk_word.string not in custom_stop_list and\
        cltk_word.string not in common_verbs and\
        cltk_word.xpos in pos_tags

In [36]:
a = [(word.string,word.xpos) for word in ESR_analysed_2.words if kiw(word)]
c_ESR = Counter([w[0] for w in a])
c_ESR.most_common(20)

[('menn', 40),
 ('eiríkr', 29),
 ('maðr', 24),
 ('þorbjörn', 24),
 ('karlsefni', 23),
 ('leifr', 22),
 ('landit', 17),
 ('vetr', 16),
 ('verit', 16),
 ('skip', 14),
 ('manna', 14),
 ('land', 13),
 ('bóndi', 13),
 ('sonr', 12),
 ('kona', 12),
 ('sumar', 11),
 ('guðríðr', 11),
 ('ormr', 11),
 ('bjarni', 11),
 ('dóttur', 10)]

In [37]:
GS_analysed_2 = align_annotations(GS_analysed_2, pos_lines_GS)

In [38]:
b = [(word.string,word.xpos) for word in GS_analysed_2.words if kiw(word)]
c_GS = Counter([w[0] for w in b])
c_GS.most_common(20)

[('leifr', 33),
 ('skip', 30),
 ('land', 25),
 ('menn', 21),
 ('landit', 20),
 ('maðr', 17),
 ('eiríkr', 16),
 ('skipi', 15),
 ('guðríðr', 15),
 ('manna', 14),
 ('þorvaldr', 13),
 ('vetr', 13),
 ('grænlandi', 13),
 ('bjarni', 13),
 ('landi', 12),
 ('karlsefni', 12),
 ('mikil', 11),
 ('bað', 11),
 ('sonr', 10),
 ('kona', 10)]

## Datation of texts

One way to estimate the date of writing of a text is by using linguistic features. In [of/um partikkelen som dateringskriterium for Eddakvada, Leiv Olsen, *Gripla XXXI*, 2020](https://timarit.is/page/7340957), the author shows that the more recent is a text, the more there are occurrences of *of* and *um* particles.

> The of/um particle can be used as a dating criterium only when the material for analysis is extensive.

The paper's focus if on Eddic poems (i.e. poems in the Poetic Edda).

However, *of* and *um* are not only particles (former pre-verbs, German and Dutch speakers can understand), they are also prepositions governing the accusative and dative.

### Dictionary lookup
CLTK has some dictionaries available. Zoëga's dictionary is for Old Norse. 

In [39]:
from cltk.lexicon.non import OldNorseZoegaLexicon

non_dictionary = OldNorseZoegaLexicon()

print(non_dictionary.lookup("of")[:500])

I)
prep.
1) with dat. and acc., over = yfir (fara of fjöll; sitja of borði); of time, = um; of haust or of haustum, in the autumn; of aptaninn, in the evening; of hríð, for a while; of allt, always;
2) with acc. of, about (bera vitni of e-t);
3) in a casual sense, poet.; of sanna sök, for a just cause, justly.
II)
an enclitic particle, chiefly placed before verbs; ek drykk of gat ens dýra mjaðar, I got a draught of the precious mead.
III)
n.
1) great quantity, number; of fjár, immensity of wealt


In [40]:
print(non_dictionary.lookup("um")[:500])

older umb, prep. with acc. and dat.
I. with acc.
1) around (slá hring um e-n);
2) about, all over (hárit féll um hana alla); um allar sveitir, all over the country; mikill um herðar, large about the shoulders, broad-shouldered; liggja um akkeri, to ride at anchor;
3) of proportion; margir voru um einn, many against one; um einn hest voru tveir menn, two men to each horse;
4) round, past, beyond, with verbs denoting motion (sigla vestr um Bretland); leggja um skut þessu skipi, to pass by this shi


### Displaying contexts and preparing analysis

In [41]:
# One of the dependency is NLTK
from nltk.text import Text
t = Text(GS_analysed_2.tokens)
t.concordance("um", lines=10)

Displaying 10 of 58 matches:
 var búinn fylgðu þeir styrr honum út um eyjar eiríkr sagði þeim at hann ætlað
 sonr úlfs kráku sá er hann rak vestr um haf þá er hann fann gunnbjarnarsker k
 eiríksey nær miðri inni eystri byggð um várit eftir fór hann til eiríksfjarða
íksey fyrir mynni eiríksfjarðar eftir um sumarit fór hann til íslands ok kom s
 landit héti vel eiríkr var á íslandi um vetrinn en um sumarit eftir fór hann 
el eiríkr var á íslandi um vetrinn en um sumarit eftir fór hann at byggja land
 eyrar er faðir hans hafði brot siglt um várit þau tíðendi þóttu bjarna mikil 
 þetta dægr áðr þeir sá land ok ræddu um með sér hvat landi þetta mun vera en 
irðmaðr jarls ok fór út til grænlands um sumarit eftir var nú mikil umræða um 
 um sumarit eftir var nú mikil umræða um landaleitan leifr sonr eiríks rauða ó


In [42]:
def display_context(cltk_words, token, window_size=5, limit=5):
    window_size = 2
    limit_index = 0
    for k, word in enumerate(cltk_words):
        if word.string == token:
            if 0 < k - window_size and k + window_size + 1 < len(GS_analysed_2.words) and limit_index < limit:
                limit_index += 1
                #print(" ".join([f"{w.string}/{w.xpos}" for w in GS_analysed_2.words[k-window_size:k+window_size+1]]))
                print(" ".join([f"{w.string}" for w in cltk_words[k-window_size:k+window_size+1]]))
                print(" ".join([f"{w.xpos}"+" "*(len(w.string)-2) for w in cltk_words[k-window_size:k+window_size+1]]))
                print("-"*50)

In [43]:
display_context(GS_analysed_2.words, "um", 2)

honum út um eyjar eiríkr
PE    AV AP NC    NC    
--------------------------------------------------
rak vestr um haf þá
VB  AV    AP NC  AV
--------------------------------------------------
eystri byggð um várit eftir
AJ     NC    VB NC    VP   
--------------------------------------------------
eiríksfjarðar eftir um sumarit fór
NC            AV    AP NC      VB 
--------------------------------------------------
á íslandi um vetrinn en
AP NC      AP NC      CC
--------------------------------------------------


In [44]:
display_context(GS_analysed_2.words, "of", 2)

hans hans of óvíða kannat
PE   PE   CU AJ    NC    
--------------------------------------------------


In [45]:
display_context(ESR_analysed_2.words, "um", 2)

leituðu hans um eyjarnar þeir
VB      PE   AP NC       AJ  
--------------------------------------------------
eiríki út um eyjarnar ok
NC     AV AP NC       CC
--------------------------------------------------
rak vestr um haf ok
VB  AV    AP NC  CC
--------------------------------------------------
eystri byggð um várit eftir
AJ     NC    VB NC    VP   
--------------------------------------------------
en eftir um sumarit fór
VB VP    AP NC      VB 
--------------------------------------------------


In [46]:
display_context(ESR_analysed_2.words, "of", 2)

dvölðust þeir of stund ok
VB       DD   AP NC    CC
--------------------------------------------------


By Clément Besnier, [www.clementbesnier.eu](https://www.clementbesnier.eu).