# Dating texts about the discovery of America by Vikings

### Context 


The discovery of America by Norse people is now a fact. Archeological evidences show the presence of Vikings at *L‚ÄôAnse aux Meadows* in the current Canada. (https://www.erudit.org/en/journals/nflds/2003-v19-n1-nflds_19_1/nflds19_1art02/)

[A Nature article: Evidence for European presence in the Americas in ad 1021](https://rdcu.be/cCdv8) (https://doi.org/10.1038/s41586-021-03972-8 or https://www.nature.com/articles/s41586-021-03972-8)



In [1]:
import cltk

In [2]:
print(cltk.__version__)

cltk 1.0.21


## Text retrieval

First, we need to retrieve the texts we want to analyse.

In [3]:
import codecs

def load_file(filename):
    with codecs.open(filename, encoding="utf-8") as f:
        return f.read()

ESR = load_file("eiriks_saga_rauda.txt")

    
GS = load_file("gr√¶nlendiga_saga.txt")

In [4]:
print(f"{ESR[:500]}\n\n===================\n\n{GS[:500]}")

1. Fr√° Au√∞i dj√∫p√∫√∞gu ok V√≠fli.

√ìl√°fr h√©t herkonungr, er kalla√∞r var √ìl√°fr hv√≠ti. Hann var sonr Ingjalds konungs Helgasonar, √ìl√°fssonar, Gu√∞r√∂√∞arsonar, H√°lfdanarsonar hv√≠tbeins Upplendingakonungs.
√ìl√°fr herja√∞i √≠ vestrv√≠king ok vann Dyflinni √° √çrlandi ok Dyflinnarsk√≠ri. √ûar ger√∞ist hann konungr yfir. Hann fekk Au√∞ar dj√∫p√∫√∞gu, d√≥ttur Ketils flatnefs, Bjarnarsonar bunu, √°g√¶ts manns √≥r N√≥regi. √ûorsteinn rau√∞r h√©t sonr √æeira.
√ìl√°fr fell √° √çrlandi √≠ orrostu, en Au√∞r ok √ûorsteinn f√≥ru √æ√° √≠ Su√∞reyjar. 


1. Fundit ok byggt Gr√¶nland.

√ûorvaldr h√©t ma√∞r, sonr √Åsvalds √ölfssonar, √ñxna-√û√≥rissonar. √ûorvaldr ok Eir√≠kr inn rau√∞i, sonr hans, f√≥ru af Ja√∞ri til √çslands fyrir v√≠ga sakir. √û√° var v√≠√∞a byggt √çsland. √ûeir bjuggu fyrst at Dr√∂ngum √° Hornstr√∂ndum. √ûar anda√∞ist √ûorvaldr.
Eir√≠kr fekk √æ√° √ûj√≥√∞hildar, d√≥ttur J√∂rundar √ölfssonar ok √ûorbjargar knarrarbringu, er √æ√° √°tti √ûorbj√∂rn inn haukd√¶lski. R√©√∞st 

One way to estimate the date of writing of a text is by using linguistic features.

[https://timarit.is/page/7340957](https://timarit.is/page/7340957)

> The of/um particle can be used as a dating criterium only when the material for analysis is extensive.

In [5]:
from cltk import NLP

In [19]:
# help(NLP)

In [7]:
from cltk.tokenizers.processes import OldNorseTokenizationProcess
from cltk.stops.processes import StopsProcess


In [8]:
#help(StopsProcess)

In [9]:
from cltk.core.data_types import Pipeline

In [10]:
non_pipeline_1 = Pipeline(language="non", description="", processes=[OldNorseTokenizationProcess, StopsProcess])

In [11]:
non_nlp_1 = NLP("non", custom_pipeline=non_pipeline_1)

‚Äéê§Ä CLTK version '1.0.21'.
Pipeline for language 'Old Norse' (ISO: 'non'): `OldNorseTokenizationProcess`, `StopsProcess`.


In [12]:
ESR_analysed_1 = non_nlp_1.analyze(ESR)

In [13]:
ESR_analysed_1.sentences_strings[0][:100] # it does not work yet

'1 . Fr√° Au√∞i dj√∫p√∫√∞gu ok V√≠fli . √ìl√°fr h√©t herkonungr , er kalla√∞r var √ìl√°fr hv√≠ti . Hann var sonr I'

In [14]:
len(ESR_analysed.sentences)

1

In [21]:
from cltk.sentence.sentence import RegexSentenceTokenizer

sent_end_chars = [".", "!", "?"]


class OldNorseRegexSentenceTokenizer(RegexSentenceTokenizer):
    """``RegexSentenceTokenizer`` for Old Norse."""

    def __init__(self: object):
        super().__init__(language="non", sent_end_chars=sent_end_chars)
        
        
@dataclass
class OldNorseSentenceTokenizationProcess(TokenizationProcess):
    """The default OldNorse tokenization algorithm.

    >>> from cltk.tokenizers import OldNorseTokenizationProcess
    >>> from cltk.languages.example_texts import get_example_text
    >>> tok = OldNorseTokenizationProcess()
    >>> output_doc = tok.run(input_doc=Doc(raw=get_example_text("non")[:29]))
    >>> output_doc.tokens
    ['Gylfi', 'konungr', 'r√©√∞', '√æar', 'l√∂ndum']
    """

    description = "Default Old Norse tokenizer"

    @cachedproperty
    def algorithm(self):
        return OldNorseWordTokenizer()

In [23]:
non_pipeline_2 = Pipeline(language="non", description="", 
                          processes=[OldNorseTokenizationProcess, OldNorseRegexSentenceTokenizer, StopsProcess])

In [24]:
non_nlp_2 = NLP("non", custom_pipeline=non_pipeline_2)

‚Äéê§Ä CLTK version '1.0.21'.
Pipeline for language 'Old Norse' (ISO: 'non'): `OldNorseTokenizationProcess`, `OldNorseRegexSentenceTokenizer`, `StopsProcess`.


In [25]:
ESR_analysed_2 = non_nlp_2.analyze(ESR)

TypeError: __init__() takes 1 positional argument but 2 were given

In [None]:
len(ESR_analysed.sentences)

In [15]:
from cltk.tag.pos import POSTag

In [20]:
# POSTag(language="non")

In [18]:
POSTag(language="lat")

<cltk.tag.pos.POSTag at 0x7f10ddf95198>