# Loading Data

We start by importing the training dataset, which is divided into three key Norman periods:
* 1066 - 1216: Encompasses the Norman Conquest and ends with the Magna Carta's issuance.
* 1217 - 1272: The reign of Henry III, noted for legal and architectural advancements.
* 1273 - 1310: Edward I's era, marked by legal reforms and military campaigns.

In [3]:
# define time period split
TIME_PERIODS = (
    (1066, 1216),
    (1217, 1272),
    (1273, 1310)
)

# split the data based on the time period split
TEXTS = {period: [] for period in TIME_PERIODS}
with open("../data/EngOrdtext") as f_texts, open("../data/EngOrdDate") as f_dates:
    for line in f_dates:
        dates = line.split()
        for date in dates:
            for (start, end) in TIME_PERIODS:
                if start <= int(date) <= end:
                    TEXTS[(start, end)].append(next(f_texts))
                    break

# creating separate txt file for each time split
for (start, end), texts in TEXTS.items():
    file_name = f"norman_{start}_{end}.txt"
    with open(file_name, "w") as f:
        for text in texts:
            f.write(f"{text}")

# Word Embedding Training

We trained word embeddings for distinct periods using `FastText`, focusing on subword information. Hyperparameter settings are:

- `minn` and `maxn`: 6 
- `lr`: 0.05
- `dim`: 100 
- `ws`: 5
- `epoch`: 30 
- `neg`: 5 

In [4]:
# !pip install fasttext
import fasttext
import os

if not os.path.exists("trained_model"):
    os.makedirs("trained_model")

minn = 6
maxn = 6
lr = 0.05
dim = 100
ws = 5
epoch = 30
neg = 5

# train the overall model
file_name = "../data/EngOrdtext"
model = fasttext.train_unsupervised(
    file_name, model="skipgram", 
    minn=minn, 
    maxn=maxn,
    lr=lr,
    dim=dim,
    ws=ws,
    epoch=epoch,
    neg=neg
)
model_name = f"trained_model/norman_all.bin"
model.save_model(model_name)

# train the period specific model
for (start, end) in TIME_PERIODS:
    file_name = f"norman_{start}_{end}.txt"
    model = fasttext.train_unsupervised(
        file_name, model="skipgram", 
        minn=minn, 
        maxn=maxn,
        lr=lr,
        dim=dim,
        ws=ws,
        epoch=epoch,
        neg=neg
    )
    model_name = f"trained_model/norman_{start}_{end}.bin"
    model.save_model(model_name)
    
    # remove the temp file
    os.remove(file_name)

Read 3M words
Number of words:  25307
Number of labels: 0
Progress: 100.0% words/sec/thread:  131460 lr:  0.000000 avg.loss:  1.370082 ETA:   0h 0m 0s
Read 1M words
Number of words:  11527
Number of labels: 0
Progress: 100.0% words/sec/thread:  149511 lr:  0.000000 avg.loss:  1.684981 ETA:   0h 0m 0s
Read 1M words
Number of words:  13823
Number of labels: 0
Progress: 100.0% words/sec/thread:  137494 lr:  0.000002 avg.loss:  1.639278 ETA:   0h 0m 0s0.000000 avg.loss:  1.639278 ETA:   0h 0m 0s
Read 0M words
Number of words:  10632
Number of labels: 0
Progress: 100.0% words/sec/thread:  153676 lr:  0.000000 avg.loss:  1.585814 ETA:   0h 0m 0s


# Load Model

note: this part can be done separately from previous part given trained model.

In [1]:
import fasttext

# load overall model
model_path = f"trained_model/norman_all.bin"
model = fasttext.load_model(model_path)

