# AI Etruscan Project

Abstract—Etruscan is an ancient language spoken in Italy from the 7th century BC to the 12th century AD. The language is both lost and isolated, making its resources extremely scarce. In this project, we will propose a state-of-the art design that integrates a transformer model architecture with a BPE with dropout tokenizer, adaptive transformer layers, and a back-translation mechanism.


# Installing

For the following code lines, we will use the pip, Python's package installer, to install packages. Also, anytime you want to run a terminal command, put the ! mark in the front!

In [None]:
!pip install -U torchdata  # Installs/updates the 'torchdata' library, which provides data processing utilities and datasets for PyTorch.
!pip install -U spacy      # Installs/updates the 'spacy' library, a popular NLP library for advanced natural language processing tasks.
!pip install tokenizers # Installs the necessary tokenizers from Huggingface
!pip install 'portalocker==2.8.2'  # Installs/updates the 'portalocker' library used for file locking - a mechanism that allows you to restrict access to a file by allowing only one process to read or write the file at once.
!pip install evaluate



We will also be installing from SpaCy's library, which uses the command 'el_core_news_sm.'

In [None]:
!python -m spacy download el_core_news_sm

Collecting el-core-news-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/el_core_news_sm-3.7.0/el_core_news_sm-3.7.0-py3-none-any.whl (12.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.6/12.6 MB[0m [31m47.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: el-core-news-sm
Successfully installed el-core-news-sm-3.7.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('el_core_news_sm')


Next, we will install a small English, German, and Greek model that includes capabilities for tokenization, lemmatization, POS tagging, named entity recognition, and more! It is opitmized to process web text.

In [None]:
!python -m spacy download en_core_web_sm # This is for English
!python -m spacy download de_core_news_sm # This is for German
!python -m spacy download el_core_news_sm # This is for Greek

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m88.0 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
Collecting de-core-news-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-3.7.0/de_core_news_sm-3.7.0-py3-none-any.whl (14.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.6/14.6 MB[0m [31m77.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: de-core-news-sm
Successfully installed de-core-news-sm-3.7.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('de_core_news_sm')
Collecting el-core-news-sm==3.7.0
  Using cached https://github.com/explosion/spac

# Setting up the Transformer

We will now begin to set up the transformer, defining its parameters and importing the necessary items. We will first use the '%matplotlib inline' magic command, which tells the Jupyter Notebook to dispaly matplotlib(plotting library) plots directly on the notebook rather than opening a new window.


In [None]:
%matplotlib inline
#Displaying plots directly onto the notebook and renders them as static images.

Now, we will define the traning and validation data paths and langauge types using PyTorch's 'torchtext' library. Defining the path involves specifying the file and location where the training and validation data sets are stored.

# Initializing the BPE Tokenizer

In [None]:
!pip install torch



In [None]:
!python -c "import evaluate; print(evaluate.load('exact_match').compute(references=['hello'], predictions=['hello']))"

2024-02-14 04:08:47.806255: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-14 04:08:47.806311: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-14 04:08:47.807697: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Downloading builder script: 100% 5.67k/5.67k [00:00<00:00, 12.7MB/s]
{'exact_match': 1.0}


In [None]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
import torch
from torchtext.datasets import Multi30k
bleu = evaluate.load("bleu") #Importing the BLEU

Assuming SRC_LANGUAGE and TGT_LANGUAGE are defined
SRC_LANGUAGE = 'de'  # Example source language
TGT_LANGUAGE = 'en'  # Example target language

# Initialize an empty BPE tokenizer
# Enable dropout - expeirment with the dropout rate
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()

# Trainer for the BPE tokenizer
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

def get_texts(language_pair):
    for ln in language_pair:
        train_iter = Multi30k(split='train', language_pair=language_pair)
        for data_sample in train_iter:
            yield data_sample[0 if ln == SRC_LANGUAGE else 1]

# Train the tokenizer
tokenizer.train_from_iterator(get_texts((SRC_LANGUAGE, TGT_LANGUAGE)), trainer=trainer)

# Example function to encode text using the trained BPE tokenizer
def encode_text(text):
    return tokenizer.encode(text).tokens

# Save the tokenizer
tokenizer.save("/content/drive/MyDrive/Etruscan_Project/bpe_tokenizer.json")

SyntaxError: invalid syntax (<ipython-input-8-c0e0f13fa225>, line 9)

Testing the existing BPE tokenizer.

In [None]:
# Example usage
# example_text = "INAS, INNI, INV; MAROS."
# print(encode_text(example_text))

In [None]:
#First, we will import the necessary utilities for handling text data. These utilities include:
from torchtext.data.utils import get_tokenizer #A function to get a tokenizer based on the specified language (or a custom tokenization function)
from torchtext.vocab import build_vocab_from_iterator #This function builds a vocabulary object that maps tokens to indicies. A vocabulary is a collection of unqiue tokens in a dataset.
# A vocabulary object is a data structure that organizes and maps tokens to unique numerical indicies(an integer), which allows any text to be represented as a sequence of integers - allowing the computer to read it.
from torchtext.datasets import multi30k, Multi30k #References the Multi30K dataset
#from datasets import load_dataset #Preparing to load the huggingface dataset

from typing import Iterable, List #Importing some typing utilities for better code readibility and type checking.

#dataset = load_dataset("latin_english_translation", split='train') #We are now loading the huggingface dataset

#This code defines the training and validation data paths (for the Multi30k Dataset).
#These URLs point to a GitHub repository that hosts the dataset files. This ensures that the datasets can be accessed and used for training and validation reasons.
multi30k.URL["train"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz" # This is for the training dataset!
multi30k.URL["valid"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz" #This is for the validation dataset!

#The code below will define the source and target languages using the ISO language codes. In our case, we are translating from German, which is our source, to English, which is our target.
SRC_LANGUAGE = 'de' #de is the ISO language code for German
TGT_LANGUAGE = 'en' #en is the ISO language code for English

#Now, we will initialize dictionaries(unordered key-value pair data structure) to hold specific transformer functions and vocabulary transformations for each language.
#These specific tokenization fuctions and vocabularies will map tokens to numerical indicies, essential for processing textual data in ML models.
token_transform = {} #Initializes the 'token_transform' dictionary to hold tokenization functions.
vocab_transform = {} #Initializes the 'vocab_transform' dictionary to hold vocabulary transformation functions.

The '!pwd' (print working directory) command is a shell command that outputs the absolute path of the current directory you are in.

In [None]:
!pwd

We will then connect Google Colab to Google Drive so that the notebook can access files stored in Google Drive.

In [None]:
from google.colab import drive #The 'drive' module provides functions to interact with Google Drive.
drive.mount('/content/drive')# This command will prompt the user to authorize access to their Google Drive.

The next two commands will upgrade the 'torchdata' and 'spacy' library to the latest version.

In [None]:
!pip install -U torchdata #Upgrading(-U) the 'torchdata' library to the latest version.
!pip install -U spacy #Upgrading the 'spacy' library to the latest version.

**Do I need 'vocab_transform' here without anything else??**

Lets now import the necessary items for the BPE Dropout Tokenizer! **Make sure to figure out if we have a pre-trained BPE tokenizer that we can load for Greek.**

In [None]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.processors import TemplateProcessing

# For demonstration, let's assume you have a pre-trained tokenizer file
tokenizer_path = 'bpe_tokenizer.json'

# Load the tokenizer
bpe_tokenizer = Tokenizer.from_file('/content/drive/MyDrive/Etruscan_Project/bpe_tokenizer.json')

# Enable dropout - expeirment with the dropout rate
#bpe_tokenizer.enable_dropout(0.1) # 0.1 is the dropout rate


Now, we will set up and define the necessary components for handling and transforming text data from German to English using the 'torchtext' and 'spacy' libraries.

In [None]:
#Here, we are defining 'token_transform' as a dictionary that can hold tokenization functions for the source (SRC_LANGUAGE) and target (TGT_LANGUAGE) languages.
#The keys for 'token_transform' are the language codes, and the value are the functions that tokenizes the texts for their respective languages.
#The get_tokenizer function is called with the SpaCy model for each language, providing the appropriate functions to tokenize the text.
token_transform[SRC_LANGUAGE] = get_tokenizer('spacy', language='de_core_news_sm')  # German tokenizer
token_transform[TGT_LANGUAGE] = get_tokenizer('spacy', language='en_core_web_sm')  # English tokenizer

#This is a helper function that iterates over a dataset, tokenizes the text samples according to a specified language, and yields a list of tokens.
#'data_iter' is a paramater that is an iterable collection of data samples, where each sample consist of text data. The iterable collection could be a list or anything else.
#'language' is a paramter that specifies the language of the text that is going to be tokenized.
# def yield_tokens(data_iter: Iterable, language: str) -> List[str]:
#     language_index = {SRC_LANGUAGE: 0, TGT_LANGUAGE: 1} #This dictionary maps the language identifiers to indicies, which are then used to select the correct part of the data sample corresponding to the specified language.

#     for data_sample in data_iter:
#         # The function loops through each 'data_sample' in 'data_iter,' tokenizing based on the specific 'language.' This involves breaking text into words, punctuation, etc., which are the basic units for processing.
#         yield token_transform[language](data_sample[language_index[language]]) #This line is where the tokenization happens.
#         #The function retrieves the appropriate tokenization function based on the language parameter, applies it to the text selected by language_index[language], and yields the list of tokens produced.

#The SpaCy tokenizer is replaced with the BPE_Dropout Tokenizer
#This is a helper function that iterates over a dataset, tokenizes the text samples according to a specified language, and yields a list of tokens.
#'data_iter' is a paramater that is an iterable collection of data samples, where each sample consist of text data. The iterable collection could be a list or anything else.
#'language' is a paramter that specifies the language of the text that is going to be tokenized.
def yield_tokens(data_iter: Iterable, language: str) -> List[str]:
    language_index = {SRC_LANGUAGE: 0, TGT_LANGUAGE: 1} #This dictionary maps the language identifiers to indicies, which are then used to select the correct part of the data sample corresponding to the specified language.
    for data_sample in data_iter:
       # The function loops through each 'data_sample' in 'data_iter,' tokenizing based on the specific 'language.' This involves breaking text into words, punctuation, etc., which are the basic units for processing.
        if language == SRC_LANGUAGE:
            tokens = bpe_tokenizer.encode(data_sample[language_index[language]]).tokens
        else:
            # Assuming you have a separate tokenizer for the target language, or use the same with different settings
            tokens = bpe_tokenizer.encode(data_sample[language_index[language]]).tokens
        yield tokens

#The break here is probably a mistake

#Here, we are defining special symbols that are used in NLP tasks, such as <unk> for unknown tokens, <pad> for padding, <bos> for the beginning of a sentence, and <eos> for the end of a sentence.
#These tokens are necessary for handling various scenarios in text processing and model training - such as padding sequences to a uniform length in collation.
UNK_IDX, PAD_IDX, BOS_IDX, EOS_IDX = 0, 1, 2, 3
special_symbols = ['<unk>', '<pad>', '<bos>', '<eos>']

# This for loop will iterate over both languages (German and English), building vocabularies for each.
#The vocabularies map unique tokens to indices based on the Multi30k Model. The vocabularies include the special symbols defined above, ensuring they are part of the vocabulary and properly indexed.
#This loop will iterate over the source and target languages.
for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
  #For each language, the Multi30k is loaded with a specific 'split,' defined by the parameter. The 'language pair' parameter specifies the source and target language.
    train_iter = Multi30k(split='train', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
    #'build_vocab_from_iterator' builds the vocabulary from an iterator that yields lists of tokens. 'yield_tokens' iterates over('train_iter' function) and processes the datasets of the current language ('ln') and yeild the tokens used to build the vocabulary.
    vocab_transform[ln] = build_vocab_from_iterator(yield_tokens(train_iter, ln),
                                                    min_freq=1, #Parameter that specifies the minimum frequency that a token must have to be included in the vocabulary is 1
                                                    specials=special_symbols, #The list of special symbols added to the vocabulary
                                                    special_first=True) #This parameter ensures that the special symbols are added to the beginning of the vocabulary
# Sets the default index for tokens not found in the vocabulary to UNK_IDX.
for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
  vocab_transform[ln].set_default_index(UNK_IDX)


# Transformer Model with Adaptive Layer

We will now create a transformer that consists of an embedding layer, a trasformer model, and a linear layer. The embedding layer is responsible for converting the tensors of input indicies into corresponding tensors of input embeddings, aka embedding the input sequences into the latent space by converting token indicies into vectors of a specific size. The transformer processes the input data through a series of self-attention and feedforward neural network layers in the encoder layers, with each layer transforming the token into a more abstract representation until reaching the point of vectors. The linear layer then comes in to map the high-dimensional token representations into a new space whose dimensionality is equal to the size of the target vocabulary. The target vocabulary is the set of all possible output tokens of the transformer, which includes words, punctuation marks, etc.

In [None]:
#We will first import the necessary libraries from PyTorch for tensor operations.
from torch import Tensor #
import torch
import torch.nn as nn
from torch.nn import Transformer
import math

# Set the device to GPU(Graphics Processing Unit) if available; otherwise, use CPU(Central Processing Unit).
#This allows leveraging hardware acceleration for training and inference, as GPU is generally acknowledged to be faster than the standard CPU especially in parallelizable tasks.
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Define a module for adding positional encodings to token embeddings.
# Positional encodings provide the model with information about the position of tokens in the sequence by adding a position identifier to the vector representation of a token.
class PositionalEncoding(nn.Module):
    def __init__(self, emb_size: int, dropout: float, maxlen: int = 5000):
        super(PositionalEncoding, self).__init__()
        # Calculate positional encodings once in log space for efficiency.
        den = torch.exp(- torch.arange(0, emb_size, 2) * math.log(10000) / emb_size)
        pos = torch.arange(0, maxlen).reshape(maxlen, 1)
        pos_embedding = torch.zeros((maxlen, emb_size))
        pos_embedding[:, 0::2] = torch.sin(pos * den)
        pos_embedding[:, 1::2] = torch.cos(pos * den)
        pos_embedding = pos_embedding.unsqueeze(-2)
        self.dropout = nn.Dropout(dropout)
        self.register_buffer('pos_embedding', pos_embedding)

    def forward(self, token_embedding: Tensor):
        # Add positional encodings to token embeddings and apply dropout.
        return self.dropout(token_embedding + self.pos_embedding[:token_embedding.size(0), :])

# A module to convert token indices to embeddings. It maps tokens to vectors in a high-dimensional space.
class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size: int, emb_size: int):
        super(TokenEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, emb_size)
        self.emb_size = emb_size

    def forward(self, tokens: Tensor):
        # Multiply embeddings by sqrt(emb_size) to normalize their scale.
        return self.embedding(tokens.long()) * math.sqrt(self.emb_size)

# Define adaptive linear layers, which allow adjusting the model's capacity without retraining from scratch.
class AdaptiveLinear(nn.Module):
    def __init__(self, size: int):
        super(AdaptiveLinear, self).__init__()
        self.linear = nn.Linear(size, size)

    def forward(self, x: Tensor):
        # A simple linear transformation.
        return self.linear(x)

# Define adapter layers that can be inserted between transformer layers to fine-tune the model for specific tasks.
class Adapter(nn.Module):
    def __init__(self, size: int):
        super(Adapter, self).__init__()
        self.adapter_block = nn.Sequential(
            nn.Linear(size, size // 2),
            nn.ReLU(),
            nn.Linear(size // 2, size)
        )

    def forward(self, x: Tensor):
        # Apply the adapter block and add a skip connection.
        ff_out = self.adapter_block(x)
        adapter_out = ff_out + x
        return adapter_out

# The main Seq2Seq Transformer model combining the above components.
class Seq2SeqTransformer(nn.Module):
    def __init__(self, num_encoder_layers: int, num_decoder_layers: int, emb_size: int,
                 nhead: int, src_vocab_size: int, tgt_vocab_size: int,
                 dim_feedforward: int = 512, dropout: float = 0.1, use_adaptive_layers: bool = False):
        super(Seq2SeqTransformer, self).__init__()
        self.transformer = Transformer(d_model=emb_size, nhead=nhead,
                                       num_encoder_layers=num_encoder_layers,
                                       num_decoder_layers=num_decoder_layers,
                                       dim_feedforward=dim_feedforward, dropout=dropout)
        self.generator = nn.Linear(emb_size, tgt_vocab_size)
        self.src_tok_emb = TokenEmbedding(src_vocab_size, emb_size)
        self.tgt_tok_emb = TokenEmbedding(tgt_vocab_size, emb_size)
        self.positional_encoding = PositionalEncoding(emb_size, dropout=dropout)
        self.use_adaptive_layers = use_adaptive_layers
        if use_adaptive_layers:
            # Initialize adaptive layers if enabled.
            self.adaptive_encoders = nn.ModuleList([Adapter(emb_size) for _ in range(num_encoder_layers - 1)])
            self.adaptive_decoders = nn.ModuleList([Adapter(emb_size) for _ in range(num_decoder_layers - 1)])

    def forward(self, src: Tensor, trg: Tensor, src_mask: Tensor, tgt_mask: Tensor,
                src_padding_mask: Tensor, tgt_padding_mask: Tensor, memory_key_padding_mask: Tensor):
        # Process input through the encoder, optional adaptive layers, and decoder to produce output.
        src_emb = self.positional_encoding(self.src_tok_emb(src))
        tgt_emb = self.positional_encoding(self.tgt_tok_emb(trg))
        # Encoder
        memory = src_emb
        for i in range(len(self.transformer.encoder.layers)):
            memory = self.transformer.encoder.layers[i](memory, src_mask)
            if self.use_adaptive_layers and i < len(self.adaptive_encoders):
                memory = self.adaptive_encoders[i](memory)
        # Decoder
        output = tgt_emb
        for i in range(len(self.transformer.decoder.layers)):
            output = self.transformer.decoder.layers[i](output, memory, tgt_mask)
            if self.use_adaptive_layers and i < len(self.adaptive_decoders):
                output = self.adaptive_decoders[i](output)
        return self.generator(output)

    # Additional methods for encoding, decoding, and toggling adaptive layers.

    def encode(self, src: Tensor, src_mask: Tensor):
        return self.transformer.encoder(self.positional_encoding(
                            self.src_tok_emb(src)), src_mask)

    def decode(self, tgt: Tensor, memory: Tensor, tgt_mask: Tensor):
        return self.transformer.decoder(self.positional_encoding(
                          self.tgt_tok_emb(tgt)), memory,
                          tgt_mask)

    def toggle_adaptive_layers(self, use: bool):
        """
        Toggle the usage of adaptive layers in the model.
        :param use: A boolean flag to enable or disable adaptive layers.
        """
        self.use_adaptive_layers = use

# The Masking Function

We will now create a masking function. We have a masking function to hide the future tokens and the padding tokens during training, ensuring that the model predicts each token only based on the previous ones and ignoring the irrelevant information(in this case, the padding tokens).

Paddings are special tokens added to the transformer sequences to make them all the same length. This is because the transformer requires inputs to be in uniform size. The padding needs to be hidden because they are irrelevant to the training.

A collation function for preparing batches of data by padding sequences to a uniform length. -> aka, in this process, adding the special padding tokens.


In [None]:
# This function generates a square mask for the sequence. The mask ensures that during training,
# the predictions for position i can depend only on the known outputs at positions less than i.
def generate_square_subsequent_mask(sz):
    # Creates an upper triangular matrix of ones, with zeros elsewhere. This matrix is then transposed.
    # This operation ensures that for any position i in the sequence, positions > i are masked with `-inf`,
    # making the model unable to peek into future tokens.
    mask = (torch.triu(torch.ones((sz, sz), device=DEVICE)) == 1).transpose(0, 1)
    # Converts the mask to float and changes 0s to `-inf` and 1s to 0.0.
    # `-inf` is used to mask out the future tokens by setting their attention weight to 0,
    # ensuring they don't contribute to the prediction of current and past tokens.
    mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
    return mask

# This function creates four masks used during the forward pass of the transformer model.
def create_mask(src, tgt):
    # Calculates the sequence lengths of source and target tensors.
    src_seq_len = src.shape[0]
    tgt_seq_len = tgt.shape[0]

    # Generates a subsequent mask for target sequence to prevent the model from accessing future tokens.
    tgt_mask = generate_square_subsequent_mask(tgt_seq_len)
    # Creates a mask with zeros for the source sequence, as self-attention for the source doesn't need masking.
    src_mask = torch.zeros((src_seq_len, src_seq_len),device=DEVICE).type(torch.bool)

    # Generates padding masks for source and target sequences. These masks ensure that model's attention mechanism
    # ignores padding tokens by setting their positions to `True`. `PAD_IDX` is used to identify padding tokens.
    # `.transpose(0, 1)` is applied to align the padding masks with the input tensor's shape for correct broadcasting.
    src_padding_mask = (src == PAD_IDX).transpose(0, 1)
    tgt_padding_mask = (tgt == PAD_IDX).transpose(0, 1)

    # Returns the source mask, target mask, source padding mask, and target padding mask.
    return src_mask, tgt_mask, src_padding_mask, tgt_padding_mask


# Weights and Parameters

We will now set up the transformer model with Seq2Seq learning tasks - such as language translation - by defining its hyperparamters, initializing its weights, and preparing it for training.

In [None]:
# Set a fixed seed for reproducibility of results across different runs.
torch.manual_seed(0)

def create_German2English(vocab_transform, SRC_LANGUAGE = 'de', TGT_LANGUAGE = 'en'):
    # Determine the sizes of the source and target vocabularies.
    SRC_VOCAB_SIZE = len(vocab_transform[SRC_LANGUAGE])
    TGT_VOCAB_SIZE = len(vocab_transform[TGT_LANGUAGE])

    # Define the embedding size for input tokens. This size is also used in different parts of the transformer model.
    EMB_SIZE = 512
    # Define the number of heads in the multi-head attention mechanism. More heads allow the model to jointly attend to
    # information at different positions from different representational spaces.
    NHEAD = 8
    # The dimension of the feedforward network model in nn.TransformerEncoder and nn.TransformerDecoder
    FFN_HID_DIM = 512
    # Batch size for training; affects the number of samples processed before the model is updated.
    BATCH_SIZE = 128
    # The number of layers in the transformer's encoder and decoder stacks.
    NUM_ENCODER_LAYERS = 3
    NUM_DECODER_LAYERS = 3

    # Instantiate the Seq2SeqTransformer model with the specified parameters.
    transformer = Seq2SeqTransformer(NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS, EMB_SIZE,
                                    NHEAD, SRC_VOCAB_SIZE, TGT_VOCAB_SIZE, FFN_HID_DIM, use_adaptive_layers=False)
    return transformer

def create_English2German(vocab_transform, SRC_LANGUAGE = 'en', TGT_LANGUAGE = 'de'):
    SRC_VOCAB_SIZE = len(vocab_transform[SRC_LANGUAGE])
    TGT_VOCAB_SIZE = len(vocab_transform[TGT_LANGUAGE])
    # Define the embedding size for input tokens. This size is also used in different parts of the transformer model.
    EMB_SIZE = 512
    # Define the number of heads in the multi-head attention mechanism. More heads allow the model to jointly attend to
    # information at different positions from different representational spaces.
    NHEAD = 8
    # The dimension of the feedforward network model in nn.TransformerEncoder and nn.TransformerDecoder
    FFN_HID_DIM = 512
    # Batch size for training; affects the number of samples processed before the model is updated.
    BATCH_SIZE = 128
    # The number of layers in the transformer's encoder and decoder stacks.
    NUM_ENCODER_LAYERS = 3
    NUM_DECODER_LAYERS = 3
    # Instantiate the Seq2SeqTransformer model with the specified parameters.
    transformer = Seq2SeqTransformer(NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS, EMB_SIZE,
                                    NHEAD, SRC_VOCAB_SIZE, TGT_VOCAB_SIZE, FFN_HID_DIM, use_adaptive_layers=False)
    return transformer

# The Collation Function

We will now set up the collation function, which will convert a batch of raw strings with varying sizes into a batch of tensors of uniform sizes that can be directly fed into our model.

In [None]:
# Import the pad_sequence utility to pad sequences to the same length.
from torch.nn.utils.rnn import pad_sequence

# Defines a function that applies a series of transformations to the input text.
# This is useful for chaining operations like tokenization, numericalization, and adding special tokens.
def sequential_transforms(*transforms):
    def func(txt_input):
        for transform in transforms:
            txt_input = transform(txt_input)  # Apply each transformation in sequence.
        return txt_input
    return func

# Function to add Beginning Of Sentence (BOS) and End Of Sentence (EOS) tokens around the tokenized input,
# and convert the sequence of token IDs into a PyTorch tensor.
def tensor_transform(token_ids: List[int]):
    return torch.cat((torch.tensor([BOS_IDX]),  # Prepend BOS token.
                      torch.tensor(token_ids),  # Include the token IDs.
                      torch.tensor([EOS_IDX])))  # Append EOS token.

# Dictionary to hold the transformations for both source and target languages.
text_transform = {}
for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
    text_transform[ln] = sequential_transforms(token_transform[ln],  # Apply tokenization.
                                               vocab_transform[ln],  # Convert tokens to numerical IDs.
                                               tensor_transform)  # Add BOS/EOS tokens and convert to tensor.

# Function to collate a batch of data points. This function is used by the DataLoader to combine individual
# data items into a batch.
def collate_fn(batch):
    src_batch, tgt_batch = [], []  # Lists to hold source and target sequences for the batch.
    for src_sample, tgt_sample in batch:  # Iterate over each data point in the batch.
        # Process the source and target samples, strip trailing newline characters, and apply text transformations.
        src_batch.append(text_transform[SRC_LANGUAGE](src_sample.rstrip("\n")))
        tgt_batch.append(text_transform[TGT_LANGUAGE](tgt_sample.rstrip("\n")))

    # Pad the sequences in the batch to the same length and convert to tensors.
    src_batch = pad_sequence(src_batch, padding_value=PAD_IDX)
    tgt_batch = pad_sequence(tgt_batch, padding_value=PAD_IDX)
    return src_batch, tgt_batch  # Return the processed batch.


# Defining the Training Loop

We will now define the training and evaluation loops for our Seq2Seq transformer model. The training loop is responsible for iterating over the training data, making predictions, calculating loss(based on the loss function), and updating the model parameters(with the optimizer to minimize the loss). This is the training loop. There are a few key compoennet to do anytime
you are triainign a model. The very first thing you do is zero the gradient. This
 means that you are starting a fresh slate for every batch of data. This way,
 you don't combine multiple batches of data (you don't want to be carrying
 over the gradient) -> Line 29. (You have to zero the gradient before calculating
  the loss. Usually, Line 29 should go before you calculat the loss and logits. Line 27. )
Then, you need to make sure that you are making a prediction.
Then, you calculate the loss.  Then you do (line 34) loss.backward, which is
how you update the model weights - or how the ML learns.
Line(38) - you just want to take a step with the optimizer - minimize the loss.
Then, you calculate some metrics.

In [None]:
from torchtext.datasets import Multi30k
import torchtext
from torch.utils.data import IterableDataset

# Import necessary modules for data handling
from torch.utils.data import DataLoader

loss_fn = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)

#make it accept the train_iter

def train_epoch(model, train_iter, optimizer, epoch, SRC_LANGUAGE, TGT_LANGUAGE, best_loss=float('inf')):
  #Initializing the best_loss as infinity so that anything below that is now considered the best.
    model.train()  # Set the model to training mode
    losses = 0
    BATCH_SIZE = 128

    # Assuming train_dataloader is defined and set up elsewhere
    print(torchtext.__version__) # 0.16
    train_iter = Multi30k(split='train', language_pair=('en', 'de'))
    print (type(train_iter))
    #TODO how do we combine the other dataset with this 'train_iter'.
    #See if train_iter is a list or set or anything else.
    train_dataloader = DataLoader(train_iter, batch_size=BATCH_SIZE, collate_fn=collate_fn)
    n_batches = 0
    for src, tgt in train_dataloader:

        src, tgt = src.to(DEVICE), tgt.to(DEVICE)
        tgt_input = tgt[:-1, :]
        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)
        optimizer.zero_grad()
        logits = model(src, tgt_input, src_mask, tgt_mask, src_padding_mask, tgt_padding_mask, src_padding_mask)
        tgt_out = tgt[1:, :]
        loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        loss.backward()
        optimizer.step()
        losses += loss.item()
        n_batches +=1

    avg_loss = losses / n_batches

    # Check if the average loss of this epoch is the best so far
    if avg_loss < best_loss:
        best_loss = avg_loss  # Update the best loss with the current average loss
        # Save the model and any other components
        if (SRC_LANGUAGE == 'de'):
            torch.save({
                'epoch': epoch,
                #It is saving that anything that needs any kind of weight associated with it. Now there is a weight to load it in
                'model_state_dict': model.state_dict(),
                #We are saving the optimizer for the training - not for the inference: if you want to use the train model for a translation.
                'optimizer_state_dict': optimizer.state_dict(),
                'loss': best_loss,
          }, '/content/drive/MyDrive/Etruscan_Project/model_best_forward.pth')  # Save to a .pth file
          #MAKE THIS /slfwdof SO THAT IT IS SAVED TO THE GOOGLE DRIVE
        else:
            torch.save({
                'epoch': epoch,
               #It is saving that anything that needs any kind of weight associated with it. Now there is a weight to load it in
                'model_state_dict': model.state_dict(),
               #We are saving the optimizer for the training - not for the inference: if you want to use the train model for a translation.
                'optimizer_state_dict': optimizer.state_dict(),
                'loss': best_loss,
          }, '/content/drive/MyDrive/Etruscan_Project/model_best_backward.pth')  # Save to a .pth file

    print(f'Epoch {epoch}, Loss: {avg_loss}, Best Loss: {best_loss}')

    return avg_loss, best_loss  # Return the average loss and the best loss for tracking

class CombinedDataset(IterableDataset):
    def __init__(self, data_pipe1, data_pipe2):
        self.data_pipe1 = data_pipe1
        self.data_pipe2 = data_pipe2
    def __iter__(self):
        for item in self.data_pipe1:
            yield item
        for item in self.data_pipe2:
            yield item

# Function to evaluate the model's performance on the validation set

def evaluate(model):
    # Set the model to evaluation mode to disable dropout, batch normalization, etc.
    model.eval()

    # Initialize a variable to accumulate the total loss
    losses = 0

    BATCH_SIZE = 128

    # Load the validation dataset
    val_iter = Multi30k(split='valid', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
    # Prepare data loader for validation data
    val_dataloader = DataLoader(val_iter, batch_size=BATCH_SIZE, collate_fn=collate_fn)

    # Iterate over each batch of validation data
    for src, tgt in val_dataloader:
        # Move source and target tensors to the appropriate device (GPU/CPU)
        src = src.to(DEVICE)
        tgt = tgt.to(DEVICE)

        # Prepare the input for the target by removing the last token
        tgt_input = tgt[:-1, :]

        # Create masks and padding masks for source and target inputs
        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)

        # Forward pass: compute the predicted outputs (logits) from the model without updating model parameters
        logits = model(src, tgt_input, src_mask, tgt_mask, src_padding_mask, tgt_padding_mask, src_padding_mask)

        # Prepare the target outputs by removing the first token (BOS)
        tgt_out = tgt[1:, :]
        # Compute the loss between the predicted logits and the actual target outputs
        loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        # Accumulate the loss
        losses += loss.item()

    # Return the average loss over the validation data
    return losses / len(list(val_dataloader))


# Training the Model

Now, we have all the hyperparameters, functions, and dictionaries to train our model! Lets define our Epoch, utilize the Greedy function to take the best at every timestamp, and use the translating function to complete the translation!

In [None]:
import torch
from torchtext.datasets import Multi30k
from torch.utils.data.datapipes.iter import IterableWrapper
from itertools import chain
from timeit import default_timer as timer
import numpy as np

#Here, they are just calling the training loop functions.
#An EPOCH is one full iteration through your training data
from timeit import default_timer as timer

#This is the general trainer
def train_model(NUM_EPOCH, transformer, optimizer, SRC_LANGUAGE, TGT_LANGUAGE, patience=2):
    best_val_loss = float('inf')
    patience_counter = 0

    for epoch in range(1, NUM_EPOCHS + 1):
        start_time = timer()
        # Assume train_epoch function returns average and best loss for the epoch
        train_avg_loss, train_best_loss = train_epoch(transformer, train_iter, optimizer, epoch, SRC_LANGUAGE, TGT_LANGUAGE, best_val_loss)
        end_time = timer()
        val_loss = evaluate(transformer)

        print(f"Epoch: {epoch}, Train Avg loss: {train_avg_loss:.3f}, Train Best loss: {train_best_loss: .3f}, "
              f"Val loss: {val_loss:.3f}, Epoch time = {(end_time - start_time):.3f}s")

        # Early stopping logic
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            patience_counter = 0  # Reset patience counter
        else:
            patience_counter += 1  # Increment patience counter

        if patience_counter >= patience:
            print("Early stopping triggered")
            break

NUM_EPOCHS = 100

transformer_forward = create_German2English(vocab_transform)

transformer_backward = create_English2German(vocab_transform)

for transformer in (transformer_forward, transformer_backward):
  # Initialize the model's parameters using the Xavier uniform initialization method.
  # This initialization helps in keeping the signal from exploding or vanishing in deep networks.
  for p in transformer.parameters():
    if p.dim() > 1:
        nn.init.xavier_uniform_(p)
  # Move the model to the appropriate device (GPU or CPU) for computation efficiency.
  transformer = transformer.to(DEVICE)

#Here, we are trying to combine the datasets

# Load the Multi30k dataset
multi30k_train_iter = Multi30k(split='train', language_pair=('en', 'de'))


# In a nutshell - neural networks work by minimizing the loss. In order to
#minimize the loss, you folloow a stochastic gradient descent - you take in a batch of data,
# you evaluate the performance on the patch of data and calculate the loss, then
#, based on the loss, your network will update its weights ideally to increase
#performance and minimize the loss further - the process to do this is accomplished
#by an optimizer. There are a lot of optimizers, Adam optimizers is one of them..

# In a nutshell - neural networks work by minimizing the loss. In order to
#minimize the loss, you folloow a stochastic gradient descent - you take in a batch of data,
# you evaluate the performance on the patch of data and calculate the loss, then
#, based on the loss, your network will update its weights ideally to increase
#performance and minimize the loss further - the process to do this is accomplished
#by an optimizer. There are a lot of optimizers, Adam optimizers is one of them..

# Define the loss function. CrossEntropyLoss is used for classification tasks.
# The `ignore_index` parameter is set to PAD_IDX to exclude the padding tokens from the loss calculation.

for index, transformer in enumerate([transformer_forward, transformer_backward]):
    if (index == 0):
        SRC_LANGUAGE = 'de'
        TGT_LANGUAGE = 'en'
    else:
        SRC_LANGUAGE = 'en'
        TGT_LANGUAGE = 'de'

    optimizer = torch.optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)
    train_model(NUM_EPOCHS, transformer, optimizer, SRC_LANGUAGE, TGT_LANGUAGE)

#This is where the back translation needs to happen. We would basically be creating a new backtranslated dataset and it needs to be integrated into the training loop.
# function to generate output sequence using greedy algorithm
#Algorithm that takes the best at every timestep - som
def greedy_decode(model, src, src_mask, max_len, start_symbol):
    src = src.to(DEVICE)
    src_mask = src_mask.to(DEVICE)

    memory = model.encode(src, src_mask)
    ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(DEVICE)
    for i in range(max_len-1):
        memory = memory.to(DEVICE)
        tgt_mask = (generate_square_subsequent_mask(ys.size(0))
                    .type(torch.bool)).to(DEVICE)
        out = model.decode(ys, memory, tgt_mask)
        out = out.transpose(0, 1)
        prob = model.generator(out[:, -1])
        _, next_word = torch.max(prob, dim=1)
        next_word = next_word.item()

        ys = torch.cat([ys,
                        torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=0)
        if next_word == EOS_IDX:
            break
    return ys


# actual function to translate input sentence into target language
def translate(model: torch.nn.Module, src_sentence: str):
    model.eval()
    src = text_transform[SRC_LANGUAGE](src_sentence).view(-1, 1)
    num_tokens = src.shape[0]
    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool)
    tgt_tokens = greedy_decode(
        model,  src, src_mask, max_len=num_tokens + 5, start_symbol=BOS_IDX).flatten()
    return " ".join(vocab_transform[TGT_LANGUAGE].lookup_tokens(list(tgt_tokens.cpu().numpy()))).replace("<bos>", "").replace("<eos>", "")

# Back Translation

This is the back trnalsation.

In [None]:
!pip install torchtext

In [None]:
import torch
from torch.utils.data import DataLoader, IterableDataset
from torchtext.datasets import Multi30k
from torchtext.data.functional import to_map_style_dataset
from torch.nn.utils.rnn import pad_sequence
# Ensure you have the necessary imports for your translate functions and any other utilities you use

BATCH_SIZE = 128
SRC_LANGUAGE = 'en'
TGT_LANGUAGE = 'de'
# Define BOS_IDX, EOS_IDX, and PAD_IDX based on your vocabulary


# Back_tranlsated = [
# Def back_translate (data_loader, transformer_forward, transformer_backward):
# 	For src, tgt in data_loader:
# 		Output = translate (transfoemrt_forward, src)
# 		Back_translation = translate(transofmer_backward, output)
# 		Sample = (src, back_translation)
# Back_translated.append(sample)

# !pip install torchtext

# import torchtext.data.transforms

# from torchtext.data.transforms import TextTransform

def back_translate (data_loader, transformer_forward, transformer_backward):
  back_translated = []
  for src, tgt in data_loader:
    # src_sentence = text_transform[SRC_LANGUAGE].decode(src[0])
    #src_sentence = src_field.decode(src[0])
    output = translate(transformer_forward, src)
    back_translation = translate(transformer_backward, output)
    sample = (src, back_translation)
    back_translated.append(sample)
  return back_translated

# text_transform = {
#     SRC_LANGUAGE: TextTransform(),
#     TGT_LANGUAGE: TextTransform(),
# }

# Re-training

In [None]:
import torchtext

BATCH_SIZE = 128
train_iter = Multi30k(split='train', language_pair=('en', 'de'))
train_dataloader = DataLoader(train_iter, batch_size=BATCH_SIZE, collate_fn=collate_fn)

# Your custom backtranslated dataset (list of tuples)
backtranslated_data = back_translate(train_dataloader, transformer_forward, transformer_backward)

# Convert your custom dataset to a DataPipe
backtranslated_iter = IterableWrapper(backtranslated_data)

#Now, for the chain concatenation system.
#It should be in a tuple - [(src, tgt), ...]

train_dataloader = DataLoader(combined_iter, batch_size=4)

from torch.utils.data import IterableDataset
class CombinedDataset(IterableDataset):
    def __init__(self, data_pipe1, data_pipe2):
        self.data_pipe1 = data_pipe1
        self.data_pipe2 = data_pipe2
    def __iter__(self):
        for item in self.data_pipe1:
            yield item
        for item in self.data_pipe2:
            yield item

#Usage
combined_dataset = CombinedDataset(train_iter, backtranslated_iter)
combine_dataloader = DataLoader(combined_dataset, BATCH_SIZE)

In [None]:
optimizer_backtranslation = torch.optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

train_model(NUM_EPOCH=1, transformer_forward, optimizer_backtranslation, SRC_LANGUAGE, TGT_LANGUAGE)



# Evaulation with BLEU score

We will now evaluate the model with the BLEU score.

In [None]:
def greedy_decode(model, src, src_mask, max_len, start_symbol):
    src = src.to(DEVICE)
    src_mask = src_mask.to(DEVICE)

    memory = model.encode(src, src_mask)
    ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(DEVICE)
    for i in range(max_len-1):
        memory = memory.to(DEVICE)
        tgt_mask = (generate_square_subsequent_mask(ys.size(0))
                    .type(torch.bool)).to(DEVICE)
        out = model.decode(ys, memory, tgt_mask)
        out = out.transpose(0, 1)
        prob = model.generator(out[:, -1])
        _, next_word = torch.max(prob, dim=1)
        next_word = next_word.item()

        ys = torch.cat([ys,
                        torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=0)
        if next_word == EOS_IDX:
            break
    return ys

# actual function to translate input sentence into target language
def translate(model: torch.nn.Module, src_sentence: str):
    model.eval()
    src = text_transform[SRC_LANGUAGE](src_sentence).view(-1, 1)
    num_tokens = src.shape[0]
    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool)
    tgt_tokens = greedy_decode(
        model,  src, src_mask, max_len=num_tokens + 5, start_symbol=BOS_IDX).flatten()
    return " ".join(vocab_transform[TGT_LANGUAGE].lookup_tokens(list(tgt_tokens.cpu().numpy()))).replace("<bos>", "").replace("<eos>", "")

In [None]:
import torch

# Load the saved file
checkpoint_forward = torch.load('/content/drive/MyDrive/Etruscan_Project/model_best_forward.pth')
checkpoint_backward = torch.load('/content/drive/MyDrive/Etruscan_Project/model_best_backward.pth')

transformer_forward = create_German2English(vocab_transform)
transformer_backward = create_English2German(vocab_transform)

# Load the model state dictionary
transformer_forward.load_state_dict(checkpoint_forward['model_state_dict'])
transformer_backward.load_state_dict(checkpoint_backward['model_state_dict'])

transformer_forward.to(DEVICE)
transformer_backward.to(DEVICE)

print(translate(transformer_backward, "Eine Gruppe von Menschen steht vor einem Iglu ."))

print(translate(transformer_forward, "The boy runs into the house"))

# predictions = []
# references = []
# results = bleu.compute(predictions=predictions, references=references)
# print(results)