---
title: "Building a Transformer-based Translator: Part 1"
author: "Dominic Leon Culver"
date: "2024-10-14"
draft: true
categories: [transformers, machine-learning, deep-learning, model-implementation]
bibliography: references.bib
format:
    html:
        code-fold: true
---

Hello There! Welcome to my blog! This is my first post, and I am excited to share it with you. In this series, I'll be walking you through a project where I built and trained a **transformer-based translator** from scratch.  It was a challenging (and occasionally very frustrating) project, but also highly rewarding and I learned a ton from doing it.

I tackled this project for a copy of reasons. 

1. **Deepening my understanding of Transformers**. I've used them in other projects (e.g. sentiment analysis with BERT), but I wanted to take a closer look under the hood and really understand how they work internally. 
2. **Improving my PyTorch skills**. I've built models in PyTorch before and loved its flexibility, but I wanted to push my skills further by building something more complex from scratch. 
3. **Aspiring to do research**. I am very interested in the current research landscape surronding LLMs, and I'd love to eventually contribute to this area. In order to do that, a strong understanding of transformers and how they work is essential. 

When I initially started this project, I thought it would be fairly straightforward. However, that is *not* what happened, and because of that I will be breaking this series into three parts. In this post, I will cover more of the history and the ideas around machine translation in general and discuss the motivations and ideas behind the attention mechanism. In the next posts, I will dive into the specifics in the original *Attention is All You Need* [@Vaswani:2017aa] paper and walk through how to turn that into actual code. After that, I will discuss the training process, model evaluation, and what I'd do differently if I tackled this project again. I'll also talk about some of the services I used throughout the project. 

# Introduction

In today's post we will delve into the context in which transformers appeared. In particular we will discuss the main ideas behind the transformers architecture before delving too deep into the actual code. Today's post will focus on 

1. How to regard language translation as a machine learning task. 
2. General approaches to machine translation architecture. 
3. Tokenization 
4. The advent of the ``attention mechanism''.
5. Challenges with traditional machine translation architectures and the advantages of transformers

By going through these points, I hope that you will be given sufficient context to understand what transformers are trying to do and why it garnered as much excitement as it did. 

# Translation as a Machine Learning Task

Before touching transformers, it is a natural question to ask why we should believe that translating a sentence from one language into another can even be solved using machine learning. Language is complicated and different languages often have vastly differing ways of expressing ideas between them. Take for example languages like English and Japanese. English generally uses the word order Subject-Verb-Object (SVO), i.e. "The boy throws the ball". The "boy" here is the subject, "throws" is the verb, and "ball" is the object. In Japanese, the word order is Subject-Object-Verb (SOV) instead. In Japanese, our sentence would instead be:

>**Japanese Sentence**:  
>少年はボールを投げる。  
>"Shōnen wa bōru o nageru"  
>**English Word-by-Word**:  
>"Boy ball throws"

As we can see from this example as well Japanese uses fewer articles than English. This is only a sampling of the vast differences that exist across all human languages. There are, as we've seen, issues related to linguistic differences (SVO vs SOV), but also cultural (idioms, jokes, or proverbs not translating well). Suffice it to say this makes modeling translation a difficult task. That being said, we can still make reasonable assumptions that can aid us in determining whether or not translation is something a machine learning algorithm can solve. 

First, we will assume that there is a source language `src_lang` and a target language `tgt_lang`, i.e. a language $X$ we hope to translate various sentences into language $Y$. As anyone who has learned another language can tell you, translation is not simply the task of taking a word and finding its corresponding word in the target language. Indeed, there is often not a direct mapping. For instance, in English the phrase "to get" can be used in several different ways, e.g. "I am getting something from the store" or "I got a present". In German, for instance, these would be translated with two separate verbs, "holen" in the first case and "bekommen" in the second. 

Despite these difficulties that exist in natural language, there is a long history in ML of using probabilities to predict the next word in a string. For example, 

> The sky was ...

The next word could be "blue" or "clear" or "cloudy", but probably not "wet" or "discrete". We would say that "blue" is more _probable_ then "discrete". In NLP we often think of the task of next word prediction as approximating the probability 
$$
    P(w_t | w_{t-1}\cdots w_0)
$$
i.e. the probability of the next word appearing given the previous words. In fact, [@jm3] define a _language model_ as models which assign a probability to an entire sentence. From perusing [@jm3] one sees that many models that people have tried over the use essentially boil down to producing better and better algorithms to approximate these conditional probabilities. 

When dealing with the subject of translating from a source language to a target language, we can think of it in much the same way. We think of the source sentence as a sequence of words
$$
\mathbf{x} = (x_0, \ldots, x_S)
$$
and the target translation as a sequence of words
$$
\mathbf{y} = (y_0, \ldots, y_T).
$$
We can think of the translation task as a sequence of probability estimations, namely for $t$ between 0 and $T$ we wish to predict the next word by maximizing the following probability,
$$
P(y_t | y_{t-1}, \ldots, y_0, x_S, \ldots, x_0).
$$


# Encoder-Decoder Architecture

We have established at this point that we can indeed use machine learning as way of doing translation. In this section we describe a very common over-arching architecture to this task, the _encoder-decoder_ architecture. This architecture was first presented in [@Sutskever:2014aa], but using RNNs rather than transformers. The discussion in that paper is excellent, but I will summarize it briefly. 

As we've mentioned, the task of machine translation is to predict conditional probabilities. However, as the authors of [@Sutskever:2014aa] say:
> Sequences pose a challenge for DNNs because they require that the dimensionality of the inputs and
outputs is known and fixed.
Here, dimensionality refers to the actual sequence length. When this paper came out, it was common for neural networks to work on sequence tasks where the input and the output have the same _length_, but we cannot assume that in machine translation. For example, "I am a mathematician" in German would be "Ich bin Mathematiker". To work around this, the authors suggested using two RNNs (LSTMs more specifically). These are the following
1. **Encoder**: this RNN is responsible for ingesting the source sentence and producing a _vector representation_ of it, known as the _context vector_.
2. **Decoder**: this RNN is responsible for taking in the context vector from the encoder, and iteratively working through the words of the tgt sentence, predicting the next word from these two. 

Its easier to understand this with some formulas. The specific details of a RNN is not relevant for the moment. What matters is that a RNN is a kind of function $f(h, x)$ and when given a sequence of $\mathbf{x} = (x_0, \ldots, x_T)$ of inputs we construct a sequence of _hidden state_ $(h_0, \ldots, h_T)$ vectors by 
$$
h_t = f(h_{t-1}, x_t).
$$ {#eq-hidden-state-update}
Now the NN typically learns in such a way that the vector $h_t$ encodes pertinet aspects of the sequence up until time $t$. An RNN by itself is good at next predicting the next time step, and in conjunction with a feed-forward network can be used to predict sequences $\mathbf{y} = (y_0, \ldots, y_T)$ of the _same_ length as $\mathbf{x}$. 

The Encoder-Decoder architecture proposed in [@Sutskever:2014aa] gets around this by using two RNNs instead, denoted by $f_1, f_2$ respectively. The second RNN is usually an LSTM (but again we needn't worry about this detail). Let $\mathbf{x} = (x_1, \ldots, x_T)$ and $\mathbf{y} = (y_0, \ldots, y_S)$ be two sequences, possibly of different lengths. The responsibility of the first RNN $f_1$ is to _encode_ the source sequence $\mathbf{x}$. This is done iteratively as in the equation @eq-hidden-state-update, and doing so builds up a sequence of hidden states $\mathbf{h} = (h_0, \ldots, h_T)$. More specifically, we use @eq-hidden-state-update to create hidden state vectors
$$
h_t^{\mathrm{enc}} = f_1(h_{t-1}^\mathrm{enc}, x_t)
$${#eq-hidden-state-update-encoder}
The output of the encoder is $c:=h_T$ in [@Sutskever:2014aa] and is sometimes referred to as the _context vector_. To generate an output sequence $\mathbf{y}$ we begin with an initial vector $y_0$ which is defined to be $c$. The second RNN $f_2$ then predicts the next term in the sequence $\mathbf{y}$ via an equation such as 
$$
y_s = \argmax \ \mathrm{softmax}( W^{\mathrm{proj}} f_2(y_{s-1}, h_{s-1}^{mathrm{dec}}, c) )
$$ {#eq-decoder}
where $h_{s-1}^{mathrm{dec}}$ is a hidden state vector produced as well by the second RNN. (THIS PARAGRAPH HERE NEEDS SOME PROOF READING FOR SURE). The second RNN terminates its prediction once a particular token (usually denoted EOS for "end of sequence") is predicted.

Whle [@Sutskever:2014aa] were able to attain state of the art results in neural translation at the time, people noticed there were certain limitations in their architecture. Namely, because the context vector $c$ had a fixed size, if the source sequence was very long compared to the dimension of $c$, then $c$ would be unable to encode enough details of the input sequence to give adequate translations. This was observed, in particular by [@Bahdanau:2014aa]. To address this issue they introduced the attention mechanism, which we discuss in the next section.

# The Attention Mechanism

# Challenges with Traditional Architectures