# Machine translation Nahuatl - Spanish

Data from Axolotl parallel corpus: https://axolotl-corpus.mx/

COURSE PROJECT LT2326

Oct 2021

Klara Båstedt

### Part 1 - Data preparation

The data is a parallel corpus of texts in Spanish and Nahuatl consisting of almost 18000 parallel sentences.

Spelling normalization: https://pypi.org/project/elotl/

In [1]:
import csv
import string
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from torchtext.data import Field, BucketIterator, TabularDataset
import numpy as np
import pandas as pd
import elotl.corpus
import elotl.nahuatl.orthography
import random

In [None]:
pip install elotl

In [2]:
hyperparameters = {'epochs':3,
                   'batch_size':16,
                   'embedding_size':128,
                   'hidden_size':1024,
                   'learning_rate':0.001,
                   'num_layers':2,
                   'dropout':0.5}

In [3]:
corpus = pd.read_csv("Axolotl.csv")

In [4]:
def normalize_nahuatl(x):
    n = elotl.nahuatl.orthography.Normalizer("sep")
    return n.normalize(x)

def remove_punct(x):
    string.punctuation = string.punctuation + '¿'
    exclude = set(string.punctuation)
    x.translate(str.maketrans('', '', string.punctuation))
    stripped_string = ''.join(ch for ch in x if ch not in exclude)
    return stripped_string.lower()

In [5]:
corpus['Nah'] = corpus['Nah'].apply(normalize_nahuatl)
corpus

Unnamed: 0,Esp,Nah
0,amanteca yoyauan quiere decir sus enemigos son...,amanteka toyauan kn iyaoan in akike in kanin o...
1,anomatia quiere decir No delante de mí se habl...,anomatia kn kitos neki nixpan in omito yauyutl...
2,Aquí estás hijo mío chaval ha quedado satisfec...,ka nikan tika nopiltsé xolé ka o toyolo on pac...
3,Aquí estás hijo mío hijito mío muchacho mío mi...,ka nikan tonka nopiltsé nopiltsiné notelpuchtl...
4,Aquí estás ¿qué es lo que piensas?,ka nikan tika tle tikmati?
...,...,...
17890,¿Y cómo lo voy a hacer yo solo? le dijo yo,¿uan kenijki nijchiuas na noseltik? kiilui
17891,¿Y es cierto eso?,¿uan melauak?
17892,¿Y lo viste?,uan tikitak?
17893,¿Y qué cosa y por qué le tocó?,uan tlaketl uan kenke kena ma kitokaro?


In [6]:
corpus['Esp'] = corpus['Esp'].apply(remove_punct)
corpus['Nah'] = corpus['Nah'].apply(remove_punct)
corpus

Unnamed: 0,Esp,Nah
0,amanteca yoyauan quiere decir sus enemigos son...,amanteka toyauan kn iyaoan in akike in kanin o...
1,anomatia quiere decir no delante de mí se habl...,anomatia kn kitos neki nixpan in omito yauyutl...
2,aquí estás hijo mío chaval ha quedado satisfec...,ka nikan tika nopiltsé xolé ka o toyolo on pac...
3,aquí estás hijo mío hijito mío muchacho mío mi...,ka nikan tonka nopiltsé nopiltsiné notelpuchtl...
4,aquí estás qué es lo que piensas,ka nikan tika tle tikmati
...,...,...
17890,y cómo lo voy a hacer yo solo le dijo yo,uan kenijki nijchiuas na noseltik kiilui
17891,y es cierto eso,uan melauak
17892,y lo viste,uan tikitak
17893,y qué cosa y por qué le tocó,uan tlaketl uan kenke kena ma kitokaro


In [7]:
corpus = shuffle(corpus)
corpus.reset_index(inplace=True, drop=True)

In [8]:
train_df = corpus[:16995]
val_df = corpus[16995:17445]
test_df = corpus[17445:]

In [9]:
train_df.to_csv("train.csv", index=False)
val_df.to_csv("val.csv", index=False)
test_df.to_csv("test.csv", index=False)

In [10]:
train_df

Unnamed: 0,Esp,Nah
0,los quales por relacion dixeron que manifestav...,okitoke tikkakiltiya in justicia ipampa in omo...
1,se dirigieron hacia el rumbo del rostro del sol,in tonatiu ixkopa itstiake
2,en la ciudad de mexico a trece dias del mes de...,in ipan siudad de mexiko a trese dias del mes ...
3,y sabe que la dicha tierra los dichos gobernad...,au in axkan ka intlal inik moteiluia ka imaxka...
4,el nombre de este papa es juan pablo ii,itoka inin uei teopixkatajtsin inik ome ioanne...
...,...,...
16990,se vende en cuetzalan y también por aquí,monamaka kuesalan uan no nikauín
16991,y,au
16992,también entonces los mexicas derrotaron a los ...,iuan ikuak kinpeuke istaktlalokan tlaka
16993,era natural de españa donde nació en la villa ...,inin chane españa in onkan uel motlakatili ipa...


In [11]:
val_df

Unnamed: 0,Esp,Nah
16995,luego fueron creciendo las aguas y hubo inunda...,au san niman ik iu ueix makok in atl tlaapachi...
16996,cómo andan sus negocios bien gracias,kenin ya in monechikolkuali tlasojkamati
16997,la luna se situó delante del sol desapareció c...,ka in yeuatl metstli ixpan momanka in tonatiu ...
16998,así pues sólo estos dos regresaron como ya se ...,au sa omentin in ualmokuepato ye omito yeuatl ...
16999,en este año maxtlaton lloró mucho después se f...,au in ipan in xiuitl senka ye choka in maxtlat...
...,...,...
17440,mucho desea las flores mi corazón sólo me afli...,kinenejki xochitl ka nololo san kuikanentlamat...
17441,pero él no le hizo caso se fue y finalmente se...,amo tetlakaman san ka uis yekene ye poliutiu
17442,vendrán a hacernos saber algo motecuhzoma neza...,in kuix ok techmatikiujin moteuksomatsin in ne...
17443,cuando se levanta un temascal se construye su ...,ijkuak mokalketsa in temaskali kichiuiliaj iku...


In [12]:
test_df

Unnamed: 0,Esp,Nah
17445,yo francisco martin alguacil mayor nombrado po...,neuatl alguasil mayor fransisko martin nunbrad...
17446,es un cuento largo ya no lo sé bien,ueyi kuentos noikí ayok mero nijmati kuali
17447,se llama también quinehuayan recibe este nombr...,iua itokayokan kineuayan inik moteneua kineuay...
17448,en este año se cumplieron años desde su partida,in ipan in xiuitl onkan kichiua matlakpoualxiu...
17449,de mala manera fueron muertos y quien los mand...,amo kuali ik miktiloke euatl kimikti in askapo...
...,...,...
17890,las abejas insuperables trabajadores entre los...,in yoyolimej motekiajkokixtia intsalan sasayol...
17891,en este mismo año con el trato de westfalia en...,sanyeno ijkuak uestfalennenonotsaltika ompa te...
17892,aquí entrego a la tierra estas semillas de maíz,nikan nimaktia taltikpaktsin nejin semilas
17893,al amanecer los llevaron encadenados y los ata...,au sa ualatui in kiualuikake ualilpitiake kimo...


In [13]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"

In [14]:
def get_data():
    whitespacer = lambda x: x.split(' ')

    SPANISH = Field(
        tokenize=whitespacer,
        lower=True,                   
        batch_first=False,
        init_token='<start>',
        eos_token='<end>')
    
    NAHUATL = Field(
        tokenize=whitespacer,
        lower=True,                   
        batch_first=False,
        init_token='<start>',
        eos_token='<end>')
    
    train, val, test = TabularDataset.splits(
                        path = './',
                        train = 'train.csv',
                        validation = 'val.csv',
                        test = 'test.csv',
                        format = 'csv',
                        fields = [('spanish', SPANISH), ('nahuatl', NAHUATL)],
                        skip_header = True)
    
    SPANISH.build_vocab(train, val)
    NAHUATL.build_vocab(train, val)

    
    train_iter = BucketIterator(
        train,                                                  
        batch_size=hyperparameters['batch_size'],
        sort_within_batch=True,
        sort_key=lambda x: (len(x.nahuatl)),
        shuffle=True,                                                  
        device=device
    )
    
    val_iter = BucketIterator(
        val,                                                  
        batch_size=hyperparameters['batch_size'],
        sort_within_batch=True,
        sort_key=lambda x: (len(x.nahuatl)),
        shuffle=True,                                                  
        device=device
    )
                
    test_iter = BucketIterator(
        test,                                                  
        batch_size=hyperparameters['batch_size'],
        sort_within_batch=True,
        sort_key=lambda x: (len(x.nahuatl)),
        shuffle=True,
        device=device
    )

    return train_iter, val_iter, test_iter, NAHUATL, SPANISH

SPANISH.build_vocab(train, val, max_size=10000, min_freq=2)
NAHUATL.build_vocab(train, val, max_size=10000, min_freq=2)

The size of len(SPANISH.vocab) and len(NAHUATL.vocab) is 13982 and 16399, with min_freq=2.
Nahuatl is agglutinative while Spanish is not. This explains why the sizes of the vocabularies change to 27313 and 49302 when min_freq=1. I don't think min_freq=2 works well when one language is agglutinative. We will need our network to have access to as many nahuan "words" as possible. For the same reason it's not smart to set the vocabularies to equal size.

  

In [15]:
train_iter, val_iter, test_iter, NAHUATL, SPANISH = get_data()

In [16]:
for i, batch in enumerate(train_iter):
    print(batch.nahuatl)

tensor([[    2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
             2,     2,     2,     2,     2,     2],
        [27419,   165,  1028,    97, 37163, 10770, 13766, 27540,   224,   163,
         20978, 40923, 14390,  3323,  6005, 13577],
        [    3,     3,     3,     3,     3,     3,     3,     3,     3,     3,
             3,     3,     3,     3,     3,     3]])
tensor([[    2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
             2,     2,     2,     2,     2,     2],
        [28017,     6,     6,     5,     6,   356,     6,    23,   122,     4,
            12,     5,    16,  8504,     6,     6],
        [    4,     4,     9,  1032,     4,    11,     4,     7,    66,  5496,
            21,    29,  5019, 36057,    21,   255],
        [  508,    30,    60,    43,    68, 16088,    68,    25,    21,  2375,
           920,  1454,  1671,    41,   408,     8],
        [ 2032,   195,    44,    54,   243,     4,     4,   540, 13024,     4,
   

tensor([[    2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
             2,     2,     2,     2,     2,     2],
        [ 2203,     6, 32486,     6,   665,     4,    20,   297,     9,     4,
            30,     6,   816,     5,   307,     6],
        [  916,     4,     4,     4,    43,  3288,   205,  1510,   159,     8,
             8,     4,   817,  1116,    57,    25],
        [   59,   995,   486,  1020,  1559,  3294,    66,   401,   185,    72,
             4,     8,    16, 11554,   458,    30],
        [  847,  1167,   399,    64,  6620,   609, 28898,     9,   453,   138,
           821,    31,  4995,    22,    15,    99],
        [   19,    14,   215,   935,   156,   286,     4,   645,   838,    35,
          1078,    15, 41210, 15368,  3074,     4],
        [ 3984,    78,   152,  1332,  3630,   220,  9808,    83,    19,    59,
            21,    60,   577,   181,  1722,     8],
        [   14,   123,    14,   127, 41157,     5, 11506,  2232,   734,  5518,
    

tensor([[    2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
             2,     2,     2,     2,     2,     2],
        [ 4370,    19,     4,   218,     4, 18040,    18,   140,  1762,    13,
             4,     4, 18674,    11,   177,    68],
        [  162,    24,    16, 23482,    30,     5,   668, 34246,    36,   130,
           447,  5256, 17366,   786,    12,    19],
        [  213,   698,    97, 46797, 29625, 27314,   769,   337, 36676,   405,
           118,   153, 10372,   418,     4,     4],
        [ 5320,   698,    56,     4,  8732,    30, 42389,   123,     5, 37833,
          1185,     7,   570,    37,  6831,   154],
        [18848,  1305,   438, 44885,    17,  1092,    19,    22,  2157,    13,
            58,  4608,   454,  6653,     4,  5443],
        [13254,   246,   983,     4,   143,   269, 30451, 38872,    70,  3243,
            42,  6796, 45836,    11,  7673,     4],
        [ 1570,  1222,    56, 18432,    30,  1213,     4,    22,   878,   154,
    

tensor([[    2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
             2,     2,     2,     2,     2,     2],
        [ 2132,     6,     7,    29,     6,     5,    15,  1381,     6,     4,
             5,  8893,     8,     4,     6,     4],
        [    4,     4,    80, 41189,  2260,  2283,   354,     4,     9,    52,
            36,   200,   337,   150,   524,   202],
        [  719,   773,   195,    29,     4,   474,  1539,   166,    16,  1971,
            18,    13,   259,    65,    21,    64],
        [   22,    36,   172,  1454,   365,  2175,  1512,   122,   173,     4,
             4,  2653,  5438,   491,    35,    34],
        [   61,     4,  3821,  9208,   124, 16523, 35467,   720, 15961,   243,
          1033,  1867, 21364,    38,     4,    85],
        [ 6732,     7,    17,    11,   585,   536,   266,     9,   365,   124,
            48, 24129,    58,  2560,     8,   458],
        [   74, 10685,  1634,   879,     4,   653,  2977,   784,     4,  1347,
    

tensor([[    2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
             2,     2,     2,     2,     2,     2],
        [    6,  7297,   834, 35001,   415,  1683,    56,    18,   454,    63,
             6,     4,   262,    25,   244,    50],
        [   15,    77,   628,  4231,  2152, 42770,   574, 48474,  1631,  3120,
             4,  2031,    92,    30,    38,   121],
        [ 5103,    24,   788,     4, 32739,   324, 16192, 16207,  8520,  1290,
            30, 10925,    17,   265,   142,   149],
        [    4,    25, 32870,   195,    24, 45940,    29, 16207,    14,  9081,
         44780,     4,   143,    34,     5,   191],
        [  314,    13,     4,  2684,   339,    19,  1810, 48540,   334,    24,
            18,    30,    15,   392,  1184, 10299],
        [   78,    51, 44854, 37809, 47682,  8159,    10, 19046,  1711,  2640,
            15,  6261,   592,   998,    13,     5],
        [ 2828,    61, 16618, 47790, 42838, 42771, 21349, 16203,  3504,  7062,
    

tensor([[    2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
             2,     2,     2,     2,     2,     2],
        [ 1070,  2631,     5, 45000,     6,     4,     6,     6,  8642,     6,
             4,     9,     6,    23,    90,     6],
        [   11,    68,    16,   380,     4,     8,  3320,    32, 13927,    80,
           565,  5177,     7, 32546,     4,     4],
        [   20,     5, 22971,  2889,    30,   366,    37,   192,  1456,     7,
          7849,  1703,    80,  1652, 44241,  1693],
        [14673,   440,     6,  1857,  2006,   133,  2349,   210,   427,  1796,
            23,     4,    36,  1761, 15039,    99],
        [    4,     4,    12,  2890,  2319,     8,    16,    64,   106,  1493,
          6080,  1019,     4,    76,     9,     4],
        [ 2221,    25,    11,    12,    29,  1205,   975,     4,   427,     4,
             4,   204, 13817,   965,  8275,   969],
        [   68,    95,    20,   136,    43,    14,     4,    32,  4355,  3340,
    

tensor([[    2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
             2,     2,     2,     2,     2,     2],
        [    6,   410, 42864, 33807,     6, 10086,     5,     9,   305,     4,
          4511,    10,   139,  3620,     6,     4],
        [    4,    13,   696,     4,    23,  4527, 23938,   159,     8,     8,
         45342,   186,  1638, 13081,    53,   150],
        [   30, 23096,  7253,     4,    32,  6582,     4,   185,  3410,     4,
         22171,     4,     4,   276,    23,    65],
        [  113,    22, 14998,  1216, 11660,   245,  3591, 35918,     4,    18,
         12145,  2338, 20798,   154,     8,  1065],
        [    8, 40112,  9258,     4,     4,    17,     4, 15802,   492,   591,
         45375,    10,     4, 47610,     4,     4],
        [22720,    24,    14,    26,    49,   106,    15,  2149,   263,     4,
             5,    16, 20332,    13,     4,     4],
        [ 1037,    13,    74,  6077,  1498,  6746, 31182,   102,  2964,   522,
    

In [17]:
print(NAHUATL.vocab.stoi["auakatl"])

17579


In [18]:
print(NAHUATL.vocab.itos[17656])

auiyen


In [19]:
print(SPANISH.vocab.stoi["aguacate"])

4372


In [20]:
print(SPANISH.vocab.itos[5543])

admiraron


In [21]:
len(SPANISH.vocab)

27254

In [22]:
len(NAHUATL.vocab)

49185

### Part 2 - Building and training the model

In [None]:
# Testa att inkludera validation i train-loopen och att sluta träna när loss inte längre minskar
# Testa att ta med en exempelmening och se hur den blir bättre översatt för varje epoch

In [23]:
class Encoder(nn.Module):
    def __init__(self, vocab_size, emb_size, hidden_size, num_layers, drop):
        super(Encoder, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.dropout = nn.Dropout(drop)
        self.embedding = nn.Embedding(vocab_size, emb_size)
        self.lstm = nn.LSTM(emb_size, hidden_size, num_layers, dropout=drop)

    def forward(self, x):
        # shape of x: (length, batchsize)
        x1 = self.embedding(x)
        # shape of x: (length, batchsize, embsize)
        x2 = self.dropout(x1)
        output, (hidden, cell) = self.lstm(x2)

        return hidden, cell

In [24]:
class Decoder(nn.Module):
    def __init__(self, vocab_size, emb_size, hidden_size, num_layers, drop):
        super(Decoder, self).__init__()
        self.hidden_size = hidden_size
        self.vocab_size = vocab_size
        self.num_layers = num_layers
        self.dropout = nn.Dropout(drop)
        self.embedding = nn.Embedding(vocab_size, emb_size)
        self.lstm = nn.LSTM(emb_size, hidden_size, num_layers, dropout=drop)
        self.fc = nn.Linear(hidden_size, vocab_size)
        
    def forward(self, x, hidden, cell):
        #shape of x: (N), but we want (1, N) or 
        x = x.unsqueeze(0)
        x2 = self.embedding(x)
        x3 = self.dropout(x2)
        output, (hidden, cell) = self.lstm(x3, (hidden, cell))
        x4 = self.fc(output)
        x4 = x4.squeeze(0)
        return x4, hidden, cell

In [25]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder):
        super(Seq2Seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        
    def forward(self, source, target, teacher_force_ratio=0.5):
        batch_size = source.shape[1]
        target_length = target.shape[0]
        target_vocab_size = self.decoder.vocab_size
        sentence = torch.zeros(target_length, batch_size, target_vocab_size).to(device)
        hidden, cell = self.encoder(source)
        x = target[0]
        
        for t in range(1, target_length):
            
            output, hidden, cell = self.decoder(x, hidden, cell)
            sentence[t] = output
            teacher_force = random.random() < teacher_force_ratio
            predicted_word = output.argmax(1) 
            x = target[t] if teacher_force else predicted_word
            
        return sentence

In [None]:
encoder = Encoder(len(NAHUATL.vocab), 
                  hyperparameters["embedding_size"], 
                  hyperparameters["hidden_size"], 
                  hyperparameters["num_layers"],
                  hyperparameters["dropout"]).to(device)

decoder = Decoder(len(SPANISH.vocab), 
                  hyperparameters["embedding_size"], 
                  hyperparameters["hidden_size"], 
                  hyperparameters["num_layers"], 
                  hyperparameters["dropout"]).to(device)
                            
seq2seq = Seq2Seq(encoder, decoder).to(device)

# Lägg till nåt här så loss inte räknas på padding

loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(
    seq2seq.parameters(),
    lr=hyperparameters['learning_rate']
)

# start training loop
total_loss = 0
for epoch in range(hyperparameters['epochs']):
    for i, batch in enumerate(train_iter):
        source = batch.nahuatl.to(device)
        target = batch.spanish.to(device)
        output = seq2seq(source, target)
        # shape (trglength, batchsize, outputdim)
        loss = loss_fn(output[1:].reshape(-1, output.shape[2]), target[1:].reshape(-1))
        total_loss += loss.item()
        print(f"Loss in epoch {epoch+1} is:" total_loss/(i+1), end='\r')
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

In [None]:
torch.save(lstm_model.state_dict(), "|".join([f"{k}_{v}" for k, v in model_hyperparameters.items()]))

### Part 3 - Evaluating the model

In [None]:
# bleu score
# back translation

In [None]:
model.eval()
total_loss = 0

with torch.no_grad():
    for i, batch in enumerate(test_iter):
        source = batch.nahuatl.to(device)
        target = batch.spanish.to(device)
        output = seq2seq(source, target)
        # shape (trglength, batchsize, outputdim)
        loss = loss_fn(output[1:].reshape(-1, output.shape[2]), target[1:].reshape(-1))
        total_loss += loss.item()
        print(f"Loss in epoch {epoch+1} is:" total_loss/(i+1), end='\r')
        
    

### References

https://axolotl-corpus.mx/