**Exploring the Dataset**

We use the Multi30k dataset for english - german translation.

In [1]:
from torchtext.datasets import multi30k, Multi30k

multi30k.URL["train"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz"
multi30k.URL["valid"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz"
multi30k.URL["test"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/mmt16_task1_test.tar.gz"

data = Multi30k()



  from .autonotebook import tqdm as notebook_tqdm


In [2]:
train_data, val_data, test_data = data

In [3]:
train_data = list(train_data)
val_data = list(val_data)

In [4]:
train_data[0]

('Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.',
 'Two young, White males are outside near many bushes.')

**How do we turn a text corpus into a fixed length vector?**

With the Byte-Pair Encoding. Let's implement it.

In [5]:
def extract_corpus_from_data(data, language, n_sentences):
    corpus = []
    for i in range(n_sentences):
        corpus.append(data[i][0 if language == "german" else 1])
    return corpus

N_SENTENCES = 1000
english_corpus = extract_corpus_from_data(train_data, "english", N_SENTENCES)
german_corpus = extract_corpus_from_data(train_data, "german", N_SENTENCES)

In [6]:
from byte_pair_encoding import BytePairEncoder

encoder_en = BytePairEncoder("english", max_vocab_size=300, use_start_token=True, use_end_token=True, use_padding_token=True, max_token_len=50)

In [11]:
english_corpus[:5]

['Two young, White males are outside near many bushes.',
 'Several men in hard hats are operating a giant pulley system.',
 'A little girl climbing into a wooden playhouse.',
 'A man in a blue shirt is standing on a ladder cleaning a window.',
 'Two men are at the stove preparing food.']

In [8]:
encoder_en.learn_vocabulary_from_corpus(english_corpus)

  0%|          | 0/202 [00:00<?, ?it/s]

100%|██████████| 202/202 [00:59<00:00,  3.38it/s]


In [10]:
encoder_en.encode_corpus(["I like to eat apples."] * 2)

array([[  0,  41,   1, 252,  75, 100, 219,  69, 204,  65,  80,  80, 155,
        169,  95,  95,  95,  95,  95,  95,  95,  95,  95,  95,  95,  95,
         95,  95,  95,  95,  95,  95,  95,  95,  95,  95,  95,  95,  95,
         95,  95,  95,  95,  95,  95,  95,  95,  95,  95,  95],
       [  0,  41,   1, 252,  75, 100, 219,  69, 204,  65,  80,  80, 155,
        169,  95,  95,  95,  95,  95,  95,  95,  95,  95,  95,  95,  95,
         95,  95,  95,  95,  95,  95,  95,  95,  95,  95,  95,  95,  95,
         95,  95,  95,  95,  95,  95,  95,  95,  95,  95,  95]])

In [12]:
encoder_de = BytePairEncoder("german", max_vocab_size=300, use_start_token=True, use_end_token=True, use_padding_token=True, max_token_len=50)

In [13]:
german_corpus[:5]

['Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.',
 'Mehrere Männer mit Schutzhelmen bedienen ein Antriebsradsystem.',
 'Ein kleines Mädchen klettert in ein Spielhaus aus Holz.',
 'Ein Mann in einem blauen Hemd steht auf einer Leiter und putzt ein Fenster.',
 'Zwei Männer stehen am Herd und bereiten Essen zu.']

In [14]:
encoder_de.learn_vocabulary_from_corpus(german_corpus)

100%|██████████| 195/195 [01:21<00:00,  2.39it/s]
