# 作業 : 實作英文-德文翻譯機器人
***
## [作業目標]

用 pytorch 實作一個英文-德文翻譯機器人

## [作業目標]

*   語言資料處理
*   使用 LSTM 建構 Encoder: EncoderLSTM
*   使用 LSTM 建構 Decoder: DecoderLSTM
*   搭建 Sequence to Sequence 模型: Seq2Seq
*   撰寫訓練函式
*   撰寫測試函式

## [問題]

在 Colab 實際上執行完這個範例後，請改用 BiLSTM 來建構 Encoder 與 Decoder


## 安裝 spacy

We'll also make use of spaCy to tokenize our data. To install spaCy, follow the instructions here making sure to install both the English and German models with:

In [1]:
!pip uninstall spacy -y
!pip install -U spacy

Found existing installation: spacy 2.2.4
Uninstalling spacy-2.2.4:
  Successfully uninstalled spacy-2.2.4
Collecting spacy
  Downloading spacy-3.1.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.4 MB)
[K     |████████████████████████████████| 6.4 MB 8.3 MB/s 
Collecting spacy-legacy<3.1.0,>=3.0.7
  Downloading spacy_legacy-3.0.8-py2.py3-none-any.whl (14 kB)
Collecting pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4
  Downloading pydantic-1.8.2-cp37-cp37m-manylinux2014_x86_64.whl (10.1 MB)
[K     |████████████████████████████████| 10.1 MB 48.4 MB/s 
Collecting pathy>=0.3.5
  Downloading pathy-0.6.0-py3-none-any.whl (42 kB)
[K     |████████████████████████████████| 42 kB 1.9 MB/s 
[?25hCollecting catalogue<2.1.0,>=2.0.4
  Downloading catalogue-2.0.4-py3-none-any.whl (16 kB)
Collecting typer<0.4.0,>=0.3.0
  Downloading typer-0.3.2-py3-none-any.whl (21 kB)
Collecting srsly<3.0.0,>=2.4.1
  Downloading srsly-2.4.1-cp37-cp37m-manylinux2014_x86_64.whl (456 kB)
[K     |████████████████

## 引用需要的模組

In [2]:
import jieba
import torch
import torch.nn as nn
import torch.optim as optim
from torchtext.datasets import Multi30k
from torchtext.legacy.data import Field, BucketIterator, Example, Dataset
import numpy as np
import pandas as pd
import spacy
import random
# from torchtext.data.metrics import bleu_score
from pprint import pprint
from torch.utils.tensorboard import SummaryWriter
from torchsummary import summary

## 下載英文預料

In [3]:
!mkdir ./data
!mkdir ./data/multi30k
!python -m spacy download en_core_web_sm
!ls ./data/multi30k -al
spacy_english = spacy.load("en_core_web_sm")
!ls ./data/multi30k -al

2021-07-25 02:38:53.002462: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Collecting en-core-web-sm==3.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.1.0/en_core_web_sm-3.1.0-py3-none-any.whl (13.6 MB)
[K     |████████████████████████████████| 13.6 MB 73 kB/s 
Installing collected packages: en-core-web-sm
  Attempting uninstall: en-core-web-sm
    Found existing installation: en-core-web-sm 2.2.5
    Uninstalling en-core-web-sm-2.2.5:
      Successfully uninstalled en-core-web-sm-2.2.5
Successfully installed en-core-web-sm-3.1.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
total 8
drwxr-xr-x 2 root root 4096 Jul 25 02:38 .
drwxr-xr-x 3 root root 4096 Jul 25 02:38 ..
total 8
drwxr-xr-x 2 root root 4096 Jul 25 02:38 .
drwxr-xr-x 3 root root 4096 Jul 25 02:38 ..


In [4]:
!ls -l data

total 4
drwxr-xr-x 2 root root 4096 Jul 25 02:38 multi30k


## 下載德語語料

In [5]:
!python -m spacy download de_core_news_sm
spacy_de = spacy.load("de_core_news_sm")
!ls ./data/multi30k -al

2021-07-25 02:39:09.682296: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Collecting de-core-news-sm==3.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-3.1.0/de_core_news_sm-3.1.0-py3-none-any.whl (18.8 MB)
[K     |████████████████████████████████| 18.8 MB 1.2 MB/s 
Installing collected packages: de-core-news-sm
Successfully installed de-core-news-sm-3.1.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('de_core_news_sm')
total 8
drwxr-xr-x 2 root root 4096 Jul 25 02:38 .
drwxr-xr-x 3 root root 4096 Jul 25 02:38 ..


In [6]:
from torchtext.vocab import build_vocab_from_iterator

def tokenize_de(text):
  return [token.text for token in spacy_de.tokenizer(text)]

def tokenize_english(text):
  return [token.text for token in spacy_english.tokenizer(text)]

# def yield_tokens(iterator, ln):
#     for data_sample in iterator:

#       yield tokenize_de(data_sample[0])

### Sample Run ###

sample_text = "I love machine learning"
print(tokenize_english(sample_text))

german = Field(tokenize=tokenize_de, lower=True,
               init_token="<sos>", eos_token="<eos>")

english = Field(tokenize=tokenize_english, lower=True,
               init_token="<sos>", eos_token="<eos>")

# train_data, valid_data, test_data = Multi30k.splits(exts = (".en", ".en"),
#                                                    fields=(german, english))
train_examples = []
valid_examples = []
test_examples = []

train_iter, valid_iter, test_iter = Multi30k(split=('train', 'valid', 'test'), 
                                 language_pair=('de', 'en'))

for src, trg in train_iter:
    train_examples.append(Example.fromlist(data=[src, trg], 
                                                fields=[('src', german), 
                                                        ('trg', english)]))
    
for src, trg in valid_iter:
    valid_examples.append(Example.fromlist(data=[src, trg], 
                                                fields=[('src', german), 
                                                        ('trg', english)]))
for src, trg in test_iter:
    test_examples.append(Example.fromlist(data=[src, trg], 
                                                fields=[('src', german), 
                                                        ('trg', english)]))

train_data = Dataset(examples=train_examples, fields={'src':german, 'trg':english})
valid_data = Dataset(examples=valid_examples, fields={'src':german, 'trg':english})
test_data = Dataset(examples=test_examples, fields={'src':german, 'trg':english})

german.build_vocab(train_data, max_size=10000, min_freq=3)
english.build_vocab(train_data, max_size=10000, min_freq=3)

print(f"Unique tokens in source (german) vocabulary: {len(german.vocab)}")
print(f"Unique tokens in target (en) vocabulary: {len(english.vocab)}")


['I', 'love', 'machine', 'learning']


training.tar.gz: 100%|██████████| 1.21M/1.21M [00:01<00:00, 968kB/s]
validation.tar.gz: 100%|██████████| 46.3k/46.3k [00:00<00:00, 174kB/s]
mmt16_task1_test.tar.gz: 100%|██████████| 43.9k/43.9k [00:00<00:00, 159kB/s]


Unique tokens in source (german) vocabulary: 5374
Unique tokens in target (en) vocabulary: 4556


In [7]:
english.vocab.stoi

defaultdict(<bound method Vocab._default_unk_index of <torchtext.legacy.vocab.Vocab object at 0x7fab29008610>>,
            {'<unk>': 0,
             '<pad>': 1,
             '<sos>': 2,
             '<eos>': 3,
             'a': 4,
             '.': 5,
             'in': 6,
             'the': 7,
             'on': 8,
             'man': 9,
             'is': 10,
             'and': 11,
             'of': 12,
             'with': 13,
             'woman': 14,
             ',': 15,
             'two': 16,
             'are': 17,
             'to': 18,
             'people': 19,
             'at': 20,
             'an': 21,
             'wearing': 22,
             'shirt': 23,
             'young': 24,
             'white': 25,
             'black': 26,
             'his': 27,
             'while': 28,
             'blue': 29,
             'men': 30,
             'red': 31,
             'sitting': 32,
             'girl': 33,
             'boy': 34,
             'dog': 35,
             

In [8]:
word_2_idx = dict(english.vocab.stoi)
# print(word_2_idx)
idx_2_word = {}
for k,v in word_2_idx.items():
  idx_2_word[v] = k

In [9]:
print(f"Number of training examples: {len(train_data.examples)}")
print(f"Number of validation examples: {len(valid_data.examples)}")
print(f"Number of testing examples: {len(test_data.examples)}")

print(train_data[5].__dict__.keys())
pprint(train_data[5].__dict__.values())

Number of training examples: 29000
Number of validation examples: 1014
Number of testing examples: 1000
dict_keys(['src', 'trg'])
dict_values([['ein', 'mann', 'in', 'grün', 'hält', 'eine', 'gitarre', ',', 'während', 'der', 'andere', 'mann', 'sein', 'hemd', 'ansieht', '.'], ['a', 'man', 'in', 'green', 'holds', 'a', 'guitar', 'while', 'the', 'other', 'man', 'observes', 'his', 'shirt', '.']])


In [None]:
# print(f"Number of training examples: {len(train_data.examples)}")
# print(f"Number of validation examples: {len(valid_data.examples)}")
# print(f"Number of testing examples: {len(test_data.examples)}")

# print(train_data[5].__dict__.keys())
# pprint(train_data[5].__dict__.values())

Number of training examples: 29000
Number of validation examples: 1014
Number of testing examples: 1000
dict_keys(['src', 'trg'])
dict_values([['a', 'man', 'in', 'green', 'holds', 'a', 'guitar', 'while', 'the', 'other', 'man', 'observes', 'his', 'shirt', '.'], ['a', 'man', 'in', 'green', 'holds', 'a', 'guitar', 'while', 'the', 'other', 'man', 'observes', 'his', 'shirt', '.']])


In [10]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
BATCH_SIZE = 32

train_iterator, valid_iterator, test_iterator = BucketIterator.splits((train_data, valid_data, test_data), 
                                                                      batch_size = BATCH_SIZE, 
                                                                      sort_within_batch=True,
                                                                      sort_key=lambda x: len(x.src),
                                                                      device = device)

In [11]:
count = 0
max_len_eng = []
max_len_ger = []
for data in train_data:
  max_len_ger.append(len(data.src))
  max_len_eng.append(len(data.trg))
  if count < 10 :
    print("German - ",*data.src, " Length - ", len(data.src))
    print("English - ",*data.trg, " Length - ", len(data.trg))
    print()
  count += 1

print("Maximum Length of English sentence {} and German sentence {} in the dataset".format(max(max_len_eng),max(max_len_ger)))
print("Minimum Length of English sentence {} and German sentence {} in the dataset".format(min(max_len_eng),min(max_len_ger)))

German -  zwei junge weiße männer sind im freien in der nähe vieler büsche .  Length -  13
English -  two young , white males are outside near many bushes .  Length -  11

German -  mehrere männer mit schutzhelmen bedienen ein antriebsradsystem .  Length -  8
English -  several men in hard hats are operating a giant pulley system .  Length -  12

German -  ein kleines mädchen klettert in ein spielhaus aus holz .  Length -  10
English -  a little girl climbing into a wooden playhouse .  Length -  9

German -  ein mann in einem blauen hemd steht auf einer leiter und putzt ein fenster .  Length -  15
English -  a man in a blue shirt is standing on a ladder cleaning a window .  Length -  15

German -  zwei männer stehen am herd und bereiten essen zu .  Length -  10
English -  two men are at the stove preparing food .  Length -  9

German -  ein mann in grün hält eine gitarre , während der andere mann sein hemd ansieht .  Length -  16
English -  a man in green holds a guitar while the other

In [None]:
# count = 0
# max_len_eng = []
# max_len_ger = []
# for data in train_data:
#   max_len_ger.append(len(data.src))
#   max_len_eng.append(len(data.trg))
#   if count < 10 :
#     print("German - ",*data.src, " Length - ", len(data.src))
#     print("English - ",*data.trg, " Length - ", len(data.trg))
#     print()
#   count += 1

# print("Maximum Length of English sentence {} and German sentence {} in the dataset".format(max(max_len_eng),max(max_len_ger)))
# print("Minimum Length of English sentence {} and German sentence {} in the dataset".format(min(max_len_eng),min(max_len_ger)))

German -  two young , white males are outside near many bushes .  Length -  11
English -  two young , white males are outside near many bushes .  Length -  11

German -  several men in hard hats are operating a giant pulley system .  Length -  12
English -  several men in hard hats are operating a giant pulley system .  Length -  12

German -  a little girl climbing into a wooden playhouse .  Length -  9
English -  a little girl climbing into a wooden playhouse .  Length -  9

German -  a man in a blue shirt is standing on a ladder cleaning a window .  Length -  15
English -  a man in a blue shirt is standing on a ladder cleaning a window .  Length -  15

German -  two men are at the stove preparing food .  Length -  9
English -  two men are at the stove preparing food .  Length -  9

German -  a man in green holds a guitar while the other man observes his shirt .  Length -  15
English -  a man in green holds a guitar while the other man observes his shirt .  Length -  15

German -  a 

In [12]:
count = 0
for data in train_iterator:
  if count < 1 :
    print("Shapes", data.src.shape, data.trg.shape)
    print()
    print("German - ",*data.src, " Length - ", len(data.src))
    print()
    print("English - ",*data.trg, " Length - ", len(data.trg))
    temp_ger = data.src
    temp_eng = data.trg
    count += 1

Shapes torch.Size([22, 32]) torch.Size([27, 32])

German -  tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2], device='cuda:0') tensor([   5,    5,    5,    5,    5,   18,    5,  191,    5,    5,    5,    5,
           5,    5,    8,    8,    5, 2237,    5,    5,    5,   59,    7,    5,
          18,    5,   18,    5,    5,   43,    5,    5], device='cuda:0') tensor([  13,   66,   13,  150,    0,   45,  116,   73,   66,  177,    0,   13,
          25,    0,   16,   16,   13, 1180, 3736,   13,  130,    6,   14,   13,
        1035,   13,   30,  269,   12,   41,   13,   13], device='cuda:0') tensor([  10, 3157,    9,   32,  200,   10,   73,   53,   25,   25,   11,   12,
         126,    9,   29,    7,    7,   12,   11,   10,   13,  814,  113,    7,
           7,  519,    7,  151,   24,   52,    7,    7], device='cuda:0') tensor([   8,   25,   15,   12,    7,    5,   52,   21,   31,  184,  370,    6,
        2737,  221,   21,   

In [None]:
# count = 0
# for data in train_iterator:
#   if count < 1 :
#     print("Shapes", data.src.shape, data.trg.shape)
#     print()
#     print("German - ",*data.src, " Length - ", len(data.src))
#     print()
#     print("English - ",*data.trg, " Length - ", len(data.trg))
#     temp_ger = data.src
#     temp_eng = data.trg
#     count += 1

Shapes torch.Size([12, 32]) torch.Size([14, 32])

German -  tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2], device='cuda:0') tensor([  4,  19,   4,  16,  16,   4,   4,   4,   4,   4,   4,   4,  21,   4,
          4,   7,   4,   4,   4,   4, 249,   4,   4,   9,   4,  16,   4,   4,
          4,   7,  21,   4], device='cuda:0') tensor([1042, 1168,    9,   23,   30,    9,    9,   35,   34,   38,   35,  209,
         103,   26,   32,    9,   33,   63,   63,   58,   49,  728,   34,   11,
          15,   24,   38,  199,   63, 2204, 1724,    9], device='cuda:0') tensor([   0,  873,  291,   62,   17,   10,  762,  121, 2213,   12,   10,  316,
         118,   35,  407,   10,  281,    6,   10,   38,    6,   13,  986,   49,
          90,  107,   12, 1422,    6,   35,    9,   11], device='cuda:0') tensor([1798,  128,    4,  192,  414,  161,   68,    8,   48,   19,   41,    0,
           6,  256,    4,   37,   10,  993, 2107,   12,  535

In [13]:
temp_eng_idx = (temp_eng).cpu().detach().numpy()
temp_ger_idx = (temp_ger).cpu().detach().numpy()

In [14]:
df_eng_idx = pd.DataFrame(data = temp_eng_idx, columns = [str("S_")+str(x) for x in np.arange(1, 33)])
df_eng_idx.index.name = 'Time Steps'
df_eng_idx.index = df_eng_idx.index + 1 
# df_eng_idx.to_csv('/content/idx.csv')
df_eng_idx

Unnamed: 0_level_0,S_1,S_2,S_3,S_4,S_5,S_6,S_7,S_8,S_9,S_10,S_11,S_12,S_13,S_14,S_15,S_16,S_17,S_18,S_19,S_20,S_21,S_22,S_23,S_24,S_25,S_26,S_27,S_28,S_29,S_30,S_31,S_32
Time Steps,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1
1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
2,4,53,4,4,4,16,74,837,4,24,4,4,0,38,4,4,4,58,4,4,21,176,4,4,417,4,176,4,4,48,4,4
3,9,122,9,25,986,50,19,12,53,33,491,9,33,616,14,14,9,956,0,9,115,10,87,9,353,9,17,105,9,19,9,9
4,11,33,36,35,184,11,17,19,33,455,0,8,13,54,36,6,6,2796,42,11,9,4,12,6,215,1373,16,8,273,474,6,6
5,4,6,6,179,6,4,36,32,10,7,6,4,122,537,20,4,4,7,0,4,252,9,19,4,6,4,0,7,8,8,26,4
6,14,44,43,51,4,9,83,124,8,0,4,157,42,553,4,29,1199,101,771,14,27,8,2121,81,26,24,30,25,7,7,147,26
7,17,81,12,18,101,17,4,4,4,12,31,241,97,15,882,11,23,956,6,17,441,4,74,10,620,34,15,11,185,39,11,339
8,167,1251,16,366,380,36,59,1090,435,118,23,232,137,74,439,52,15,1438,4,36,1590,1112,19,1584,11,6,46,29,13,4387,4,11
9,37,75,0,4,1567,8,1054,108,308,443,32,4,13,6,136,117,1899,51,197,71,49,198,189,27,25,4,12,237,27,877,52,4
10,1106,44,558,123,1500,4,416,6,28,847,6,118,698,231,1239,10,15,1916,296,18,4,45,17,394,15,67,155,996,394,65,1060,14


In [None]:

# df_eng_idx = pd.DataFrame(data = temp_eng_idx, columns = [str("S_")+str(x) for x in np.arange(1, 33)])
# df_eng_idx.index.name = 'Time Steps'
# df_eng_idx.index = df_eng_idx.index + 1 
# # df_eng_idx.to_csv('/content/idx.csv')
# df_eng_idx


Unnamed: 0_level_0,S_1,S_2,S_3,S_4,S_5,S_6,S_7,S_8,S_9,S_10,S_11,S_12,S_13,S_14,S_15,S_16,S_17,S_18,S_19,S_20,S_21,S_22,S_23,S_24,S_25,S_26,S_27,S_28,S_29,S_30,S_31,S_32
Time Steps,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1
1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
2,4,19,4,16,16,4,4,4,4,4,4,4,21,4,4,7,4,4,4,4,251,4,4,9,4,16,4,4,4,7,21,4
3,812,1170,9,24,30,9,9,35,34,38,35,216,106,26,33,9,31,64,64,59,50,735,34,11,14,25,38,198,64,386,1729,9
4,0,877,296,63,17,10,768,125,2217,12,10,323,120,35,411,10,276,6,10,38,6,13,996,50,91,112,12,1430,6,42,9,11
5,1801,131,4,199,419,165,69,8,49,19,41,0,6,260,4,37,10,1003,2103,12,533,4,4,6,8,17,30,222,4,210,761,4
6,2936,20,663,4,6,4,7,4,4,569,80,290,4,18,123,13,506,421,28,0,36,31,68,4,4,844,737,6,2167,35,4,14
7,2808,4,65,29,43,1130,1820,288,840,6,4,786,31,4201,68,7,76,78,74,2694,6,913,6,2309,959,83,4,741,125,10,1003,152
8,49,2441,27,11,12,6,12,334,6,21,303,13,81,13,28,2368,4,4,24,747,4,1404,27,372,80,6,184,226,124,79,67,232
9,7,20,787,62,4,7,21,49,4,709,12,4,165,4,32,53,550,170,127,165,2035,4,1125,252,7,25,12,4,21,171,11,4
10,240,305,1961,297,299,238,4252,250,118,2486,47,382,4,61,40,35,920,99,201,1242,119,1111,187,4,47,563,377,164,4252,4,570,1452


In [15]:
df_eng_word = pd.DataFrame(columns = [str("S_")+str(x) for x in np.arange(1, 33)])
df_eng_word = df_eng_idx.replace(idx_2_word)
# df_eng_word.to_csv('/content/Words.csv')
df_eng_word

Unnamed: 0_level_0,S_1,S_2,S_3,S_4,S_5,S_6,S_7,S_8,S_9,S_10,S_11,S_12,S_13,S_14,S_15,S_16,S_17,S_18,S_19,S_20,S_21,S_22,S_23,S_24,S_25,S_26,S_27,S_28,S_29,S_30,S_31,S_32
Time Steps,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1
1,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>
2,a,little,a,a,a,two,some,lots,a,young,a,a,<unk>,group,a,a,a,as,a,a,an,there,a,a,2,a,there,a,a,three,a,a
3,man,blond,man,white,cloudy,women,people,of,little,girl,bearded,man,girl,class,woman,woman,man,i,<unk>,man,older,is,crowd,man,hockey,man,are,player,man,people,man,man
4,and,girl,standing,dog,day,and,are,people,girl,reads,<unk>,on,with,for,standing,in,in,overlook,-,and,man,a,of,in,players,helps,two,on,laying,lined,in,in
5,a,in,in,jumps,in,a,standing,sitting,is,the,in,a,blond,martial,at,a,a,the,<unk>,a,having,man,people,a,in,a,<unk>,the,on,on,black,a
6,woman,her,front,up,a,man,around,along,on,<unk>,a,bicycle,-,arts,a,blue,neon,city,photographer,woman,his,on,containing,jacket,black,young,men,white,the,the,pants,black
7,are,jacket,of,to,city,are,a,a,a,of,red,rides,hair,",",crosswalk,and,shirt,i,in,are,beard,a,some,is,gold,boy,",",and,ground,street,and,vest
8,each,sticking,two,catch,does,standing,large,low,playground,park,shirt,past,plays,some,which,green,",",pick,a,standing,shaved,bucking,people,raising,and,in,one,blue,with,shading,a,and
9,playing,out,<unk>,a,n't,on,inflatable,wall,swing,statue,sitting,a,with,in,has,dress,khakis,up,suit,next,by,horse,who,his,white,a,of,team,his,themselves,green,a
10,acoustic,her,prepares,soccer,keep,a,slide,in,while,;,in,park,bubbles,uniform,only,is,",",my,takes,to,a,holding,are,arm,",",hat,them,kicks,arm,from,polo,woman


In [None]:
# df_eng_word = pd.DataFrame(columns = [str("S_")+str(x) for x in np.arange(1, 33)])
# df_eng_word = df_eng_idx.replace(idx_2_word)
# # df_eng_word.to_csv('/content/Words.csv')
# df_eng_word

Unnamed: 0_level_0,S_1,S_2,S_3,S_4,S_5,S_6,S_7,S_8,S_9,S_10,S_11,S_12,S_13,S_14,S_15,S_16,S_17,S_18,S_19,S_20,S_21,S_22,S_23,S_24,S_25,S_26,S_27,S_28,S_29,S_30,S_31,S_32
Time Steps,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1
1,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>
2,a,people,a,two,two,a,a,a,a,a,a,a,an,a,a,the,a,a,a,a,five,a,a,man,a,two,a,a,a,the,an,a
3,chef,serving,man,young,men,man,man,dog,boy,group,dog,construction,asian,black,girl,man,red,person,person,large,women,clown,boy,and,woman,white,group,horse,person,short,overweight,man
4,examination,themselves,takes,children,are,is,heads,walks,strolls,of,is,worker,lady,dog,throwing,is,truck,in,is,group,in,with,kicks,women,sits,dogs,of,jockey,in,-,man,and
5,newly,food,a,under,laughing,taking,into,on,by,people,walking,examination,in,trying,a,playing,is,protective,breakdancing,of,dresses,a,a,in,on,are,men,covered,a,haired,wears,a
6,baked,at,break,a,in,a,the,a,a,gather,near,someone,a,to,soccer,with,driving,equipment,while,examination,standing,red,ball,a,a,splashing,enjoy,in,kilt,dog,a,woman
7,popcorn,a,from,blue,front,nap,shadows,path,pond,in,a,digging,red,mate,ball,the,over,riding,some,clothed,in,nose,in,halloween,log,around,a,mud,walks,is,protective,walk
8,by,buffet,his,and,of,in,of,surrounded,in,an,body,with,jacket,with,while,shaggy,a,a,young,dancers,a,blows,his,costume,near,in,day,during,along,running,hat,past
9,the,at,hiking,yellow,a,the,an,by,a,urban,of,a,taking,a,sitting,little,rocky,dirt,boys,taking,lobby,a,living,having,the,white,of,a,an,across,and,a
10,window,night,trip,umbrella,light,train,overpass,trees,park,environment,water,machine,a,brown,down,dog,surface,bike,watch,photographs,talking,bubble,room,a,water,waves,fishing,race,overpass,a,gloves,storefront


## 用 LSTM 搭建的 Encoder 類別: EncoderLSTM



In [16]:
class EncoderLSTM(nn.Module):
  def __init__(self, input_size, embedding_size, hidden_size, num_layers, p):
    super(EncoderLSTM, self).__init__()

    # Size of the one hot vectors that will be the input to the encoder
    #self.input_size = input_size

    # Output size of the word embedding NN
    #self.embedding_size = embedding_size

    # Dimension of the NN's inside the lstm cell/ (hs,cs)'s dimension.
    self.hidden_size = hidden_size

    # Number of layers in the lstm
    self.num_layers = num_layers

    # Regularization parameter
    self.dropout = nn.Dropout(p)
    self.tag = True

    # Shape --------------------> (5376, 300) [input size, embedding dims]
    self.embedding = nn.Embedding(input_size, embedding_size)
    
    # Shape -----------> (300, 2, 1024) [embedding dims, hidden size, num layers]
    self.LSTM = nn.LSTM(embedding_size, hidden_size, num_layers, dropout = p)

  # Shape of x (26, 32) [Sequence_length, batch_size]
  def forward(self, x):

    # Shape -----------> (26, 32, 300) [Sequence_length , batch_size , embedding dims]
    embedding = self.dropout(self.embedding(x))
    
    # Shape --> outputs (26, 32, 1024) [Sequence_length , batch_size , hidden_size]
    # Shape --> (hs, cs) (2, 32, 1024) , (2, 32, 1024) [num_layers, batch_size size, hidden_size]
    outputs, (hidden_state, cell_state) = self.LSTM(embedding)

    return hidden_state, cell_state

input_size_encoder = len(german.vocab)
encoder_embedding_size = 300
hidden_size = 1024
num_layers = 2
encoder_dropout = 0.5

encoder_lstm = EncoderLSTM(input_size_encoder, encoder_embedding_size,
                           hidden_size, num_layers, encoder_dropout).to(device)
print(encoder_lstm)

EncoderLSTM(
  (dropout): Dropout(p=0.5, inplace=False)
  (embedding): Embedding(5374, 300)
  (LSTM): LSTM(300, 1024, num_layers=2, dropout=0.5)
)


## 用 LSTM 搭建的 decoder 類別: DecoderLSTM


In [17]:
class DecoderLSTM(nn.Module):
  def __init__(self, input_size, embedding_size, hidden_size, num_layers, p, output_size):
    super(DecoderLSTM, self).__init__()

    # Size of the one hot vectors that will be the input to the encoder
    #self.input_size = input_size

    # Output size of the word embedding NN
    #self.embedding_size = embedding_size

    # Dimension of the NN's inside the lstm cell/ (hs,cs)'s dimension.
    self.hidden_size = hidden_size

    # Number of layers in the lstm
    self.num_layers = num_layers

    # Size of the one hot vectors that will be the output to the encoder (English Vocab Size)
    self.output_size = output_size

    # Regularization parameter
    self.dropout = nn.Dropout(p)

    # Shape --------------------> (5376, 300) [input size, embedding dims]
    self.embedding = nn.Embedding(input_size, embedding_size)

    # Shape -----------> (300, 2, 1024) [embedding dims, hidden size, num layers]
    self.LSTM = nn.LSTM(embedding_size, hidden_size, num_layers, dropout = p)

    # Shape -----------> (1024, 4556) [embedding dims, hidden size, num layers]
    self.fc = nn.Linear(hidden_size, output_size)

  # Shape of x (32) [batch_size]
  def forward(self, x, hidden_state, cell_state):

    # Shape of x (1, 32) [1, batch_size]
    x = x.unsqueeze(0)

    # Shape -----------> (1, 32, 300) [1, batch_size, embedding dims]
    embedding = self.dropout(self.embedding(x))

    # Shape --> outputs (1, 32, 1024) [1, batch_size , hidden_size]
    # Shape --> (hs, cs) (2, 32, 1024) , (2, 32, 1024) [num_layers, batch_size size, hidden_size] (passing encoder's hs, cs - context vectors)
    outputs, (hidden_state, cell_state) = self.LSTM(embedding, (hidden_state, cell_state))

    # Shape --> predictions (1, 32, 4556) [ 1, batch_size , output_size]
    predictions = self.fc(outputs)

    # Shape --> predictions (32, 4556) [batch_size , output_size]
    predictions = predictions.squeeze(0)

    return predictions, hidden_state, cell_state

input_size_decoder = len(english.vocab)
decoder_embedding_size = 300
hidden_size = 1024
num_layers = 2
decoder_dropout = 0.5
output_size = len(english.vocab)

decoder_lstm = DecoderLSTM(input_size_decoder, decoder_embedding_size,
                           hidden_size, num_layers, decoder_dropout, output_size).to(device)
print(decoder_lstm)

DecoderLSTM(
  (dropout): Dropout(p=0.5, inplace=False)
  (embedding): Embedding(4556, 300)
  (LSTM): LSTM(300, 1024, num_layers=2, dropout=0.5)
  (fc): Linear(in_features=1024, out_features=4556, bias=True)
)


In [18]:
for batch in train_iterator:
  print(batch.src.shape)
  print(batch.trg.shape)
  break

x = batch.trg[1]
print(x)

torch.Size([13, 32])
torch.Size([17, 32])
tensor([   4,    4,   64,  491,    4,    4, 3212,   21,    4,    4,    4,    4,
           4,    4,    4,    4,    4,    4,   21,    7,    4,   16,    4,    4,
           4,    7,    4,    4,   74,   16,    4,    4], device='cuda:0')


# Sequence to Sequence 類別

In [19]:
class Seq2Seq(nn.Module):
  def __init__(self, Encoder_LSTM, Decoder_LSTM):
    super(Seq2Seq, self).__init__()
    self.Encoder_LSTM = Encoder_LSTM
    self.Decoder_LSTM = Decoder_LSTM

  def forward(self, source, target, tfr=0.5):
    # Shape - Source : (10, 32) [(Sentence length german + some padding), Number of Sentences]
    batch_size = source.shape[1]

    # Shape - Source : (14, 32) [(Sentence length English + some padding), Number of Sentences]
    target_len = target.shape[0]
    target_vocab_size = len(english.vocab)
    
    # Shape --> outputs (14, 32, 5766) 
    outputs = torch.zeros(target_len, batch_size, target_vocab_size).to(device)

    # Shape --> (hs, cs) (2, 32, 1024) ,(2, 32, 1024) [num_layers, batch_size size, hidden_size] (contains encoder's hs, cs - context vectors)
    hidden_state, cell_state = self.Encoder_LSTM(source)

    # Shape of x (32 elements)
    x = target[0] # Trigger token <SOS>

    for i in range(1, target_len):
      # Shape --> output (32, 5766) 
      output, hidden_state, cell_state = self.Decoder_LSTM(x, hidden_state, cell_state)
      outputs[i] = output
      best_guess = output.argmax(1) # 0th dimension is batch size, 1st dimension is word embedding
      x = target[i] if random.random() < tfr else best_guess # Either pass the next word correctly from the dataset or use the earlier predicted word

    # Shape --> outputs (14, 32, 5766) 
    return outputs


In [20]:
# Hyperparameters

learning_rate = 0.001
writer = SummaryWriter(f"runs/loss_plot")
step = 0

model = Seq2Seq(encoder_lstm, decoder_lstm).to(device)
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

pad_idx = english.vocab.stoi["<pad>"]
criterion = nn.CrossEntropyLoss(ignore_index=pad_idx)

In [21]:
model

Seq2Seq(
  (Encoder_LSTM): EncoderLSTM(
    (dropout): Dropout(p=0.5, inplace=False)
    (embedding): Embedding(5374, 300)
    (LSTM): LSTM(300, 1024, num_layers=2, dropout=0.5)
  )
  (Decoder_LSTM): DecoderLSTM(
    (dropout): Dropout(p=0.5, inplace=False)
    (embedding): Embedding(4556, 300)
    (LSTM): LSTM(300, 1024, num_layers=2, dropout=0.5)
    (fc): Linear(in_features=1024, out_features=4556, bias=True)
  )
)

In [24]:
def translate_sentence(model, sentence, german, english, device, max_length=50):
    spacy_ger = spacy.load("de_core_news_sm")

    if type(sentence) == str:
        tokens = [token.text.lower() for token in spacy_ger(sentence)]
    else:
        tokens = [token.lower() for token in sentence]
    tokens.insert(0, german.init_token)
    tokens.append(german.eos_token)
    text_to_indices = [german.vocab.stoi[token] for token in tokens]
    sentence_tensor = torch.LongTensor(text_to_indices).unsqueeze(1).to(device)

    # Build encoder hidden, cell state
    with torch.no_grad():
        hidden, cell = model.Encoder_LSTM(sentence_tensor)

    outputs = [english.vocab.stoi["<sos>"]]

    for _ in range(max_length):
        previous_word = torch.LongTensor([outputs[-1]]).to(device)

        with torch.no_grad():
            output, hidden, cell = model.Decoder_LSTM(previous_word, hidden, cell)
            best_guess = output.argmax(1).item()

        outputs.append(best_guess)

        # Model predicts it's the end of the sentence
        if output.argmax(1).item() == english.vocab.stoi["<eos>"]:
            break

    translated_sentence = [english.vocab.itos[idx] for idx in outputs]
    return translated_sentence[1:]

# 用來評估模型的函式: bleu
def bleu(data, model, german, english, device):
    targets = []
    outputs = []

    for example in data:
        src = vars(example)["src"]
        trg = vars(example)["trg"]

        prediction = translate_sentence(model, src, german, english, device)
        prediction = prediction[:-1]  # remove <eos> token

        targets.append([trg])
        outputs.append(prediction)

    return bleu_score(outputs, targets)

def checkpoint_and_save(model, best_loss, epoch, optimizer, epoch_loss):
    print('saving')
    print()
    state = {'model': model,'best_loss': best_loss,'epoch': epoch,'rng_state': torch.get_rng_state(), 'optimizer': optimizer.state_dict(),}
    torch.save(state, './checkpoint-NMT')
    torch.save(model.state_dict(),'./checkpoint-NMT-SD')

In [25]:
epoch_loss = 0.0
num_epochs = 100
best_loss = 999999
best_epoch = -1
sentence1 = "ein mann in einem blauen hemd steht auf einer leiter und putzt ein fenster"
ts1  = []

for epoch in range(num_epochs):
  print("Epoch - {} / {}".format(epoch+1, num_epochs))
  model.eval()
  translated_sentence1 = translate_sentence(model, sentence1, german, english, device, max_length=50)
  print(f"Translated example sentence 1: \n {translated_sentence1}")
  ts1.append(translated_sentence1)

  model.train(True)
  for batch_idx, batch in enumerate(train_iterator):
    input = batch.src.to(device)
    target = batch.trg.to(device)

    # Pass the input and target for model's forward method
    output = model(input, target)
    output = output[1:].reshape(-1, output.shape[2])
    target = target[1:].reshape(-1)

    # Clear the accumulating gradients
    optimizer.zero_grad()

    # Calculate the loss value for every epoch
    loss = criterion(output, target)

    # Calculate the gradients for weights & biases using back-propagation
    loss.backward()

    # Clip the gradient value is it exceeds > 1
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1)

    # Update the weights values using the gradients we calculated using bp 
    optimizer.step()
    step += 1
    epoch_loss += loss.item()
    writer.add_scalar("Training loss", loss, global_step=step)

  if epoch_loss < best_loss:
    best_loss = epoch_loss
    best_epoch = epoch
    checkpoint_and_save(model, best_loss, epoch, optimizer, epoch_loss) 
    if ((epoch - best_epoch) >= 10):
      print("no improvement in 10 epochs, break")
      break
  print("Epoch_Loss - {}".format(loss.item()))
  print()
  
print(epoch_loss / len(train_iterator))

# score = bleu(test_data[1:100], model, german, english, device)
# print(f"Bleu score {score*100:.2f}")

Epoch - 1 / 100
Translated example sentence 1: 
 ['balanced', 'communicating', 'knight', 'fresh', 'fresh', 'balanced', 'participate', 'available', 'share', 'chest', 'being', 'being', 'being', 'shirted', 'cry', 'hotdog', 'hotdog', 'hotdog', 'cries', 'cries', 'toss', 'spectacular', 'unseen', 'sandal', 'sandal', 'sandal', 'sides', 'sides', 'paw', 'poles', 'poles', 'poles', 'chinatown', 'opening', 'kitchenaid', 'kitchenaid', 'kitchenaid', 'kitchenaid', 'chest', 'director', 'being', 'being', 'cattle', 'shirted', 'stares', 'lamp', 'removed', 'removed', 'fashionable', 'pauses']
saving

Epoch_Loss - 4.004133224487305

Epoch - 2 / 100
Translated example sentence 1: 
 ['a', 'man', 'in', 'a', 'black', 'shirt', 'and', 'a', 'a', 'a', 'a', 'a', '.', '<eos>']
Epoch_Loss - 3.0022172927856445

Epoch - 3 / 100
Translated example sentence 1: 
 ['a', 'man', 'in', 'a', 'blue', 'shirt', 'is', 'sitting', 'in', 'a', 'chair', 'in', 'a', 'a', '.', '<eos>']
Epoch_Loss - 3.618008852005005

Epoch - 4 / 100
Transla