# Fudan PRML Assignment2: Machine Translation and Model Attack

![Google Translation](./img/google_translation.PNG)

*Your name and Student ID: 陈朦伊 19307110382*

*Complete and hand in this completed worksheet (including its outputs and any supporting code outside of the worksheet, and a .pdf report file) with your assignment submission.*

你好，欢迎来到第二次作业！    
Hello and welcome to the second assignment!



In this assignment, you will work on a **Chinese to English** machine translation (MT) task with neural networks. Different from assignment1, this is a generation task in the field of NLP. The bilingual parallel corpus you will play with is the News Commentary v13 dataset from the [Third Conference on Machine Learning (WMT18)](https://www.statmt.org/wmt18/translation-task.html). There are about 252700 training samples, 2000 validation samples and 2000 test samples in the dataset. And the Chinese sentences have been processed by word segmentation. You have to design a Sequence to Sequence (Seq2Seq) model to complete this translation task. **The Attention mechanism must be used in your model** but you are free to design other modules with CNNs, RNNs and so on. **You have to evaluate your model on the test set with [Bilingual Evaluation Understudy (BELU) score](https://en.wikipedia.org/wiki/BLEU) and [Perplexity](https://en.wikipedia.org/wiki/Perplexity)**. **Besides, you have to visualize the attention matrix to help you understand your model.**

After you building your model, in the second part, you have to **attack** it : ) AI safety is nowadays a popular research point and many kind of attack methods have been developed during the the past few years. These methods include adversarial attack, data poisoning attack, training data leakage attack and so on [1]. In this assignment, **you just need to conduct one type of attack**. You can choose a simplest one or just analyze what kind of samples the model will predict incorrectly. The important thing is to understand the behavior of the neural models.

You can use the deep learning frameworks like paddle, pytorch, tensorflow in your experiment but not more high-level libraries like Huggingface. Please write down the version of them in the './requirements.txt' file.

The following links may be useful:    
- *Machine Translation: Foundations and Models,Tong Xiao and Jingbo Zhu, link: https://github.com/NiuTrans/MTBook*
- PyTorch Seq2Seq Tutorial @ Ben Trevett, link: https://github.com/bentrevett/pytorch-seq2seq

Certainly, our exercises in the PaddlePaddle AI Studio Platform will be helpful.

## 1. Setup

import the libraries and load the dataset here.

In [1]:
# setup code
%load_ext autoreload
%autoreload 2

In [2]:
import torch
import torch.nn as nn
import torch.optim as optim

import torchtext
from torchtext.legacy.datasets import Multi30k
from torchtext.legacy.data import Field, BucketIterator


import numpy as np

import random
import math
import time


SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

<torch._C.Generator at 0x7fc72041acf0>

In [25]:
print("Loading datasets...")
dataset_path = './dataset'
train_en_path = os.path.join(dataset_path, 'train', 'news-commentary-v13.zh-en.en')
train_zh_path = os.path.join(dataset_path, 'train', 'news-commentary-v13.zh-en.zh')
dev_en_path = os.path.join(dataset_path, 'dev', 'newsdev2017.tc.en')
dev_zh_path = os.path.join(dataset_path, 'dev', 'newsdev2017.tc.zh')
test_en_path = os.path.join(dataset_path, 'test', 'newstest2017.tc.en')
test_zh_path = os.path.join(dataset_path, 'test', 'newstest2017.tc.zh')

paths = {'train_en': train_en_path, 'train_zh': train_zh_path, 'dev_en': dev_en_path, 'dev_zh': dev_zh_path, 'test_en': test_en_path, 'test_zh': test_zh_path}
data_bundle = data_loader.DataBundle(paths)
data_bundle.process()
print("Processing datasets...")

Loading datasets...
Processing datasets...


## 2. Exploratory Data Analysis (5 points)

Your may have to explore the dataset and do some analysis first.

In [26]:
data_bundle.print_info()

Datasets:
	Train: 252777
	Dev: 2002
	Test: 2001
Vocabulary:
	En: 166192
	Zh: 93264


## 3. Methodology (50 points)

Build and evaluate your model here.

In [27]:
#en_embed = embedding.Embedding(vocab=data_bundle.get_vocab('en'), model_name='glove.6B.300d.txt', word_type='en')
#zh_embed = embedding.Embedding(vocab=data_bundle.get_vocab('zh'), model_name='cc.zh.300.vec', word_type='zh')
en_embed = embedding.Embedding(vocab=data_bundle.get_vocab('en'), model_name=None, word_type='en')
zh_embed = embedding.Embedding(vocab=data_bundle.get_vocab('zh'), model_name=None, word_type='zh')

Pretraining en embedding...
Pretraining zh embedding...


In [38]:
input_vocab = data_bundle.get_vocab('zh')
output_vocab = data_bundle.get_vocab('en')
input_var = data_bundle.get_dataset('train')['zh']
output_var = data_bundle.get_dataset('train')['en']
pairs = list(zip(input_var, output_var))
pairs.sort(key=lambda x:len(x[0].split()), reverse=True)
input_var, output_var = zip(*pairs)
#strip empty input
input_var = input_var[:1000]
output_var = output_var[:1000]

In [41]:
#batches = utils.batch2TrainData(input_vocab, output_vocab, input_var, output_var)
#input_var, lengths, output_var, mask, max_target_len = batches

model_name = 'nmt'
attn_model = 'dot'
batch_size = 50
hidden_size = 300
encoder_n_layers = 2
decoder_n_layers = 2
dropout = 0.1
learning_rate = 0.0001
decoder_learning_ratio = 5.0
clip = 50.0
n_iteration = 100


encoder = models.EncoderRNN(hidden_size, zh_embed, encoder_n_layers, dropout)
decoder = models.DecoderRNN('dot', en_embed, hidden_size, len(output_vocab), decoder_n_layers, dropout)
encoder_optimizer = optim.Adam(encoder.parameters(), lr=learning_rate)
decoder_optimizer = optim.Adam(decoder.parameters(), lr=learning_rate * decoder_learning_ratio)

encoder.train()
decoder.train()
#encoder_outputs, encoder_hidden = encoder(input_var, lengths)

DecoderRNN(
  (embedding): Embedding(
    (embedding): Embedding(166192, 300, padding_idx=0)
    (dropout_layer): Dropout(p=0, inplace=False)
  )
  (embedding_dropout): Dropout(p=0.1, inplace=False)
  (gru): GRU(300, 300, num_layers=2, dropout=0.1)
  (concat): Linear(in_features=600, out_features=300, bias=True)
  (out): Linear(in_features=300, out_features=166192, bias=True)
  (attn): Attn()
)

In [42]:
trainer.trainIters(model_name, input_vocab, output_vocab, input_var, output_var, encoder, decoder,
                   encoder_optimizer, decoder_optimizer, en_embed, zh_embed, encoder_n_layers, decoder_n_layers, 
                   n_iteration, batch_size, clip, print_every=10)

Initializing...
Start training...
torch.Size([87, 50, 300])
torch.Size([109, 50, 300])
torch.Size([98, 50, 300])
torch.Size([92, 50, 300])
torch.Size([108, 50, 300])
torch.Size([108, 50, 300])
torch.Size([129, 50, 300])
torch.Size([115, 50, 300])
torch.Size([94, 50, 300])
torch.Size([98, 50, 300])
Iteration: 10; Percent complete: 10.0%; Average loss: 11.7894
torch.Size([109, 50, 300])
torch.Size([95, 50, 300])
torch.Size([101, 50, 300])
torch.Size([97, 50, 300])
torch.Size([98, 50, 300])
torch.Size([109, 50, 300])
torch.Size([110, 50, 300])
torch.Size([98, 50, 300])
torch.Size([98, 50, 300])
torch.Size([98, 50, 300])
Iteration: 20; Percent complete: 20.0%; Average loss: 9.7012
torch.Size([90, 50, 300])
torch.Size([89, 50, 300])
torch.Size([111, 50, 300])
torch.Size([96, 50, 300])
torch.Size([96, 50, 300])
torch.Size([88, 50, 300])
torch.Size([89, 50, 300])
torch.Size([94, 50, 300])
torch.Size([102, 50, 300])
torch.Size([111, 50, 300])
Iteration: 30; Percent complete: 30.0%; Average los

In [52]:
encoder.eval()
decoder.eval()
searcher = trainer.Searcher(encoder, decoder)
words = trainer.evaluate(encoder, decoder, searcher, input_vocab, output_vocab, "我 喜欢 你 啊 啊 啊 啊", 0)
print(words)

torch.Size([7, 1, 300])
the the the the the the the


## 4. Attention Visualization (10 points)

Visualize the attention matrix in your model here.

## 5. Model Attack (30 points)

Attack your model here.

## 6. Conclusion (5 points)

Write down your conclusion here.

## Reference

[1] Must-read Papers on Textual Adversarial Attack and Defense, GitHub: https://github.com/thunlp/TAADpapers