# Fudan PRML Assignment2: Machine Translation and Model Attack

![Google Translation](./img/google_translation.PNG)

*Your name and Student ID: 陈朦伊 19307110382*

*Complete and hand in this completed worksheet (including its outputs and any supporting code outside of the worksheet, and a .pdf report file) with your assignment submission.*

你好，欢迎来到第二次作业！    
Hello and welcome to the second assignment!



In this assignment, you will work on a **Chinese to English** machine translation (MT) task with neural networks. Different from assignment1, this is a generation task in the field of NLP. The bilingual parallel corpus you will play with is the News Commentary v13 dataset from the [Third Conference on Machine Learning (WMT18)](https://www.statmt.org/wmt18/translation-task.html). There are about 252700 training samples, 2000 validation samples and 2000 test samples in the dataset. And the Chinese sentences have been processed by word segmentation. You have to design a Sequence to Sequence (Seq2Seq) model to complete this translation task. **The Attention mechanism must be used in your model** but you are free to design other modules with CNNs, RNNs and so on. **You have to evaluate your model on the test set with [Bilingual Evaluation Understudy (BELU) score](https://en.wikipedia.org/wiki/BLEU) and [Perplexity](https://en.wikipedia.org/wiki/Perplexity)**. **Besides, you have to visualize the attention matrix to help you understand your model.**

After you building your model, in the second part, you have to **attack** it : ) AI safety is nowadays a popular research point and many kind of attack methods have been developed during the the past few years. These methods include adversarial attack, data poisoning attack, training data leakage attack and so on [1]. In this assignment, **you just need to conduct one type of attack**. You can choose a simplest one or just analyze what kind of samples the model will predict incorrectly. The important thing is to understand the behavior of the neural models.

You can use the deep learning frameworks like paddle, pytorch, tensorflow in your experiment but not more high-level libraries like Huggingface. Please write down the version of them in the './requirements.txt' file.

The following links may be useful:    
- *Machine Translation: Foundations and Models,Tong Xiao and Jingbo Zhu, link: https://github.com/NiuTrans/MTBook*
- PyTorch Seq2Seq Tutorial @ Ben Trevett, link: https://github.com/bentrevett/pytorch-seq2seq

Certainly, our exercises in the PaddlePaddle AI Studio Platform will be helpful.

## 1. Setup

import the libraries and load the dataset here.

In [1]:
# setup code
%load_ext autoreload
%autoreload 2

In [2]:
import torch
import torch.nn as nn
import torch.optim as optim

import torchtext
from torchtext.legacy.data import Field, BucketIterator, Example

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

import model.DataLoader as data_loader
import model.utils as utils
import model.Trainer2 as trainer
from model.Datasets import MyDataset
from model.models import Encoder, Decoder, Seq2Seq

import numpy as np

import random
import math
import time
import os


SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [3]:
dataset_path = './dataset'
train_en_path = os.path.join(dataset_path, 'train', 'news-commentary-v13.zh-en.en')
train_zh_path = os.path.join(dataset_path, 'train', 'news-commentary-v13.zh-en.zh')
dev_en_path = os.path.join(dataset_path, 'dev', 'newsdev2017.tc.en')
dev_zh_path = os.path.join(dataset_path, 'dev', 'newsdev2017.tc.zh')
test_en_path = os.path.join(dataset_path, 'test', 'newstest2017.tc.en')
test_zh_path = os.path.join(dataset_path, 'test', 'newstest2017.tc.zh')

paths = {'train_en': train_en_path, 'train_zh': train_zh_path, 'dev_en': dev_en_path, 'dev_zh': dev_zh_path, 'test_en': test_en_path, 'test_zh': test_zh_path}
data_bundle = data_loader.DataBundle(paths)
data_bundle.process()


## 2. Exploratory Data Analysis (5 points)

Your may have to explore the dataset and do some analysis first.

In [4]:
data_bundle.print_info()

Datasets:
	Train: 252777
	Dev: 2002
	Test: 2001
Vocabulary:
	En: 166192
	Zh: 93264


## 3. Methodology (50 points)

Build and evaluate your model here.

In [34]:
SRC = Field(tokenize = utils.tokenize, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True, 
            batch_first = True)

TRG = Field(tokenize = utils.tokenize, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True, 
            batch_first = True)

train_data = MyDataset(data_bundle.get_dataset('train')['zh'][:1000], data_bundle.get_dataset('train')['en'][:1000], (SRC, TRG))
valid_data = MyDataset(data_bundle.get_dataset('dev')['zh'], data_bundle.get_dataset('dev')['en'], (SRC, TRG))
test_data = MyDataset(data_bundle.get_dataset('test')['zh'], data_bundle.get_dataset('test')['en'], (SRC, TRG))

trg_data, src_data = data_bundle.get_dataset('train')['en'][:5000], data_bundle.get_dataset('train')['zh'][:5000]

#find 朋友 in train
for i, data in enumerate(src_data):
    words = data.split()
    for word in words:
        if word.find('朋友') != -1:
            print(i)

52
570
686
700
1251
1464
2773
3249
3535
3746
4586
4982
4987


## 4. Attention Visualization (10 points)

Visualize the attention matrix in your model here.

## 5. Model Attack (30 points)

Attack your model here.

## 6. Conclusion (5 points)

Write down your conclusion here.

## Reference

[1] Must-read Papers on Textual Adversarial Attack and Defense, GitHub: https://github.com/thunlp/TAADpapers