# Coursework: Baseline Model

This notebook takes you step by step to the implementation of a simple baseline model to get you started on the coursework. You have a section for the English-German task and another for English-Chinese. They are made to be standalone so feel free to check only one of the sections. However, as the tasks require slighlty different approaches, going through both sections could help you to get inspired for your chosen task, especially each task processes english in a slighlty different way.

Enjoy!

## A. English-German

### Importing Data

In [None]:
# Download and unzip the data
from os.path import exists
if not exists('ende_data.zip'):
    !wget -O ende_data.zip https://competitions.codalab.org/my/datasets/download/c748d2c0-d6be-4e36-9f12-ca0e88819c4d
    !unzip ende_data.zip

In [None]:
# Check the files
import io

#English-German
print("---EN-DE---")
print()

with open("./train.ende.src", "r") as ende_src:
  print("Source: ",ende_src.readline())
with open("./train.ende.mt", "r") as ende_mt:
  print("Translation: ",ende_mt.readline())
with open("./train.ende.scores", "r") as ende_scores:
  print("Score: ",ende_scores.readline())


---EN-DE---

Source:  José Ortega y Gasset visited Husserl at Freiburg in 1934.

Translation:  1934 besuchte José Ortega y Gasset Husserl in Freiburg.

Score:  1.1016968715664406



### Setting up GPU

In [None]:
import numpy as np
import random

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
# Fix GPU seeds
SEED = 9320

if torch.cuda.is_available():
    torch.backends.cudnn.deterministic = True
    device = 'cuda:0'
else:
    device = 'cpu'

print('Device is', device)


# we fix the seeds to get consistent results before every training
# loop in what follows
def fix_seed(seed=234):
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    np.random.seed(seed)
    random.seed(seed)


fix_seed()

Device is cuda:0


### Computing Sentence Embeddings - FLAIR library


For this baseline model, we will simply use pre-trained GloVe embeddings via the Spacy module and compute the vector for each word and take the global mean for each sentence. We will do the same for both source and translation sentences. For chinese tokenization and embeddings we will have to find other tools.

This is a very simplistic approach so feel free to be more creative and play around with how the sentence embeddings are computed for example ;).

GloVe embeddings do not support the Chinese language so in the section of the English-Chinese task we will have to download pretrained Chinese embeddings from word2vec repositories.

In [None]:
import torch
!pip install flair
import flair
from flair.data import Sentence




Collecting flair
[?25l  Downloading https://files.pythonhosted.org/packages/03/29/81e3c9a829ec50857c23d82560941625f6b42ce76ee7c56ea9529e959d18/flair-0.4.5-py3-none-any.whl (136kB)
[K     |██▍                             | 10kB 23.4MB/s eta 0:00:01[K     |████▉                           | 20kB 1.7MB/s eta 0:00:01[K     |███████▏                        | 30kB 2.5MB/s eta 0:00:01[K     |█████████▋                      | 40kB 3.2MB/s eta 0:00:01[K     |████████████                    | 51kB 2.0MB/s eta 0:00:01[K     |██████████████▍                 | 61kB 2.4MB/s eta 0:00:01[K     |████████████████▉               | 71kB 2.8MB/s eta 0:00:01[K     |███████████████████▏            | 81kB 3.2MB/s eta 0:00:01[K     |█████████████████████▋          | 92kB 3.6MB/s eta 0:00:01[K     |████████████████████████        | 102kB 2.7MB/s eta 0:00:01[K     |██████████████████████████▍     | 112kB 2.7MB/s eta 0:00:01[K     |████████████████████████████▊   | 122kB 2.7MB/s eta 0:00:0

In [None]:
from flair.embeddings import WordEmbeddings
from flair.embeddings import CharacterEmbeddings
from flair.embeddings import StackedEmbeddings
from flair.embeddings import FlairEmbeddings
from flair.embeddings import BertEmbeddings
from flair.embeddings import ELMoEmbeddings
from flair.embeddings import FlairEmbeddings
from flair.embeddings import DocumentPoolEmbeddings
!pip install allennlp

###########English Embeddings##########
# glove_embedding = WordEmbeddings('glove')
# character_embeddings = CharacterEmbeddings()
# bert_embedding = BertEmbedding()
elmo_embedding = ELMoEmbeddings()

flair_forward_en  = FlairEmbeddings('news-forward-fast')
flair_backward_en = FlairEmbeddings('news-backward-fast')

###########German Embeddings###############

distillBERT_de = BertEmbeddings(bert_model_or_path="distilbert-base-german-cased")
BERT_de =  BertEmbeddings(bert_model_or_path="bert-base-german-cased")


#################MultiLingual Embeddings##########
# init Flair embeddings
flair_forward_embedding = FlairEmbeddings('multi-forward')
flair_backward_embedding = FlairEmbeddings('multi-backward')

# init multilingual BERT
bert_embedding = BertEmbeddings('bert-base-multilingual-cased')
bert_embedding2 = BertEmbeddings(bert_model_or_path="albert-base-v2")

# #Stack some embeddings:
# stacked_embeddings = StackedEmbeddings(
#     embeddings=[flair_forward_embedding, flair_backward_embedding, bert_embedding])




# document_embeddings = DocumentPoolEmbeddings(
#     embeddings=[flair_forward_embedding, flair_backward_embedding, bert_embedding])

# #sentence = Sentence('The grass is green .')

# # just embed a sentence using the StackedEmbedding as you would with any single embedding.
# stacked_embeddings.embed(sentence)

# # now check out the embedded tokens.
# for token in sentence:
#     print(token)
#     print(token.embedding)

Collecting allennlp
[?25l  Downloading https://files.pythonhosted.org/packages/bb/bb/041115d8bad1447080e5d1e30097c95e4b66e36074277afce8620a61cee3/allennlp-0.9.0-py3-none-any.whl (7.6MB)
[K     |████████████████████████████████| 7.6MB 2.5MB/s 
Collecting unidecode
[?25l  Downloading https://files.pythonhosted.org/packages/d0/42/d9edfed04228bacea2d824904cae367ee9efd05e6cce7ceaaedd0b0ad964/Unidecode-1.1.1-py2.py3-none-any.whl (238kB)
[K     |████████████████████████████████| 245kB 56.5MB/s 
Collecting word2number>=1.1
  Downloading https://files.pythonhosted.org/packages/4a/29/a31940c848521f0725f0df6b25dca8917f13a2025b0e8fcbe5d0457e45e6/word2number-1.1.zip
Collecting pytorch-pretrained-bert>=0.6.0
[?25l  Downloading https://files.pythonhosted.org/packages/d7/e0/c08d5553b89973d9a240605b9c12404bcf8227590de62bae27acbcfe076b/pytorch_pretrained_bert-0.6.2-py3-none-any.whl (123kB)
[K     |████████████████████████████████| 133kB 53.8MB/s 
[?25hCollecting jsonpickle
  Downloading https://f

100%|██████████| 336/336 [00:00<00:00, 809469.35B/s]
100%|██████████| 374434792/374434792 [00:21<00:00, 17542670.07B/s]


2020-02-28 15:08:11,098 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings/lm-news-english-forward-1024-v0.2rc.pt not found in cache, downloading to /tmp/tmpmy9e4yq5


100%|██████████| 19689779/19689779 [00:03<00:00, 5033129.27B/s]

2020-02-28 15:08:16,151 copying /tmp/tmpmy9e4yq5 to cache at /root/.flair/embeddings/lm-news-english-forward-1024-v0.2rc.pt
2020-02-28 15:08:16,170 removing temp file /tmp/tmpmy9e4yq5





2020-02-28 15:08:17,811 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings/lm-news-english-backward-1024-v0.2rc.pt not found in cache, downloading to /tmp/tmpgxx98vcp


100%|██████████| 19689779/19689779 [00:03<00:00, 5667289.32B/s]

2020-02-28 15:08:22,439 copying /tmp/tmpgxx98vcp to cache at /root/.flair/embeddings/lm-news-english-backward-1024-v0.2rc.pt
2020-02-28 15:08:22,459 removing temp file /tmp/tmpgxx98vcp





HBox(children=(IntProgress(value=0, description='Downloading', max=239836, style=ProgressStyle(description_wid…




HBox(children=(IntProgress(value=0, description='Downloading', max=593, style=ProgressStyle(description_width=…




HBox(children=(IntProgress(value=0, description='Downloading', max=269752043, style=ProgressStyle(description_…


2020-02-28 15:08:59,347 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings-v0.4.3/lm-jw300-forward-v0.1.pt not found in cache, downloading to /tmp/tmpyk9gl8xj


100%|██████████| 172513724/172513724 [00:18<00:00, 9250363.23B/s]

2020-02-28 15:09:19,164 copying /tmp/tmpyk9gl8xj to cache at /root/.flair/embeddings/lm-jw300-forward-v0.1.pt
2020-02-28 15:09:19,347 removing temp file /tmp/tmpyk9gl8xj





2020-02-28 15:09:21,959 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings-v0.4.3/lm-jw300-backward-v0.1.pt not found in cache, downloading to /tmp/tmp_lsn2u7b


100%|██████████| 172513724/172513724 [00:19<00:00, 8839586.37B/s]

2020-02-28 15:09:42,652 copying /tmp/tmp_lsn2u7b to cache at /root/.flair/embeddings/lm-jw300-backward-v0.1.pt





2020-02-28 15:09:42,812 removing temp file /tmp/tmp_lsn2u7b


HBox(children=(IntProgress(value=0, description='Downloading', max=995526, style=ProgressStyle(description_wid…




HBox(children=(IntProgress(value=0, description='Downloading', max=569, style=ProgressStyle(description_width=…




HBox(children=(IntProgress(value=0, description='Downloading', max=714314041, style=ProgressStyle(description_…




HBox(children=(IntProgress(value=0, description='Downloading', max=760289, style=ProgressStyle(description_wid…




HBox(children=(IntProgress(value=0, description='Downloading', max=534, style=ProgressStyle(description_width=…




HBox(children=(IntProgress(value=0, description='Downloading', max=47376696, style=ProgressStyle(description_w…




In [None]:
document_embeddings_de = DocumentPoolEmbeddings(
    embeddings=[BERT_de])


document_embeddings_en  = DocumentPoolEmbeddings(
    embeddings=[elmo_embedding])

# document_embeddings_de = DocumentPoolEmbeddings(
#     embeddings=[bert_embedding])


# document_embeddings_en  = DocumentPoolEmbeddings(
#     embeddings=[bert_embedding ])

In [None]:
from scipy.fftpack import dct

#### DCT Pooling
def DCT_pooling(sent,k=2):
    '''
    Calculates sentence embedding by saving first k coefficients of DCT transform
    input: sent -> np array [B,N,D] B= Batch size N = number words, D = embedding dim
           k - how many coefficients to keep
    output: Sentence embedding
    '''
    
    num_words = sent.shape[0]
    embedding_dim = sent.shape[1]
    #DCT
    out = dct(sent,type=3,n=k, axis = 1)
    #reshape into row vector
    return out.reshape(-1,1)

We can now write our functions that will return the average embeddings for a sentence.

#### Pre-processing

In [None]:
!pip install nltk



In [None]:

from nltk.tokenize import RegexpTokenizer
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
# from nltk.stem.cistem import Cistem
from nltk.corpus import stopwords

#downloading stopwords from the nltk package
nltk.download('stopwords') #stopwords dictionary, run once
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('punkt')

stop_words_en = set(stopwords.words('english'))
stop_words_de = set(stopwords.words('german'))


tokenizer = RegexpTokenizer(r'\w+')
lemmatizer = WordNetLemmatizer()


def nltk2wn_tag(nltk_tag):
  if nltk_tag.startswith('J'):
    return wordnet.ADJ
  elif nltk_tag.startswith('V'):
    return wordnet.VERB
  elif nltk_tag.startswith('N'):
    return wordnet.NOUN
  elif nltk_tag.startswith('R'):
    return wordnet.ADV
  else:          
    return None

def lemmatize_sentence_en(sentence):
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))  
    wn_tagged = map(lambda x: (x[0], nltk2wn_tag(x[1])), nltk_tagged)
    res_words = []
    for word, tag in wn_tagged:
        if word not in stop_words_en:
            if tag is None:            
                res_words.append(word)
            else:
                res_words.append(lemmatizer.lemmatize(word, tag))
    return " ".join(res_words)



def lemmatize_sentence_de(sentence):
    '''
    input: sentence (list(tokens))
    return: lemmatized sentence list(tokens)
    '''
    stemmer = nltk.stem.cistem.Cistem()
    #Assumes tokenizing first
    
    return [stemmer.segment(token)[0] for token in sentence if token not in stop_words_de]
    
    
    
def tokenize_sentences(corpus):

    return [tokenizer.tokenize(s) for s in corpus]



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
import numpy as np
import spacy
import torch
from nltk import download
from nltk.corpus import stopwords

#downloading stopwords from the nltk package
#download('stopwords') #stopwords dictionary, run once

stop_words_en = set(stopwords.words('english'))
stop_words_de = set(stopwords.words('german'))

def get_sentence_emb(line,nlp,lang):
  if lang == 'en':
    # text = line.lower()
    text = line
    l = lemmatize_sentence_en(text)
  elif lang == 'de':
    # text = line.lower()
    text = line
    # l = lemmatize_sentence_de(text)
    l=text
    l= ' '.join([word for word in l if word not in stop_words_de])
  
  sentence = Sentence(l)
  nlp.embed(sentence)
  return sentence.get_embedding()

def get_embeddings(f,nlp,lang):
  file = open(f) 
  lines = file.readlines() 
  sentences_vectors  = []
  count=0
  for l in lines:
      vec = get_sentence_emb(l,nlp,lang)
      if vec is not None:
        # vec = np.mean(vec.cpu().detach().numpy())
        sentences_vectors.append(vec.cpu().detach().numpy())
      else:
        print("didn't work :", l)
        sentences_vectors.append(0)
      if count % 100 == 0:
            print(count)
      count+=1
  return sentences_vectors


#### Getting Training and Validation Sets

We will now run the code fo the English-German translations and getting our training and validation sets ready for the regression task.


In [None]:
# import spacy

# nlp_de =spacy.load('de300')
# nlp_en =spacy.load('en300')

In [None]:
import torch


#EN-DE files
de_train_src = get_embeddings("./train.ende.src",document_embeddings_en,'en')
print("English dev Done")
de_train_mt = get_embeddings("./train.ende.mt",document_embeddings_de,'de')
print('German Done')
f_train_scores = open("./train.ende.scores",'r')
de_train_scores = f_train_scores.readlines()

de_val_src = get_embeddings("./dev.ende.src",document_embeddings_en,'en')
de_val_mt = get_embeddings("./dev.ende.mt",document_embeddings_de,'de')
f_val_scores = open("./dev.ende.scores",'r')
de_val_scores = f_val_scores.readlines()



0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
2500
2600
2700
2800
2900
3000
3100
3200
3300
3400
3500
3600
3700
3800
3900
4000
4100
4200
4300
4400
4500
4600
4700
4800
4900
5000
5100
5200
5300
5400
5500
5600
5700
5800
5900
6000
6100
6200
6300
6400
6500
6600
6700
6800
6900
English dev Done
0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
2500
2600
2700
2800
2900
3000
3100
3200
3300
3400
3500
3600
3700
3800
3900
4000
4100
4200
4300
4400
4500
4600
4700
4800
4900
5000
5100
5200
5300
5400
5500
5600
5700
5800
5900
6000
6100
6200
6300
6400
6500
6600
6700
6800
6900
German Done
0
100
200
300
400
500
600
700
800
900
0
100
200
300
400
500
600
700
800
900


In [None]:

#EN-DE
print(f"Training mt: {len(de_train_mt)} Training src: {len(de_train_src)}")
print()
print(f"Validation mt: {len(de_val_mt)} Validation src: {len(de_val_src)}")


Training mt: 7000 Training src: 7000

Validation mt: 1000 Validation src: 1000


1.1016968715664406



### Computing embeddings - pre-trained BERT

In [None]:
!pip install transformers
import random

import numpy as np

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F


from torchtext import data, datasets
from torch.utils.data import DataLoader, TensorDataset, sampler

from transformers import BertTokenizer, BertConfig, BertModel, BertForSequenceClassification, AdamW



class BERTembedd(nn.Module):
  def __init__(self, batch_size=64):
    super().__init__()
    self.english_BERT = BertModel.from_pretrained('bert-base-uncased')
    self.german_BERT = BertModel.from_pretrained('bert-base-german-cased')
    self.english_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    self.german_tokenizer = BertTokenizer.from_pretrained('bert-base-german-cased')
    self.batch_size = batch_size


    if torch.cuda.is_available():
      torch.backends.cudnn.deterministic = True
      self.device = 'cuda:0'
    else:
      self.device = 'cpu'

    print('Device is', self.device)

  def prepare_data(self, f, lang, max_len=30):
    with open(f) as file:
      lines = file.readlines()
      if lang == "en":
        tokenizer = self.english_tokenizer
      elif lang == "de":
        tokenizer = self.german_tokenizer
      else:
        raise ValueError("lang must be either en or de")

    input_ids = torch.LongTensor(
      [tokenizer.encode(text, max_length=max_len, add_special_tokens=True, pad_to_max_length=True) for text in
       lines])
    tokenizer = None


    # Create attention masks
    attention_masks = []
    attention_masks = torch.zeros(input_ids.shape).long()

    # Mask of the token is 0 if token_id is 0 (padding). mask is 1 otherwise.
    attention_masks[attention_masks != input_ids] = 1

    # Create dataset and dataloader
    dataset = list(zip(input_ids, attention_masks))
    dataloader = DataLoader(dataset, batch_size=self.batch_size)


    return dataloader


  def get_embeddings(self, f, lang, pooling_fcn=torch.mean):
    d = self.prepare_data(f, lang)

    if lang == "en":
      bert = self.english_BERT
    elif lang == "de":
      bert = self.german_BERT
    else:
      raise ValueError("lang must be de or en.")
    bert = bert.to(self.device)
    bert.eval()
    with torch.no_grad():
      results = []

      for x in d:
        sentences = x[0].to(self.device)
        masks = x[1].to(self.device)
        result = bert(input_ids=sentences, attention_mask=masks)[0] # -> (batch_size, sequence_length, hidden_size)
        # pooled = F.max_pool1d(result, result.shape[2]).squeeze() # -> (batch_size, sequence_length, 1)
        pooled = F.avg_pool1d(result, result.shape[2]).squeeze() # -> (batch_size, sequence_length, 1)
        if pooling_fcn is not None:
          sentence_vectors = pooling_fcn(pooled, 1)
        else:
          sentence_vectors = pooled
        # print(sentence_vectors)
        results += list(sentence_vectors.cpu().numpy())

    return results


  def forward(self, f, lang, pooling_fcn=torch.mean):
    return self.get_embeddings(f, lang, pooling_fcn)


bertembedder = BERTembedd()
def get_bert_embeddings(f, lang, pooling_fcn=torch.mean):
  return bertembedder.forward(f, lang, pooling_fcn)


Device is cuda:0


###### Bert embedding, getting training and validation with BERT:

In [None]:
#EN-DE files
de_train_src = get_bert_embeddings("./train.ende.src",'en', pooling_fcn=torch.mean)

de_train_mt = get_bert_embeddings("./train.ende.mt",'de', pooling_fcn=torch.mean)

f_train_scores = open("./train.ende.scores",'r')
de_train_scores = f_train_scores.readlines()

de_val_src = get_bert_embeddings("./dev.ende.src",'en',pooling_fcn=torch.mean)
de_val_mt = get_bert_embeddings("./dev.ende.mt",'de',pooling_fcn=torch.mean)
f_val_scores = open("./dev.ende.scores",'r')
de_val_scores = f_val_scores.readlines()

### Pytorch NN Regressor



##### Prepare test and validation set vectors:

In [None]:
#Put the features into a tesnor [B,D] - number sentences, embedding dim
import numpy as np
import torch

num_samples = len(de_train_src)
num_dims = len(de_train_src[0]) * 2
X_train = torch.zeros((num_samples,num_dims),dtype=torch.float)
for i in range(len(de_train_src)):
  en_vec = de_train_src[i]
  de_vec = de_train_mt[i]
  vec = np.concatenate((en_vec,de_vec))
  vec = torch.tensor(vec,dtype=torch.float).squeeze()

  X_train[i,:] = vec

X_train_de = X_train



# X_train= [np.array(de_train_src),np.array(de_train_mt)]
# X_train_de = np.array(X_train).transpose()


num_samples = len(de_val_src)
num_dims = len(de_val_src[0]) * 2

X_val = torch.zeros((num_samples,num_dims),dtype=torch.float)
for i in range(len(de_val_src)):
  en_vec = de_val_src[i]
  de_vec = de_val_mt[i]
  vec = np.concatenate((en_vec,de_vec))
  vec = torch.tensor(vec,dtype=torch.float).squeeze()
  X_val[i,:] = vec

X_val_de = X_val


# X_val = [np.array(de_val_src),np.array(de_val_mt)]
# X_val_de = np.array(X_val).transpose()

#Scores
print(de_train_scores[0])
de_train_scores = np.array(de_train_scores, dtype=np.float)
train_scores = torch.tensor(de_train_scores).type(torch.float)
y_train_de =train_scores

val_scores = np.array(de_val_scores,dtype=np.float)
val_scores = torch.tensor(val_scores, dtype=torch.float)
y_val_de =val_scores

1.1016968715664406



Pytorch FCL


In [None]:
import numpy as np
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torchvision import datasets, transforms

from scipy.stats.stats import pearsonr


# You should set a random seed to ensure that your results are reproducible.
torch.manual_seed(0)
use_cuda = torch.cuda.is_available()
device = torch.device('cuda' if use_cuda else 'cpu')
if use_cuda:
    torch.cuda.manual_seed(0)
    
print("Using GPU: {}".format(use_cuda))

class OneHiddenLayerMNISTClassifier(nn.Module):
    # Define entities containing model weights in the constructor.
    def __init__(self, n_hidden):
        super().__init__()
        self.linear1 = nn.Linear(
            in_features=num_dims, out_features=n_hidden, bias=True
        )
        self.linear2 = nn.Linear(
            in_features=n_hidden, out_features=200, bias=True
        )

        self.linear3 = nn.Linear(
            in_features=200, out_features=100, bias=True
        )

        self.linear4 = nn.Linear(
            in_features=100, out_features=1, bias=True
        )


    # Then, all you need to do is implement a `forward` method to define the
    # computation that takes place on the forward pass. A corresponding
    # `backward` method, which computes gradients, is automatically defined!
    def forward(self, inputs):
        h = self.linear1(inputs)
        h = F.tanh(h)
        h = self.linear2(h)
        h = F.tanh(h)
        h = self.linear3(h)
        h = F.tanh(h)
        h = self.linear4(h)
        
        return h


def train(model, train_loader, optimizer, epoch, log_interval=100):
    """
    A utility function that performs a basic training loop.

    For each batch in the training set, fetched using `train_loader`:
        - Zeroes the gradient used by `optimizer`
        - Performs forward pass through `model` on the given batch
        - Computes loss on batch
        - Performs backward pass
        - `optimizer` updates model parameters using computed gradient

    Prints the training loss on the current batch every `log_interval` batches.
    """
    for batch_idx, (inputs,targets) in enumerate(train_loader):
        # We need to send our batch to the device we are using. If this is not
        # it will default to using the CPU.
        inputs = inputs.to(device)

        targets = targets.to(device)
        
        # Zeroes the gradient used by `optimizer`; NOTE: if this is not done,
        # then gradients will be accumulated across batches!
        optimizer.zero_grad()

        # Performs forward pass through `model` on the given batch; equivalent
        # to `model.forward(inputs)`. Any information needed to compute
        # gradients is automatically thanks to autograd running under the hood.
        outputs = model(inputs)

        # Computes loss on batch; `F.mse_loss` computes the mean squared error 
        #loss on batch.
 
        loss =  F.mse_loss(outputs.squeeze(),targets)

        # Performs backward pass; steps backward through the computation graph,
        # computing the gradient of the loss wrt model parameters.
        loss.backward()

        # `optimizer` updates model parameters using computed gradient.
        optimizer.step()

        # Prints the training loss on the current batch every `log_interval`
        # batches.
        if batch_idx % log_interval == 0:
            print(
                "Train Epoch: {:02d} -- Batch: {:03d} -- Loss: {:.4f}".format(
                    epoch,
                    batch_idx,
                    # Calling `loss.item()` returns the scalar loss as a Python
                    # number.
                    loss.item(),
                )
            )


def val(model, test_loader):
    """
    A utility function to compute the loss and accuracy on a test set by
    iterating through the test set using the provided `test_loader` and
    accumulating the loss and accuracy on each batch.
    """
    test_loss = 0.0
    count =0
    test_pearson = 0
    # You should use the `torch.no_grad()` context when you want to perform a
    # forward pass but do not need gradients. This effectively disables
    # autograd and results in fewer resources being used to perform the forward
    # pass (since information needed to compute gradients is not logged).
    with torch.no_grad():
        for inputs, targets in test_loader:
            inputs = inputs.to(device)
            targets = targets.to(device)
            # We use `reduction="sum"` to aggregate losses across batches using
            # summation instead of taking the mean - we will take the mean at
            # the end once we have accumulated all the losses.
            outputs = model(inputs)
            test_loss += F.mse_loss(outputs.squeeze(), targets, reduction="sum").item()
            # pred = outputs.argmax(dim=1, keepdim=True)
            pred = outputs
            # correct += pred.eq(targets.view_as(pred)).sum().item()
   
            test_pearson =  pearsonr(targets.cpu(), outputs.squeeze().cpu())
            print(test_pearson)
            count +=1

    pearson_score = test_pearson
    print(f'Pearson score: {pearson_score[0]}')



def test(model,input):
    with torch.no_grad():
      return model(input)
    

Using GPU: True


Create data loaders


In [None]:
 # Create dataloaders
from torch.utils.data import TensorDataset
# train_dataset = TensorDataset(X_train, y_train)



train_loader = DataLoader(TensorDataset(X_train_de, y_train_de), batch_size=128, shuffle=True)
test_loader = DataLoader(TensorDataset(X_val_de, y_val_de), batch_size=1000, shuffle=False)




Training the model

In [None]:
model = OneHiddenLayerMNISTClassifier(n_hidden=100).to(device)

# Create instance of optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Train-test loop
for epoch in range(150):
    train(model, train_loader, optimizer, epoch)

val(model, test_loader)



Train Epoch: 00 -- Batch: 000 -- Loss: 1.1142
Train Epoch: 01 -- Batch: 000 -- Loss: 0.9789
Train Epoch: 02 -- Batch: 000 -- Loss: 0.7936
Train Epoch: 03 -- Batch: 000 -- Loss: 0.9088
Train Epoch: 04 -- Batch: 000 -- Loss: 0.6705
Train Epoch: 05 -- Batch: 000 -- Loss: 0.4876
Train Epoch: 06 -- Batch: 000 -- Loss: 0.3257
Train Epoch: 07 -- Batch: 000 -- Loss: 0.9002
Train Epoch: 08 -- Batch: 000 -- Loss: 0.7124
Train Epoch: 09 -- Batch: 000 -- Loss: 0.7570
Train Epoch: 10 -- Batch: 000 -- Loss: 0.7919
Train Epoch: 11 -- Batch: 000 -- Loss: 1.2568
Train Epoch: 12 -- Batch: 000 -- Loss: 0.5428
Train Epoch: 13 -- Batch: 000 -- Loss: 0.2917
Train Epoch: 14 -- Batch: 000 -- Loss: 0.4030
Train Epoch: 15 -- Batch: 000 -- Loss: 0.8317
Train Epoch: 16 -- Batch: 000 -- Loss: 0.8784
Train Epoch: 17 -- Batch: 000 -- Loss: 0.8211
Train Epoch: 18 -- Batch: 000 -- Loss: 0.6996
Train Epoch: 19 -- Batch: 000 -- Loss: 0.6248
Train Epoch: 20 -- Batch: 000 -- Loss: 0.3495
Train Epoch: 21 -- Batch: 000 -- L

### Other regressors

Putting data in list

In [None]:
#Put the features into a list
import numpy as np

X_train= [np.array(de_train_src),np.array(de_train_mt)]
X_train_de = np.array(X_train).transpose()

X_val = [np.array(de_val_src),np.array(de_val_mt)]
X_val_de = np.array(X_val).transpose()

#Scores
train_scores = np.array(de_train_scores).astype(float)
y_train_de =train_scores

val_scores = np.array(de_val_scores).astype(float)
y_val_de =val_scores

Define RMSE

In [None]:
import numpy as np

def rmse(predictions, targets):
    return np.sqrt(((predictions - targets) ** 2).mean())

#### SVM

SVM have many parameters such as the kernel and the regularizating constant C. Here we will use C = 1 and compare kernels. 

In [None]:
from sklearn.svm import SVR
from scipy.stats.stats import pearsonr

for k in ['linear','poly','rbf','sigmoid']:
    clf_t = SVR(kernel=k)
    clf_t.fit(X_train_de, y_train_de)
    print(k)
    predictions = clf_t.predict(X_val_de)
    pearson = pearsonr(y_val_de, predictions)
    print(f'RMSE: {rmse(predictions,y_val_de)} Pearson {pearson[0]}')
    print()



linear
RMSE: 0.8815523248077068 Pearson 0.02316592901063899

poly
RMSE: 0.881270949278876 Pearson 0.025748763501295843

rbf
RMSE: 0.8811243881929659 Pearson 0.026738971503697792

sigmoid
RMSE: 23.256236575094878 Pearson 0.014059164328044008



#### Random Tree Forest

In [None]:
# Import the model we are using

from sklearn.ensemble import RandomForestRegressor

for n in [100,500,1000,1500]:

  rf = RandomForestRegressor(n_estimators = n, random_state = 666)

  rf.fit(X_train_de, y_train_de);


  predictions = rf.predict(X_val_de)

  pearson = pearsonr(y_val_de, predictions)
  print('RMSE:', rmse(predictions,y_val_de))
  print(f"Pearson {pearson[0]} for n_estimators = {n}")


RMSE: 0.9171844096424996
Pearson 0.020763124093085897 for n_estimators = 100
RMSE: 0.9174139364155368
Pearson 0.007309205571025664 for n_estimators = 500
RMSE: 0.9165907358012123
Pearson 0.009060574412041517 for n_estimators = 1000
RMSE: 0.9168888726567898
Pearson 0.007834303184429967 for n_estimators = 1500


Here is a regressor using KerasRegressor. Makes use of Neural networks and hyper parameter searching. 


In [None]:
### Keras Regressor using Neural Networks ####

# This is a neural network regressor that we built using keras

# Import necessary packages
from sklearn import linear_model
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor

# Create random seed to replicate results
seed = 1

# Create lists that we can use for hyper-parameter search
activation = ['sigmoid', 'tanh'] # activation functions used
layer_size_1 = [80, 100] # first layer size
layer_size_2 = [40, 100] # second layer size
optimizer = ['adam', 'sgd'] # two optimizers used
batch_size = [32,64] # different batch sizes

# Conduct hyper-parameter search using 'for loops' 

for item in activation:
  for size_1 in layer_size_1:
    for size_2 in layer_size_2:
      for optim in optimizer:
        for batch in batch_size:

          # Build the neural network architecture. We add three layers into the neural network
          # and assign activation functions. 
          def baseline_model():
              model = Sequential()
              model.add(Dense(40, input_dim=2, activation='tanh'))
              model.add(Dense(10, activation = 'tanh'))
              model.add(Dense(1, activation = 'linear'))
              model.compile(loss='mse', optimizer='adam')
              return model

          # Create the estimator using the KerasRegressor function from Keras and our
          # model architecutre
          estimator = KerasRegressor(build_fn=baseline_model, nb_epoch=100, batch_size=32, verbose=False, validation_split = 0.2)

          # train the model with the training data
          estimator.fit(X_train_de, y_train_de);

          # Predict the model on the validation/test data
          predictions = estimator.predict(X_val_de)

          # compute pearson score for each hyper-parameter selection
          pearson = pearsonr(y_val_de, predictions)
          print('RMSE:', rmse(predictions,y_val_de))
          print(f"Pearson {pearson[0]}, activation = {item}, size_1 = {size_1}, size_2 = {size_2}, optim = {optim}, batch size = {batch}")







Using TensorFlow backend.










RMSE: 0.8645205190554102
Pearson -0.004148206212801797


Here I test out a Kernel Ridge Regressor. Similar to SVR with slight differences.


In [None]:
### Kernel Ridge Regressor ####

# import necessary packages to apply KernelRidge Regressor
from sklearn.kernel_ridge import KernelRidge

# Create list for kernel such that we can perform hyper-parameter search.
for k in ['linear','poly','rbf','sigmoid']:
  # Create Regressor
  kr = KernelRidge(alpha = 0.2, kernel = k)

  # Train the model on the trianing data
  kr.fit(X_train_de, y_train_de);

  # Predict outcome for the validation/test data
  predictions = kr.predict(X_val_de)

  # Compute and pring pearson value for each hyper-parameter selection
  pearson = pearsonr(y_val_de, predictions)
  print('RMSE:', rmse(predictions,y_val_de))
  print(f"Pearson score for KernelRidge Regression: {pearson[0]} for kernel = {k}")


RMSE: 0.8638932073906487
Pearson score for KernelRidge Regression: 0.03584678337960797 for kernel = linear
RMSE: 0.8638899984186302
Pearson score for KernelRidge Regression: 0.03318011280305015 for kernel = poly
RMSE: 0.8638932466690457
Pearson score for KernelRidge Regression: 0.035468538582623894 for kernel = rbf
RMSE: 0.8638994229750283
Pearson score for KernelRidge Regression: 0.0387287287343936 for kernel = sigmoid


The next regressor I test is the Passive Aggressive Regressor from sklearn.

In [None]:
### Passive Aggressive Regressor ###

# import necessary packages for the Passive Aggressive regressor
from sklearn.linear_model import PassiveAggressiveRegressor

# create values for max iterations used for hyper-parameter search
max_iter = [100, 200, 300]

# conduct hyper-parameter search
for value in max_iter:

  # create regressor from library
  clf = PassiveAggressiveRegressor(max_iter = value, random_state = 0)

  # fit the model to the training data
  clf.fit(X_train_de, y_train_de)

  # predict outcomes for validation/test data
  clf.predict(X_val_de)

  # Print Pearson score for each hyper-parameter selection
  pearson = pearsonr(y_val_de, predictions)
  print('RMSE:', rmse(predictions,y_val_de))
  print(f"Pearson score for PA Regression: {pearson[0]}, max iteration is {value}")


RMSE: 0.8638994229750283
Pearson score for PA Regression: 0.0387287287343936


I also try the TheilSen regressor from sklearn. 

In [None]:
### TheilSen Regressor ###

# Import necessary packages for the TheilSen Regressor
from sklearn.linear_model import TheilSenRegressor

# create values for max iterations used for hyper-parameter search
max_iter = [100, 200, 300]

# conduct hyper-parameter search
for value in max_iter:

  # create regressor from library
  tsr = TheilSenRegressor(max_iter = value, random_state = 0)

  # fit the model to the training data
  tsr.fit(X_train_de, y_train_de)

  # predict outcomes for validation/test data
  tsr.predict(X_val_de)

  pearson = pearsonr(y_val_de, predictions)
  print('RMSE:', rmse(predictions,y_val_de))
  print(f"Pearson score for TSR Regression: {pearson[0]}, max iteration = {value}")


RMSE: 0.8638994229750283
Pearson score for TSR Regression: 0.0387287287343936


Next I test the Gradient Boostin regressor from sklearn.

In [None]:
# import necessary package from sklearn

from sklearn import ensemble

#### Gradient Boosting Regressor ####

# create lists for hyper-parameter search. Change learning rate, 
# max depth and n_estimator value

for lr in [0.001, 0.0005]:
  for max_depth in [1,2]:
    for n_estimator in [200,500]:

      # Put hyper-parameters in parameter dictionary
      params = {'n_estimators': n_estimator, 'max_depth': max_depth, 'min_samples_split': 2,
                'learning_rate': lr, 'loss': 'ls'}

      # Create Gradient Boosting Regressor
      clf = ensemble.GradientBoostingRegressor(**params)

      # Train the model on the training data
      clf.fit(X_train_de, y_train_de)

      # Predict outcomes on validation data
      predictions = clf.predict(X_val_de)

      # Compute and print pearson score for each hyper-parameter evaluation
      pearson = pearsonr(y_val_de, predictions)
      print('RMSE:', rmse(predictions,y_val_de))
      print(f"Pearson Gradient Boosting Regressor: {pearson[0]} with lr = {lr}, max_depth = {max_depth}, n_estimator = {n_estimator}")



RMSE: 0.8639040160750084
Pearson Gradient Boosting Regressor: nan with lr = 0.001, max_depth = 1, n_estimator = 200




RMSE: 0.8639071947381184
Pearson Gradient Boosting Regressor: nan with lr = 0.001, max_depth = 1, n_estimator = 500
RMSE: 0.8637188063364276
Pearson Gradient Boosting Regressor: 0.04102789914741822 with lr = 0.001, max_depth = 2, n_estimator = 200
RMSE: 0.8636124658386475
Pearson Gradient Boosting Regressor: 0.03539659808635272 with lr = 0.001, max_depth = 2, n_estimator = 500




RMSE: 0.8639027377528806
Pearson Gradient Boosting Regressor: nan with lr = 0.0005, max_depth = 1, n_estimator = 200




RMSE: 0.863904610309668
Pearson Gradient Boosting Regressor: nan with lr = 0.0005, max_depth = 1, n_estimator = 500
RMSE: 0.8637630797945611
Pearson Gradient Boosting Regressor: 0.050526866274125365 with lr = 0.0005, max_depth = 2, n_estimator = 200
RMSE: 0.863702224295285
Pearson Gradient Boosting Regressor: 0.03788722301305035 with lr = 0.0005, max_depth = 2, n_estimator = 500


Here I use an MLP regressor from sklearn and a pipeline which is used to create a hyper parameter search. So far the best results have come from this of about 0.772.

In [None]:
# MLP regressor.... using neural networks to predict labels #

# import necessary packages from sklearn library for MLP regressor.
from sklearn.pipeline import make_pipeline
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler

# Create list of scores so we can keep track of best Pearson score
list_of_scores = []

# initalise hyper-parameter search
for hidden_layer_size in [(100),(100,100)]:
  for activation in ['relu','logistic','tanh']:
    for learning_rate in [0.001]:
      for solver in ['adam','sgd']:
        
        # use pipeline to create MLP regressor
        mlp = make_pipeline(StandardScaler(),
                            MLPRegressor(hidden_layer_sizes=hidden_layer_size,
                                        tol=1e-2, max_iter=500, random_state=0,early_stopping = False, learning_rate_init = learning_rate, activation = activation,solver = solver))


        # Train model on the training data
        mlp.fit(X_train_de, y_train_de)

        # Evaluate model on training/validation data
        predictions = mlp.predict(X_val_de)

        # Compute and print Pearson score for each hyper-parameter evaluation
        pearson = pearsonr(y_val_de, predictions)

        list_of_scores.append(pearson[0])

        print('RMSE:', rmse(predictions,y_val_de))
        print(f"mlp Regression Pearson: {pearson[0]} for hidden_layer_size = {hidden_layer_size}, activation = {activation}, learning_rate = {learning_rate}, solver = {solver}")

# Print highest pearson score
highest_pearson = max(list_of_scores)
print(f'highest pearson score is {highest_pearson}')


RMSE: 0.8627738173589881
mlp Regression Pearson: 0.04871467066315298 for hidden_layer_size = 100, activation = relu, learning_rate = 0.001, solver = adam
RMSE: 0.8640966324587362
mlp Regression Pearson: 0.01666724969930131 for hidden_layer_size = 100, activation = relu, learning_rate = 0.001, solver = sgd
RMSE: 0.8636248006571877
mlp Regression Pearson: 0.03147185820197343 for hidden_layer_size = 100, activation = logistic, learning_rate = 0.001, solver = adam
RMSE: 0.8638408517197841
mlp Regression Pearson: 0.023327163080269296 for hidden_layer_size = 100, activation = logistic, learning_rate = 0.001, solver = sgd
RMSE: 0.8660894003533958
mlp Regression Pearson: 0.00520628517702393 for hidden_layer_size = 100, activation = tanh, learning_rate = 0.001, solver = adam
RMSE: 0.8638373858946172
mlp Regression Pearson: 0.01875658845426788 for hidden_layer_size = 100, activation = tanh, learning_rate = 0.001, solver = sgd
RMSE: 0.8650959316659498
mlp Regression Pearson: 0.018673115547737158 

Here is a neural network regressor which I create from scratch using pytorch. I have included a small hyper parameter search. 

### Writing Results

Here is our function to write the scores into a txt file. We can follow the <Method> <ID> <SCORE> template but having only the scores will work too.

In [None]:
import os

def writeScores(method_name,scores):
    fn = "predictions.txt"
    print("")
    with open(fn, 'w') as output_file:
        for idx,x in enumerate(scores):
            #out =  metrics[idx]+":"+str("{0:.2f}".format(x))+"\n"
            #print(out)
            output_file.write(f"{x}\n")

In [None]:
#EN-DE

import numpy as np
import torch

# de_test_src = get_embeddings("./test.ende.src",document_embeddings_en,'en')
# de_test_mt = get_embeddings("./test.ende.mt",document_embeddings_de,'de')

de_test_src = get_bert_embeddings("./test.ende.src",'en',pooling_fcn=None)

de_test_mt = get_bert_embeddings("./test.ende.mt",'de',pooling_fcn=None)

# mlp instead of svr

# mlp = make_pipeline(StandardScaler(),
#                             MLPRegressor(hidden_layer_sizes=(100,100),
#                                         tol=1e-2, max_iter=500, random_state=0,early_stopping = False, learning_rate_init = 0.001, activation = 'tanh',solver = 'adam'))
# mlp.fit(X_train_de, y_train_de)

# predictions_de = mlp.predict(X_val_de)


In [None]:
num_samples = len(de_test_src)
num_dims = len(de_test_src[0]) * 2
X_test = torch.zeros((num_samples,num_dims),dtype=torch.float)
for i in range(len(de_test_src)):
  en_vec = de_test_src[i]
  de_vec = de_test_mt[i]
  vec = np.concatenate((en_vec,de_vec))
  vec = torch.tensor(vec,dtype=torch.float).squeeze()
  X_test[i,:] = vec

X_test_de = X_test


#Predict

predictions = test(model, X_test.to(device))







In [None]:
from google.colab import files
from zipfile import ZipFile


writeScores("korbi_bert",predictions.squeeze())

with ZipFile("en-de_kbert.zip","w") as newzip:
	newzip.write("predictions.txt")
 
files.download('en-de_kbert.zip') 


