# Using BERT on the Germeval Task 2017

## Subtask A) Relevance Classification

_This Code provides the minimal functionality for setting up the training of a binary classification task using the __simpletransformers__ module._

# 1. Setup

Add a GPU by going to the menu and:

`Edit 🡒 Notebook Settings 🡒 Hardware accelerator 🡒 (GPU)`

In [1]:
import tensorflow as tf

# GPU device name.
device_name = tf.test.gpu_device_name()

# The device name should look like the following:
if device_name == '/device:GPU:0':
    print('Found it at: {}'.format(device_name))
else:
    raise SystemError('GPU device not found')

Found it at: /device:GPU:0


In order for torch to use the GPU, we need to identify and specify the GPU as the device.

In [2]:
import torch

if torch.cuda.is_available():    
    device = torch.device("cuda")
    print('GPU:', torch.cuda.get_device_name(0))
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

GPU: Tesla P100-PCIE-16GB


In [3]:
!nvidia-smi

Sun Apr 19 13:38:36 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P0    33W / 250W |    353MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
+-------

## 1.2. Install the [simpletransformers](https://github.com/ThilinaRajapakse/simpletransformers) module alongside with [apex](https://github.com/NVIDIA/apex)  


In [4]:
!pip install simpletransformers

Collecting simpletransformers
[?25l  Downloading https://files.pythonhosted.org/packages/a2/cd/184543483da9b6a5d23a6eb21ec2d0575716b12dcdfbdf2d87c6871bc31e/simpletransformers-0.24.8-py3-none-any.whl (151kB)
[K     |████████████████████████████████| 153kB 8.7MB/s 
[?25hCollecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/a3/78/92cedda05552398352ed9784908b834ee32a0bd071a9b32de287327370b7/transformers-2.8.0-py3-none-any.whl (563kB)
[K     |████████████████████████████████| 573kB 17.1MB/s 
Collecting tensorboardx
[?25l  Downloading https://files.pythonhosted.org/packages/35/f1/5843425495765c8c2dd0784a851a93ef204d314fc87bcc2bbb9f662a3ad1/tensorboardX-2.0-py2.py3-none-any.whl (195kB)
[K     |████████████████████████████████| 204kB 24.7MB/s 
Collecting seqeval
  Downloading https://files.pythonhosted.org/packages/34/91/068aca8d60ce56dd9ba4506850e876aba5e66a6f2f29aa223224b50df0de/seqeval-0.0.12.tar.gz
Collecting tokenizers
[?25l  Downloading https://files.

In [5]:
%%writefile setup.sh

git clone https://github.com/NVIDIA/apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./apex

Writing setup.sh


In [None]:
!sh setup.sh

Cloning into 'apex'...
remote: Enumerating objects: 178, done.[K
remote: Counting objects: 100% (178/178), done.[K
remote: Compressing objects: 100% (112/112), done.[K
remote: Total 6349 (delta 127), reused 103 (delta 66), pack-reused 6171[K
Receiving objects: 100% (6349/6349), 13.68 MiB | 13.90 MiB/s, done.
Resolving deltas: 100% (4179/4179), done.
  cmdoptions.check_install_build_global(options)
Created temporary directory: /tmp/pip-ephem-wheel-cache-xp54324n
Created temporary directory: /tmp/pip-req-tracker-2hvupfo2
Created requirements tracker '/tmp/pip-req-tracker-2hvupfo2'
Created temporary directory: /tmp/pip-install-v00ezfq1
Processing ./apex
  Created temporary directory: /tmp/pip-req-build-2rivs1ho
  Added file:///content/apex to build tracker '/tmp/pip-req-tracker-2hvupfo2'
    Running setup.py (path:/tmp/pip-req-build-2rivs1ho/setup.py) egg_info for package from file:///content/apex
    Running command python setup.py egg_info
    torch.__version__  =  1.4.0
    running

# 2. Load the data

Go to the [Germeval 2017 homepage](https://sites.google.com/view/germeval2017-absa/data
) and download the _train.tsv_ and the _dev.tsv_ data sets.

We will use _train.tsv_ for training and validation set, while _dev.tsv_ will serve as a held out test set in this case.

## 2.1. Upload to colab

Use the following command to open a window where you can upload the files to colab.

In [None]:
from google.colab import files
uploaded = files.upload()

Saving dev_v1.4.tsv to dev_v1.4.tsv
Saving train_v1.4.tsv to train_v1.4.tsv


## 2.2. Preparation

We'll use pandas to prepare the training set and look at a few of its properties.

In [None]:
import pandas as pd
import numpy as np

# Load the dataset into a pandas dataframe.
df = pd.read_csv("./train_v1.4.tsv", 
                 delimiter = "\t", 
                 header = None, 
                 names = ["id", "text", "relevance", "sentiment", "aspect:polarity"])
df["labels"] = np.where(df["relevance"] == True, 1, 0)

# Report the number of sentences.
print('Number of training sentences: {:,}\n'.format(df.shape[0]))

# The data set has some missing values which have to be removed
df = df.dropna(axis = 0, subset = ["text", "labels"])
print('Number of training sentences (after removing NAs): {:,}\n'.format(df.shape[0]))

# Display random rows from the data.
df.sample(3)

Number of training sentences: 20,941

Number of training sentences (after removing NAs): 20,859



Unnamed: 0,id,text,relevance,sentiment,aspect:polarity,labels
9695,http://t.neuepresse.de/Nachrichten/Niedersachs...,Verkehr – Bericht: Bahn vor Jahren zu Strecken...,True,neutral,Zugfahrt#Streckennetz:neutral,1
8369,http://www.aida-weblounge.de/weblounge/bilder/...,Wo verkehrt diese Bahn ? Bin mal gespannt was ...,True,neutral,Allgemein#Haupt:neutral,1
10606,http://twitter.com/ziamtrash\_/statuses/671730...,@tesssss_a Meh. Ein weiterer Beweis dafür das ...,True,negative,Allgemein#Haupt:negative,1


Remove all other columns

In [None]:
df = df[["text","labels"]]

In [None]:
from sklearn.model_selection import train_test_split

train, validation = train_test_split(df, test_size = 0.1, random_state = 2020)

Set up the logger

In [None]:
import logging

logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

Set up the model  
_For all arguments & their defaults, see [documentation](https://github.com/ThilinaRajapakse/simpletransformers#default-settings)_

In [None]:
from simpletransformers.classification import ClassificationModel

# Create a ClassificationModel
model = ClassificationModel("bert", "bert-base-german-cased", num_labels = 2, 
                            args={"overwrite_output_dir": True,
                                  "max_seq_length": 128,
                                  "train_batch_size": 128,
                                  "eval_batch_size": 128,
                                  "evaluate_during_training": True,
                                  "evaluate_during_training_steps": 100,
                                  "evaluate_during_training_verbose": False,
                                  "num_train_epochs": 4,
                                  "gradient_accumulation_steps": 1,
                                  "learning_rate": 4e-5,
                                  "adam_epsilon": 1e-8,
                                  "warmup_ratio": 0.06,
                                  "manual_seed": 2020,
                                  "save_eval_checkpoints": False}, 
                            use_cuda = True)

Train the model

In [None]:
model.train_model(train_df = train, eval_df = validation)

INFO:simpletransformers.classification.classification_model: Converting to features started. Cache is not used.


HBox(children=(IntProgress(value=0, max=18773), HTML(value='')))


Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic


HBox(children=(IntProgress(value=0, description='Epoch', max=4, style=ProgressStyle(description_width='initial…

HBox(children=(IntProgress(value=0, description='Current iteration', max=147, style=ProgressStyle(description_…

Running loss: 0.729481



Running loss: 0.210101



Running loss: 0.295630


HBox(children=(IntProgress(value=0, description='Current iteration', max=147, style=ProgressStyle(description_…

Running loss: 0.072098


HBox(children=(IntProgress(value=0, description='Current iteration', max=147, style=ProgressStyle(description_…

Running loss: 0.117641


HBox(children=(IntProgress(value=0, description='Current iteration', max=147, style=ProgressStyle(description_…

Running loss: 0.004343



INFO:simpletransformers.classification.classification_model: Training of bert model complete. Saved to outputs/.


Evaluate the results

In [None]:
# Load the dataset into a pandas dataframe.
test = pd.read_csv("./dev_v1.4.tsv", 
                   delimiter = "\t", 
                   header = None, 
                   names = ["id", "text", "relevance", "sentiment", "aspect:polarity"])
test["labels"] = np.where(test["relevance"] == True, 1, 0)

# Report the number of sentences.
print('Number of tratestining sentences: {:,}\n'.format(test.shape[0]))

# The data set has some missing values which have to be removed
test = test.dropna(axis = 0, subset = ["text", "labels"])

test = test[["text","labels"]]

Number of tratestining sentences: 2,584



In [None]:
result, model_outputs, wrong_predictions = model.eval_model(eval_df = test)

INFO:simpletransformers.classification.classification_model: Converting to features started. Cache is not used.


HBox(children=(IntProgress(value=0, max=2571), HTML(value='')))




HBox(children=(IntProgress(value=0, max=21), HTML(value='')))

INFO:simpletransformers.classification.classification_model:{'mcc': 0.8103766675686843, 'tp': 1969, 'tn': 444, 'fp': 78, 'fn': 80, 'eval_loss': 0.22829291969537735}





In [None]:
result

{'eval_loss': 0.22829291969537735,
 'fn': 80,
 'fp': 78,
 'mcc': 0.8103766675686843,
 'tn': 444,
 'tp': 1969}

## Task A)  
Play around with different parameters like maximum sequence length, batch size, etc.

## Task B)  
Try Multi-class Classification by using the _sentiment_ variable from the data set