# Using BERT on the Germeval Task 2017

## Subtask A) Relevance Classification

_This Code provides the minimal functionality for setting up the training of a binary classification task using the __simpletransformers__ module._

# 1. Setup

Add a GPU by going to the menu and:

`Edit 🡒 Notebook Settings 🡒 Hardware accelerator 🡒 (GPU)`

In [None]:
import tensorflow as tf

# GPU device name.
device_name = tf.test.gpu_device_name()

# The device name should look like the following:
if device_name == '/device:GPU:0':
    print('Found it at: {}'.format(device_name))
else:
    raise SystemError('GPU device not found')

In order for torch to use the GPU, we need to identify and specify the GPU as the device.

In [None]:
import torch

if torch.cuda.is_available():    
    device = torch.device("cuda")
    print('GPU:', torch.cuda.get_device_name(0))
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

In [None]:
!nvidia-smi

## 1.2. Install the [simpletransformers](https://github.com/ThilinaRajapakse/simpletransformers) module alongside with [apex](https://github.com/NVIDIA/apex)  


In [None]:
!pip install simpletransformers

In [None]:
%%writefile setup.sh

git clone https://github.com/NVIDIA/apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./apex

In [None]:
!sh setup.sh

# 2. Load the data

Go to the [Germeval 2017 homepage](https://sites.google.com/view/germeval2017-absa/data
) and download the _train.tsv_ and the _dev.tsv_ data sets.

We will use _train.tsv_ for training and validation set, while _dev.tsv_ will serve as a held out test set in this case.

## 2.1. Upload to colab

Use the following command to open a window where you can upload the files to colab.

In [None]:
from google.colab import files
uploaded = files.upload()

## 2.2. Preparation

We'll use pandas to prepare the training set and look at a few of its properties.

In [None]:
import pandas as pd
import numpy as np

# Load the dataset into a pandas dataframe.
df = pd.read_csv("./train_v1.4.tsv", 
                 delimiter = "\t", 
                 header = None, 
                 names = ["id", "text", "relevance", "sentiment", "aspect:polarity"])
df["labels"] = np.where(df["relevance"] == True, 1, 0)

# Report the number of sentences.
print('Number of training sentences: {:,}\n'.format(df.shape[0]))

# The data set has some missing values which have to be removed
df = df.dropna(axis = 0, subset = ["text", "labels"])
print('Number of training sentences (after removing NAs): {:,}\n'.format(df.shape[0]))

# Display random rows from the data.
df.sample(3)

Remove all other columns

In [None]:
df = df[["text","labels"]]

In [None]:
from sklearn.model_selection import train_test_split

train, validation = train_test_split(df, test_size = 0.1, random_state = 2020)

Set up the logger

In [None]:
import logging

logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

Set up the model  
_For all arguments & their defaults, see [documentation](https://github.com/ThilinaRajapakse/simpletransformers#default-settings)_

In [None]:
from simpletransformers.classification import ClassificationModel

# Create a ClassificationModel
model = ClassificationModel("bert", "bert-base-german-cased", num_labels = 2, 
                            args={"overwrite_output_dir": True,
                                  "max_seq_length": 128,
                                  "train_batch_size": 128,
                                  "eval_batch_size": 128,
                                  "evaluate_during_training": True,
                                  "evaluate_during_training_steps": 100,
                                  "evaluate_during_training_verbose": False,
                                  "num_train_epochs": 4,
                                  "gradient_accumulation_steps": 1,
                                  "learning_rate": 4e-5,
                                  "adam_epsilon": 1e-8,
                                  "warmup_ratio": 0.06,
                                  "manual_seed": 2020,
                                  "save_eval_checkpoints": False}, 
                            use_cuda = True)

Train the model

In [None]:
model.train_model(train_df = train, eval_df = validation)

Evaluate the results

In [None]:
# Load the dataset into a pandas dataframe.
test = pd.read_csv("./dev_v1.4.tsv", 
                   delimiter = "\t", 
                   header = None, 
                   names = ["id", "text", "relevance", "sentiment", "aspect:polarity"])
test["labels"] = np.where(test["relevance"] == True, 1, 0)

# Report the number of sentences.
print('Number of tratestining sentences: {:,}\n'.format(test.shape[0]))

# The data set has some missing values which have to be removed
test = test.dropna(axis = 0, subset = ["text", "labels"])

test = test[["text","labels"]]

In [None]:
result, model_outputs, wrong_predictions = model.eval_model(eval_df = test)

In [None]:
result

## Task A)  
Play around with different parameters like maximum sequence length, batch size, etc.

## Task B)  
Try Multi-class Classification by using the _sentiment_ variable from the data set