<a href="https://colab.research.google.com/github/crux82/mt-ganbert/blob/main/0_BERT_based_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERT-based model in Pytorch
This notebook shows how to train a simple model without Multi-task and Generative adversarial learning. We will train a simple transformer, Italian Bert-base model, in particular UmBERTo (https://github.com/musixmatchresearch/umberto), on one six tasks considered in our work, used for the recognition of abusive linguistic behaviors. The task are:

1.   HaSpeeDe: Hate Spech Recognition
2.   AMI A: Automatic Misogyny Identification (misogyny, not mysogyny)
3.   AMI B: Automatic Misogyny Identification (misogyny_category: stereotype, sexual_harassment, discredit)
4.   DANKMEMEs: Hate Spech Recognition in MEMEs sentences
5.   SENTIPOLC 1: Sentiment Polarity Classification (objective, subjective)
6.   SENTIPOLC 2: Sentiment Polarity Classification (polarity: positive, negative, neutral)

## Setup environment

In [None]:
#--------------------------------
#  Retrieve the github directory
#--------------------------------
!git clone https://github.com/crux82/mt-ganbert
%cd mt-ganbert/mttransformer/

#installation of necessary packages
!pip install -r requirements.txt
!pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 -f https://download.pytorch.org/whl/torch_stable.html
!pip install ekphrasis

## Import

In [None]:
from google.colab import drive
import pandas as pd
import csv
from sklearn.model_selection import train_test_split
import numpy as np
import random
import tensorflow as tf
import torch

# Get the GPU device name.
device_name = tf.test.gpu_device_name()
# The device name should look like the following:
if device_name == '/device:GPU:0':
    print('Found GPU at: {}'.format(device_name))
else:
    raise SystemError('GPU device not found')

# If there's a GPU available...
if torch.cuda.is_available():    
    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))
# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

## Run training

For each dataset, with a dedicated script ("script_tsv.py"), are created 4 files:

1.   taskName_task_def.yml, a config file about the task
2.   taskName_train.tsv, file tsv of task train set 
3.   taskName_test.tsv, file tsv of task test set 
4.   taskName_dev.tsv, file tsv of task dev set 


The number of examples of train can consist of:

*   All train dataset
*   100 examples of oringinal train dataset
*   200 examples of oringinal train dataset
*   500 examples of oringinal train dataset


To access to the .tsv files and config file of each task, based on the cutting of examples of the train set you want to use, these can be the paths:

*   data/0/taskName_file
*   data/100/no_gan/taskName_file
*   data/200/no_gan/taskName_file
*   data/500/no_gan/taskName_file

"no_gan" means that you want to use BERT_based model

###**Tokenization and Convert to Json**

The training code reads tokenized data in json format, so "prepro_std.py" (modified script of work https://github.com/namisan/mt-dnn) is used to do tokenization and convert data of .tsv files into json format.

The args used in the script invocation are:

* --model: the model used to tokenize input sentences
* --root_dir: the folder from which to get the .tsv files
* --task_def: the task_def file of the task, which contains useful information for converting to .json files

The script is run for single task.

In [None]:
#edit --root_dir and --task_def depending on the task and train set
!python prepro_std.py --model Musixmatch/umberto-commoncrawl-cased-v1 --root_dir data/"0"/ --task_def data/0/haspeede-TW_task_def.yml  

###**Onboard your task into training!**

To run the training is used the script "train.py" (modified script of work https://github.com/namisan/mt-dnn).
The args used in the script invocation are:


*   --encoder_type: it means which transformer is used to encode the sentences. In this case is equals to "9", that matches to UmBERTo
*   --epochs: number of epochs that you want to use in the training
*   --task_def: the task_def file of the task
*   --data_dir: the folder from which to get the .json files
*   --init_checkpoint: the name of the transformer to be loaded, in this case "Musixmatch/umberto-commoncrawl-cased-v1"
*   --max_seq_len: the maximum length of a sequence that the BERT model can handle
*   --batch_size: the number of training examples in one forward/backward pass
*   --batch_size_eval: the batch size used for validation and test
*   --optimizer: the name of optimizer that you want to use
*   --train_datasets: the name of task without train file extension
*   --test_datasets: the name of task without test file extension
*   --learning_rate: the learning rate that you want to use

The script is run for single task.









In [None]:
#edit --task_def, --data_dir, --train_datasets and test_datasets depending on the task and train set
!python train.py --encoder_type 9 --epochs 10 --task_def data/0/haspeede-TW_task_def.yml --data_dir data/0/musixmatch_cased/ --init_checkpoint Musixmatch/umberto-commoncrawl-cased-v1 --max_seq_len 128 --batch_size 16 --batch_size_eval 16 --optimizer "adamW" --train_datasets haspeede-TW --test_datasets haspeede-TW --learning_rate "5e-5"