<a href="https://colab.research.google.com/github/crux82/mt-ganbert/blob/master/MT_GANBERT_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MT-GANBERT model in Pytorch
This notebook shows how to run the MT-GANBERT model (with UmBERTo) with the example tasks present in the git repository, used for the recognition of abusive linguistic behaviors.

The tasks are:


1.   HaSpeeDe: Hate Spech Recognition
2.   AMI A: Automatic Misogyny Identification (misogyny, not mysogyny)
3.   AMI B: Automatic Misogyny Identification (misogyny_category: stereotype, sexual_harassment, discredit)
4.   DANKMEMEs: Hate Spech Recognition in MEMEs sentences
5.   SENTIPOLC 1: Sentiment Polarity Classification (objective, subjective)
6.   SENTIPOLC 2: Sentiment Polarity Classification (polarity: positive, negative, neutral)

In [None]:
!git clone --branch master https://brezzi94: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@github.com/crux82/mt-ganbert.git #modificare
%cd mt-ganbert/mttransformer/

!pip install -r requirements.txt
!pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 -f https://download.pytorch.org/whl/torch_stable.html
!pip install ekphrasis

Cloning into 'mt-ganbert'...
remote: Enumerating objects: 245, done.[K
remote: Counting objects: 100% (245/245), done.[K
remote: Compressing objects: 100% (164/164), done.[K
remote: Total 245 (delta 84), reused 235 (delta 78), pack-reused 0[K
Receiving objects: 100% (245/245), 4.96 MiB | 15.12 MiB/s, done.
Resolving deltas: 100% (84/84), done.
/content/mt-ganbert/mttransformer/mt-ganbert/mttransformer
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/cli/base_command.py", line 180, in _main
    status = self.run(options, args)
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/cli/req_command.py", line 199, in wrapper
    return func(self, options, args)
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/commands/install.py", line 319, in run
    reqs, check_supported_wheels=not options.target_dir
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/resolution/resolvelib/resolver.py", line 128, in resolve
   

In [None]:
from google.colab import drive
import pandas as pd
import csv
from sklearn.model_selection import train_test_split
import numpy as np
import random
import tensorflow as tf
import torch

# Get the GPU device name.
device_name = tf.test.gpu_device_name()
# The device name should look like the following:
if device_name == '/device:GPU:0':
    print('Found GPU at: {}'.format(device_name))
else:
    raise SystemError('GPU device not found')

# If there's a GPU available...
if torch.cuda.is_available():    
    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))
# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

Found GPU at: /device:GPU:0
There are 1 GPU(s) available.
We will use the GPU: Tesla K80


For each dataset, with a dedicated script (script_MT-model.py), are created 4 files:

1.   taskName_task_def.yml, a config file about the task
2.   taskName_train.tsv, file tsv of task train set 
3.   taskName_test.tsv, file tsv of task test set 
4.   taskName_dev.tsv, file tsv of task dev set 


The number of examples of train can consist of:

*   All train dataset
*   100 examples of oringinal train dataset
*   200 examples of oringinal train dataset
*   500 examples of oringinal train dataset


To access the .tsv files and config file of each task, based on the cutting of examples of the train set you want to use, these can be the paths:

*   data/0/taskName_file
*   data/100/no_gan/taskName_file or data/100/gan/taskName_file
*   data/200/no_gan/taskName_file or data/200/gan/taskName_file
*   data/500/no_gan/taskName_file or data/500/gan/taskName_file

no_gan, if the model that you want to use is BERT_based and gan if the model that you want to use is GANBERT






**Tokenization and Convert to Json**

The training code reads tokenized data in json format. please use "prepro_std.py" to do tokenization and convert your data into json format. 

In [None]:
#edit --root_dir and --task_def depending on the task and train set
!python prepro_std.py --gan --apply_balance --model Musixmatch/umberto-commoncrawl-cased-v1 --root_dir data/"0"/ --task_def data/0/haspeede-TW_AMI2018A_AMI2018B_DANKMEMES2020_SENTIPOLC20161_SENTIPOLC20162_task_def.yml

Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
  self.tok = re.compile(r"({})".format("|".join(pipeline)))
Word statistics files not found!
Downloading... done!
Unpacking... done!
Reading english - 1grams ...
generating cache file for faster loading...
reading ngrams /root/.ekphrasis/stats/english/counts_1grams.txt
Reading english - 2grams ...
generating cache file for faster loading...
reading ngrams /root/.ekphrasis/stats/english/counts_2grams.txt
  regexes = {k.lower(): re.compile(self.expressions[k]) for k, v in
Reading english - 1grams ...
Downloading: 100% 794k/794k [00:00<00:00, 6.73MB/s]
Downloading: 100% 1.68M/1.68M [00:00<00:00, 10.0MB/s]
09/17/2021 05:16:50 Task haspeede-TW
09/17/2021 05:16:50 data/0/musixmatch_cased/haspeede-TW_train.json
09/17/2021 05:16:50 data/0/musixmatch_cased/haspeede-TW_dev.json
09/17/2021 05:16:51 data/0/musixmatch_cased/haspeede-TW_test.json
[0m

**Onboard your task into training!**

--encoder_type 9: it means which BERT is used to encode the sentences. In this case Umberto is used

In [None]:
#edit --task_def, --data_dir, --train_datasets and test_datasets  depending on the task and train set
!python train.py --gan --noise_size 100 --epsilon 1e-8 --encoder_type 9 --epochs 25 --task_def data/0/haspeede-TW_AMI2018A_AMI2018B_DANKMEMES2020_SENTIPOLC20161_SENTIPOLC20162_task_def.yml --data_dir data/0/musixmatch_cased/ --init_checkpoint Musixmatch/umberto-commoncrawl-cased-v1 --bert_model_type Musixmatch/umberto-commoncrawl-cased-v1 --max_seq_len 128 --batch_size 64 --batch_size_eval 64 --optimizer "adamW" --train_datasets haspeede-TW,AMI2018A,AMI2018B,DANKMEMES2020,SENTIPOLC20161,SENTIPOLC20162 --test_datasets haspeede-TW,AMI2018A,AMI2018B,DANKMEMES2020,SENTIPOLC20161,SENTIPOLC20162 --learning_rate "1e-5" #--multi_gpu_on --grad_accumulation_step 4 --fp16 --grad_clipping 0 --global_grad_clipping 1

Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
09/17/2021 05:16:56 Launching the MT-DNN training
09/17/2021 05:16:56 Loading data/0/musixmatch_cased/haspeede-TW_train.json as task 0
Loaded 2400 samples out of 2400
Loaded 600 samples out of 600
Loaded 1000 samples out of 1000
09/17/2021 05:16:56 ####################
09/17/2021 05:16:56 {'log_file': 'mt-dnn-train.log', 'tensorboard': False, 'tensorboard_logdir': 'tensorboard_logdir', 'init_checkpoint': 'Musixmatch/umberto-commoncrawl-cased-v1', 'data_dir': 'data/0/musixmatch_cased/', 'data_sort_on': False, 'name': 'farmer', 'task_def': 'data/0/haspeede-TW_task_def.yml', 'train_datasets': ['haspeede-TW'], 'test_datasets': ['haspeede-TW'], 'glue_format_on': False, 'mkd_opt': 0, 'do_padding': False, 'gan': False, 'num_hidden_layers_d': 1, 'num_hidden_layers_g': 1, 'noise_size': 100, 'epsilon': 1e-08, 'update_bert_opt': 0, 'multi_gpu_on': False, 'mem_cum_type': 'simple', 'answer_num_turn': 5, 'answe