## Import and device

In [None]:
from google.colab import drive
import pandas as pd
import csv
from sklearn.model_selection import train_test_split
import numpy as np
import random
import tensorflow as tf
import torch

# Get the GPU device name.
device_name = tf.test.gpu_device_name()
# The device name should look like the following:
if device_name == '/device:GPU:0':
    print('Found GPU at: {}'.format(device_name))
else:
    raise SystemError('GPU device not found')

# If there's a GPU available...
if torch.cuda.is_available():    
    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))
# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

Found GPU at: /device:GPU:0
There are 1 GPU(s) available.
We will use the GPU: Tesla K80


## Clone repository git and install requirements

In [None]:
!git clone --branch master https://brezzi94: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@github.com/crux82/mt-ganbert.git
%cd mt-ganbert/mttransformer/

!pip install -r requirements.txt
!pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 -f https://download.pytorch.org/whl/torch_stable.html
!pip install ekphrasis

Cloning into 'mt-ganbert'...
remote: Enumerating objects: 117, done.[K
remote: Counting objects: 100% (117/117), done.[K
remote: Compressing objects: 100% (100/100), done.[K
remote: Total 117 (delta 14), reused 113 (delta 14), pack-reused 0[K
Receiving objects: 100% (117/117), 1.26 MiB | 8.69 MiB/s, done.
Resolving deltas: 100% (14/14), done.
/content/mt-ganbert/mttransformer
Collecting folium==0.2.1
  Downloading folium-0.2.1.tar.gz (69 kB)
[K     |████████████████████████████████| 69 kB 3.4 MB/s 
Collecting colorlog
  Downloading colorlog-6.4.1-py2.py3-none-any.whl (11 kB)
Collecting boto3
  Downloading boto3-1.18.36-py3-none-any.whl (131 kB)
[K     |████████████████████████████████| 131 kB 11.0 MB/s 
[?25hCollecting pytorch-pretrained-bert==v0.6.0
  Downloading pytorch_pretrained_bert-0.6.0-py3-none-any.whl (114 kB)
[K     |████████████████████████████████| 114 kB 44.7 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinu

Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.7.1+cu101
  Downloading https://download.pytorch.org/whl/cu101/torch-1.7.1%2Bcu101-cp37-cp37m-linux_x86_64.whl (735.4 MB)
[K     |████████████████████████████████| 735.4 MB 8.7 kB/s 
[?25hCollecting torchvision==0.8.2+cu101
  Downloading https://download.pytorch.org/whl/cu101/torchvision-0.8.2%2Bcu101-cp37-cp37m-linux_x86_64.whl (12.8 MB)
[K     |████████████████████████████████| 12.8 MB 75 kB/s 
Installing collected packages: torch, torchvision
  Attempting uninstall: torch
    Found existing installation: torch 1.9.0+cu102
    Uninstalling torch-1.9.0+cu102:
      Successfully uninstalled torch-1.9.0+cu102
  Attempting uninstall: torchvision
    Found existing installation: torchvision 0.10.0+cu102
    Uninstalling torchvision-0.10.0+cu102:
      Successfully uninstalled torchvision-0.10.0+cu102
[31mERROR: pip's dependency resolver does not currently take into account all the packages that ar

Collecting ekphrasis
  Downloading ekphrasis-0.5.1.tar.gz (80 kB)
[?25l[K     |████                            | 10 kB 25.7 MB/s eta 0:00:01[K     |████████▏                       | 20 kB 18.6 MB/s eta 0:00:01[K     |████████████▎                   | 30 kB 10.1 MB/s eta 0:00:01[K     |████████████████▍               | 40 kB 8.3 MB/s eta 0:00:01[K     |████████████████████▌           | 51 kB 5.0 MB/s eta 0:00:01[K     |████████████████████████▌       | 61 kB 5.4 MB/s eta 0:00:01[K     |████████████████████████████▋   | 71 kB 5.8 MB/s eta 0:00:01[K     |████████████████████████████████| 80 kB 4.0 MB/s 
Collecting colorama
  Downloading colorama-0.4.4-py2.py3-none-any.whl (16 kB)
Collecting ujson
  Downloading ujson-4.1.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (178 kB)
[K     |████████████████████████████████| 178 kB 34.8 MB/s 
Collecting ftfy
  Downloading ftfy-6.0.3.tar.gz (64 kB)
[K     |████████████████████████████████| 64 kB 2.5 MB/s 
Building wheels

## **BERT-based model**


training the chosen model with the Abusive Recognition task dataset. The tasks are:


1.   HaSpeeDe: Hate Spech Recognition
2.   AMI A: Automatic Misogyny Identification (misogyny, not mysogyny)
3.   AMI B: Automatic Misogyny Identification (misogyny_category: stereotype, sexual_harassment, discredit)
4.   DANKMEMEs: Hate Spech Recognition in MEMEs sentences
5.   SENTIPOLC 1: Sentiment Polarity Classification (objective, subjective)
6.   SENTIPOLC 2: Sentiment Polarity Classification (polarity: positive, negative, neutral)

In [None]:
#number examples #0-100-200-500

**Tokenization and Convert to Json**

The training code reads tokenized data in json format. please use "prepro_std.py" to do tokenization and convert your data into json format. 

In [None]:
#edit --root_dir and --task_def depending on the task
!python prepro_std.py --model Musixmatch/umberto-commoncrawl-cased-v1 --root_dir data/HaSpeeDe/"0"/ --task_def data/HaSpeeDe/0/haspeede-TW_task_def.yml  

Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
  self.tok = re.compile(r"({})".format("|".join(pipeline)))
Word statistics files not found!
Downloading... done!
Unpacking... done!
Reading english - 1grams ...
generating cache file for faster loading...
reading ngrams /root/.ekphrasis/stats/english/counts_1grams.txt
Reading english - 2grams ...
generating cache file for faster loading...
reading ngrams /root/.ekphrasis/stats/english/counts_2grams.txt
  regexes = {k.lower(): re.compile(self.expressions[k]) for k, v in
Reading english - 1grams ...
Downloading: 100% 794k/794k [00:00<00:00, 5.57MB/s]
Downloading: 100% 1.68M/1.68M [00:00<00:00, 8.33MB/s]
09/07/2021 12:58:37 Task haspeede-TW
09/07/2021 12:58:37 data/HaSpeeDe/0/musixmatch_cased/haspeede-TW_train.json
09/07/2021 12:58:38 data/HaSpeeDe/0/musixmatch_cased/haspeede-TW_dev.json
09/07/2021 12:58:38 data/HaSpeeDe/0/musixmatch_cased/haspeede-TW_test.json
[0m

**Onboard your task into training!**

--encoder_type 9: it means which BERT is used to encode the sentences. In this case Umberto is used

In [None]:
#edit --task_def, --data_dir, --train_datasets and test_datasets  depending on the task
!python train.py --encoder_type 9 --epochs 10 --task_def data/HaSpeeDe/0/haspeede-TW_task_def.yml --data_dir data/HaSpeeDe/0/musixmatch_cased/ --init_checkpoint Musixmatch/umberto-commoncrawl-cased-v1 --bert_model_type Musixmatch/umberto-commoncrawl-cased-v1 --max_seq_len 128 --batch_size 16 --batch_size_eval 16 --optimizer "adamW" --train_datasets haspeede-TW --test_datasets haspeede-TW --learning_rate "5e-5" #--multi_gpu_on --grad_accumulation_step 4 --fp16 --grad_clipping 0 --global_grad_clipping 1

Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
09/07/2021 12:58:44 Launching the MT-DNN training
09/07/2021 12:58:44 Loading data/HaSpeeDe/0/musixmatch_cased/haspeede-TW_train.json as task 0
Loaded 2400 samples out of 2400
Loaded 600 samples out of 600
Loaded 1000 samples out of 1000
09/07/2021 12:58:44 ####################
09/07/2021 12:58:44 {'log_file': 'mt-dnn-train.log', 'tensorboard': False, 'tensorboard_logdir': 'tensorboard_logdir', 'init_checkpoint': 'Musixmatch/umberto-commoncrawl-cased-v1', 'data_dir': 'data/HaSpeeDe/0/musixmatch_cased/', 'data_sort_on': False, 'name': 'farmer', 'task_def': 'data/HaSpeeDe/0/haspeede-TW_task_def.yml', 'train_datasets': ['haspeede-TW'], 'test_datasets': ['haspeede-TW'], 'glue_format_on': False, 'mkd_opt': 0, 'do_padding': False, 'gan': False, 'num_hidden_layers_d': 1, 'num_hidden_layers_g': 1, 'noise_size': 100, 'epsilon': 1e-08, 'update_bert_opt': 0, 'multi_gpu_on': False, 'mem_cum_type': 'simple', '