# Tutorial for Run Your Own Tasks in MT-GAN
This notebook shows how to run MT-GAN with the example tasks present in the git repository, used for the recognition of abusive linguistic behaviors.

To compare the performance of the MT-DNN model, the following models are present in this notebbok:

1. A model based only on the Transformer BERT (BERT-based model)
2. A model based on the Transformer BERT (BERT-based model) and characterized by Semi-Supervised Adversarial Learning (SS-GAN), called GANBERT
3. The MT-DNN model


## Import and device

In [None]:
from google.colab import drive
import pandas as pd
import csv
from sklearn.model_selection import train_test_split
import numpy as np
import random
import tensorflow as tf
import torch

# Get the GPU device name.
device_name = tf.test.gpu_device_name()
# The device name should look like the following:
if device_name == '/device:GPU:0':
    print('Found GPU at: {}'.format(device_name))
else:
    raise SystemError('GPU device not found')

# If there's a GPU available...
if torch.cuda.is_available():    
    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))
# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

Found GPU at: /device:GPU:0
There are 1 GPU(s) available.
We will use the GPU: Tesla T4


## Clone repository git and install requirements

In [None]:
!git clone http://breazzano:Tr4nsf0rm3r!@gitlab.revealsrl.it/croce/mttransformer.git
%cd mttransformer/

!pip install -r requirements.txt
!pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 -f https://download.pytorch.org/whl/torch_stable.html
!pip install ekphrasis

Cloning into 'mttransformer'...
remote: Counting objects: 270, done.[K
remote: Compressing objects: 100% (266/266), done.[K
remote: Total 270 (delta 131), reused 0 (delta 0)[K
Receiving objects: 100% (270/270), 1.30 MiB | 1.56 MiB/s, done.
Resolving deltas: 100% (131/131), done.
/content/mttransformer
Collecting folium==0.2.1
  Downloading folium-0.2.1.tar.gz (69 kB)
[K     |████████████████████████████████| 69 kB 9.3 MB/s 
Collecting colorlog
  Downloading colorlog-5.0.1-py2.py3-none-any.whl (10 kB)
Collecting boto3
  Downloading boto3-1.18.6-py3-none-any.whl (131 kB)
[K     |████████████████████████████████| 131 kB 47.8 MB/s 
[?25hCollecting pytorch-pretrained-bert==v0.6.0
  Downloading pytorch_pretrained_bert-0.6.0-py3-none-any.whl (114 kB)
[K     |████████████████████████████████| 114 kB 57.6 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 58

Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.7.1+cu101
  Downloading https://download.pytorch.org/whl/cu101/torch-1.7.1%2Bcu101-cp37-cp37m-linux_x86_64.whl (735.4 MB)
[K     |████████████████████████████████| 735.4 MB 11 kB/s 
[?25hCollecting torchvision==0.8.2+cu101
  Downloading https://download.pytorch.org/whl/cu101/torchvision-0.8.2%2Bcu101-cp37-cp37m-linux_x86_64.whl (12.8 MB)
[K     |████████████████████████████████| 12.8 MB 76 kB/s 
Installing collected packages: torch, torchvision
  Attempting uninstall: torch
    Found existing installation: torch 1.9.0+cu102
    Uninstalling torch-1.9.0+cu102:
      Successfully uninstalled torch-1.9.0+cu102
  Attempting uninstall: torchvision
    Found existing installation: torchvision 0.10.0+cu102
    Uninstalling torchvision-0.10.0+cu102:
      Successfully uninstalled torchvision-0.10.0+cu102
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are

Collecting ekphrasis
  Downloading ekphrasis-0.5.1.tar.gz (80 kB)
[?25l[K     |████                            | 10 kB 31.3 MB/s eta 0:00:01[K     |████████▏                       | 20 kB 33.6 MB/s eta 0:00:01[K     |████████████▎                   | 30 kB 37.3 MB/s eta 0:00:01[K     |████████████████▍               | 40 kB 39.1 MB/s eta 0:00:01[K     |████████████████████▌           | 51 kB 36.2 MB/s eta 0:00:01[K     |████████████████████████▌       | 61 kB 38.5 MB/s eta 0:00:01[K     |████████████████████████████▋   | 71 kB 30.8 MB/s eta 0:00:01[K     |████████████████████████████████| 80 kB 8.2 MB/s 
Collecting colorama
  Downloading colorama-0.4.4-py2.py3-none-any.whl (16 kB)
Collecting ujson
  Downloading ujson-4.0.2-cp37-cp37m-manylinux1_x86_64.whl (179 kB)
[K     |████████████████████████████████| 179 kB 55.4 MB/s 
Collecting ftfy
  Downloading ftfy-6.0.3.tar.gz (64 kB)
[K     |████████████████████████████████| 64 kB 3.1 MB/s 
Building wheels for collected pa

## Set the type of model you are using and the amount of data to train the model:


> Single-task model:


    1. If apply_gan = False & number_labeled_examples = 0, I want to train the BERT-based model with the full amount of data available for the i-th task
    2. If apply_gan = False & number_labeled_examples = 100 (or 200 or 500), I want to train the BERT-based model (because the model does not contain GAN) with a "reduction" of the i-th task data
    3. If apply_gan = True & number_labeled_examples = 100 (or 200 or 500), I want to train the GANBERT model with the amount of data in the number_labeled_examples, labeled, of the i-th task and the rest of the dataset not labeled

> Model trained simultaneously on tasks (MT model):

    1. If apply_gan = False & number_labeled_examples = 0, I want to train the MT-DNN model with the full amount of data available to each task
    2. If apply_gan = False & number_labeled_examples = 100 (or 200 or 500), I want to train the MT-DNN model with a "reduction" of the data of each task
    3. If apply_gan = True & number_labeled_examples = 100 (or 200 or 500), I want to train the MT-GAN model with the amount of data in number_labeled_examples, tagged and the remainder of the untagged dataset, of each task








In [None]:
apply_gan=False
number_labeled_examples=200 #0-100-200-500
file_loaded=False
file_loaded2=False
file_loaded3=False
file_loaded4=False
file_loaded5=False
file_loaded6=False

## **Single-Task model**


> BERT-based model


> *GANBERT*




Each sub-block consists of training the chosen model with the Abusive Recognition task dataset. The tasks are:


1.   HaSpeeDe: Hate Spech Recognition
2.   AMI A: Automatic Misogyny Identification (misogyny, not mysogyny)
3.   AMI B: Automatic Misogyny Identification (misogyny_category: stereotype, sexual_harassment, discredit)
4.   DANKMEMEs: Hate Spech Recognition in MEMEs sentences
5.   SENTIPOLC 1: Sentiment Polarity Classification (objective, subjective)
6.   SENTIPOLC 2: Sentiment Polarity Classification (polarity: positive, negative, neutral)



### Task HaSpeeDe 

In [None]:
%cd tsv_files/

[Errno 2] No such file or directory: 'tsv_files/'
/content


Upload the dataset as dataframe

In [None]:
file_loaded=True

tsv_haspeede_train = 'haspeede_TW-train.tsv'
tsv_haspeede_test = 'haspeede_TW-reference.tsv'

df_train = pd.read_csv(tsv_haspeede_train, delimiter='\t', names=('id','sentence','label'))
df_train = df_train[['id']+['label']+['sentence']]
df_test = pd.read_csv(tsv_haspeede_test, delimiter='\t', names=('id','sentence','label'))
df_test = df_test[['id']+['label']+['sentence']]

#split train dev
train_dataset, dev_dataset = train_test_split(df_train, test_size=0.2, shuffle = True)

#reduction
if number_labeled_examples!=0:
  if number_labeled_examples==100:
    labeled = train_dataset.sample(n=100)
    unlabeled = train_dataset
    cond = unlabeled['id'].isin(labeled['id'])
    unlabeled.drop(unlabeled[cond].index, inplace = True)

  elif number_labeled_examples==200:
    labeled = train_dataset.sample(n=200)
    unlabeled = train_dataset
    cond = unlabeled['id'].isin(labeled['id'])
    unlabeled.drop(unlabeled[cond].index, inplace = True)

  elif number_labeled_examples==500:
    labeled = train_dataset.sample(n=500)
    unlabeled = train_dataset
    cond = unlabeled['id'].isin(labeled['id'])
    unlabeled.drop(unlabeled[cond].index, inplace = True)

  #model with or without gan 
  if apply_gan == True:
    print("GANBERT")
    #dataset unlabeled with label -1
    unlabeled['label'] = unlabeled['label'].replace(0,-1)
    unlabeled['label'] = unlabeled['label'].replace(1,-1)
    train = pd.concat([labeled, unlabeled])
    dev = dev_dataset
    print("Size of Train dataset is {}, with {} labeled and {} not labeled ".format(len(train),len(labeled), len(unlabeled)))
    print("Size of Dev dataset is {} ".format(len(dev)))
  else:
    print("BERT-based model, with reduction dataset")
    train = labeled
    dev = dev_dataset
    print("Size of Train dataset is {} ".format(len(labeled)))
    print("Size of Dev dataset is {} ".format(len(dev)))

else:
  print("BERT-based model")
  train = train_dataset
  dev = dev_dataset
  print("Size of Train dataset is {} ".format(len(train)))
  print("Size of Dev dataset is {} ".format(len(dev)))

NameError: ignored

In [None]:
!mkdir tsv_transformed
%cd tsv_transformed/

The code is using surfix to distinguish what type of set it is ("_train","_dev" and "_test"). So:
1.   make sure your train set is named as "TASK_train" (replace TASK with your task name)

2.   make sure your dev set and test set ends with "_dev" and "_test".
3.   add your task into task define config (task_def file):

  Here is a piece of example task define config :
  <pre>haspeede-TW:
    data_format: PremiseOnly
    ensable_san: false
    labels:
    - contradiction
    - neutral
    - entailment
    metric_meta:
    - ACC
    loss: CeCriterion
    n_class: 3
    task_type: Classification</pre>

Choose the correct data format based on your task, in this notebook are used 2 types of data formats, coresponds to different tasks:
  1. "PremiseOnly" : single text, i.e. premise. Data format is "id" \t "label" \t "premise" .
  2. "Gan" : single text, i.e. premise. Data format is "id" \t "label" \t "premise" .

ensable_san: Set "true" if you would like to use Stochastic Answer Networks(SAN) for your task.

If you prefer using readable labels (text), you can specify what labels are there in your data set, under "labels" field.

More details about metrics,please refer to [data_utils/metrics.py](../data_utils/metrics.py);
  
You can choose loss (for BERT-based model and MT-DNN, the GANBERT loss is in the model), from pre-defined losses in file [mt_dnn/loss.py](../mt_dnn/loss.py), and you can implement your customized losses into this file and specify it in the task config.

Specify what task type it is in your own task, choose one from types in:
    1. Classification
    2. Regression
    3. Ranking
    4. Span
    5. SeqenceLabeling
    6. MaskLM
  More details in [data_utils/task_def.py](../data_utils/task_def.py)
  
Also, specify how many classes in total in your task, under "n_class" field.

In [None]:
#train
name_train = "haspeede-TW_train.tsv"
id_train = train.id 
label_train = train.label
sentence_train = train.sentence

#dev
name_dev = "haspeede-TW_dev.tsv"
id_dev = dev.id
label_dev = dev.label
sentence_dev = dev.sentence

#test
name_test = "haspeede-TW_test.tsv"
id_test = df_test.id
label_test = df_test.label
sentence_test = df_test.sentence

#task_def
name_file = 'haspeede-TW_task_def.yml'


f = open(name_train, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_train,label_train,sentence_train))

f = open(name_dev, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_dev,label_dev,sentence_dev))

f = open(name_test, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_test,label_test,sentence_test))

task = "haspeede-TW:\n"

f = open(name_file, 'w')

with f:

    f.write(task)
    if apply_gan == True:
      f.write("  data_format: Gan\n")
    else:
      f.write("  data_format: PremiseOnly\n")
    f.write("  enable_san: false\n")
    f.write("  metric_meta:\n")
    f.write("  - F1MAC\n")
    f.write("  - ACC\n")
    f.write("  loss: CeCriterion\n")
    f.write("  n_class: 2\n")
    f.write("  task_type: Classification\n")

In [None]:
%cd ..
%cd ..

**Tokenization and Convert to Json**

The training code reads tokenized data in json format. please use "prepro_std.py" to do tokenization and convert your data into json format. The tokenization can be of two types:


*   For GANBERT, applying the balance between labeled and unlabeled data
*   For BERT-based model



In [None]:
if apply_gan == True:
  print("GANBERT")
  !python prepro_std.py --gan --apply_balance --model Musixmatch/umberto-commoncrawl-cased-v1 --root_dir tsv_files/tsv_transformed/ --task_def tsv_files/tsv_transformed/haspeede-TW_task_def.yml
else:
  print("BERT-based model")
  !python prepro_std.py --model Musixmatch/umberto-commoncrawl-cased-v1 --root_dir tsv_files/tsv_transformed/ --task_def tsv_files/tsv_transformed/haspeede-TW_task_def.yml

**Onboard your task into training!**

Add your piece of config into overall config for task. Again, we distinguish between:

*   For GANBERT, in which the number of layers of the Discriminator and Generator, the size of the noise vector and the epsilon are specified
*   For BERT-based model

--encoder_type 9: it means which BERT is used to encode the sentences. In this case Umberto is used

In [None]:
if apply_gan == True:
  print("GANBERT")
  !python train.py --gan --num_hidden_layers_g 3 --num_hidden_layers_d 0 --noise_size 100 --epsilon 1e-8 --encoder_type 9 --epochs 25 --task_def tsv_files/tsv_transformed/haspeede-TW_task_def.yml --data_dir tsv_files/tsv_transformed/musixmatch_cased/ --init_checkpoint Musixmatch/umberto-commoncrawl-cased-v1 --bert_model_type Musixmatch/umberto-commoncrawl-cased-v1 --max_seq_len 128 --batch_size 64 --batch_size_eval 64 --optimizer "adamW" --train_datasets haspeede-TW --test_datasets haspeede-TW --learning_rate "1e-5" #--multi_gpu_on --grad_accumulation_step 4 --fp16 --grad_clipping 0 --global_grad_clipping 1
else:
  print("BERT-based model")
  !python train.py --encoder_type 9 --epochs 10 --task_def tsv_files/tsv_transformed/haspeede-TW_task_def.yml --data_dir tsv_files/tsv_transformed/musixmatch_cased/ --init_checkpoint Musixmatch/umberto-commoncrawl-cased-v1 --bert_model_type Musixmatch/umberto-commoncrawl-cased-v1 --max_seq_len 128 --batch_size 16 --batch_size_eval 16 --optimizer "adamW" --train_datasets haspeede-TW --test_datasets haspeede-TW --learning_rate "5e-5" #--multi_gpu_on --grad_accumulation_step 4 --fp16 --grad_clipping 0 --global_grad_clipping 1

### Task AMI A

In [None]:
%cd tsv_files/

/content/mttransformer/tsv_files


In [None]:
file_loaded2=True

tsv_AMI2018_train = 'AMI2018_it_training.tsv'
tsv_AMI2018_test = 'AMI2018_it_testing.tsv'

df_train2 = pd.read_csv(tsv_AMI2018_train, delimiter='\t')
df_train2 = df_train2[['id']+['misogynous']+['text']]
df_test2 = pd.read_csv(tsv_AMI2018_test, delimiter='\t')
df_test2 = df_test2[['id']+['misogynous']+['text']]

#split train dev
train_dataset2, dev_dataset2 = train_test_split(df_train2, test_size=0.2, shuffle = True)

#reduction
if number_labeled_examples!=0:
  if number_labeled_examples==100:
      labeled2 = train_dataset2.sample(n=100)
      unlabeled2 = train_dataset2
      cond = unlabeled2['id'].isin(labeled2['id'])
      unlabeled2.drop(unlabeled2[cond].index, inplace = True)

  elif number_labeled_examples==200:
    labeled2 = train_dataset2.sample(n=200)
    unlabeled2 = train_dataset2
    cond = unlabeled2['id'].isin(labeled2['id'])
    unlabeled2.drop(unlabeled2[cond].index, inplace = True)

  elif number_labeled_examples==500:
    labeled2 = train_dataset2.sample(n=500)
    unlabeled2 = train_dataset2
    cond = unlabeled2['id'].isin(labeled2['id'])
    unlabeled2.drop(unlabeled2[cond].index, inplace = True)
  
  #model with or without gan 
  if apply_gan == True:
    print("GANBERT")
    #dataset unlabeled with label -1
    unlabeled2['misogynous'] = unlabeled2['misogynous'].replace(0,-1)
    unlabeled2['misogynous'] = unlabeled2['misogynous'].replace(1,-1)
    train2 = pd.concat([labeled2, unlabeled2])
    dev2 = dev_dataset2
    print("Size of Train dataset is {}, with {} labeled and {} not labeled ".format(len(train2),len(labeled2), len(unlabeled2)))
    print("Size of Dev dataset is {} ".format(len(dev2)))
  else:
    print("BERT-based model, with reduction dataset")
    train2 = labeled2
    dev2 = dev_dataset2
    print("Size of Train dataset is {} ".format(len(labeled2)))
    print("Size of Dev dataset is {} ".format(len(dev2)))

else:
  print("BERT-based model")
  train2 = train_dataset2
  dev2 = dev_dataset2
  print("Size of Train dataset is {} ".format(len(train2)))
  print("Size of Dev dataset is {} ".format(len(dev2)))

BERT-based model, with reduction dataset
Size of Train dataset is 200 
Size of Dev dataset is 800 


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [None]:
!mkdir tsv_transformed
%cd tsv_transformed/

mkdir: cannot create directory ‘tsv_transformed’: File exists
/content/mttransformer/tsv_files/tsv_transformed


In [None]:
#train
name_train = "AMI2018A_train.tsv"
id_train = train2.id
label_train = train2.misogynous
sentence_train = train2.text

#dev
name_dev = "AMI2018A_dev.tsv"
id_dev = dev2.id
label_dev = dev2.misogynous
sentence_dev = dev2.text

#test
name_test = "AMI2018A_test.tsv"
id_test = df_test2.id
label_test = df_test2.misogynous
sentence_test = df_test2.text

#task_def
name_file = 'AMI2018A_task_def.yml'

f = open(name_train, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_train,label_train,sentence_train))

f = open(name_dev, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_dev,label_dev,sentence_dev))

f = open(name_test, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_test,label_test,sentence_test))


task = "AMI2018A:\n"
  
f = open(name_file, 'w')

with f:

    f.write(task)
    if apply_gan == True:
      f.write("  data_format: Gan\n")
    else:
      f.write("  data_format: PremiseOnly\n")
    f.write("  enable_san: false\n")
    f.write("  metric_meta:\n")
    f.write("  - F1MAC\n")
    f.write("  - ACC\n")
    f.write("  loss: CeCriterion\n")
    f.write("  n_class: 2\n")
    f.write("  task_type: Classification\n")

In [None]:
%cd ..
%cd ..

/content/mttransformer/tsv_files
/content/mttransformer


In [None]:
if apply_gan == True:
  print("GANBERT")
  !python prepro_std.py --gan --apply_balance --model Musixmatch/umberto-commoncrawl-cased-v1 --root_dir tsv_files/tsv_transformed/ --task_def tsv_files/tsv_transformed/AMI2018A_task_def.yml
else:
  print("BERT-based model")
  !python prepro_std.py --model Musixmatch/umberto-commoncrawl-cased-v1 --root_dir tsv_files/tsv_transformed/ --task_def tsv_files/tsv_transformed/AMI2018A_task_def.yml

BERT-based model
Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
2021-07-26 08:51:15.632842: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
  self.tok = re.compile(r"({})".format("|".join(pipeline)))
Reading english - 1grams ...
Reading english - 2grams ...
  regexes = {k.lower(): re.compile(self.expressions[k]) for k, v in
Reading english - 1grams ...
07/26/2021 08:51:27 Task AMI2018A
07/26/2021 08:51:27 tsv_files/tsv_transformed/musixmatch_cased/AMI2018A_train.json
07/26/2021 08:51:27 tsv_files/tsv_transformed/musixmatch_cased/AMI2018A_dev.json
07/26/2021 08:51:27 tsv_files/tsv_transformed/musixmatch_cased/AMI2018A_test.json
[0m

In [None]:
if apply_gan == True:
  print("GANBERT")
  !python train.py --gan --num_hidden_layers_d 0 --num_hidden_layers_g 3 --noise_size 100 --epsilon 1e-8 --encoder_type 9 --epochs 25 --task_def tsv_files/tsv_transformed/AMI2018A_task_def.yml --data_dir tsv_files/tsv_transformed/musixmatch_cased --init_checkpoint Musixmatch/umberto-commoncrawl-cased-v1 --bert_model_type Musixmatch/umberto-commoncrawl-cased-v1 --max_seq_len 128 --batch_size 64 --batch_size_eval 64 --optimizer "adamW" --train_datasets AMI2018A --test_datasets AMI2018A --learning_rate "1e-5" #--multi_gpu_on --grad_accumulation_step 4 --fp16 --grad_clipping 0 --global_grad_clipping 1
else:
  print("BERT-based model")
  !python train.py --encoder_type 9 --epochs 10 --task_def tsv_files/tsv_transformed/AMI2018A_task_def.yml --data_dir tsv_files/tsv_transformed/musixmatch_cased/ --init_checkpoint Musixmatch/umberto-commoncrawl-cased-v1 --bert_model_type Musixmatch/umberto-commoncrawl-cased-v1 --max_seq_len 128 --batch_size 16 --batch_size_eval 16 --optimizer "adamW" --train_datasets AMI2018A --test_datasets AMI2018A --learning_rate "5e-5" #--multi_gpu_on --grad_accumulation_step 4 --fp16 --grad_clipping 0 --global_grad_clipping 1

BERT-based model
2021-07-26 08:51:29.881740: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
07/26/2021 08:51:32 Launching the MT-DNN training
07/26/2021 08:51:32 Loading tsv_files/tsv_transformed/musixmatch_cased/AMI2018A_train.json as task 0
Loaded 200 samples out of 200
Loaded 800 samples out of 800
Loaded 1000 samples out of 1000
07/26/2021 08:51:32 ####################
07/26/2021 08:51:32 {'log_file': 'mt-dnn-train.log', 'tensorboard': False, 'tensorboard_logdir': 'tensorboard_logdir', 'init_checkpoint': 'Musixmatch/umberto-commoncrawl-cased-v1', 'data_dir': 'tsv_files/tsv_transformed/musixmatch_cased/', 'data_sort_on': False, 'name': 'farmer', 'task_def': 'tsv_files/tsv_transformed/AMI2018A_task_def.yml', 'train_datasets': ['AMI2018A'], 'test_datasets': ['AMI2018A'], 'glue_format_on': False, 'mkd_opt': 0, 'do_padding': Fal

### Task AMI B

In [None]:
%cd tsv_files/

/content/mttransformer/tsv_files


In [None]:
file_loaded3=True

tsv_AMI2018_train = 'AMI2018_it_training.tsv'
tsv_AMI2018_test = 'AMI2018_it_testing.tsv'

df_train3 = pd.read_csv(tsv_AMI2018_train, delimiter='\t')
df = pd.DataFrame(columns=['id', 'misogyny_category', 'text'])
for ind in df_train3.index:
  if df_train3.misogynous[ind]==1:
    if df_train3.misogyny_category[ind] == 'stereotype':
      df = df.append({'id' : df_train3['id'][ind], 'misogyny_category' : 0, 'text' : df_train3['text'][ind] }, ignore_index=True)
    #elif df_train3.misogyny_category[ind] == 'dominance':
      #df = df.append({'id' : df_train3['id'][ind], 'misogyny_category' : 1, 'text' : df_train3['text'][ind] }, ignore_index=True)
    #elif df_train3.misogyny_category[ind] == 'derailing':
      #df = df.append({'id' : df_train3['id'][ind], 'misogyny_category' : 2, 'text' : df_train3['text'][ind] }, ignore_index=True)
    elif df_train3.misogyny_category[ind] == 'sexual_harassment':
      df = df.append({'id' : df_train3['id'][ind], 'misogyny_category' : 1, 'text' : df_train3['text'][ind] }, ignore_index=True)
    elif df_train3.misogyny_category[ind] == 'discredit':
      df = df.append({'id' : df_train3['id'][ind], 'misogyny_category' : 2, 'text' : df_train3['text'][ind] }, ignore_index=True)

df_train3 = df

df_test3 = pd.read_csv(tsv_AMI2018_test, delimiter='\t')
df = pd.DataFrame(columns=['id', 'misogyny_category', 'text'])
for ind in df_test3.index:
  if df_test3.misogynous[ind]==1:
    if df_test3.misogyny_category[ind] == 'stereotype':
      df = df.append({'id' : df_test3['id'][ind], 'misogyny_category' : 0, 'text' : df_test3['text'][ind] }, ignore_index=True)
    #elif df_test3.misogyny_category[ind] == 'dominance':
      #df = df.append({'id' : df_test3['id'][ind], 'misogyny_category' : 1, 'text' : df_test3['text'][ind] }, ignore_index=True)
    #elif df_test3.misogyny_category[ind] == 'derailing':
      #df = df.append({'id' : df_test3['id'][ind], 'misogyny_category' : 2, 'text' : df_test3['text'][ind] }, ignore_index=True)
    elif df_test3.misogyny_category[ind] == 'sexual_harassment':
      df = df.append({'id' : df_test3['id'][ind], 'misogyny_category' : 1, 'text' : df_test3['text'][ind] }, ignore_index=True)
    elif df_test3.misogyny_category[ind] == 'discredit':
      df = df.append({'id' : df_test3['id'][ind], 'misogyny_category' : 2, 'text' : df_test3['text'][ind] }, ignore_index=True)

df_test3 = df

#split train dev
train_dataset3, dev_dataset3 = train_test_split(df_train3, test_size=0.2, shuffle = True)

#reduction
if number_labeled_examples!=0:
  if number_labeled_examples==100:
    labeled3 = train_dataset3.sample(n=100)
    unlabeled3 = train_dataset3
    cond = unlabeled3['id'].isin(labeled3['id'])
    unlabeled3.drop(unlabeled3[cond].index, inplace = True)

  elif number_labeled_examples==200:
    labeled3 = train_dataset3.sample(n=200)
    unlabeled3 = train_dataset3
    cond = unlabeled3['id'].isin(labeled3['id'])
    unlabeled3.drop(unlabeled3[cond].index, inplace = True)

  elif number_labeled_examples==500:
    labeled3 = train_dataset3.sample(n=500)
    unlabeled3 = train_dataset3
    cond = unlabeled3['id'].isin(labeled3['id'])
    unlabeled3.drop(unlabeled3[cond].index, inplace = True)

  #model with or without gan 
  if apply_gan == True:
    print("GANBERT")
    #dataset unlabeled with label -1
    unlabeled3['misogyny_category'] = unlabeled3['misogyny_category'].replace(0,-1)
    unlabeled3['misogyny_category'] = unlabeled3['misogyny_category'].replace(1,-1)
    unlabeled3['misogyny_category'] = unlabeled3['misogyny_category'].replace(2,-1)
    unlabeled3['misogyny_category'] = unlabeled3['misogyny_category'].replace(3,-1)
    unlabeled3['misogyny_category'] = unlabeled3['misogyny_category'].replace(4,-1)
    train3 = pd.concat([labeled3, unlabeled3])
    dev3 = dev_dataset3
    print("Size of Train dataset is {}, with {} labeled and {} not labeled ".format(len(train3),len(labeled3), len(unlabeled3)))
    print("Size of Dev dataset is {} ".format(len(dev3)))
  else:
    print("BERT-based model, with reduction dataset")
    train3 = labeled3
    dev3 = dev_dataset3
    print("Size of Train dataset is {} ".format(len(labeled3)))
    print("Size of Dev dataset is {} ".format(len(dev3)))

else:
  print("BERT-based model")
  train3 = train_dataset3
  dev3=dev_dataset3
  print("Size of Train dataset is {} ".format(len(train3)))
  print("Size of Dev dataset is {} ".format(len(dev3)))

BERT-based model, with reduction dataset
Size of Train dataset is 200 
Size of Dev dataset is 347 


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [None]:
!mkdir tsv_transformed
%cd tsv_transformed/

mkdir: cannot create directory ‘tsv_transformed’: File exists
/content/mttransformer/tsv_files/tsv_transformed


In [None]:
#train
name_train = "AMI2018B_train.tsv"
id_train = train3.id
label_train = train3.misogyny_category
sentence_train = train3.text

#dev
name_dev = "AMI2018B_dev.tsv"
id_dev = dev3.id
label_dev = dev3.misogyny_category
sentence_dev = dev3.text

#test
name_test = "AMI2018B_test.tsv"
id_test = df_test3.id
label_test = df_test3.misogyny_category
sentence_test = df_test3.text

#task_def
name_file = 'AMI2018B_task_def.yml'


f = open(name_train, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_train,label_train,sentence_train))


f = open(name_dev, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_dev,label_dev,sentence_dev))


f = open(name_test, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_test,label_test,sentence_test))


task = "AMI2018B:\n"
  
f = open(name_file, 'w')

with f:

    f.write(task)
    if apply_gan == True:
      f.write("  data_format: Gan\n")
    else:
      f.write("  data_format: PremiseOnly\n")
    f.write("  enable_san: false\n")
    f.write("  metric_meta:\n")
    f.write("  - F1MAC\n")
    f.write("  - ACC\n")
    f.write("  loss: CeCriterion\n")
    f.write("  n_class: 3\n")
    f.write("  task_type: Classification\n")

In [None]:
%cd ..
%cd ..

/content/mttransformer/tsv_files
/content/mttransformer


In [None]:
if apply_gan == True:
  print("GANBERT")
  !python prepro_std.py --gan --apply_balance --model Musixmatch/umberto-commoncrawl-cased-v1 --root_dir tsv_files/tsv_transformed/ --task_def tsv_files/tsv_transformed/AMI2018B_task_def.yml
else:
  print("BERT-based model")
  !python prepro_std.py --model Musixmatch/umberto-commoncrawl-cased-v1 --root_dir tsv_files/tsv_transformed/ --task_def tsv_files/tsv_transformed/AMI2018B_task_def.yml

BERT-based model
Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
2021-07-26 08:57:18.973721: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
  self.tok = re.compile(r"({})".format("|".join(pipeline)))
Reading english - 1grams ...
Reading english - 2grams ...
  regexes = {k.lower(): re.compile(self.expressions[k]) for k, v in
Reading english - 1grams ...
07/26/2021 08:57:36 Task AMI2018B
07/26/2021 08:57:36 tsv_files/tsv_transformed/musixmatch_cased/AMI2018B_train.json
07/26/2021 08:57:36 tsv_files/tsv_transformed/musixmatch_cased/AMI2018B_dev.json
07/26/2021 08:57:36 tsv_files/tsv_transformed/musixmatch_cased/AMI2018B_test.json
[0m

In [None]:
if apply_gan == True:
  print("GANBERT")
  !python train.py --gan --num_hidden_layers_d 0 --num_hidden_layers_g 3 --noise_size 100 --epsilon 1e-8 --encoder_type 9 --epochs 25 --task_def tsv_files/tsv_transformed/AMI2018B_task_def.yml --data_dir tsv_files/tsv_transformed/musixmatch_cased/ --init_checkpoint Musixmatch/umberto-commoncrawl-cased-v1 --bert_model_type Musixmatch/umberto-commoncrawl-cased-v1 --max_seq_len 128 --batch_size 64 --batch_size_eval 64 --optimizer "adamW" --train_datasets AMI2018B --test_datasets AMI2018B --learning_rate "1e-5" #--multi_gpu_on --grad_accumulation_step 4 --fp16 --grad_clipping 0 --global_grad_clipping 1
else:
  print("BERT-based model")
  !python train.py --encoder_type 9 --epochs 10 --task_def tsv_files/tsv_transformed/AMI2018B_task_def.yml --data_dir tsv_files/tsv_transformed/musixmatch_cased/ --init_checkpoint Musixmatch/umberto-commoncrawl-cased-v1 --bert_model_type Musixmatch/umberto-commoncrawl-cased-v1 --max_seq_len 128 --batch_size 16 --batch_size_eval 16 --optimizer "adamW" --train_datasets AMI2018B --test_datasets AMI2018B --learning_rate "5e-5" #--multi_gpu_on --grad_accumulation_step 4 --fp16 --grad_clipping 0 --global_grad_clipping 1

BERT-based model
2021-07-26 08:57:38.774832: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
07/26/2021 08:57:40 Launching the MT-DNN training
07/26/2021 08:57:40 Loading tsv_files/tsv_transformed/musixmatch_cased/AMI2018B_train.json as task 0
Loaded 200 samples out of 200
Loaded 347 samples out of 347
Loaded 446 samples out of 446
07/26/2021 08:57:40 ####################
07/26/2021 08:57:40 {'log_file': 'mt-dnn-train.log', 'tensorboard': False, 'tensorboard_logdir': 'tensorboard_logdir', 'init_checkpoint': 'Musixmatch/umberto-commoncrawl-cased-v1', 'data_dir': 'tsv_files/tsv_transformed/musixmatch_cased/', 'data_sort_on': False, 'name': 'farmer', 'task_def': 'tsv_files/tsv_transformed/AMI2018B_task_def.yml', 'train_datasets': ['AMI2018B'], 'test_datasets': ['AMI2018B'], 'glue_format_on': False, 'mkd_opt': 0, 'do_padding': False

### Task DANKMEMEs

In [None]:
%cd tsv_files/

/content/mttransformer/tsv_files


In [None]:
file_loaded4=True

tsv_DANKMEMES2020_train = 'dankmemes_task2_train.csv'
tsv_DANKMEMES2020_test = 'hate_test.csv'

df_train4 = pd.read_csv(tsv_DANKMEMES2020_train, delimiter=',')
df_train4 = df_train4[['File']+['Hate Speech']+['Text']]
df_test4 = pd.read_csv(tsv_DANKMEMES2020_test, delimiter=',')
df_test4 = df_test4[['File']+['Hate Speech']+['Text']]


#split train dev
train_dataset4, dev_dataset4 = train_test_split(df_train4, test_size=0.2, shuffle = True)

#reduction
if number_labeled_examples!=0:

  if number_labeled_examples==100:
    labeled4 = train_dataset4.sample(n=100)
    unlabeled4 = train_dataset4
    cond = unlabeled4['File'].isin(labeled4['File'])
    unlabeled4.drop(unlabeled4[cond].index, inplace = True)

  elif number_labeled_examples==200:
    labeled4 = train_dataset4.sample(n=200)
    unlabeled4 = train_dataset4
    cond = unlabeled4['File'].isin(labeled4['File'])
    unlabeled4.drop(unlabeled4[cond].index, inplace = True)

  elif number_labeled_examples==500:
    labeled4 = train_dataset4.sample(n=500)
    unlabeled4 = train_dataset4
    cond = unlabeled4['File'].isin(labeled4['File'])
    unlabeled4.drop(unlabeled4[cond].index, inplace = True)

  #model with or without gan 
  if apply_gan == True:
    print("GANBERT")
    #dataset unlabeled with label -1
    unlabeled4['Hate Speech'] = unlabeled4['Hate Speech'].replace(0,-1)
    unlabeled4['Hate Speech'] = unlabeled4['Hate Speech'].replace(1,-1)
    train4 = pd.concat([labeled4, unlabeled4])
    dev4 = dev_dataset4
    print("Size of Train dataset is {}, with {} labeled and {} not labeled ".format(len(train4),len(labeled4), len(unlabeled4)))
    print("Size of Dev dataset is {} ".format(len(dev4)))
  else:
    print("BERT-based model, with reduction dataset")
    train4 = labeled4
    dev4 = dev_dataset4
    print("Size of Train dataset is {} ".format(len(labeled4)))
    print("Size of Dev dataset is {} ".format(len(dev4)))


else:
  print("BERT-based model")
  train4 = train_dataset4
  dev4=dev_dataset4
  print("Size of Train dataset is {} ".format(len(train4)))
  print("Size of Dev dataset is {} ".format(len(dev4)))


BERT-based model, with reduction dataset
Size of Train dataset is 200 
Size of Dev dataset is 160 


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [None]:
!mkdir tsv_transformed
%cd tsv_transformed/

mkdir: cannot create directory ‘tsv_transformed’: File exists
/content/mttransformer/tsv_files/tsv_transformed


In [None]:
#train
name_train = "DANKMEMES2020_train.tsv"
id_train = train4.File
label_train = train4["Hate Speech"]
sentence_train = train4.Text

#dev
name_dev = "DANKMEMES2020_dev.tsv"
id_dev = dev4.File
label_dev = dev4["Hate Speech"]
sentence_dev = dev4.Text

#test
name_test = "DANKMEMES2020_test.tsv"
id_test = df_test4.File
label_test = df_test4["Hate Speech"]
sentence_test = df_test4.Text

#task_def
name_file = 'DANKMEMES2020_task_def.yml'


f = open(name_train, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_train,label_train,sentence_train))


f = open(name_dev, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_dev,label_dev,sentence_dev))


f = open(name_test, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_test,label_test,sentence_test))


task = "DANKMEMES2020:\n"
  
f = open(name_file, 'w')

with f:

    f.write(task)
    if apply_gan == True:
      f.write("  data_format: Gan\n")
    else:
      f.write("  data_format: PremiseOnly\n")
    f.write("  enable_san: false\n")
    f.write("  metric_meta:\n")
    f.write("  - F1MAC\n") 
    f.write("  - ACC\n")
    f.write("  loss: CeCriterion\n")
    f.write("  n_class: 2\n")
    f.write("  task_type: Classification\n")

In [None]:
%cd ..
%cd ..

/content/mttransformer/tsv_files
/content/mttransformer


In [None]:
if apply_gan == True:
  print("GANBERT")
  !python prepro_std.py --gan --apply_balance --model Musixmatch/umberto-commoncrawl-cased-v1 --root_dir tsv_files/tsv_transformed/ --task_def tsv_files/tsv_transformed/DANKMEMES2020_task_def.yml
else:
  print("BERT-based model")
  !python prepro_std.py --model Musixmatch/umberto-commoncrawl-cased-v1 --root_dir tsv_files/tsv_transformed/ --task_def tsv_files/tsv_transformed/DANKMEMES2020_task_def.yml

BERT-based model
Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
2021-07-26 09:03:11.695673: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
  self.tok = re.compile(r"({})".format("|".join(pipeline)))
Reading english - 1grams ...
Reading english - 2grams ...
  regexes = {k.lower(): re.compile(self.expressions[k]) for k, v in
Reading english - 1grams ...
07/26/2021 09:03:28 Task DANKMEMES2020
07/26/2021 09:03:28 tsv_files/tsv_transformed/musixmatch_cased/DANKMEMES2020_train.json
07/26/2021 09:03:28 tsv_files/tsv_transformed/musixmatch_cased/DANKMEMES2020_dev.json
07/26/2021 09:03:28 tsv_files/tsv_transformed/musixmatch_cased/DANKMEMES2020_test.json
[0m

In [None]:
if apply_gan == True:
  print("GANBERT")
  !python train.py --gan --num_hidden_layers_d 0 --num_hidden_layers_g 3 --noise_size 100 --epsilon 1e-8 --encoder_type 9 --epochs 25 --task_def tsv_files/tsv_transformed/DANKMEMES2020_task_def.yml --data_dir tsv_files/tsv_transformed/musixmatch_cased/ --init_checkpoint Musixmatch/umberto-commoncrawl-cased-v1 --bert_model_type Musixmatch/umberto-commoncrawl-cased-v1 --max_seq_len 128 --batch_size 64 --batch_size_eval 64 --optimizer "adamW" --train_datasets DANKMEMES2020 --test_datasets DANKMEMES2020 --learning_rate "1e-5" #--multi_gpu_on --grad_accumulation_step 4 --fp16 --grad_clipping 0 --global_grad_clipping 1
else:
  print("BERT-based model")
  !python train.py --encoder_type 9 --epochs 10 --task_def tsv_files/tsv_transformed/DANKMEMES2020_task_def.yml --data_dir tsv_files/tsv_transformed/musixmatch_cased/ --init_checkpoint Musixmatch/umberto-commoncrawl-cased-v1 --bert_model_type Musixmatch/umberto-commoncrawl-cased-v1 --max_seq_len 128 --batch_size 16 --batch_size_eval 16 --optimizer "adamW" --train_datasets DANKMEMES2020 --test_datasets DANKMEMES2020 --learning_rate "5e-5" #--multi_gpu_on --grad_accumulation_step 4 --fp16 --grad_clipping 0 --global_grad_clipping 1

BERT-based model
2021-07-26 09:03:30.841804: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
07/26/2021 09:03:33 Launching the MT-DNN training
07/26/2021 09:03:33 Loading tsv_files/tsv_transformed/musixmatch_cased/DANKMEMES2020_train.json as task 0
Loaded 200 samples out of 200
Loaded 160 samples out of 160
Loaded 200 samples out of 200
07/26/2021 09:03:33 ####################
07/26/2021 09:03:33 {'log_file': 'mt-dnn-train.log', 'tensorboard': False, 'tensorboard_logdir': 'tensorboard_logdir', 'init_checkpoint': 'Musixmatch/umberto-commoncrawl-cased-v1', 'data_dir': 'tsv_files/tsv_transformed/musixmatch_cased/', 'data_sort_on': False, 'name': 'farmer', 'task_def': 'tsv_files/tsv_transformed/DANKMEMES2020_task_def.yml', 'train_datasets': ['DANKMEMES2020'], 'test_datasets': ['DANKMEMES2020'], 'glue_format_on': False, 'mkd_opt': 0,

### Task SENTIPOLC 1

In [None]:
%cd tsv_files/

/content/mttransformer/tsv_files


In [None]:
file_loaded5=True

tsv_SENTIPOLC2016_train = 'training_set_sentipolc16.csv'
tsv_SENTIPOLC2016_test = 'test_set_sentipolc16_gold2000.csv'

df_train5 = pd.read_csv(tsv_SENTIPOLC2016_train, delimiter=',')
df_train5 = df_train5[['idtwitter']+['subj']+['text']]

df_test5 = pd.read_csv(tsv_SENTIPOLC2016_test, delimiter=',')
df_test5 = df_test5[['idtwitter']+['subj']+['text']]

for ind in df_train5.index:
  if "\t" in df_train5.text[ind]:
    df_train5 = df_train5.replace(to_replace='\t', value='', regex=True)


#split train dev 
train_dataset5, dev_dataset5 = train_test_split(df_train5, test_size=0.2, shuffle = True)

if number_labeled_examples!=0:

  if number_labeled_examples==100:
    labeled5 = train_dataset5.sample(n=100)
    unlabeled5 = train_dataset5
    cond = unlabeled5['idtwitter'].isin(labeled5['idtwitter'])
    unlabeled5.drop(unlabeled5[cond].index, inplace = True)

  elif number_labeled_examples==200:
    labeled5 = train_dataset5.sample(n=200)
    unlabeled5 = train_dataset5
    cond = unlabeled5['idtwitter'].isin(labeled5['idtwitter'])
    unlabeled5.drop(unlabeled5[cond].index, inplace = True)

  elif number_labeled_examples==500:
    labeled5 = train_dataset5.sample(n=500)
    unlabeled5 = train_dataset5
    cond = unlabeled5['idtwitter'].isin(labeled5['idtwitter'])
    unlabeled5.drop(unlabeled5[cond].index, inplace = True)
  
  #model with or without gan 
  if apply_gan == True:
    print("GANBERT")
    #dataset unlabeled with label -1
    unlabeled5['subj'] = unlabeled5['subj'].replace(0,-1)
    unlabeled5['subj'] = unlabeled5['subj'].replace(1,-1)
    train5 = pd.concat([labeled5, unlabeled5])
    dev5 = dev_dataset5
    print("Size of Train dataset is {}, with {} labeled and {} not labeled ".format(len(train5),len(labeled5), len(unlabeled5)))
    print("Size of Dev dataset is {} ".format(len(dev5)))
  else:
    print("BERT-based model, with reduction dataset")
    train5 = labeled5
    dev5 = dev_dataset5
    print("Size of Train dataset is {} ".format(len(labeled5)))
    print("Size of Dev dataset is {} ".format(len(dev5)))

else:
  print("BERT-based model")
  train5 = train_dataset5
  dev5=dev_dataset5
  print("Size of Train dataset is {} ".format(len(train5)))
  print("Size of Dev dataset is {} ".format(len(dev5)))

BERT-based model, with reduction dataset
Size of Train dataset is 200 
Size of Dev dataset is 1482 


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [None]:
!mkdir tsv_transformed
%cd tsv_transformed/

mkdir: cannot create directory ‘tsv_transformed’: File exists
/content/mttransformer/tsv_files/tsv_transformed


In [None]:
#train
name_train = "SENTIPOLC20161_train.tsv"
id_train = train5.idtwitter
label_train = train5.subj
sentence_train = train5.text

#dev
name_dev = "SENTIPOLC20161_dev.tsv"
id_dev = dev5.idtwitter
label_dev = dev5.subj
sentence_dev = dev5.text

#test
name_test = "SENTIPOLC20161_test.tsv"
id_test = df_test5.idtwitter
label_test = df_test5.subj
sentence_test = df_test5.text

#task_def
name_file = 'SENTIPOLC20161_task_def.yml'


f = open(name_train, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_train,label_train,sentence_train))


f = open(name_dev, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_dev,label_dev,sentence_dev))


f = open(name_test, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_test,label_test,sentence_test))


task = "SENTIPOLC20161:\n"
  
f = open(name_file, 'w')

with f:

    f.write(task)
    if apply_gan == True:
      f.write("  data_format: Gan\n")
    else:
      f.write("  data_format: PremiseOnly\n")
    f.write("  enable_san: false\n")
    f.write("  metric_meta:\n")
    f.write("  - F1MAC\n")
    f.write("  - ACC\n")
    f.write("  loss: CeCriterion\n")
    f.write("  n_class: 2\n")
    f.write("  task_type: Classification\n")

In [None]:
%cd ..
%cd ..

/content/mttransformer/tsv_files
/content/mttransformer


In [None]:
if apply_gan == True:
  print("GANBERT")
  !python prepro_std.py --gan --apply_balance --model Musixmatch/umberto-commoncrawl-cased-v1 --root_dir tsv_files/tsv_transformed/ --task_def tsv_files/tsv_transformed/SENTIPOLC20161_task_def.yml
else:
  print("BERT-based model")
  !python prepro_std.py --model Musixmatch/umberto-commoncrawl-cased-v1 --root_dir tsv_files/tsv_transformed/ --task_def tsv_files/tsv_transformed/SENTIPOLC20161_task_def.yml

BERT-based model
Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
2021-07-26 09:08:56.169779: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
  self.tok = re.compile(r"({})".format("|".join(pipeline)))
Reading english - 1grams ...
Reading english - 2grams ...
  regexes = {k.lower(): re.compile(self.expressions[k]) for k, v in
Reading english - 1grams ...
07/26/2021 09:09:11 Task SENTIPOLC20161
07/26/2021 09:09:11 tsv_files/tsv_transformed/musixmatch_cased/SENTIPOLC20161_train.json
07/26/2021 09:09:11 tsv_files/tsv_transformed/musixmatch_cased/SENTIPOLC20161_dev.json
07/26/2021 09:09:11 tsv_files/tsv_transformed/musixmatch_cased/SENTIPOLC20161_test.json
[0m

In [None]:
if apply_gan == True:
  print("GANBERT")
  !python train.py --gan --num_hidden_layers_d 0 --num_hidden_layers_g 3 --noise_size 100 --epsilon 1e-8 --encoder_type 9 --epochs 25 --task_def tsv_files/tsv_transformed/SENTIPOLC20161_task_def.yml --data_dir tsv_files/tsv_transformed/musixmatch_cased/ --init_checkpoint Musixmatch/umberto-commoncrawl-cased-v1 --bert_model_type Musixmatch/umberto-commoncrawl-cased-v1 --max_seq_len 128 --batch_size 64 --batch_size_eval 64 --optimizer "adamW" --train_datasets SENTIPOLC20161 --test_datasets SENTIPOLC20161 --learning_rate "1e-5" #--multi_gpu_on --grad_accumulation_step 4 --fp16 --grad_clipping 0 --global_grad_clipping 1
else:
  print("BERT-based model")
  !python train.py --encoder_type 9 --epochs 10 --task_def tsv_files/tsv_transformed/SENTIPOLC20161_task_def.yml --data_dir tsv_files/tsv_transformed/musixmatch_cased/ --init_checkpoint Musixmatch/umberto-commoncrawl-cased-v1 --bert_model_type Musixmatch/umberto-commoncrawl-cased-v1 --max_seq_len 128 --batch_size 16 --batch_size_eval 16 --optimizer "adamW" --train_datasets SENTIPOLC20161 --test_datasets SENTIPOLC20161 --learning_rate "5e-5" #--multi_gpu_on --grad_accumulation_step 4 --fp16 --grad_clipping 0 --global_grad_clipping 1

BERT-based model
2021-07-26 09:09:14.512586: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
07/26/2021 09:09:16 Launching the MT-DNN training
07/26/2021 09:09:16 Loading tsv_files/tsv_transformed/musixmatch_cased/SENTIPOLC20161_train.json as task 0
Loaded 200 samples out of 200
Loaded 1482 samples out of 1482
Loaded 2000 samples out of 2000
07/26/2021 09:09:16 ####################
07/26/2021 09:09:16 {'log_file': 'mt-dnn-train.log', 'tensorboard': False, 'tensorboard_logdir': 'tensorboard_logdir', 'init_checkpoint': 'Musixmatch/umberto-commoncrawl-cased-v1', 'data_dir': 'tsv_files/tsv_transformed/musixmatch_cased/', 'data_sort_on': False, 'name': 'farmer', 'task_def': 'tsv_files/tsv_transformed/SENTIPOLC20161_task_def.yml', 'train_datasets': ['SENTIPOLC20161'], 'test_datasets': ['SENTIPOLC20161'], 'glue_format_on': False, 'mkd_

### Task SENTIPOLC 2

In [None]:
%cd tsv_files/

/content/mttransformer/tsv_files


In [None]:
file_loaded6=True

tsv_SENTIPOLC2016_train = 'training_set_sentipolc16.csv'
tsv_SENTIPOLC2016_test = 'test_set_sentipolc16_gold2000.csv'

df_train6 = pd.read_csv(tsv_SENTIPOLC2016_train, delimiter=',')

df = pd.DataFrame(columns=['idtwitter', 'polarity', 'text'])
for ind in df_train6.index:
  if df_train6['subj'][ind] == 1:
    if df_train6['opos'][ind] == 1 and df_train6['oneg'][ind] == 0:
      df = df.append({'idtwitter' : df_train6['idtwitter'][ind], 'polarity' : 0, 'text' : df_train6['text'][ind] }, ignore_index=True)
    elif df_train6['opos'][ind] == 0 and df_train6['oneg'][ind] == 1:
      df = df.append({'idtwitter' : df_train6['idtwitter'][ind], 'polarity' : 1, 'text' : df_train6['text'][ind] }, ignore_index=True)
    elif df_train6['opos'][ind] == 0 and df_train6['oneg'][ind] == 0:
      df = df.append({'idtwitter' : df_train6['idtwitter'][ind], 'polarity' : 2, 'text' : df_train6['text'][ind] }, ignore_index=True)
  else:
    if df_train6['opos'][ind] == 0 and df_train6['oneg'][ind] == 0:
      df = df.append({'idtwitter' : df_train6['idtwitter'][ind], 'polarity' : 2, 'text' : df_train6['text'][ind] }, ignore_index=True)

df_train6 = df

for ind in df_train6.index:
  if "\t" in df_train6.text[ind]:
    df_train6 = df_train6.replace(to_replace='\t', value='', regex=True)

df_test6 = pd.read_csv(tsv_SENTIPOLC2016_test, delimiter=',')

df = pd.DataFrame(columns=['idtwitter', 'polarity', 'text'])
for ind in df_test6.index:
  if df_test6['subj'][ind] == 1:
    if df_test6['opos'][ind] == 1 and df_test6['oneg'][ind] == 0:
      df = df.append({'idtwitter' : df_test6['idtwitter'][ind], 'polarity' : 0, 'text' : df_test6['text'][ind] }, ignore_index=True)
    elif df_test6['opos'][ind] == 0 and df_test6['oneg'][ind] == 1:
      df = df.append({'idtwitter' : df_test6['idtwitter'][ind], 'polarity' : 1, 'text' : df_test6['text'][ind] }, ignore_index=True)
    elif df_test6['opos'][ind] == 0 and df_test6['oneg'][ind] == 0:
      df = df.append({'idtwitter' : df_test6['idtwitter'][ind], 'polarity' : 2, 'text' : df_test6['text'][ind] }, ignore_index=True)
  else:
    if df_test6['opos'][ind] == 0 and df_test6['oneg'][ind] == 0:
      df = df.append({'idtwitter' : df_test6['idtwitter'][ind], 'polarity' : 2, 'text' : df_test6['text'][ind] }, ignore_index=True)

df_test6 = df

#split train dev
train_dataset6, dev_dataset6 = train_test_split(df_train6, test_size=0.2, shuffle = True)

#reduction
if number_labeled_examples!=0:

  if number_labeled_examples==100:
    labeled6 = train_dataset6.sample(n=100)
    unlabeled6 = train_dataset6
    cond = unlabeled6['idtwitter'].isin(labeled6['idtwitter'])
    unlabeled6.drop(unlabeled6[cond].index, inplace = True)

  elif number_labeled_examples==200:
    labeled6 = train_dataset6.sample(n=200)
    unlabeled6 = train_dataset6
    cond = unlabeled6['idtwitter'].isin(labeled6['idtwitter'])
    unlabeled6.drop(unlabeled6[cond].index, inplace = True)

  elif number_labeled_examples==500:
    labeled6 = train_dataset6.sample(n=500)
    unlabeled6 = train_dataset6
    cond = unlabeled6['idtwitter'].isin(labeled6['idtwitter'])
    unlabeled6.drop(unlabeled6[cond].index, inplace = True)
  
  #model with or without gan 
  if apply_gan == True:
    print("GANBERT")
    #dataset unlabeled with label -1
    unlabeled6['polarity'] = unlabeled6['polarity'].replace(0,-1)
    unlabeled6['polarity'] = unlabeled6['polarity'].replace(1,-1)
    unlabeled6['polarity'] = unlabeled6['polarity'].replace(2,-1)
    train6 = pd.concat([labeled6, unlabeled6])
    dev6 = dev_dataset6
    print("Size of Train dataset is {}, with {} labeled and {} not labeled ".format(len(train6),len(labeled6), len(unlabeled6)))
    print("Size of Dev dataset is {} ".format(len(dev6)))
  else:
    print("BERT-based model, with reduction dataset")
    train6 = labeled6
    dev6 = dev_dataset6
    print("Size of Train dataset is {} ".format(len(labeled6)))
    print("Size of Dev dataset is {} ".format(len(dev6)))

else:
  print("BERT-based model")
  train6 = train_dataset6
  dev6=dev_dataset6
  print("Size of Train dataset is {} ".format(len(train6)))
  print("Size of Dev dataset is {} ".format(len(dev6)))

BERT-based model, with reduction dataset
Size of Train dataset is 200 
Size of Dev dataset is 1394 


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [None]:
!mkdir tsv_transformed
%cd tsv_transformed/

mkdir: cannot create directory ‘tsv_transformed’: File exists
/content/mttransformer/tsv_files/tsv_transformed


In [None]:
#train
name_train = "SENTIPOLC20162_train.tsv"
id_train = train6.idtwitter
label_train = train6.polarity
sentence_train = train6.text

#dev
name_dev = "SENTIPOLC20162_dev.tsv"
id_dev = dev6.idtwitter
label_dev = dev6.polarity
sentence_dev = dev6.text

#test
name_test = "SENTIPOLC20162_test.tsv"
id_test = df_test6.idtwitter
label_test = df_test6.polarity
sentence_test = df_test6.text

#task_def
name_file = 'SENTIPOLC20162_task_def.yml'


f = open(name_train, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_train,label_train,sentence_train))


f = open(name_dev, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_dev,label_dev,sentence_dev))


f = open(name_test, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_test,label_test,sentence_test))


task = "SENTIPOLC20162:\n"
  
f = open(name_file, 'w')

with f:

    f.write(task)
    if apply_gan == True:
      f.write("  data_format: Gan\n")
    else:
      f.write("  data_format: PremiseOnly\n")
    f.write("  enable_san: false\n")
    f.write("  metric_meta:\n")
    f.write("  - F1MAC\n")
    f.write("  - ACC\n")
    f.write("  loss: CeCriterion\n")
    f.write("  n_class: 3\n")
    f.write("  task_type: Classification\n")

In [None]:
%cd ..
%cd ..

/content/mttransformer/tsv_files
/content/mttransformer


In [None]:
if apply_gan == True:
  print("GANBERT")
  !python prepro_std.py --gan --apply_balance --model Musixmatch/umberto-commoncrawl-cased-v1 --root_dir tsv_files/tsv_transformed/ --task_def tsv_files/tsv_transformed/SENTIPOLC20162_task_def.yml
else:
  print("BERT-based model")
  !python prepro_std.py --model Musixmatch/umberto-commoncrawl-cased-v1 --root_dir tsv_files/tsv_transformed/ --task_def tsv_files/tsv_transformed/SENTIPOLC20162_task_def.yml

BERT-based model
Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
2021-07-26 09:15:17.563850: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
  self.tok = re.compile(r"({})".format("|".join(pipeline)))
Reading english - 1grams ...
Reading english - 2grams ...
  regexes = {k.lower(): re.compile(self.expressions[k]) for k, v in
Reading english - 1grams ...
07/26/2021 09:15:33 Task SENTIPOLC20162
07/26/2021 09:15:33 tsv_files/tsv_transformed/musixmatch_cased/SENTIPOLC20162_train.json
07/26/2021 09:15:33 tsv_files/tsv_transformed/musixmatch_cased/SENTIPOLC20162_dev.json
07/26/2021 09:15:34 tsv_files/tsv_transformed/musixmatch_cased/SENTIPOLC20162_test.json
[0m

In [None]:
if apply_gan == True:
  print("GANBERT")
  !python train.py --gan --num_hidden_layers_d 0 --num_hidden_layers_g 3 --noise_size 100 --epsilon 1e-8 --encoder_type 9 --epochs 25 --task_def tsv_files/tsv_transformed/SENTIPOLC20162_task_def.yml --data_dir tsv_files/tsv_transformed/musixmatch_cased/ --init_checkpoint Musixmatch/umberto-commoncrawl-cased-v1 --bert_model_type Musixmatch/umberto-commoncrawl-cased-v1 --max_seq_len 128 --batch_size 64 --batch_size_eval 64 --optimizer "adamW" --train_datasets SENTIPOLC20162 --test_datasets SENTIPOLC20162 --learning_rate "1e-5" #--multi_gpu_on --grad_accumulation_step 4 --fp16 --grad_clipping 0 --global_grad_clipping 1
else:
  print("BERT-based model")
  !python train.py --encoder_type 9 --epochs 10 --task_def tsv_files/tsv_transformed/SENTIPOLC20162_task_def.yml --data_dir tsv_files/tsv_transformed/musixmatch_cased/ --init_checkpoint Musixmatch/umberto-commoncrawl-cased-v1 --bert_model_type Musixmatch/umberto-commoncrawl-cased-v1 --max_seq_len 128 --batch_size 16 --batch_size_eval 16 --optimizer "adamW" --train_datasets SENTIPOLC20162 --test_datasets SENTIPOLC20162 --learning_rate "5e-5" #--multi_gpu_on --grad_accumulation_step 4 --fp16 --grad_clipping 0 --global_grad_clipping 1

BERT-based model
2021-07-26 09:15:37.055273: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
07/26/2021 09:15:39 Launching the MT-DNN training
07/26/2021 09:15:39 Loading tsv_files/tsv_transformed/musixmatch_cased/SENTIPOLC20162_train.json as task 0
Loaded 200 samples out of 200
Loaded 1394 samples out of 1394
Loaded 1964 samples out of 1964
07/26/2021 09:15:39 ####################
07/26/2021 09:15:39 {'log_file': 'mt-dnn-train.log', 'tensorboard': False, 'tensorboard_logdir': 'tensorboard_logdir', 'init_checkpoint': 'Musixmatch/umberto-commoncrawl-cased-v1', 'data_dir': 'tsv_files/tsv_transformed/musixmatch_cased/', 'data_sort_on': False, 'name': 'farmer', 'task_def': 'tsv_files/tsv_transformed/SENTIPOLC20162_task_def.yml', 'train_datasets': ['SENTIPOLC20162'], 'test_datasets': ['SENTIPOLC20162'], 'glue_format_on': False, 'mkd_

## **MT model**


> MTDNN


> MT-GAN



* In the first block the balancing technique is performed to balance the labeled data in the case of MT-DNN, and to balance the unlabeled data in the case of MT-GAN
* In the second there is the multi-task training of the chosen model. This block has the same structure as the sub-blocks of the Single-Task model




### Loading File:


> if the single-model is not used, the files of dataset have to load here or 


> if these files are loaded, but here we have to use the model with GAN (while in the single-model not), or here we have to use without GAN (while in the single-model not)

if the data are the same of the single model, this block is not needed

In [None]:
#if apply_gan is different from single-model
apply_gan=True

In [None]:
%cd tsv_files/

/content/mttransformer/tsv_files


In [None]:
tsv_haspeede_train = 'haspeede_TW-train.tsv'
tsv_haspeede_test = 'haspeede_TW-reference.tsv'
tsv_AMI2018_train = 'AMI2018_it_training.tsv'
tsv_AMI2018_test = 'AMI2018_it_testing.tsv'
tsv_AMI2018_train = 'AMI2018_it_training.tsv'
tsv_AMI2018_test = 'AMI2018_it_testing.tsv'
tsv_DANKMEMES2020_train = 'dankmemes_task2_train.csv'
tsv_DANKMEMES2020_test = 'hate_test.csv'
tsv_SENTIPOLC2016_train = 'training_set_sentipolc16.csv'
tsv_SENTIPOLC2016_test = 'test_set_sentipolc16_gold2000.csv'
tsv_SENTIPOLC2016_train = 'training_set_sentipolc16.csv'
tsv_SENTIPOLC2016_test = 'test_set_sentipolc16_gold2000.csv'

In [None]:
%cd tsv_transformed/

/content/mttransformer/tsv_files/tsv_transformed


In [None]:
if file_loaded==False:
  df_train = pd.read_csv(tsv_haspeede_train, delimiter='\t', names=('id','sentence','label'))
  df_train = df_train[['id']+['label']+['sentence']]
  df_test = pd.read_csv(tsv_haspeede_test, delimiter='\t', names=('id','sentence','label'))
  df_test = df_test[['id']+['label']+['sentence']]

  #split train dev
  train_dataset, dev_dataset = train_test_split(df_train, test_size=0.2, shuffle = True)

  #reduction
  if number_labeled_examples!=0:
    if number_labeled_examples==100:
      labeled = train_dataset.sample(n=100)
      unlabeled = train_dataset
      cond = unlabeled['id'].isin(labeled['id'])
      unlabeled.drop(unlabeled[cond].index, inplace = True)

    elif number_labeled_examples==200:
      labeled = train_dataset.sample(n=200)
      unlabeled = train_dataset
      cond = unlabeled['id'].isin(labeled['id'])
      unlabeled.drop(unlabeled[cond].index, inplace = True)

    elif number_labeled_examples==500:
      labeled = train_dataset.sample(n=500)
      unlabeled = train_dataset
      cond = unlabeled['id'].isin(labeled['id'])
      unlabeled.drop(unlabeled[cond].index, inplace = True)
    
    #model with or without gan 
    if apply_gan == True:
      print("MT-GAN")
      #dataset unlabeled with label -1
      unlabeled['label'] = unlabeled['label'].replace(0,-1)
      unlabeled['label'] = unlabeled['label'].replace(1,-1)
      train = pd.concat([labeled, unlabeled])
      dev = dev_dataset
      print("Size of Train dataset is {}, with {} labeled and {} not labeled ".format(len(train),len(labeled), len(unlabeled)))
      print("Size of Dev dataset is {} ".format(len(dev)))
    else:
      print("MT-DNN, with reduction dataset")
      train = labeled
      dev = dev_dataset
      print("Size of Train dataset is {} ".format(len(labeled)))
      print("Size of Dev dataset is {} ".format(len(dev)))

  else:
    print("MT-DNN")
    train = train_dataset
    dev = dev_dataset
    print("Size of Train dataset is {} ".format(len(train)))
    print("Size of Dev dataset is {} ".format(len(dev)))
else:
    print("no file loaded")
    if apply_gan == True:
      print("MT-GAN")
      #dataset unlabeled with label -1
      unlabeled['label'] = unlabeled['label'].replace(0,-1)
      unlabeled['label'] = unlabeled['label'].replace(1,-1)
      train = pd.concat([labeled, unlabeled])
      dev = dev_dataset
      print("Size of Train dataset is {}, with {} labeled and {} not labeled ".format(len(train),len(labeled), len(unlabeled)))
      print("Size of Dev dataset is {} ".format(len(dev)))
    else:
      print("MT-DNN, with reduction dataset")
      train = labeled
      dev = dev_dataset
      print("Size of Train dataset is {} ".format(len(labeled)))
      print("Size of Dev dataset is {} ".format(len(dev)))

no file loaded
MT-GAN
Size of Train dataset is 2400, with 200 labeled and 2200 not labeled 
Size of Dev dataset is 600 


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [None]:
#train
name_train = "haspeede-TW_train.tsv"
id_train = train.id 
label_train = train.label
sentence_train = train.sentence

#dev
name_dev = "haspeede-TW_dev.tsv"
id_dev = dev.id
label_dev = dev.label
sentence_dev = dev.sentence

#test
name_test = "haspeede-TW_test.tsv"
id_test = df_test.id
label_test = df_test.label
sentence_test = df_test.sentence

f = open(name_train, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_train,label_train,sentence_train))

f = open(name_dev, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_dev,label_dev,sentence_dev))

f = open(name_test, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_test,label_test,sentence_test))

In [None]:
if file_loaded2==False:
  df_train2 = pd.read_csv(tsv_AMI2018_train, delimiter='\t')
  df_train2 = df_train2[['id']+['misogynous']+['text']]
  df_test2 = pd.read_csv(tsv_AMI2018_test, delimiter='\t')
  df_test2 = df_test2[['id']+['misogynous']+['text']]

  #split train dev
  train_dataset2, dev_dataset2 = train_test_split(df_train2, test_size=0.2, shuffle = True)

  #reduction
  if number_labeled_examples!=0:
    if number_labeled_examples==100:
        labeled2 = train_dataset2.sample(n=100)
        unlabeled2 = train_dataset2
        cond = unlabeled2['id'].isin(labeled2['id'])
        unlabeled2.drop(unlabeled2[cond].index, inplace = True)

    elif number_labeled_examples==200:
      labeled2 = train_dataset2.sample(n=200)
      unlabeled2 = train_dataset2
      cond = unlabeled2['id'].isin(labeled2['id'])
      unlabeled2.drop(unlabeled2[cond].index, inplace = True)

    elif number_labeled_examples==500:
      labeled2 = train_dataset2.sample(n=500)
      unlabeled2 = train_dataset2
      cond = unlabeled2['id'].isin(labeled2['id'])
      unlabeled2.drop(unlabeled2[cond].index, inplace = True)
    
    #model with or without gan 
    if apply_gan == True:
      print("MT-GAN")
      #dataset unlabeled with label -1
      unlabeled2['misogynous'] = unlabeled2['misogynous'].replace(0,-1)
      unlabeled2['misogynous'] = unlabeled2['misogynous'].replace(1,-1)
      train2 = pd.concat([labeled2, unlabeled2])
      dev2 = dev_dataset2
      print("Size of Train dataset is {}, with {} labeled and {} not labeled ".format(len(train2),len(labeled2), len(unlabeled2)))
      print("Size of Dev dataset is {} ".format(len(dev2)))
    else:
      print("MT-DNN, with reduction dataset")
      train2 = labeled2
      dev2 = dev_dataset2
      print("Size of Train dataset is {} ".format(len(labeled2)))
      print("Size of Dev dataset is {} ".format(len(dev2)))

  else:
    print("MT-DNN")
    train2 = train_dataset2
    dev2 = dev_dataset2
    print("Size of Train dataset is {} ".format(len(train2)))
    print("Size of Dev dataset is {} ".format(len(dev2)))
else:
  print("no file loaded")
  if apply_gan == True:
      print("MT-GAN")
      #dataset unlabeled with label -1
      unlabeled2['misogynous'] = unlabeled2['misogynous'].replace(0,-1)
      unlabeled2['misogynous'] = unlabeled2['misogynous'].replace(1,-1)
      train2 = pd.concat([labeled2, unlabeled2])
      dev2 = dev_dataset2
      print("Size of Train dataset is {}, with {} labeled and {} not labeled ".format(len(train2),len(labeled2), len(unlabeled2)))
      print("Size of Dev dataset is {} ".format(len(dev2)))
  else:
    print("MT-DNN, with reduction dataset")
    train2 = labeled2
    dev2 = dev_dataset2
    print("Size of Train dataset is {} ".format(len(labeled2)))
    print("Size of Dev dataset is {} ".format(len(dev2)))

no file loaded
MT-GAN
Size of Train dataset is 3200, with 200 labeled and 3000 not labeled 
Size of Dev dataset is 800 


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [None]:
#train
name_train = "AMI2018A_train.tsv"
id_train = train2.id
label_train = train2.misogynous
sentence_train = train2.text

#dev
name_dev = "AMI2018A_dev.tsv"
id_dev = dev2.id
label_dev = dev2.misogynous
sentence_dev = dev2.text

#test
name_test = "AMI2018A_test.tsv"
id_test = df_test2.id
label_test = df_test2.misogynous
sentence_test = df_test2.text


f = open(name_train, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_train,label_train,sentence_train))

f = open(name_dev, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_dev,label_dev,sentence_dev))

f = open(name_test, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_test,label_test,sentence_test))

In [None]:
if file_loaded3==False:
  df_train3 = pd.read_csv(tsv_AMI2018_train, delimiter='\t')
  df = pd.DataFrame(columns=['id', 'misogyny_category', 'text'])
  for ind in df_train3.index:
    if df_train3.misogynous[ind]==1:
      if df_train3.misogyny_category[ind] == 'stereotype':
        df = df.append({'id' : df_train3['id'][ind], 'misogyny_category' : 0, 'text' : df_train3['text'][ind] }, ignore_index=True)
      #elif df_train3.misogyny_category[ind] == 'dominance':
        #df = df.append({'id' : df_train3['id'][ind], 'misogyny_category' : 1, 'text' : df_train3['text'][ind] }, ignore_index=True)
      #elif df_train3.misogyny_category[ind] == 'derailing':
        #df = df.append({'id' : df_train3['id'][ind], 'misogyny_category' : 2, 'text' : df_train3['text'][ind] }, ignore_index=True)
      elif df_train3.misogyny_category[ind] == 'sexual_harassment':
        df = df.append({'id' : df_train3['id'][ind], 'misogyny_category' : 1, 'text' : df_train3['text'][ind] }, ignore_index=True)
      elif df_train3.misogyny_category[ind] == 'discredit':
        df = df.append({'id' : df_train3['id'][ind], 'misogyny_category' : 2, 'text' : df_train3['text'][ind] }, ignore_index=True)

  df_train3 = df

  df_test3 = pd.read_csv(tsv_AMI2018_test, delimiter='\t')
  df = pd.DataFrame(columns=['id', 'misogyny_category', 'text'])
  for ind in df_test3.index:
    if df_test3.misogynous[ind]==1:
      if df_test3.misogyny_category[ind] == 'stereotype':
        df = df.append({'id' : df_test3['id'][ind], 'misogyny_category' : 0, 'text' : df_test3['text'][ind] }, ignore_index=True)
      #elif df_test3.misogyny_category[ind] == 'dominance':
        #df = df.append({'id' : df_test3['id'][ind], 'misogyny_category' : 1, 'text' : df_test3['text'][ind] }, ignore_index=True)
      #elif df_test3.misogyny_category[ind] == 'derailing':
        #df = df.append({'id' : df_test3['id'][ind], 'misogyny_category' : 2, 'text' : df_test3['text'][ind] }, ignore_index=True)
      elif df_test3.misogyny_category[ind] == 'sexual_harassment':
        df = df.append({'id' : df_test3['id'][ind], 'misogyny_category' : 1, 'text' : df_test3['text'][ind] }, ignore_index=True)
      elif df_test3.misogyny_category[ind] == 'discredit':
        df = df.append({'id' : df_test3['id'][ind], 'misogyny_category' : 2, 'text' : df_test3['text'][ind] }, ignore_index=True)

  df_test3 = df

  #split train dev
  train_dataset3, dev_dataset3 = train_test_split(df_train3, test_size=0.2, shuffle = True)

  #reduction
  if number_labeled_examples!=0:
    if number_labeled_examples==100:
      labeled3 = train_dataset3.sample(n=100)
      unlabeled3 = train_dataset3
      cond = unlabeled3['id'].isin(labeled3['id'])
      unlabeled3.drop(unlabeled3[cond].index, inplace = True)

    elif number_labeled_examples==200:
      labeled3 = train_dataset3.sample(n=200)
      unlabeled3 = train_dataset3
      cond = unlabeled3['id'].isin(labeled3['id'])
      unlabeled3.drop(unlabeled3[cond].index, inplace = True)

    elif number_labeled_examples==500:
      labeled3 = train_dataset3.sample(n=500)
      unlabeled3 = train_dataset3
      cond = unlabeled3['id'].isin(labeled3['id'])
      unlabeled3.drop(unlabeled3[cond].index, inplace = True)

    #model with or without gan 
    if apply_gan == True:
      print("MT-GAN")
      #dataset unlabeled with label -1
      unlabeled3['misogyny_category'] = unlabeled3['misogyny_category'].replace(0,-1)
      unlabeled3['misogyny_category'] = unlabeled3['misogyny_category'].replace(1,-1)
      unlabeled3['misogyny_category'] = unlabeled3['misogyny_category'].replace(2,-1)
      unlabeled3['misogyny_category'] = unlabeled3['misogyny_category'].replace(3,-1)
      unlabeled3['misogyny_category'] = unlabeled3['misogyny_category'].replace(4,-1)
      train3 = pd.concat([labeled3, unlabeled3])
      dev3 = dev_dataset3
      print("Size of Train dataset is {}, with {} labeled and {} not labeled ".format(len(train3),len(labeled3), len(unlabeled3)))
      print("Size of Dev dataset is {} ".format(len(dev3)))
    else:
      print("MT-DNN, with reduction dataset")
      train3 = labeled3
      dev3 = dev_dataset3
      print("Size of Train dataset is {} ".format(len(labeled3)))
      print("Size of Dev dataset is {} ".format(len(dev3)))

  else:
    print("MT-DNN")
    train3 = train_dataset3
    dev3=dev_dataset3
    print("Size of Train dataset is {} ".format(len(train3)))
    print("Size of Dev dataset is {} ".format(len(dev3)))
else:
  print("no file loaded")
  if apply_gan == True:
      print("MT-GAN")
      #dataset unlabeled with label -1
      unlabeled3['misogyny_category'] = unlabeled3['misogyny_category'].replace(0,-1)
      unlabeled3['misogyny_category'] = unlabeled3['misogyny_category'].replace(1,-1)
      unlabeled3['misogyny_category'] = unlabeled3['misogyny_category'].replace(2,-1)
      unlabeled3['misogyny_category'] = unlabeled3['misogyny_category'].replace(3,-1)
      unlabeled3['misogyny_category'] = unlabeled3['misogyny_category'].replace(4,-1)
      train3 = pd.concat([labeled3, unlabeled3])
      dev3 = dev_dataset3
      print("Size of Train dataset is {}, with {} labeled and {} not labeled ".format(len(train3),len(labeled3), len(unlabeled3)))
      print("Size of Dev dataset is {} ".format(len(dev3)))
  else:
    print("MT-DNN, with reduction dataset")
    train3 = labeled3
    dev3 = dev_dataset3
    print("Size of Train dataset is {} ".format(len(labeled3)))
    print("Size of Dev dataset is {} ".format(len(dev3)))

no file loaded
MT-GAN
Size of Train dataset is 1386, with 200 labeled and 1186 not labeled 
Size of Dev dataset is 347 


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user

In [None]:
#train
name_train = "AMI2018B_train.tsv"
id_train = train3.id
label_train = train3.misogyny_category
sentence_train = train3.text

#dev
name_dev = "AMI2018B_dev.tsv"
id_dev = dev3.id
label_dev = dev3.misogyny_category
sentence_dev = dev3.text

#test
name_test = "AMI2018B_test.tsv"
id_test = df_test3.id
label_test = df_test3.misogyny_category
sentence_test = df_test3.text


f = open(name_train, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_train,label_train,sentence_train))


f = open(name_dev, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_dev,label_dev,sentence_dev))


f = open(name_test, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_test,label_test,sentence_test))

In [None]:
if file_loaded4==False:
  df_train4 = pd.read_csv(tsv_DANKMEMES2020_train, delimiter=',')
  df_train4 = df_train4[['File']+['Hate Speech']+['Text']]
  df_test4 = pd.read_csv(tsv_DANKMEMES2020_test, delimiter=',')
  df_test4 = df_test4[['File']+['Hate Speech']+['Text']]


  #split train dev
  train_dataset4, dev_dataset4 = train_test_split(df_train4, test_size=0.2, shuffle = True)

  #reduction
  if number_labeled_examples!=0:

    if number_labeled_examples==100:
      labeled4 = train_dataset4.sample(n=100)
      unlabeled4 = train_dataset4
      cond = unlabeled4['File'].isin(labeled4['File'])
      unlabeled4.drop(unlabeled4[cond].index, inplace = True)

    elif number_labeled_examples==200:
      labeled4 = train_dataset4.sample(n=200)
      unlabeled4 = train_dataset4
      cond = unlabeled4['File'].isin(labeled4['File'])
      unlabeled4.drop(unlabeled4[cond].index, inplace = True)

    elif number_labeled_examples==500:
      labeled4 = train_dataset4.sample(n=500)
      unlabeled4 = train_dataset4
      cond = unlabeled4['File'].isin(labeled4['File'])
      unlabeled4.drop(unlabeled4[cond].index, inplace = True)

    #model with or without gan 
    if apply_gan == True:
      print("MT-GAN")
      #dataset unlabeled with label -1
      unlabeled4['Hate Speech'] = unlabeled4['Hate Speech'].replace(0,-1)
      unlabeled4['Hate Speech'] = unlabeled4['Hate Speech'].replace(1,-1)
      train4 = pd.concat([labeled4, unlabeled4])
      dev4 = dev_dataset4
      print("Size of Train dataset is {}, with {} labeled and {} not labeled ".format(len(train4),len(labeled4), len(unlabeled4)))
      print("Size of Dev dataset is {} ".format(len(dev4)))
    else:
      print("MT-DNN, with reduction dataset")
      train4 = labeled4
      dev4 = dev_dataset4
      print("Size of Train dataset is {} ".format(len(labeled4)))
      print("Size of Dev dataset is {} ".format(len(dev4)))


  else:
    print("MT-DNN")
    train4 = train_dataset4
    dev4=dev_dataset4
    print("Size of Train dataset is {} ".format(len(train4)))
    print("Size of Dev dataset is {} ".format(len(dev4)))
else:
  print("no file loaded")
  if apply_gan == True:
      print("MT-GAN")
      #dataset unlabeled with label -1
      unlabeled4['Hate Speech'] = unlabeled4['Hate Speech'].replace(0,-1)
      unlabeled4['Hate Speech'] = unlabeled4['Hate Speech'].replace(1,-1)
      train4 = pd.concat([labeled4, unlabeled4])
      dev4 = dev_dataset4
      print("Size of Train dataset is {}, with {} labeled and {} not labeled ".format(len(train4),len(labeled4), len(unlabeled4)))
      print("Size of Dev dataset is {} ".format(len(dev4)))
  else:
    print("MT-DNN, with reduction dataset")
    train4 = labeled4
    dev4 = dev_dataset4
    print("Size of Train dataset is {} ".format(len(labeled4)))
    print("Size of Dev dataset is {} ".format(len(dev4)))

no file loaded
MT-GAN
Size of Train dataset is 640, with 200 labeled and 440 not labeled 
Size of Dev dataset is 160 


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [None]:
#train
name_train = "DANKMEMES2020_train.tsv"
id_train = train4.File
label_train = train4["Hate Speech"]
sentence_train = train4.Text

#dev
name_dev = "DANKMEMES2020_dev.tsv"
id_dev = dev4.File
label_dev = dev4["Hate Speech"]
sentence_dev = dev4.Text

#test
name_test = "DANKMEMES2020_test.tsv"
id_test = df_test4.File
label_test = df_test4["Hate Speech"]
sentence_test = df_test4.Text


f = open(name_train, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_train,label_train,sentence_train))


f = open(name_dev, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_dev,label_dev,sentence_dev))


f = open(name_test, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_test,label_test,sentence_test))

In [None]:
if file_loaded5==False:
  df_train5 = pd.read_csv(tsv_SENTIPOLC2016_train, delimiter=',')
  df_train5 = df_train5[['idtwitter']+['subj']+['text']]

  df_test5 = pd.read_csv(tsv_SENTIPOLC2016_test, delimiter=',')
  df_test5 = df_test5[['idtwitter']+['subj']+['text']]

  for ind in df_train5.index:
    if "\t" in df_train5.text[ind]:
      df_train5 = df_train5.replace(to_replace='\t', value='', regex=True)


  #split train dev 
  train_dataset5, dev_dataset5 = train_test_split(df_train5, test_size=0.2, shuffle = True)

  if number_labeled_examples!=0:

    if number_labeled_examples==100:
      labeled5 = train_dataset5.sample(n=100)
      unlabeled5 = train_dataset5
      cond = unlabeled5['idtwitter'].isin(labeled5['idtwitter'])
      unlabeled5.drop(unlabeled5[cond].index, inplace = True)

    elif number_labeled_examples==200:
      labeled5 = train_dataset5.sample(n=200)
      unlabeled5 = train_dataset5
      cond = unlabeled5['idtwitter'].isin(labeled5['idtwitter'])
      unlabeled5.drop(unlabeled5[cond].index, inplace = True)

    elif number_labeled_examples==500:
      labeled5 = train_dataset5.sample(n=500)
      unlabeled5 = train_dataset5
      cond = unlabeled5['idtwitter'].isin(labeled5['idtwitter'])
      unlabeled5.drop(unlabeled5[cond].index, inplace = True)
    
    #model with or without gan 
    if apply_gan == True:
      print("MT-GAN")
      #dataset unlabeled with label -1
      unlabeled5['subj'] = unlabeled5['subj'].replace(0,-1)
      unlabeled5['subj'] = unlabeled5['subj'].replace(1,-1)
      train5 = pd.concat([labeled5, unlabeled5])
      dev5 = dev_dataset5
      print("Size of Train dataset is {}, with {} labeled and {} not labeled ".format(len(train5),len(labeled5), len(unlabeled5)))
      print("Size of Dev dataset is {} ".format(len(dev5)))
    else:
      print("MT-DNN, with reduction dataset")
      train5 = labeled5
      dev5 = dev_dataset5
      print("Size of Train dataset is {} ".format(len(labeled5)))
      print("Size of Dev dataset is {} ".format(len(dev5)))

  else:
    print("MT-DNN")
    train5 = train_dataset5
    dev5=dev_dataset5
    print("Size of Train dataset is {} ".format(len(train5)))
    print("Size of Dev dataset is {} ".format(len(dev5)))
else:
  print("no file loaded")
  if apply_gan == True:
      print("MT-GAN")
      #dataset unlabeled with label -1
      unlabeled5['subj'] = unlabeled5['subj'].replace(0,-1)
      unlabeled5['subj'] = unlabeled5['subj'].replace(1,-1)
      train5 = pd.concat([labeled5, unlabeled5])
      dev5 = dev_dataset5
      print("Size of Train dataset is {}, with {} labeled and {} not labeled ".format(len(train5),len(labeled5), len(unlabeled5)))
      print("Size of Dev dataset is {} ".format(len(dev5)))
  else:
    print("MT-DNN, with reduction dataset")
    train5 = labeled5
    dev5 = dev_dataset5
    print("Size of Train dataset is {} ".format(len(labeled5)))
    print("Size of Dev dataset is {} ".format(len(dev5)))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


no file loaded
MT-GAN
Size of Train dataset is 5928, with 200 labeled and 5728 not labeled 
Size of Dev dataset is 1482 


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [None]:
#train
name_train = "SENTIPOLC20161_train.tsv"
id_train = train5.idtwitter
label_train = train5.subj
sentence_train = train5.text

#dev
name_dev = "SENTIPOLC20161_dev.tsv"
id_dev = dev5.idtwitter
label_dev = dev5.subj
sentence_dev = dev5.text

#test
name_test = "SENTIPOLC20161_test.tsv"
id_test = df_test5.idtwitter
label_test = df_test5.subj
sentence_test = df_test5.text


f = open(name_train, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_train,label_train,sentence_train))


f = open(name_dev, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_dev,label_dev,sentence_dev))


f = open(name_test, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_test,label_test,sentence_test))

In [None]:
if file_loaded6==False:
  df_train6 = pd.read_csv(tsv_SENTIPOLC2016_train, delimiter=',')

  df = pd.DataFrame(columns=['idtwitter', 'polarity', 'text'])
  for ind in df_train6.index:
    if df_train6['subj'][ind] == 1:
      if df_train6['opos'][ind] == 1 and df_train6['oneg'][ind] == 0:
        df = df.append({'idtwitter' : df_train6['idtwitter'][ind], 'polarity' : 0, 'text' : df_train6['text'][ind] }, ignore_index=True)
      elif df_train6['opos'][ind] == 0 and df_train6['oneg'][ind] == 1:
        df = df.append({'idtwitter' : df_train6['idtwitter'][ind], 'polarity' : 1, 'text' : df_train6['text'][ind] }, ignore_index=True)
      elif df_train6['opos'][ind] == 0 and df_train6['oneg'][ind] == 0:
        df = df.append({'idtwitter' : df_train6['idtwitter'][ind], 'polarity' : 2, 'text' : df_train6['text'][ind] }, ignore_index=True)
    else:
      if df_train6['opos'][ind] == 0 and df_train6['oneg'][ind] == 0:
        df = df.append({'idtwitter' : df_train6['idtwitter'][ind], 'polarity' : 2, 'text' : df_train6['text'][ind] }, ignore_index=True)

  df_train6 = df

  for ind in df_train6.index:
    if "\t" in df_train6.text[ind]:
      df_train6 = df_train6.replace(to_replace='\t', value='', regex=True)

  df_test6 = pd.read_csv(tsv_SENTIPOLC2016_test, delimiter=',')

  df = pd.DataFrame(columns=['idtwitter', 'polarity', 'text'])
  for ind in df_test6.index:
    if df_test6['subj'][ind] == 1:
      if df_test6['opos'][ind] == 1 and df_test6['oneg'][ind] == 0:
        df = df.append({'idtwitter' : df_test6['idtwitter'][ind], 'polarity' : 0, 'text' : df_test6['text'][ind] }, ignore_index=True)
      elif df_test6['opos'][ind] == 0 and df_test6['oneg'][ind] == 1:
        df = df.append({'idtwitter' : df_test6['idtwitter'][ind], 'polarity' : 1, 'text' : df_test6['text'][ind] }, ignore_index=True)
      elif df_test6['opos'][ind] == 0 and df_test6['oneg'][ind] == 0:
        df = df.append({'idtwitter' : df_test6['idtwitter'][ind], 'polarity' : 2, 'text' : df_test6['text'][ind] }, ignore_index=True)
    else:
      if df_test6['opos'][ind] == 0 and df_test6['oneg'][ind] == 0:
        df = df.append({'idtwitter' : df_test6['idtwitter'][ind], 'polarity' : 2, 'text' : df_test6['text'][ind] }, ignore_index=True)

  df_test6 = df

  #split train dev
  train_dataset6, dev_dataset6 = train_test_split(df_train6, test_size=0.2, shuffle = True)

  #reduction
  if number_labeled_examples!=0:

    if number_labeled_examples==100:
      labeled6 = train_dataset6.sample(n=100)
      unlabeled6 = train_dataset6
      cond = unlabeled6['idtwitter'].isin(labeled6['idtwitter'])
      unlabeled6.drop(unlabeled6[cond].index, inplace = True)

    elif number_labeled_examples==200:
      labeled6 = train_dataset6.sample(n=200)
      unlabeled6 = train_dataset6
      cond = unlabeled6['idtwitter'].isin(labeled6['idtwitter'])
      unlabeled6.drop(unlabeled6[cond].index, inplace = True)

    elif number_labeled_examples==500:
      labeled6 = train_dataset6.sample(n=500)
      unlabeled6 = train_dataset6
      cond = unlabeled6['idtwitter'].isin(labeled6['idtwitter'])
      unlabeled6.drop(unlabeled6[cond].index, inplace = True)
    
    #model with or without gan 
    if apply_gan == True:
      print("MT-GAN")
      #dataset unlabeled with label -1
      unlabeled6['polarity'] = unlabeled6['polarity'].replace(0,-1)
      unlabeled6['polarity'] = unlabeled6['polarity'].replace(1,-1)
      unlabeled6['polarity'] = unlabeled6['polarity'].replace(2,-1)
      train6 = pd.concat([labeled6, unlabeled6])
      dev6 = dev_dataset6
      print("Size of Train dataset is {}, with {} labeled and {} not labeled ".format(len(train6),len(labeled6), len(unlabeled6)))
      print("Size of Dev dataset is {} ".format(len(dev6)))
    else:
      print("MT-DNN, with reduction dataset")
      train6 = labeled6
      dev6 = dev_dataset6
      print("Size of Train dataset is {} ".format(len(labeled6)))
      print("Size of Dev dataset is {} ".format(len(dev6)))

  else:
    print("MT-DNN")
    train6 = train_dataset6
    dev6=dev_dataset6
    print("Size of Train dataset is {} ".format(len(train6)))
    print("Size of Dev dataset is {} ".format(len(dev6)))
else:
  print("no file loaded")
  if apply_gan == True:
      print("MT-GAN")
      #dataset unlabeled with label -1
      unlabeled6['polarity'] = unlabeled6['polarity'].replace(0,-1)
      unlabeled6['polarity'] = unlabeled6['polarity'].replace(1,-1)
      unlabeled6['polarity'] = unlabeled6['polarity'].replace(2,-1)
      train6 = pd.concat([labeled6, unlabeled6])
      dev6 = dev_dataset6
      print("Size of Train dataset is {}, with {} labeled and {} not labeled ".format(len(train6),len(labeled6), len(unlabeled6)))
      print("Size of Dev dataset is {} ".format(len(dev6)))
  else:
    print("MT-DNN, with reduction dataset")
    train6 = labeled6
    dev6 = dev_dataset6
    print("Size of Train dataset is {} ".format(len(labeled6)))
    print("Size of Dev dataset is {} ".format(len(dev6)))

no file loaded
MT-GAN
Size of Train dataset is 5576, with 200 labeled and 5376 not labeled 
Size of Dev dataset is 1394 


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [None]:
#train
name_train = "SENTIPOLC20162_train.tsv"
id_train = train6.idtwitter
label_train = train6.polarity
sentence_train = train6.text

#dev
name_dev = "SENTIPOLC20162_dev.tsv"
id_dev = dev6.idtwitter
label_dev = dev6.polarity
sentence_dev = dev6.text

#test
name_test = "SENTIPOLC20162_test.tsv"
id_test = df_test6.idtwitter
label_test = df_test6.polarity
sentence_test = df_test6.text


f = open(name_train, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_train,label_train,sentence_train))


f = open(name_dev, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_dev,label_dev,sentence_dev))


f = open(name_test, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_test,label_test,sentence_test))

In [None]:
%cd ..
%cd ..

/content/mttransformer/tsv_files
/content/mttransformer


### Balancing for:

* MT-DNN, trained on the total dataset of each task
* MT-GAN

In [None]:
if apply_gan== True:
  print("MT-GAN")
  max_train_un = max(len(unlabeled), len(unlabeled2), len(unlabeled3), len(unlabeled4), len(unlabeled5), len(unlabeled6))
  print(max_train_un)
else:
  print("MT-DNN")
  unlabeled=train
  unlabeled2=train2
  unlabeled3=train3
  unlabeled4=train4
  unlabeled5=train5
  unlabeled6=train6
  max_train_un = max(len(unlabeled), len(unlabeled2), len(unlabeled3), len(unlabeled4), len(unlabeled5), len(unlabeled6))
  print(max_train_un)

5828


In [None]:
#double dataset

df = pd.DataFrame(columns=['id', 'label', 'sentence'])
count=0

if len(unlabeled)<max_train_un:
  for i in range(max_train_un):
    if i < len(unlabeled):
      df = df.append({'id' : unlabeled.iloc[i, 0], 'label' : unlabeled.iloc[i, 1], 'sentence' : unlabeled.iloc[i, 2] }, ignore_index=True)
    else:
      if count < len(unlabeled):
        df = df.append({'id' : unlabeled.iloc[count, 0], 'label' : unlabeled.iloc[count, 1], 'sentence' : unlabeled.iloc[count, 2] }, ignore_index=True)
        count = count+1
      else:
        count = 0
        df = df.append({'id' : unlabeled.iloc[count, 0], 'label' : unlabeled.iloc[count, 1], 'sentence' : unlabeled.iloc[count, 2] }, ignore_index=True)
        count = count+1

  unlabeled = df

if apply_gan== True:
  train = pd.concat([labeled, unlabeled])
else:
  train=unlabeled



df = pd.DataFrame(columns=['id', 'misogynous', 'text'])
count=0

if len(unlabeled2)<max_train_un:
  for i in range(max_train_un):
    if i < len(unlabeled2):
      df = df.append({'id' : unlabeled2.iloc[i, 0], 'misogynous' : unlabeled2.iloc[i, 1], 'text' : unlabeled2.iloc[i, 2] }, ignore_index=True)
    else:
      if count < len(unlabeled2):
        df = df.append({'id' : unlabeled2.iloc[count, 0], 'misogynous' : unlabeled2.iloc[count, 1], 'text' : unlabeled2.iloc[count, 2] }, ignore_index=True)
        count = count+1
      else:
        count = 0
        df = df.append({'id' : unlabeled2.iloc[count, 0], 'misogynous' : unlabeled2.iloc[count, 1], 'text' : unlabeled2.iloc[count, 2] }, ignore_index=True)
        count = count+1

  unlabeled2 = df
  
if apply_gan==True:
  train2 = pd.concat([labeled2, unlabeled2])
else:
  train2=unlabeled2


df = pd.DataFrame(columns=['id', 'misogyny_category', 'text'])
count=0

if len(unlabeled3)<max_train_un:
  for i in range(max_train_un):
    if i < len(unlabeled3):
      df = df.append({'id' : unlabeled3.iloc[i, 0], 'misogyny_category' : unlabeled3.iloc[i, 1], 'text' : unlabeled3.iloc[i, 2] }, ignore_index=True)
    else:
      if count < len(unlabeled3):
        df = df.append({'id' : unlabeled3.iloc[count, 0], 'misogyny_category' : unlabeled3.iloc[count, 1], 'text' : unlabeled3.iloc[count, 2] }, ignore_index=True)
        count = count+1
      else:
        count = 0
        df = df.append({'id' : unlabeled3.iloc[count, 0], 'misogyny_category' : unlabeled3.iloc[count, 1], 'text' : unlabeled3.iloc[count, 2] }, ignore_index=True)
        count = count+1

  unlabeled3 = df

if apply_gan==True:
  train3 = pd.concat([labeled3, unlabeled3])
else:
  train3=unlabeled3


df = pd.DataFrame(columns=['File', 'Hate Speech', 'Text'])
count=0

if len(unlabeled4)<max_train_un:
  for i in range(max_train_un):
    if i < len(unlabeled4):
      df = df.append({'File' : unlabeled4.iloc[i, 0], 'Hate Speech' : unlabeled4.iloc[i, 1], 'Text' : unlabeled4.iloc[i, 2] }, ignore_index=True)
    else:
      if count < len(unlabeled4):
        df = df.append({'File' : unlabeled4.iloc[count, 0], 'Hate Speech' : unlabeled4.iloc[count, 1], 'Text' : unlabeled4.iloc[count, 2] }, ignore_index=True)
        count = count+1
      else:
        count = 0
        df = df.append({'File' : unlabeled4.iloc[count, 0], 'Hate Speech' : unlabeled4.iloc[count, 1], 'Text' : unlabeled4.iloc[count, 2] }, ignore_index=True)
        count = count+1

  unlabeled4 = df

if apply_gan==True:
  train4 = pd.concat([labeled4, unlabeled4])
else:
  train4=unlabeled4


df = pd.DataFrame(columns=['idtwitter', 'subj', 'text'])
count=0

if len(unlabeled5)<max_train_un:
  for i in range(max_train_un):
    if i < len(unlabeled5):
      df = df.append({'idtwitter' : unlabeled5.iloc[i, 0], 'subj' : unlabeled5.iloc[i, 1], 'text' : unlabeled5.iloc[i, 2] }, ignore_index=True)
    else:
      if count < len(unlabeled5):
        df = df.append({'idtwitter' : unlabeled5.iloc[count, 0], 'subj' : unlabeled5.iloc[count, 1], 'text' : unlabeled5.iloc[count, 2] }, ignore_index=True)
        count = count+1
      else:
        count = 0
        df = df.append({'idtwitter' : unlabeled5.iloc[count, 0], 'subj' : unlabeled5.iloc[count, 1], 'text' : unlabeled5.iloc[count, 2] }, ignore_index=True)
        count = count+1

  unlabeled5 = df

if apply_gan==True:
  train5 = pd.concat([labeled5, unlabeled5])
else:
  train5=unlabeled5


df = pd.DataFrame(columns=['idtwitter', 'polarity', 'text'])
count=0

if len(unlabeled6)<max_train_un:
  for i in range(max_train_un):
    if i < len(unlabeled6):
      df = df.append({'idtwitter' : unlabeled6.iloc[i, 0], 'polarity' : unlabeled6.iloc[i, 1], 'text' : unlabeled6.iloc[i, 2] }, ignore_index=True)
    else:
      if count < len(unlabeled6):
        df = df.append({'idtwitter' : unlabeled6.iloc[count, 0], 'polarity' : unlabeled6.iloc[count, 1], 'text' : unlabeled6.iloc[count, 2] }, ignore_index=True)
        count = count+1
      else:
        count = 0
        df = df.append({'idtwitter' : unlabeled6.iloc[count, 0], 'polarity' : unlabeled6.iloc[count, 1], 'text' : unlabeled6.iloc[count, 2] }, ignore_index=True)
        count = count+1

  unlabeled6 = df

if apply_gan==True:
  train6 = pd.concat([labeled6, unlabeled6])
else:
  train6=unlabeled6

KeyboardInterrupt: ignored

In [None]:
%cd tsv_files/

In [None]:
!mkdir tsv_transformed
%cd tsv_transformed/

In [None]:
#train
name_train = "haspeede-TW_train.tsv"
id_train = train.id
label_train = train.label
sentence_train = train.sentence

#dev
name_dev = "haspeede-TW_dev.tsv"
id_dev = dev.id
label_dev = dev.label
sentence_dev = dev.sentence

#test
name_test = "haspeede-TW_test.tsv"
id_test = df_test.id
label_test = df_test.label
sentence_test = df_test.sentence


f = open(name_train, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_train,label_train,sentence_train))


f = open(name_dev, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_dev,label_dev,sentence_dev))


f = open(name_test, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_test,label_test,sentence_test))

In [None]:
#train
name_train = "AMI2018A_train.tsv"
id_train = train2.id
label_train = train2.misogynous
sentence_train = train2.text

#dev
name_dev = "AMI2018A_dev.tsv"
id_dev = dev2.id
label_dev = dev2.misogynous
sentence_dev = dev2.text

#test
name_test = "AMI2018A_test.tsv"
id_test = df_test2.id
label_test = df_test2.misogynous
sentence_test = df_test2.text


f = open(name_train, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_train,label_train,sentence_train))


f = open(name_dev, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_dev,label_dev,sentence_dev))


f = open(name_test, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_test,label_test,sentence_test))

In [None]:
#train
name_train = "AMI2018B_train.tsv"
id_train = train3.id
label_train = train3.misogyny_category
sentence_train = train3.text

#dev
name_dev = "AMI2018B_dev.tsv"
id_dev = dev3.id
label_dev = dev3.misogyny_category
sentence_dev = dev3.text

#test
name_test = "AMI2018B_test.tsv"
id_test = df_test3.id
label_test = df_test3.misogyny_category
sentence_test = df_test3.text


f = open(name_train, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_train,label_train,sentence_train))


f = open(name_dev, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_dev,label_dev,sentence_dev))


f = open(name_test, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_test,label_test,sentence_test))

In [None]:
#train
name_train = "DANKMEMES2020_train.tsv"
id_train = train4.File
label_train = train4["Hate Speech"]
sentence_train = train4.Text

#dev
name_dev = "DANKMEMES2020_dev.tsv"
id_dev = dev4.File
label_dev = dev4["Hate Speech"]
sentence_dev = dev4.Text

#test
name_test = "DANKMEMES2020_test.tsv"
id_test = df_test4.File
label_test = df_test4["Hate Speech"]
sentence_test = df_test4.Text


f = open(name_train, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_train,label_train,sentence_train))


f = open(name_dev, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_dev,label_dev,sentence_dev))


f = open(name_test, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_test,label_test,sentence_test))

In [None]:
#train
name_train = "SENTIPOLC20161_train.tsv"
id_train = train5.idtwitter
label_train = train5.subj
sentence_train = train5.text

#dev
name_dev = "SENTIPOLC20161_dev.tsv"
id_dev = dev5.idtwitter
label_dev = dev5.subj
sentence_dev = dev5.text

#test
name_test = "SENTIPOLC20161_test.tsv"
id_test = df_test5.idtwitter
label_test = df_test5.subj
sentence_test = df_test5.text


f = open(name_train, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_train,label_train,sentence_train))


f = open(name_dev, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_dev,label_dev,sentence_dev))


f = open(name_test, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_test,label_test,sentence_test))

In [None]:
#train
name_train = "SENTIPOLC20162_train.tsv"
id_train = train6.idtwitter
label_train = train6.polarity
sentence_train = train6.text

#dev
name_dev = "SENTIPOLC20162_dev.tsv"
id_dev = dev6.idtwitter
label_dev = dev6.polarity
sentence_dev = dev6.text

#test
name_test = "SENTIPOLC20162_test.tsv"
id_test = df_test6.idtwitter
label_test = df_test6.polarity
sentence_test = df_test6.text


f = open(name_train, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_train,label_train,sentence_train))


f = open(name_dev, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_dev,label_dev,sentence_dev))


f = open(name_test, 'w')

with f:

    writer = csv.writer(f, delimiter='\t')
    writer.writerows(zip(id_test,label_test,sentence_test))

In [None]:
%cd ..

### MTL

In [None]:
%cd tsv_files/

/content/mttransformer/tsv_files


In [None]:
!mkdir tsv_transformed
%cd tsv_transformed/

mkdir: cannot create directory ‘tsv_transformed’: File exists
/content/mttransformer/tsv_files/tsv_transformed


**Onboard your task into training!**
1. Add your piece of config into overall config for all tasks

In [None]:
#task_def
name_file = "haspeede-TW_AMI2018A_AMI2018B_DANKMEMES2020_SENTIPOLC20161_SENTIPOLC20162_task_def.yml"
  
f = open(name_file, 'w')

with f:

    f.write("haspeede-TW:\n")
    if apply_gan == True:
      f.write("  data_format: Gan\n")
    else:
      f.write("  data_format: PremiseOnly\n")
    f.write("  enable_san: false\n")
    f.write("  metric_meta:\n")
    f.write("  - F1MAC\n")
    f.write("  - ACC\n")
    f.write("  loss: CeCriterion\n")
    f.write("  n_class: 2\n")
    f.write("  task_type: Classification\n")
    f.write("AMI2018A:\n")
    if apply_gan == True:
      f.write("  data_format: Gan\n")
    else:
      f.write("  data_format: PremiseOnly\n")
    f.write("  enable_san: false\n")
    f.write("  metric_meta:\n")
    f.write("  - F1MAC\n")
    f.write("  - ACC\n")
    f.write("  loss: CeCriterion\n")
    f.write("  n_class: 2\n")
    f.write("  task_type: Classification\n")
    f.write("AMI2018B:\n")
    if apply_gan == True:
      f.write("  data_format: Gan\n")
    else:
      f.write("  data_format: PremiseOnly\n")
    f.write("  enable_san: false\n")
    f.write("  metric_meta:\n")
    f.write("  - F1MAC\n")
    f.write("  - ACC\n")
    f.write("  loss: CeCriterion\n")
    f.write("  n_class: 3\n")
    f.write("  task_type: Classification\n")
    f.write("DANKMEMES2020:\n")
    if apply_gan == True:
      f.write("  data_format: Gan\n")
    else:
      f.write("  data_format: PremiseOnly\n")
    f.write("  enable_san: false\n")
    f.write("  metric_meta:\n")
    f.write("  - F1MAC\n")
    f.write("  - ACC\n")
    f.write("  loss: CeCriterion\n")
    f.write("  n_class: 2\n")
    f.write("  task_type: Classification\n")
    f.write("SENTIPOLC20161:\n")
    if apply_gan == True:
      f.write("  data_format: Gan\n")
    else:
      f.write("  data_format: PremiseOnly\n")
    f.write("  enable_san: false\n")
    f.write("  metric_meta:\n")
    f.write("  - F1MAC\n")
    f.write("  - ACC\n")
    f.write("  loss: CeCriterion\n")
    f.write("  n_class: 2\n")
    f.write("  task_type: Classification\n")
    f.write("SENTIPOLC20162:\n")
    if apply_gan == True:
      f.write("  data_format: Gan\n")
    else:
      f.write("  data_format: PremiseOnly\n")
    f.write("  enable_san: false\n")
    f.write("  metric_meta:\n")
    f.write("  - F1MAC\n")
    f.write("  - ACC\n")
    f.write("  loss: CeCriterion\n")
    f.write("  n_class: 3\n")
    f.write("  task_type: Classification\n")

In [None]:
%cd ..
%cd ..

/content/mttransformer/tsv_files
/content/mttransformer


In [None]:
if apply_gan == True:
  print("MT-GAN")
  !python prepro_std.py --gan --apply_balance --model Musixmatch/umberto-commoncrawl-cased-v1 --root_dir tsv_files/tsv_transformed/ --task_def tsv_files/tsv_transformed/haspeede-TW_AMI2018A_AMI2018B_DANKMEMES2020_SENTIPOLC20161_SENTIPOLC20162_task_def.yml
else:
  print("MT-DNN")
  !python prepro_std.py --model Musixmatch/umberto-commoncrawl-cased-v1 --root_dir tsv_files/tsv_transformed/ --task_def tsv_files/tsv_transformed/haspeede-TW_AMI2018A_AMI2018B_DANKMEMES2020_SENTIPOLC20161_SENTIPOLC20162_task_def.yml

MT-GAN
Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
2021-07-26 10:05:53.520100: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
  self.tok = re.compile(r"({})".format("|".join(pipeline)))
Reading english - 1grams ...
Reading english - 2grams ...
  regexes = {k.lower(): re.compile(self.expressions[k]) for k, v in
Reading english - 1grams ...
07/26/2021 10:06:05 Task haspeede-TW
07/26/2021 10:06:05 tsv_files/tsv_transformed/musixmatch_cased/haspeede-TW_train.json
labeled
200
unlabeled
2200
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
07/26/2021 10:06:07

2. To join your new task with exsting tasks, please append your task and test_set prefix in train.py args : "--train_datasets EXISTING_TASKS,YOUR_NEW_TASK --test_datasets EXISTING_TASK_TEST_SETS,YOUR_NEW_TASK_SETS"; if you are looking for single task fine-tuning, please just leave your new task only in the args.

In [None]:
if apply_gan == True:
  print("MT-GAN")
  !python train.py --gan --noise_size 100 --epsilon 1e-8 --encoder_type 9 --epochs 25 --task_def tsv_files/tsv_transformed/haspeede-TW_AMI2018A_AMI2018B_DANKMEMES2020_SENTIPOLC20161_SENTIPOLC20162_task_def.yml --data_dir tsv_files/tsv_transformed/musixmatch_cased/ --init_checkpoint Musixmatch/umberto-commoncrawl-cased-v1 --bert_model_type Musixmatch/umberto-commoncrawl-cased-v1 --max_seq_len 128 --batch_size 64 --batch_size_eval 64 --optimizer "adamW" --train_datasets haspeede-TW,AMI2018A,AMI2018B,DANKMEMES2020,SENTIPOLC20161,SENTIPOLC20162 --test_datasets haspeede-TW,AMI2018A,AMI2018B,DANKMEMES2020,SENTIPOLC20161,SENTIPOLC20162 --learning_rate "1e-5" --multi_gpu_on --grad_accumulation_step 4 #--fp16 --grad_clipping 0 --global_grad_clipping 1
else:
  print("MT-DNN")
  !python train.py --encoder_type 9 --epochs 10 --task_def tsv_files/tsv_transformed/haspeede-TW_AMI2018A_AMI2018B_DANKMEMES2020_SENTIPOLC20161_SENTIPOLC20162_task_def.yml --data_dir tsv_files/tsv_transformed/musixmatch_cased/ --init_checkpoint Musixmatch/umberto-commoncrawl-cased-v1 --bert_model_type Musixmatch/umberto-commoncrawl-cased-v1 --max_seq_len 128 --batch_size 16 --batch_size_eval 16 --optimizer "adamW" --train_datasets haspeede-TW,AMI2018A,AMI2018B,DANKMEMES2020,SENTIPOLC20161,SENTIPOLC20162 --test_datasets haspeede-TW,AMI2018A,AMI2018B,DANKMEMES2020,SENTIPOLC20161,SENTIPOLC20162 --learning_rate "5e-5" --multi_gpu_on --grad_accumulation_step 4 #--fp16 --grad_clipping 0 --global_grad_clipping 1

MT-GAN
2021-07-26 10:06:24.256781: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
07/26/2021 10:06:26 Launching the MT-DNN training
07/26/2021 10:06:26 Loading tsv_files/tsv_transformed/musixmatch_cased/haspeede-TW_train.json as task 0
Loaded 2800 samples out of 2800
07/26/2021 10:06:26 Loading tsv_files/tsv_transformed/musixmatch_cased/AMI2018A_train.json as task 1
Loaded 3800 samples out of 3800
07/26/2021 10:06:26 Loading tsv_files/tsv_transformed/musixmatch_cased/AMI2018B_train.json as task 2
Loaded 1586 samples out of 1586
07/26/2021 10:06:26 Loading tsv_files/tsv_transformed/musixmatch_cased/DANKMEMES2020_train.json as task 3
Loaded 640 samples out of 640
07/26/2021 10:06:26 Loading tsv_files/tsv_transformed/musixmatch_cased/SENTIPOLC20161_train.json as task 4
Loaded 6528 samples out of 6528
07/26/2021 10:06:26 Loading ts

### Finetuning

**Use model resulting from previous training!**

> For upload the best model of training MTL:

<pre>--init_checkpoint .../model_0.pt</pre>

the model is located in the checkpoint folder

> To specify which task to perform finetuning for:

<pre>--task 0</pre>

0 is the first task!

In [None]:
#finetuning
!python finetuning.py --finetuning --task 0 --epochs 5 --string haspeede-TW_AMI2018A_AMI2018B_DANKMEMES2020_SENTIPOLC20161_SENTIPOLC20162 --task_def tsv_files/tsv_transformed/haspeede-TW_AMI2018A_AMI2018B_DANKMEMES2020_SENTIPOLC20161_SENTIPOLC20162_task_def.yml --data_dir tsv_files/tsv_transformed/musixmatch_cased/ --init_checkpoint checkpoint/model_0.pt --max_seq_len 128 --batch_size 16 --batch_size_eval 16 --optimizer "adamW" --train_datasets haspeede-TW --test_datasets haspeede-TW --learning_rate "5e-5" --multi_gpu_on --grad_accumulation_step 4 #--fp16 --grad_clipping 0 --global_grad_clipping 1

2021-05-28 12:50:16.665257: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
05/28/2021 12:50:18 Launching the MT-DNN training
05/28/2021 12:50:18 Loading tsv_files/tsv_transformed/musixmatch_cased/haspeede-TW_train.json as task 0
Loaded 2700 samples out of 2700
Loaded 600 samples out of 600
Loaded 1000 samples out of 1000
05/28/2021 12:50:18 ####################
05/28/2021 12:50:18 {'log_file': 'mt-dnn-train.log', 'tensorboard': False, 'tensorboard_logdir': 'tensorboard_logdir', 'init_checkpoint': 'checkpoint/model_0.pt', 'data_dir': 'tsv_files/tsv_transformed/musixmatch_cased/', 'data_sort_on': False, 'name': 'farmer', 'task_def': 'tsv_files/tsv_transformed/haspeede-TW_AMI2018A_AMI2018B_DANKMEMES2020_SENTIPOLC20161_SENTIPOLC20162_task_def.yml', 'train_datasets': ['haspeede-TW'], 'test_datasets': ['haspeede-TW'], 'glue_format_on