El procedimiento conocido como "Data augmentation", consiste en la creación de datos sintéticos a partir de datos existentes o de los datos etiquetados que ya disponemos.
Esta técnica es relativamente nueva para los proyectos de NLP.
El módulo nlpaug implementa una serie de algoritmos de data augmentation que podría mejorar la eficiencia de nuestros modelos.

Vamos a ver como utilizar NLPAUG para generar datos de twitter y evaluar el funcionamiento de un modelo con o sin la utilización las técnicas de data augmentation.

# Install and import modules

In [1]:
# Install the most recent version of gensim.
# Otherwise, you may get the following error when running naw.WordEmbsAug():
# 'Word2VecKeyedVectors' object has no attribute 'index_to_key'
# see: https://stackoverflow.com/questions/71032760/word2veckeyedvectors-object-has-no-attribute-index-to-key
!pip install --upgrade gensim --quiet

In [2]:
# Import gensim.
# Note: You will need to retart runtime in order to import the most recent version of gensim
import gensim
print(gensim.__version__)

4.3.2


In [3]:
# Install the transformers module in order to use their base models (e.g., BERT)
!pip install transformers --quiet

In [4]:
# Import transformers
import transformers

In [5]:
# Install the tokenizer needed by the back translation model
!pip install sacremoses --quiet

In [6]:
# Install the tokenizer
import sacremoses

In [7]:
# Install the nlpaug module
!pip install nlpaug --quiet

In [8]:
# Import the nlpaug module and its methods
import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.sentence as nas
import nlpaug.flow as nafc
from nlpaug.util import Action

In [9]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments

In [10]:
# Import other modules
import nltk
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('omw-1.4')
import os
import pandas as pd
import numpy as np
import re
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
import torch
from torch import device
from sklearn.metrics import accuracy_score, f1_score

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [11]:

# Show all outputs of a cell in a jupyter notebook
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Twitter Climate Change Sentiment Dataset


In [12]:
# Mount Google drive to colab
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [13]:
path =('/content/drive/MyDrive/datasets')

df = pd.read_csv('/content/drive/MyDrive/Datasets a limpio /EXIST 2021 dataset_esp.csv')

In [14]:
df.head()

Unnamed: 0,test_case,id,source,language,text,task1,task2
0,EXIST2021,1,twitter,es,Nadie te va a tratar tan bien como un hombre q...,sexist,sexual-violence
1,EXIST2021,2,twitter,es,"@lindagisela74 Que rica putita obediente, afor...",sexist,stereotyping-dominance
2,EXIST2021,3,twitter,es,@BicireporteraDF Yo lo hice a los 18 años por ...,non-sexist,non-sexist
3,EXIST2021,4,twitter,es,las cosas q sueño son indicios de que yo enrea...,non-sexist,non-sexist
4,EXIST2021,5,twitter,es,"Pero a la niña le gustó desde que lo vió, así ...",non-sexist,non-sexist


In [15]:
columns_to_remove = ['test_case', 'id', 'source','language','task2']
df = df.rename(columns = {"task1":"label"}).drop(columns_to_remove ,axis=1)

In [16]:
df.head()

Unnamed: 0,text,label
0,Nadie te va a tratar tan bien como un hombre q...,sexist
1,"@lindagisela74 Que rica putita obediente, afor...",sexist
2,@BicireporteraDF Yo lo hice a los 18 años por ...,non-sexist
3,las cosas q sueño son indicios de que yo enrea...,non-sexist
4,"Pero a la niña le gustó desde que lo vió, así ...",non-sexist


In [17]:
#Cambiamos los valores nominales de sexista y no sexista a valores numéricos
df['label'] = df['label'].replace(['non-sexist','sexist'],[0, 1])

In [18]:
# Important Note: Check the integrity of the DataFrame to ensure that there are no missing values,
# which will deter the training progress. Here, we simply drop any missing observations.
df = df.dropna()

In [19]:
# Take a look at the first five samples in the dataframe
pd.set_option('display.max_colwidth', None)
df.head()

Unnamed: 0,text,label
0,Nadie te va a tratar tan bien como un hombre que te lo quiere meter por primera vez.,1
1,"@lindagisela74 Que rica putita obediente, afortunado tu marido de tener una mujer como tú, saludos",1
2,@BicireporteraDF Yo lo hice a los 18 años por la carretera libre a Veracruz y ahora hay más carreteras veras que si puedes mujer,0
3,"las cosas q sueño son indicios de que yo enrealidad soy una lesbiana reprimida, no tengo ninguna duda d esto",0
4,"Pero a la niña le gustó desde que lo vió, así que me le dije, hola bien y tú? Ese día sólo podía ver un hombre guapo, moreno (como me gustan) con una barba que rodeaba su boca tan perfecta y unos ojazos que wow. Hasta las manos le vi ajjaja por que si me fijo en eso",0


In [20]:
# Split the data into 80% train, 10% validation, and 10% test using sklearn
from sklearn.model_selection import train_test_split
train_df, valtest_df = train_test_split(df, test_size = 0.2, random_state = 42)
val_df, test_df = train_test_split(valtest_df, test_size = 0.5, random_state = 42)
train_df.shape, val_df.shape, test_df.shape

((4560, 2), (570, 2), (571, 2))

In [21]:
!pip install datasets



In [22]:
from datasets import DatasetDict, Dataset

In [23]:
# Convert the three DataFrames to three Datasets (Aparche Arrow format)
dset_train = Dataset.from_pandas(train_df)
dset_val = Dataset.from_pandas(val_df)
dset_test = Dataset.from_pandas(test_df)

In [24]:
# Gather train, val, and test Datasets to have a single DatasetDict, which can be manipulated together
tweets = DatasetDict({
  'train': dset_train,
  'val': dset_val,
  'test': dset_test})
# Dataset.from_pandas will add an index column, which can be removed
tweets = tweets.remove_columns(["__index_level_0__"])

In [25]:
# Specify the model checkpoint
# Note: Find pre-trained models that support AutoModelForSequenceClassification:
# https://huggingface.co/transformers/v3.0.2/model_doc/auto.html#automodelforsequenceclassification
model_ckpt = 'Twitter/twhin-bert-base'

# Load the tokenizer from the pretrained model
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [26]:
#Debemos definir una función para tokenizar el dataset y que sea gestionable por nuestro modelo:
def tokenize(batch):
  return tokenizer( batch["text"], padding = True,  max_length=128, truncation = True)

In [27]:
# Use GPU (cuda) if available
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


In [28]:

# Define two dictionaries that convert between ids (0, 1, 2, 3) and labels (strings)
# Note: By adding label2id and id2label to our model's config,
# we will get friendlier labels in the inference API.
label2id = {}
id2label = {}
labels = ['non-sexist','sexist']
for i, label_name in enumerate(labels):
  label2id[label_name] = str(i)
  id2label[str(i)] = label_name

# Take a look the two dictionaries
label2id, id2label, len(label2id)

({'non-sexist': '0', 'sexist': '1'}, {'0': 'non-sexist', '1': 'sexist'}, 2)

In [29]:

# Grab the pre-trained model with a classification head
model = AutoModelForSequenceClassification.from_pretrained(
  # The pretrained model
  model_ckpt,
  # Number of classes/labels
  num_labels = len(label2id),
  # A dictionary linking label to id
  label2id = label2id,
  # A dictionary linking id to label
  id2label = id2label).to(device)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at Twitter/twhin-bert-base and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [30]:
def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  f1 = f1_score(labels, preds, average = "weighted")
  acc = accuracy_score(labels, preds)
  return {"accuracy": acc, "f1": f1}

In [31]:
! pip install accelerate -U



In [32]:
! pip install transformers[torch]



In [33]:
## Define the training arguments

# Batch size
batch_size = 8

# Number of epochs
num_epochs = 3

# Name of the model (to be uploaded to Huggingface)
model_name = f"{model_ckpt}-finetuned-climate-change"

# Specify the path to store the fine-tuned model
path_model = '/content/drive/MyDrive/Deep Learning Course/Models'

# Training argument
training_args = TrainingArguments(
  # Output directory
  # Note: All model checkpoints will be saved to the folder named `model_name`
  output_dir = os.path.join(path_model, model_name),
  # Number of epochs
  num_train_epochs = num_epochs,
  # Learning rate
  learning_rate = 2e-5,
  # Batch size for training and validation
  per_device_train_batch_size = batch_size,
  per_device_eval_batch_size = batch_size,
  # Weight decay for regularization
  weight_decay = 0.01,
  # Validate the model using the val set after each epoch
  evaluation_strategy = "epoch",
  # Load the best model at the end of training
  load_best_model_at_end = True,
  # Push to Huggingface Hub
  # It could be helpful to push the model to the Hub for sharing and using pipeline(), but
  # it takes a very long time to push the model, so we choose not do it here.
  push_to_hub = False,
  # Save model checkpoint after each epoch
  save_strategy = "epoch")



# Initiate three alternative text augmentation strategies


In [34]:
#Sinonimo Augmentation

In [35]:
# Initiate the synonym augmentation
aug_syn = naw.SynonymAug(
  aug_src = 'wordnet',
  aug_max = 3)


In [36]:
## Initiate the contextual word embeddings (BERT, DistilBERT, RoBERTA or XLNet) augmentation
aug_emb = naw.ContextualWordEmbsAug(
  # Other models include 'distilbert-base-uncased', 'roberta-base', etc.
  model_path = 'distilbert-base-uncased',
  # You can also choose "insert"
  action = "substitute",
  # Use GPU
  device = 'cuda'
  )

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [37]:
# Initiate the back translation augmentation
aug_bt = naw.BackTranslationAug(
  # Translate English to German
  from_model_name = 'Helsinki-NLP/opus-mt-es-en',
  # Translate German back to English
  to_model_name = 'Helsinki-NLP/opus-mt-en-es',
  # Use GPU
  device = 'cuda')

config.json:   0%|          | 0.00/1.44k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/312M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/312M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/826k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.59M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/826k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.59M [00:00<?, ?B/s]

In [38]:
# Create a function to evaluate text augmentation on model performance on test set
def evaluate_aug(aug_strategy, n, train_df, dset_val, dset_test):

  # Create two lists to store augmented tweets and their correponding labels
  augmented_tweets = []
  augmented_tweets_labels = []

  # Loop through the train set to create augmented tweets
  # Note: We create n augmented tweets per original tweet.
  for i in train_df.index:
    if aug_strategy == 'synonym':
      lst_augment = aug_syn.augment(train_df['text'].loc[i], n = n)
    elif aug_strategy == 'embedding':
      lst_augment = aug_emb.augment(train_df['text'].loc[i], n = n)
    else:
      lst_augment = aug_bt.augment(train_df['text'].loc[i], n = n)
    for augment in lst_augment:
      augmented_tweets.append(augment)
      augmented_tweets_labels.append(train_df['label'].loc[i])

  # Zip the two lists into a list of tuples
  augmented_tweets_labels = list(zip(augmented_tweets, augmented_tweets_labels))

  # Convert the list of tuples to a Pandas Dataframe.
  df_augmented_tweets_labels = pd.DataFrame(
    augmented_tweets_labels, columns = ['text', 'label'])

  # Vertically concat the train set with the augmented texts
  train_df_augmented = pd.concat([train_df, df_augmented_tweets_labels], axis = 0)



  # Convert the DataFrame to a Dataset (Aparche Arrow format)
  dset_train_augmented = Dataset.from_pandas(train_df_augmented)

  # Gather train, val, and test Datasets to have a single DatasetDict,
  # which can be manipulated together
  tweets_augmented = DatasetDict({
    'train': dset_train_augmented,
    'val': dset_val,
    'test': dset_test})
  tweets_augmented = tweets_augmented.remove_columns(["__index_level_0__"])

  # Tokenize the tweets dataset
  tweets_augmented_encoded = tweets_augmented.map(
    tokenize,
    batched = True,
    batch_size = None)

  # Remove the text column from the encoded DatasetDict because the model does not use it.
  tweets_augmented_encoded = tweets_augmented_encoded.remove_columns(['text'])

  # Since the model expects tensors as inputs,
  # we will convert the input_ids and attention_mask columns to the "torch" format.
  tweets_augmented_encoded.set_format(
    "torch", columns = ["input_ids", "attention_mask", "label"])

  # Define the trainer
  trainer = Trainer(
    # Model
    model = model,
    # Training argument
    args = training_args,
    # Metrics (f1 score and accuracy)
    compute_metrics = compute_metrics,
    # Train and val Datasets
    train_dataset = tweets_augmented_encoded["train"],
    eval_dataset = tweets_augmented_encoded["val"],
    # Tokenizer
    tokenizer = tokenizer)

  # Clean up the memory using the garbage cleaner
  gc.collect()
  torch.cuda.empty_cache()

  # Start the training process
  trainer.train()

  # Use the model to predict the test set
  preds_output = trainer.predict(tweets_encoded["test"])
  print(preds_output.metrics)

  # Remove all elements from the lists
  augmented_tweets.clear()
  augmented_tweets_labels.clear()

In [39]:
import gc

In [40]:
tweets_encoded = tweets.map(
  tokenize,
  # Encode the tweets in batches
  batched = True,
  # Apply the tokenize function on the full dataset as a single batch
  # Note: This ensures that the input tensors and attention masks have the same shape globally
  # Alternatively, we can specify max_length in the tokenize() function to ensure the same
  batch_size = None)

Map:   0%|          | 0/4560 [00:00<?, ? examples/s]

Map:   0%|          | 0/570 [00:00<?, ? examples/s]

Map:   0%|          | 0/571 [00:00<?, ? examples/s]

In [41]:
# Evaluate the synonym text augmentation
score_synonym = evaluate_aug(
  aug_strategy = 'synonym',
  n = 1,
  train_df = train_df,
  dset_val = dset_val,
  dset_test = dset_test)
print(score_synonym)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Map:   0%|          | 0/9120 [00:00<?, ? examples/s]

Map:   0%|          | 0/570 [00:00<?, ? examples/s]

Map:   0%|          | 0/571 [00:00<?, ? examples/s]



Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.4766,0.622602,0.794737,0.793084
2,0.3012,0.889719,0.777193,0.776695
3,0.1545,1.110689,0.773684,0.773178


{'test_loss': 0.5923130512237549, 'test_accuracy': 0.8126094570928196, 'test_f1': 0.8120771123265388, 'test_runtime': 5.5441, 'test_samples_per_second': 102.992, 'test_steps_per_second': 12.987}
None


In [42]:
# Evaluate the embedding text augmentation
score_synonym = evaluate_aug(
  aug_strategy = 'embedding',
  n = 1,
  train_df = train_df,
  dset_val = dset_val,
  dset_test = dset_test)
print(score_synonym)

Map:   0%|          | 0/9120 [00:00<?, ? examples/s]

Map:   0%|          | 0/570 [00:00<?, ? examples/s]

Map:   0%|          | 0/571 [00:00<?, ? examples/s]

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.3859,0.763606,0.782456,0.781588
2,0.3274,0.874804,0.768421,0.768398
3,0.208,1.17202,0.768421,0.768139


{'test_loss': 0.6831099390983582, 'test_accuracy': 0.7950963222416813, 'test_f1': 0.7948358810531897, 'test_runtime': 4.6975, 'test_samples_per_second': 121.555, 'test_steps_per_second': 15.327}
None


In [43]:
# Evaluate the back translation text augmentation
score_synonym = evaluate_aug(
  aug_strategy = 'backtranslation',
  n = 1,
  train_df = train_df,
  dset_val = dset_val,
  dset_test = dset_test)
print(score_synonym)

  self.pid = os.fork()
  self.pid = os.fork()


KeyboardInterrupt: 