### Laboratory made by:

- Ignacio Cano Navarro
- Angel Langdon Villamayor

In [3]:
!pip install transformers numpy torch sklearn emoji




In [None]:
!rm -rf logs

# Lab 9

In this lab we're going to try to further improve our two models (the one predicting toxicity and the one predicting toxicity levels) by implementing two different strategies. The first one is going to involve using the model predicting whether a comment is toxic to help the second model to classify the level of toxicity. The second one is going to be a radical change. We're going to use the first models that we tried (SVM, RF...) and put them together in a stack (we will explain this better later) to see if this strategy can improve BETO's results.

# Toxicity

In this notebook, we're going to test how well BERT is able to predict toxicity and toxicity_levels variables. Lets start with toxicity 

In [4]:
import random


from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer
import numpy as np
import pandas as pd
import tensorflow as tf
import torch
from sklearn.model_selection import train_test_split
from transformers import (AutoModel, AutoModelForSequenceClassification,
                          AutoTokenizer, BertForSequenceClassification,
                          BertTokenizerFast, BertTokenizer,Trainer,
                          TrainingArguments)
from scipy.special import softmax
from transformers.file_utils import (is_tf_available, is_torch_available,
                                     is_torch_tpu_available)

Let's make a function to set a seed so we'll have same results in different runs:



In [3]:
def set_seed(seed: int):
    """
    Helper function for reproducible behavior to set the seed in ``random``, ``numpy``, ``torch`` and/or ``tf`` (if
    installed).

    Args:
        seed (:obj:`int`): The seed to set.
    """
    random.seed(seed)
    np.random.seed(seed)
    if is_torch_available():
        torch.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        # ^^ safe to call this function even if cuda is not available
    if is_tf_available():
        import tensorflow as tf

        tf.random.set_seed(seed)

set_seed(1)


we'll be using bert-base-spanish-cased because we found that the cased version had a bit more f1-score than the uncased version. Also we found that the spanish bert base pretrained model worked better than the bert base version (which absolutely makes sense) 

In [None]:
model_name = "dccuchile/bert-base-spanish-wwm-cased"
tokenizer = BertTokenizerFast.from_pretrained(model_name, do_lower_case=False)


  


Data loading and split in train/test

In [5]:
df = pd.read_csv("train.csv")
#df_test = pd.read_csv("test.csv")
train_text = df["comment"].values
train_labels = df["toxicity"].values
train_texts, valid_texts, train_labels, valid_labels = train_test_split(list(train_text),
                                                                     list(train_labels),
                                                                     test_size = 0.1)
texts = list(df["comment"])

In [6]:
# max sequence length will be the average length of texts

leng = [len(txt) for txt in texts]
max_length = sum(leng)//len(leng)
max_length


206

Now, lets convert our text to sequences of tokens

In [None]:
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=max_length)
valid_encodings = tokenizer(valid_texts, truncation=True, padding=True, max_length=max_length)


The below code wraps our tokenized text data into a torch Dataset



In [None]:
class DetoxisDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
        item["labels"] = torch.tensor([self.labels[idx]])
        return item

    def __len__(self):
        return len(self.labels)

In [None]:

# convert our tokenized data into a torch Dataset
train_dataset = DetoxisDataset(train_encodings, train_labels)
valid_dataset = DetoxisDataset(valid_encodings, valid_labels)


Now that we have our data prepared, let's download and load our BERT model and its pre-trained weights. Right now we're just trying to predict whether a comment is toxic or not so the number of labels is just 2.

In [None]:
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2).to("cuda")


Some weights of the model checkpoint at dccuchile/bert-base-spanish-wwm-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dccuchi

In [None]:
from sklearn.metrics import accuracy_score

def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  # calculate accuracy using sklearn's function
  acc = accuracy_score(labels, preds)
  return {'accuracy': acc,}


After loading the model, lets choose the training parameters for our model.

In [None]:
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=4,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=20,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    load_best_model_at_end=True,     # load the best model when finished training (default metric is loss)
    # but you can specify `metric_for_best_model` argument to change to accuracy or other metric
    logging_steps=200,               # log & save weights each logging_steps
    evaluation_strategy="steps",     # evaluate each `logging_steps`
    learning_rate = 0.00001
)


In [None]:
trainer = Trainer(
    model=model,                         # the instantiated Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=valid_dataset,          # evaluation dataset
    compute_metrics=compute_metrics,     # the callback that computes metrics of interest
)


In [None]:
# train the model
trainer.train()

Step,Training Loss,Validation Loss,Accuracy,Runtime,Samples Per Second
200,0.657,0.630134,0.648415,8.6086,40.308
400,0.5122,0.558628,0.731988,8.5354,40.654
600,0.4155,0.577997,0.740634,8.5573,40.55


TrainOutput(global_step=780, training_loss=0.4670776856251252, metrics={'train_runtime': 967.3852, 'train_samples_per_second': 0.806, 'total_flos': 1692331864908672.0, 'epoch': 4.0, 'init_mem_cpu_alloc_delta': 8192, 'init_mem_gpu_alloc_delta': 0, 'init_mem_cpu_peaked_delta': 0, 'init_mem_gpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': 14204928, 'train_mem_gpu_alloc_delta': 1780847104, 'train_mem_cpu_peaked_delta': 0, 'train_mem_gpu_peaked_delta': 2955052032})

In [None]:
trainer.evaluate()


{'epoch': 4.0,
 'eval_accuracy': 0.7319884726224783,
 'eval_loss': 0.5586280226707458,
 'eval_mem_cpu_alloc_delta': 0,
 'eval_mem_cpu_peaked_delta': 0,
 'eval_mem_gpu_alloc_delta': 0,
 'eval_mem_gpu_peaked_delta': 170200064,
 'eval_runtime': 8.6957,
 'eval_samples_per_second': 39.905}

After training our model, we're going to create a function that will receive a text as input and will return the model's prediction (toxic or not) as output.

In [None]:
def get_prediction(text,max_length):
    # prepare our text into tokenized sequence
    inputs = tokenizer(text, padding=True, truncation=True, max_length=max_length, return_tensors="pt").to("cuda")
    # perform inference to our model
    outputs = model(**inputs)
    # get output probabilities by doing softmax
    probs = outputs[0].softmax(1)
    # executing argmax function to get the candidate label
    return probs.argmax()


In [None]:
from sklearn.metrics import f1_score
pred = [int(get_prediction(text, max_length)) for text in valid_texts]
score = f1_score(valid_labels, pred, average='macro')
print(score)

0.6904258815795652


We can see that the F1 is around 0.7 which we don't think is that much improvement, taking into account that a much simpler model such as SVM was able to have a F1 score of ~0.68 and that BERT uses lots of resources and its a lot more time consuming. Also, this is the best pretrained version of BERT that we could find, other versions such as Distile-BERT or BERT-BASE, GPT-2, BETO-Cased , twitter-roberta-base-offensive were even worse than a SVM with TFIDF matrix as input.

## Toxicity_levels

Now we're going to further improve the toxicity_levels model by first using our previous model that identifies whether a comment is toxic or not and then, if the model targets the comment as toxic we'll feed that comment to our new toxicity_levels predictor that will only focus on predicting the level of toxicity of a comment (excluding the possibility that the comment is not toxic, as we were doing before)

In [None]:
def replace(df_model):
  df_model = df_model.copy()
  df_model['toxicity_level'] = df_model['toxicity_level'].replace(1,0)
  df_model['toxicity_level'] = df_model['toxicity_level'].replace(2,1)
  df_model['toxicity_level'] = df_model['toxicity_level'].replace(3,2)
  return df_model

In [None]:
df = pd.read_csv("train.csv")
#df_test = pd.read_csv("test.csv")
df_model = df.copy()
# We are going to remove values with toxicity_level == 0
# Get names of indexes for which column Age has value 30
indexNames = df_model[df_model['toxicity_level'] == 0].index
# Delete these row indexes from dataFrame
df_model.drop(indexNames , inplace=True)
df_model = replace(df_model)

train_text = df_model["comment"].values
train_labels = df_model["toxicity_level"].values
train_texts, valid_texts, train_labels, valid_labels = train_test_split(list(train_text),
                                                                     list(train_labels),
                                                                     test_size = 0.2)
texts = list(df_model["comment"])

In [None]:
df_model['toxicity_level'].unique()

array([0, 1, 2])

In [None]:
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=max_length)
valid_encodings = tokenizer(valid_texts, truncation=True, padding=True, max_length=max_length)


train_dataset = DetoxisDataset(train_encodings, train_labels)
valid_dataset = DetoxisDataset(valid_encodings, valid_labels)


In [None]:
model_levels = BertForSequenceClassification.from_pretrained(model_name, num_labels=3).to("cuda")


Some weights of the model checkpoint at dccuchile/bert-base-spanish-wwm-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dccuchi

In [None]:
trainer = Trainer(
    model=model_levels,                         # the instantiated Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=valid_dataset,          # evaluation dataset
    compute_metrics=compute_metrics,     # the callback that computes metrics of interest
)

# train the model
trainer.train()

Step,Training Loss,Validation Loss,Accuracy,Runtime,Samples Per Second
200,0.8361,0.805472,0.665217,5.6493,40.713


TrainOutput(global_step=232, training_loss=0.8153631604951004, metrics={'train_runtime': 288.6245, 'train_samples_per_second': 0.804, 'total_flos': 497492567379648.0, 'epoch': 4.0, 'init_mem_cpu_alloc_delta': 4096, 'init_mem_gpu_alloc_delta': 0, 'init_mem_cpu_peaked_delta': 0, 'init_mem_gpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': 12288, 'train_mem_gpu_alloc_delta': 1781686784, 'train_mem_cpu_peaked_delta': 0, 'train_mem_gpu_peaked_delta': 2954110976})

In [None]:
def get_prediction_levels(text,max_length):
    labels = {0:1, 1:2, 2:3}
    # prepare our text into tokenized sequence
    inputs = tokenizer(text, padding=True, truncation=True, max_length=max_length, return_tensors="pt").to("cuda")
    # perform inference to our model
    outputs = model(**inputs)
    # get output probabilities by doing softmax
    probs = outputs[0].softmax(1)
    # executing argmax function to get the candidate label
    result = int(probs.argmax())
    if result == 0:
      return 0
    else:
      outputs = model_levels(**inputs)
      probs = outputs[0].softmax(1)
      result = int(probs.argmax())
      return labels[result]

In [None]:
from sklearn.metrics import f1_score
pred = [int(get_prediction_levels(text, max_length)) for text in valid_texts]
score = f1_score(valid_labels, pred, average='macro')
print(score)

0.33712236058970646



We didn't expect the f1-score to be worse with this strategy than using two separate models. However there's one last technique to test in order to see if we can improve the f1-score:

## Stacking

The point of stacking is to explore a space of different models for the same problem. The idea is that you can attack a learning problem with different types of models which are capable to learn some part of the problem, but not the whole space of the problem. So, you can build multiple different learners and you use them to build an intermediate prediction, one prediction for each learned model. Then you add a new model which learns from the intermediate predictions the same target.


As we did in previous labs we're going to implement the TF-IDF transformation. Then we'll use a stack of models such as SVM, logistic regression, decision trees, random forests... and also a metalearner that will be the best classifier of the many we have tried before (SVM)

In [5]:
import pandas as pd
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer


# necessary packages
nltk.download("stopwords")
nltk.download("punkt")


# Preprocessing
def delete_stop_words(comment):
    spanish_stopwords = stopwords.words("spanish")
    return " ".join([w for w in comment.split() if w not in spanish_stopwords])

def steam(text, stemmer):
    stemmed_text = [stemmer.stem(word) for word in word_tokenize(text)]
    return " ".join(stemmed_text)

def clean_text_column(df, col, stemmer):
    """Normalizes a string column to have a processed format 
    Arguments:
      df (pd.DataFrame): the dataframe that contains the column to normalize
      col (str): the dataframe column to normalize
      steammer (nltk.steam.SnowballStammer): the steammer to use for 
          steamming the text
    Returns:
      The dataframe with the preprocessed column
    """
    df = df.copy() # copy the dataframe avoid modifying the original
    # Make the comments to lowercase 
    df[col] = df[col].str.lower()
    # Delete the stop words
    df[col] = [delete_stop_words(c) for c in df[col]]
    # Replace underscores and hyphens with spaces 
    df[col] = df[col].str.replace("_", " ")
    df[col] = df[col].str.replace("-", " ")
    # Create the regex to delete the urls, usernames and emojis
    urls = r'https?://[\S]+'
    users = r'@[\S]+'
    emojis = r'[\U00010000-\U0010ffff]'
    hashtags = r'\s#[\S]+'
    # Join the regex
    expr = f'''({"|".join([urls,
                           users,
                           hashtags,
                           emojis])})'''
    # Replace the urls, users and emojis with empty string
    df[col] = df[col].str.replace(expr, "", regex=True)                      
    # Get only the words of the text
    df[col] = df[col].str.findall("\w+").str.join(" ")
    # Delete the numbers
    df[col] = df[col].str.replace("[0-9]+", "",regex=True)
    # Steam the words of the text for each text in the specified column
    #df[col] = [steam(c, stemmer) for c in  df[col]]
    return df


# Initialize the steammer to Spanish language
stemmer = SnowballStemmer('spanish')
# read the data
df_original = pd.read_csv("train.csv") 
# Normalize the "comment" column
df = clean_text_column(df_original, "comment", stemmer)
df.head()


# Create a TF-IDF with ngrams 
tfidf = TfidfVectorizer(ngram_range=(1,1))
# Fit with the comments 
features = tfidf.fit_transform(df["comment"])
# Get the feature extraction matrix
df_features = pd.DataFrame(features.todense(),
             columns= tfidf.get_feature_names())
# Print the first comment
print(df_original["comment"].iloc[0])
# Print the sorted by probability first row of the matrix
df_features.sort_values(by=0, axis=1, ascending=False).head(1)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Pensó: Zumo para restar.


Unnamed: 0,pensó,restar,zumo,opino,operativas,opinadores,opinamos,opinan,opinando,opinar,opine,opines,opinion,opiniones,opinión,opniones,operación,oponen,oportunas,oportunidad,oportunidades,oposicion,oposición,opresor,opresores,oprimidas,oprimido,oprimidos,operandi,operaciones,optas,onegetas,olía,olímpicamente,omiso,omite,omiten,omites,omitir,omnipresente,...,elijo,emigraban,emigrado,emigramos,emigran,emigrante,emigrantes,emigrar,emigraran,emigraron,emigré,emigró,embarcación,embarcaciones,embajadss,embajadas,elimina,eliminación,eliminar,eliminas,elimine,elisa,elite,eljueves,ella,ellas,elle,ello,ellos,elmundotoday,elpais,elplural,elsaltodiario,elíptica,em,ema,email,emanan,emanuel,útiles
0,0.630681,0.608146,0.482079,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [6]:
X,y = df_features, df_original['toxicity']
X_train, X_eval, y_train, y_eval = train_test_split(X, y, test_size=0.2, random_state=20)


In [7]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, make_scorer
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from mlxtend.regressor import StackingCVRegressor
from sklearn.ensemble import StackingClassifier

In [8]:
svc = svm.SVC(class_weight='balanced')
svc_meta = svm.SVC(class_weight='balanced')
dt = DecisionTreeClassifier(class_weight='balanced')
rf = RandomForestClassifier(class_weight='balanced')
lr = LogisticRegression(class_weight='balanced')
estimators = [('svc', svc),
               ("dt", dt),
               ("rf", rf),
               ("lr", lr)]

clf = StackingClassifier(
    estimators=estimators,
    final_estimator=svc_meta,
    n_jobs=-1
)
clf.fit(X_train, y_train)

StackingClassifier(cv=None,
                   estimators=[('svc',
                                SVC(C=1.0, break_ties=False, cache_size=200,
                                    class_weight='balanced', coef0=0.0,
                                    decision_function_shape='ovr', degree=3,
                                    gamma='scale', kernel='rbf', max_iter=-1,
                                    probability=False, random_state=None,
                                    shrinking=True, tol=0.001, verbose=False)),
                               ('dt',
                                DecisionTreeClassifier(ccp_alpha=0.0,
                                                       class_weight='balanced',
                                                       criter...
                                                   solver='lbfgs', tol=0.0001,
                                                   verbose=0,
                                                   warm_start=False))],
         

In [9]:
pred = clf.predict(X_eval)
score = f1_score(y_eval, pred, average='macro')
print(score)


0.672077922077922


In [10]:
X,y = df_features, df_original['toxicity_level']
X_train, X_eval, y_train, y_eval = train_test_split(X, y, test_size=0.2, random_state=20)

svc = svm.SVC(class_weight='balanced')
svc_meta = svm.SVC(class_weight='balanced')
dt = DecisionTreeClassifier(class_weight='balanced')
rf = RandomForestClassifier(class_weight='balanced')
lr = LogisticRegression(class_weight='balanced')
estimators = [('svc', svc),
               ("dt", dt),
               ("rf", rf),
               ("lr", lr)]

clf = StackingClassifier(
    estimators=estimators,
    final_estimator=svc_meta,
    n_jobs=-1
)
clf.fit(X_train, y_train)
pred = clf.predict(X_eval)
score = f1_score(y_eval, pred, average='macro')
print(score)



0.3204497857988262


## Conclusion

After implementing the two different strategies mentioned before, we can see that it was a wasted effort because none of them we're able to improve the results seen in the last lab, that is, the results obtained with BETO. It would have been great to use gridSearchCV with the stacking strategy but training took aproximately one hour and a half without using gridsearch so we couldnt spend hours and hours of computing. After testing several strategies and having seen the results, we absolutely think that with BETO we are going to get the best f1-score and that's why we chose its prediction