# RoBERTa

En este notebook vamos a utilizar una versión alternativa de BERT llamada RoBERTa la cual se supone que es aún más potente.

La idea va a ser utilizar RoBERTa para crear embeddings para los tweets como ya veníamos haciendo, y pasar esto por los siguientes modelos:

* Una CNN con tres filtros de distinto tamaño, la cual utilizamos anteriormente con otros embeddings.

* Una RNN sencilla como la que se encuentra en el notebook `lstm-baseline`.

También vamos a probar utilizar las features que extrae la capa densa de ambos modelos para usarlas como input en una SVM, lo cual parece dar buenos resultados para este problema en particular.

Finalmente vamos a hacer un pseudo ensamble haciendo un averaging entre las predicciones de las dos redes descriptas.

In [None]:
!pip install transformers



In [None]:
from transformers import *
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import int32
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, confusion_matrix, precision_score, recall_score

from sklearn import svm
from keras.layers import Dense, Dropout, Input, GlobalMaxPooling1D, Conv1D, concatenate, LSTM
from keras.models import  Model
from tensorflow.keras.utils import plot_model
from sklearn.model_selection import train_test_split
from tensorflow.keras.callbacks import LearningRateScheduler
import string

In [None]:
def metrics(predictions, y_test):
    tn, fp, fn, tp = confusion_matrix(y_test, predictions).ravel()
    print(f'Verdaderos Negativos: {tn}')
    print(f'Falsos Negativos: {fn}')
    print(f'Verdaderos Positivos: {tp}')
    print(f'Falsos Positivos: {fp}')
    print()
    print(f'precision score: {precision_score(y_test, predictions)}')
    print(f'recall score: {recall_score(y_test, predictions)}')
    print(f'f1 score: {f1_score(y_test,  predictions)}')

In [None]:
# Créditos a este notebook https://www.kaggle.com/nmaguette/up-to-date-list-of-slangs-for-text-preprocessing
abbreviations = {
    "$" : " dollar ",
    "€" : " euro ",
    "4ao" : "for adults only",
    "a.m" : "before midday",
    "a3" : "anytime anywhere anyplace",
    "aamof" : "as a matter of fact",
    "acct" : "account",
    "adih" : "another day in hell",
    "afaic" : "as far as i am concerned",
    "afaict" : "as far as i can tell",
    "afaik" : "as far as i know",
    "afair" : "as far as i remember",
    "afk" : "away from keyboard",
    "app" : "application",
    "approx" : "approximately",
    "apps" : "applications",
    "asap" : "as soon as possible",
    "asl" : "age, sex, location",
    "atk" : "at the keyboard",
    "ave." : "avenue",
    "aymm" : "are you my mother",
    "ayor" : "at your own risk", 
    "b&b" : "bed and breakfast",
    "b+b" : "bed and breakfast",
    "b.c" : "before christ",
    "b2b" : "business to business",
    "b2c" : "business to customer",
    "b4" : "before",
    "b4n" : "bye for now",
    "b@u" : "back at you",
    "bae" : "before anyone else",
    "bak" : "back at keyboard",
    "bbbg" : "bye bye be good",
    "bbc" : "british broadcasting corporation",
    "bbias" : "be back in a second",
    "bbl" : "be back later",
    "bbs" : "be back soon",
    "be4" : "before",
    "bfn" : "bye for now",
    "blvd" : "boulevard",
    "bout" : "about",
    "brb" : "be right back",
    "bros" : "brothers",
    "brt" : "be right there",
    "bsaaw" : "big smile and a wink",
    "btw" : "by the way",
    "bwl" : "bursting with laughter",
    "c/o" : "care of",
    "cet" : "central european time",
    "cf" : "compare",
    "cia" : "central intelligence agency",
    "csl" : "can not stop laughing",
    "cu" : "see you",
    "cul8r" : "see you later",
    "cv" : "curriculum vitae",
    "cwot" : "complete waste of time",
    "cya" : "see you",
    "cyt" : "see you tomorrow",
    "dae" : "does anyone else",
    "dbmib" : "do not bother me i am busy",
    "diy" : "do it yourself",
    "dm" : "direct message",
    "dwh" : "during work hours",
    "e123" : "easy as one two three",
    "eet" : "eastern european time",
    "eg" : "example",
    "embm" : "early morning business meeting",
    "encl" : "enclosed",
    "encl." : "enclosed",
    "etc" : "and so on",
    "faq" : "frequently asked questions",
    "fawc" : "for anyone who cares",
    "fb" : "facebook",
    "fc" : "fingers crossed",
    "fig" : "figure",
    "fimh" : "forever in my heart", 
    "ft." : "feet",
    "ft" : "featuring",
    "ftl" : "for the loss",
    "ftw" : "for the win",
    "fwiw" : "for what it is worth",
    "fyi" : "for your information",
    "g9" : "genius",
    "gahoy" : "get a hold of yourself",
    "gal" : "get a life",
    "gcse" : "general certificate of secondary education",
    "gfn" : "gone for now",
    "gg" : "good game",
    "gl" : "good luck",
    "glhf" : "good luck have fun",
    "gmt" : "greenwich mean time",
    "gmta" : "great minds think alike",
    "gn" : "good night",
    "g.o.a.t" : "greatest of all time",
    "goat" : "greatest of all time",
    "goi" : "get over it",
    "gps" : "global positioning system",
    "gr8" : "great",
    "gratz" : "congratulations",
    "gyal" : "girl",
    "h&c" : "hot and cold",
    "hp" : "horsepower",
    "hr" : "hour",
    "hrh" : "his royal highness",
    "ht" : "height",
    "ibrb" : "i will be right back",
    "ic" : "i see",
    "icq" : "i seek you",
    "icymi" : "in case you missed it",
    "idc" : "i do not care",
    "idgadf" : "i do not give a damn fuck",
    "idgaf" : "i do not give a fuck",
    "idk" : "i do not know",
    "ie" : "that is",
    "i.e" : "that is",
    "ifyp" : "i feel your pain",
    "IG" : "instagram",
    "iirc" : "if i remember correctly",
    "ilu" : "i love you",
    "ily" : "i love you",
    "imho" : "in my humble opinion",
    "imo" : "in my opinion",
    "imu" : "i miss you",
    "iow" : "in other words",
    "irl" : "in real life",
    "j4f" : "just for fun",
    "jic" : "just in case",
    "jk" : "just kidding",
    "jsyk" : "just so you know",
    "l8r" : "later",
    "lb" : "pound",
    "lbs" : "pounds",
    "ldr" : "long distance relationship",
    "lmao" : "laugh my ass off",
    "lmfao" : "laugh my fucking ass off",
    "lol" : "laughing out loud",
    "ltd" : "limited",
    "ltns" : "long time no see",
    "m8" : "mate",
    "mf" : "motherfucker",
    "mfs" : "motherfuckers",
    "mfw" : "my face when",
    "mofo" : "motherfucker",
    "mph" : "miles per hour",
    "mr" : "mister",
    "mrw" : "my reaction when",
    "ms" : "miss",
    "mte" : "my thoughts exactly",
    "nagi" : "not a good idea",
    "nbc" : "national broadcasting company",
    "nbd" : "not big deal",
    "nfs" : "not for sale",
    "ngl" : "not going to lie",
    "nhs" : "national health service",
    "nrn" : "no reply necessary",
    "nsfl" : "not safe for life",
    "nsfw" : "not safe for work",
    "nth" : "nice to have",
    "nvr" : "never",
    "nyc" : "new york city",
    "oc" : "original content",
    "og" : "original",
    "ohp" : "overhead projector",
    "oic" : "oh i see",
    "omdb" : "over my dead body",
    "omg" : "oh my god",
    "omw" : "on my way",
    "p.a" : "per annum",
    "p.m" : "after midday",
    "pm" : "prime minister",
    "poc" : "people of color",
    "pov" : "point of view",
    "pp" : "pages",
    "ppl" : "people",
    "prw" : "parents are watching",
    "ps" : "postscript",
    "pt" : "point",
    "ptb" : "please text back",
    "pto" : "please turn over",
    "qpsa" : "what happens", #"que pasa",
    "ratchet" : "rude",
    "rbtl" : "read between the lines",
    "rlrt" : "real life retweet", 
    "rofl" : "rolling on the floor laughing",
    "roflol" : "rolling on the floor laughing out loud",
    "rotflmao" : "rolling on the floor laughing my ass off",
    "rt" : "retweet",
    "ruok" : "are you ok",
    "sfw" : "safe for work",
    "sk8" : "skate",
    "smh" : "shake my head",
    "sq" : "square",
    "srsly" : "seriously", 
    "ssdd" : "same stuff different day",
    "tbh" : "to be honest",
    "tbs" : "tablespooful",
    "tbsp" : "tablespooful",
    "tfw" : "that feeling when",
    "thks" : "thank you",
    "tho" : "though",
    "thx" : "thank you",
    "tia" : "thanks in advance",
    "til" : "today i learned",
    "tl;dr" : "too long i did not read",
    "tldr" : "too long i did not read",
    "tmb" : "tweet me back",
    "tntl" : "trying not to laugh",
    "ttyl" : "talk to you later",
    "u" : "you",
    "u2" : "you too",
    "u4e" : "yours for ever",
    "utc" : "coordinated universal time",
    "w/" : "with",
    "w/o" : "without",
    "w8" : "wait",
    "wassup" : "what is up",
    "wb" : "welcome back",
    "wtf" : "what the fuck",
    "wtg" : "way to go",
    "wtpa" : "where the party at",
    "wuf" : "where are you from",
    "wuzup" : "what is up",
    "wywh" : "wish you were here",
    "yd" : "yard",
    "ygtr" : "you got that right",
    "ynk" : "you never know",
    "zzz" : "sleeping bored and tired"
}


def convert_abbrev(word):
    return abbreviations[word.lower()] if word.lower() in abbreviations.keys() else word

# Esta lista de contractions la obtuvimos de un notebook de Kaggle también, el cual pone como fuente al siguiente
# post de stackoverflow http://stackoverflow.com/questions/19790188/expanding-english-language-contractions-in-python
contractions = { 
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he's": "he is",
"how'd": "how did",
"how'll": "how will",
"how's": "how is",
"i'd": "i would",
"i'll": "i will",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'll": "it will",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"must've": "must have",
"mustn't": "must not",
"needn't": "need not",
"oughtn't": "ought not",
"shan't": "shall not",
"sha'n't": "shall not",
"she'd": "she would",
"she'll": "she will",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"that'd": "that would",
"that's": "that is",
"there'd": "there had",
"there's": "there is",
"they'd": "they would",
"they'll": "they will",
"they're": "they are",
"they've": "they have",
"wasn't": "was not",
"we'd": "we would",
"we'll": "we will",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what're": "what are",
"what's": "what is",
"what've": "what have",
"where'd": "where did",
"where's": "where is",
"who'll": "who will",
"who's": "who is",
"won't": "will not",
"wouldn't": "would not",
"you'd": "you would",
"you'll": "you will",
"you're": "you are",
"thx"   : "thanks",
"didnt" : "did not"
}


def remove_contractions(text):
    return contractions[text.lower()] if text.lower() in contractions.keys() else text

def clean_text(text):
    words = text.split(' ')
    words = [convert_abbrev(word) for word in words]
    words = [remove_contractions(word) for word in words]
    text = ' '.join([word for word in words if not word.startswith('@')])
    return text

In [None]:
url_train = 'https://raw.githubusercontent.com/fsicardir/datos-tp2/master/dataset/train.csv?token=AFVAIUVCNNLG2DE4LNMEN2C7HMHQE'
url_test = 'https://raw.githubusercontent.com/fsicardir/datos-tp2/master/dataset/test.csv?token=AFVAIUWNQDPWBVOREJGS2727HMHPG'

df_train = pd.read_csv(url_train)
df_test = pd.read_csv(url_test)

# Quitamos las urls
df_train['text'] = df_train['text'].str.replace(r'http:\/\/.*', '', regex=True).replace(r'https:\/\/.*', '', regex=True)
df_test['text'] = df_test['text'].str.replace(r'http:\/\/.*', '', regex=True).replace(r'https:\/\/.*', '', regex=True)

df_train['text'] = df_train['text'].apply(clean_text)
df_test['text'] = df_test['text'].apply(clean_text)

In [None]:
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

train_text = np.array([tokenizer.encode(text,
                                     add_special_tokens=True, 
                                     max_length=40, truncation=True,
                                     pad_to_max_length=True) for text in df_train["text"]])

test_text = np.array([tokenizer.encode(text, 
                           add_special_tokens=True, 
                           max_length=40, truncation=True, 
                           pad_to_max_length=True) for text in df_test["text"]])

In [None]:
X_train, X_test, y_train, y_test = train_test_split(train_text, df_train.target, test_size=0.2, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((6090, 40), (1523, 40), (6090,), (1523,))

# RoBERTa + CNN con múltiples filtros

Esta red es la misma que utilizamos en otro notebook y nos dio resultados aceptables, veamos si con RoBERTa detrás es capaz de más.

In [None]:
inputs = Input((40,), dtype=int32)
roberta_pretrained = TFRobertaModel.from_pretrained('roberta-base')
sequence, cls = roberta_pretrained(inputs)

conv1 = Conv1D(128, kernel_size=3, activation='relu', name='conv_size_3')(sequence)
conv1 = GlobalMaxPooling1D()(conv1)

conv2 = Conv1D(128, kernel_size=4, activation='relu', name='conv_size_4')(sequence)
conv2 = GlobalMaxPooling1D()(conv2)

conv3 = Conv1D(128, kernel_size=2, activation='relu', name='conv_size_2')(sequence)
conv3 = GlobalMaxPooling1D()(conv3)

pooling = concatenate([conv1, conv2, conv3])

dense = Dense(64, activation='relu', name='dense_layer')(pooling)
dense = Dropout(0.5)(dense)

predictions = Dense(1, activation="sigmoid", name="predictions")(dense)
model = Model(inputs, predictions)

model.compile(optimizer='adam', loss="binary_crossentropy",  metrics=["accuracy"])

Some weights of the model checkpoint at roberta-base were not used when initializing TFRobertaModel: ['lm_head']
- This IS expected if you are initializing TFRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFRobertaModel were initialized from the model checkpoint at roberta-base.
If your task is similar to the task the model of the ckeckpoint was trained on, you can already use TFRobertaModel for predictions without further training.


In [None]:
def rate(epoch):
    return 1.5e-5/(epoch + 1)

scheduler = LearningRateScheduler(rate)
EPOCHS = 3
history = model.fit(X_train, y_train, validation_data=(X_test, y_test), callbacks=[scheduler], epochs=EPOCHS, verbose=True)

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [None]:
preds = model.predict(X_test)

preds = [1 if x >= 0.5 else 0 for x in preds]
metrics(preds, y_test)

Verdaderos Negativos: 783
Falsos Negativos: 146
Verdaderos Positivos: 503
Falsos Positivos: 91

precision score: 0.8468013468013468
recall score: 0.7750385208012327
f1 score: 0.8093322606596943


In [None]:
test_preds = model.predict(test_text)

In [None]:
test_preds = [1 if x >= 0.5 else 0 for x in test_preds]

df_test['target'] = test_preds

In [None]:
df_test[['id', 'target']].to_csv('roberta-cnn-multi-filter.csv', index=False)

In [None]:
# Guardamos los pesos del modelo porque me gustó su performance.
model.save_weights('roberta-cnn.h5')

# SVM con features extraídas por la CNN.

In [None]:
extractor = Model(inputs, model.get_layer('dense_layer').output)
features_train = extractor.predict(X_train)
features_val = extractor.predict(X_test)

In [None]:
svc = svm.SVC(probability=True, C=0.5, random_state=42)

svc.fit(features_train, y_train)

svc_preds = svc.predict(features_val)
metrics(svc_preds, y_test)

In [None]:
test_preds = extractor.predict(test_text)
kaggle_preds = svc.predict(test_preds)

In [None]:
df_test['target'] = kaggle_preds
df_test[['id', 'target']].to_csv('roberta-cnn-multi-filter-into-svc.csv', index=False)

# RoBERTa + LSTM Simple

Probamos sin `return_sequences=True` y dio peores resultados.

Así que utilizamos el max pooling para planchar el output de la capa LSTM, y obtuvimos resultados decentes.

In [None]:
input_ids = Input((40,), dtype=int32)
roberta_pretrained = TFRobertaModel.from_pretrained('roberta-base')
sequence, cls = roberta_pretrained(input_ids)

lstm = LSTM(units=128, return_sequences=True)(sequence)
lstm = GlobalMaxPooling1D()(lstm)
dense = Dense(32, activation='relu', name='dense_layer')(lstm)
dense = Dropout(0.5)(dense)
predictions = Dense(1, activation='sigmoid')(dense)

model_lstm = Model(input_ids, predictions)
model_lstm.compile(optimizer='adam', loss="binary_crossentropy",  metrics=["accuracy"])
model_lstm.summary()

Some weights of the model checkpoint at roberta-base were not used when initializing TFRobertaModel: ['lm_head']
- This IS expected if you are initializing TFRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFRobertaModel were initialized from the model checkpoint at roberta-base.
If your task is similar to the task the model of the ckeckpoint was trained on, you can already use TFRobertaModel for predictions without further training.


Model: "functional_11"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_6 (InputLayer)         [(None, 40)]              0         
_________________________________________________________________
tf_roberta_model_5 (TFRobert ((None, 40, 768), (None,  124645632 
_________________________________________________________________
lstm_5 (LSTM)                (None, 40, 128)           459264    
_________________________________________________________________
global_max_pooling1d_6 (Glob (None, 128)               0         
_________________________________________________________________
dense_layer (Dense)          (None, 32)                4128      
_________________________________________________________________
dropout_233 (Dropout)        (None, 32)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)               

In [None]:
# Probamos con 2.5e-5,  0.5e-5 y 3.5e-5 pero no superaron a este valor.
def rate(epoch):
    return 1.5e-5/(epoch + 1)

# Actualiza el learning rate del optimizador al inicio de cada epoch.
scheduler = LearningRateScheduler(rate)
EPOCHS = 3
history = model_lstm.fit(X_train, y_train, validation_data=(X_test, y_test), callbacks=[scheduler], epochs=EPOCHS, verbose=True)

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [None]:
preds = model_lstm.predict(X_test)

preds = [1 if x >= 0.5 else 0 for x in preds]
metrics(preds, y_test)

Verdaderos Negativos: 817
Falsos Negativos: 163
Verdaderos Positivos: 486
Falsos Positivos: 57

precision score: 0.8950276243093923
recall score: 0.74884437596302
f1 score: 0.8154362416107384


In [None]:
test_preds = model_lstm.predict(test_text)
test_preds = [1 if x >= 0.5 else 0 for x in test_preds]

df_test['target'] = test_preds
df_test[['id', 'target']].to_csv('roberta-lstm.csv', index=False)

In [None]:
# Guardamos los pesos del modelo porque me gustó su performance.
model_lstm.save_weights('roberta-lstm.h5')

# Probamos utilizar SVC con las features extraídas por el modelo anterior

In [None]:
extractor = Model(input_ids, model_lstm.get_layer('dense_layer').output)
features_train = extractor.predict(X_train)
features_val = extractor.predict(X_test)

In [None]:
svc = svm.SVC(probability=True, C=0.5, random_state=42)

svc.fit(features_train, y_train)

svc_preds = svc.predict(features_val)
metrics(svc_preds, y_test)

Verdaderos Negativos: 807
Falsos Negativos: 158
Verdaderos Positivos: 491
Falsos Positivos: 67

precision score: 0.8799283154121864
recall score: 0.7565485362095532
f1 score: 0.8135874067937034


In [None]:
test_features = extractor.predict(test_text)
kaggle_preds = svc.predict(test_features)
df_test['target'] = kaggle_preds
df_test[['id', 'target']].to_csv('roberta-lstm-into-svc.csv', index=False)

# Averaging entre CNN + LSTM

In [None]:
cnn_preds = model.predict(test_text)
lstm_preds = model_lstm.predict(test_text)

In [None]:
final_preds = []
for x, y in zip(cnn_preds, lstm_preds):
  final_preds.append(x * 0.4 + y * 0.6)


In [None]:
final_preds = [1 if x >= 0.5 else 0 for x in final_preds]
final_preds[:15]

[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [None]:
df_test['target'] = final_preds
df_test[['id', 'target']].to_csv('roberta-cnn-lstm-avg.csv', index=False)