# Error Analysis

En esta notebook haremos análisis de error sobre nuestro modelo neuronal.

La idea es ver cómo se activan las neuronas de la LSTM, cuándo las compuertas se saturan, etc.


In [1]:
%load_ext autoreload
%autoreload 2
import pandas as pd
import csv
import numpy as np
import tensorflow as tf
import random
import torch

torch.manual_seed(2019)
np.random.seed(2019)
tf.random.set_random_seed(2019)
random.seed(2019)

torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

df_dev = pd.read_table("../../../data/es/dev_es.tsv", index_col="id", quoting=csv.QUOTE_NONE)
df_train = pd.read_table("../../../data/es/train_es.tsv", index_col="id", quoting=csv.QUOTE_NONE)
df_test = pd.read_table("../../../data/es/reference_es.tsv", header=None, 
                        names=["text", "HS", "TR", "AG"], quoting=csv.QUOTE_NONE)


text_train, y_train = df_train["text"], df_train["HS"]
text_dev, y_dev = df_dev["text"], df_dev["HS"]
text_test, y_test = df_test["text"], df_test["HS"]

print("Instancias de entrenamiento: {}".format(len(df_train)))
print("Instancias de desarrollo: {}".format(len(df_dev)))
print("Instancias de test: {}".format(len(df_test)))


Instancias de entrenamiento: 4500
Instancias de desarrollo: 500
Instancias de test: 1600


Cargamos datos dev con anotaciones propias

In [2]:
df_dev = pd.read_csv("dev_with_annotations.es.csv", index_col="id")

df_dev = df_dev[df_dev["text"].notnull()]


Cargamos modelos de fasttext y ELMo

In [3]:
Embedder?

Object `Embedder` not found.


In [4]:
%%capture
from elmoformanylangs import Embedder
import fastText
import os


fasttext_model = fastText.load_model(os.path.expanduser("../../../WordVectors/UBA_w3_300.bin"))
elmo_embedder = Embedder("../../../models/elmo/es/")

In [5]:
elmo_embedder.sents2elmo(["esto es una prueba"]);

## GRU + Global Max Pooling

In [10]:
from hate.nn import ElmoModel
from keras.optimizers import Adam

max_length = 40

model = ElmoModel(
    max_length, fasttext_model=fasttext_model,
    elmo_embedder=elmo_embedder, 
    rnn_units=256, dropout=0.75,
    tokenize_args = {
        "preserve_case": False,
        "deaccent": False,
        "reduce_len": True,
        "strip_handles": True,
        "alpha_only": False,
        "stem": False
    }
)

#model.load_weights("../../../models/lstm_elmo.h5")


In [11]:

optimizer_args = {
    "lr": 0.0008,
    "decay": 0.01,
}

model.compile(loss='binary_crossentropy', 
              optimizer=Adam(**optimizer_args), 
              metrics=['accuracy'])

¿Cómo anda el tokenizador?

In [12]:
from nltk.tokenize import TweetTokenizer

nltk_tokenizer = TweetTokenizer(
    preserve_case=False, reduce_len=True, strip_handles=True)
tweet_prueba = "jajajaAJAjaj qué hdy culi4w @mauriciomacri #HashTag"

print("Nuestro tokenizador: ", model._tokenizer.tokenize(tweet_prueba))
print("El de NLTK: ", nltk_tokenizer.tokenize(tweet_prueba))

Nuestro tokenizador:  ['jajajaajajaj', 'qué', 'hdy', 'culi', '4w', '#hashtag']
El de NLTK:  ['jajajaajajaj', 'qué', 'hdy', 'culi', '4w', '#hashtag']


In [13]:
from keras.callbacks import ModelCheckpoint, EarlyStopping

checkpointer = ModelCheckpoint('/tmp/lstm_model.h5', save_best_only=True, monitor='val_acc', verbose=1)
early_stopper = EarlyStopping(monitor='val_loss', patience=15)


model.fit(text_train, y_train, 
          callbacks=[checkpointer, early_stopper],
          validation_data=(text_dev, y_dev), epochs=100, batch_size=32)


Train on 4500 samples, validate on 500 samples
Epoch 1/100

Epoch 00001: val_acc improved from -inf to 0.70800, saving model to /tmp/lstm_model.h5
Epoch 2/100

Epoch 00002: val_acc improved from 0.70800 to 0.76400, saving model to /tmp/lstm_model.h5
Epoch 3/100

Epoch 00003: val_acc improved from 0.76400 to 0.79400, saving model to /tmp/lstm_model.h5
Epoch 4/100

Epoch 00004: val_acc did not improve from 0.79400
Epoch 5/100

Epoch 00005: val_acc did not improve from 0.79400
Epoch 6/100

Epoch 00006: val_acc improved from 0.79400 to 0.79600, saving model to /tmp/lstm_model.h5
Epoch 7/100

Epoch 00007: val_acc improved from 0.79600 to 0.80200, saving model to /tmp/lstm_model.h5
Epoch 8/100

Epoch 00008: val_acc improved from 0.80200 to 0.81200, saving model to /tmp/lstm_model.h5
Epoch 9/100

Epoch 00009: val_acc did not improve from 0.81200
Epoch 10/100

Epoch 00010: val_acc did not improve from 0.81200
Epoch 11/100

Epoch 00011: val_acc did not improve from 0.81200
Epoch 12/100

Epoch 0

<keras.callbacks.History at 0x7f08e2394cf8>

In [14]:
from hate.utils import print_evaluation
print("biGRU + MaxPool1D - Elmo+Embeddings -- \n\n")
print("Evaluación sobre dev")

model.load_weights(checkpointer.filepath)

print_evaluation(model, text_dev, y_dev)
print("\nEvaluación sobre test")

print_evaluation(model, text_test, y_test)

biGRU + MaxPool1D - Elmo+Embeddings -- 


Evaluación sobre dev
Loss           : 0.4686
Accuracy       : 0.8280
Precision(1)   : 0.8091
Precision(1)   : 0.8429
Precision(avg) : 0.8260

Recall(1)      : 0.8018
Recall(0)      : 0.8489
Recall(avg)    : 0.8254

F1(1)          : 0.8054
F1(0)          : 0.8459
F1(avg)        : 0.8257

Evaluación sobre test
Loss           : 0.5343
Accuracy       : 0.7306
Precision(1)   : 0.6514
Precision(1)   : 0.8053
Precision(avg) : 0.7283

Recall(1)      : 0.7530
Recall(0)      : 0.7170
Recall(avg)    : 0.7350

F1(1)          : 0.6985
F1(0)          : 0.7586
F1(avg)        : 0.7286


## Error Analysis

Vamos a ver los tweets con mayores errores

In [None]:
df_dev["proba"] = model.predict(text_dev)
df_dev["PROFANITY"] = 0


true_positives = df_dev[(df_dev["HS"] == 1) & (df_dev["proba"] >= 0.5)].copy()
true_negatives = df_dev[(df_dev["HS"] == 0) & (df_dev["proba"] < 0.5)].copy()

false_positives = df_dev[(df_dev["HS"] == 0) & (df_dev["proba"] > 0.5)].copy()
false_positives.sort_values("proba", ascending=False, inplace=True)


false_negatives = df_dev[(df_dev["HS"] == 1) & (df_dev["proba"] < 0.5)].copy()
false_negatives.sort_values("proba", ascending=True, inplace=True)

conf_matrix = pd.DataFrame([
    {"real":"hs=1", "pred_true": len(true_positives), "pred_false": len(false_negatives)},
    {"real":"hs=0", "pred_true": len(false_positives), "pred_false": len(true_negatives)}
])



conf_matrix.set_index("real", inplace=True)

print("Falsos negativos: {}".format(len(false_negatives)))
print("Falsos positivos: {}".format(len(false_positives)))

conf_matrix[["pred_true", "pred_false"]]

In [None]:
lstm_output_model = Model(inputs=[elmo_input, emb_input], 
                          outputs=[rnn_layer])

In [None]:

ret = lstm_output_model.predict([X_dev[0][np.newaxis, ...], 
                           X_emb_dev[0][np.newaxis, ...]])


In [None]:
ret[:, :, 0]

In [None]:

for tok in tokens_dev[0]:
    

In [None]:
cols = df_dev.columns
cols = cols.difference(["proba"])

df_dev[cols].to_csv("dev_with_annotations.es.csv")

## Proporción de Agresivos

In [None]:
print("Proporción de agresivos :", sum(df_dev["AG"] == 0) / len(df_dev))

hs = df_dev[df_dev["HS"] == 1]

print("Correlación AG - TR:", hs["AG"].corr(hs["TR"]))

## Falsos negativos

Vamos a etiquetar la profanidad. Considero profanidad todo aquellas palabras de uso vulgar (puta, perra, zorra, coño, negratas, musulmonos) pero no así aquellas que sean marcadoras de discurso racista pero no vulgar (negro, subsahariano)

In [None]:
profane_words = [
    "sudaca",
    "puta", "polla", "perra", "zorra", "coño", "orto", "morra", "negrata", "pelotuda", "moromierda", "guarr"]
for idx, t in df_dev.iterrows():
    for w in profane_words:
        if w in t["text"].lower():
            df_dev.loc[idx, "PROFANITY"] = 1
            break
    

In [None]:
df_dev.loc[24529, "PROFANITY"] = 1

In [None]:
print(df_dev[df_dev["PROFANITY"] == 0].shape[0])
df_dev[df_dev["PROFANITY"] == 0][169:]

In [None]:
pd.set_option('max_colwidth', 400)



print("Totales = ", len(false_negatives))
print("No AG ({}) AG ({})".format(sum(false_negatives["AG"] == 0), sum(false_negatives["AG"] == 1)))
false_negatives[["text", "proba", "HS", "AG", "PROFANITY"]]

## Falsos Positivos

In [None]:
pd.set_option('max_colwidth', 300)



print("Totales = ", len(false_positives))
print("No AG ({}) AG ({})".format(sum(false_positives["AG"] == 0), sum(false_positives["AG"] == 1)))
false_positives[["text", "proba", "HS", "AG"]]