## **Importar librerias**

In [14]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from transformers import BertTokenizer, TFBertForSequenceClassification
from transformers import InputExample, InputFeatures
import tensorflow as tf

## **Cargar datos**

In [15]:
df = pd.read_csv('labeled_data.csv')

In [16]:
df.drop(columns=['Unnamed: 0'], inplace=True)

In [17]:
df.head()

Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet
0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...
1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...


## **Preprocesamiento de datos**

In [18]:
# Convertir etiquetas en categorías
df['class'] = df['class'].astype(int)
df = df[['tweet', 'class']].dropna()

In [19]:
# Separar en conjuntos de entrenamiento y prueba
train_texts, test_texts, train_labels, test_labels = train_test_split(
    df['tweet'], df['class'], test_size=0.2, random_state=42
)

## **Tokenización usando Hugging Face**

In [20]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [21]:
def tokenize_data(texts, labels, tokenizer, max_len=128):
    input_ids, attention_masks, label_list = [], [], []
    for text, label in zip(texts, labels):
        tokenized = tokenizer.encode_plus(
            text, 
            add_special_tokens=True, 
            max_length=max_len, 
            padding="max_length", 
            truncation=True, 
            return_attention_mask=True, 
            return_tensors="tf"
        )
        input_ids.append(tokenized['input_ids'])
        attention_masks.append(tokenized['attention_mask'])
        label_list.append(label)
    
    return (
        tf.convert_to_tensor(input_ids),
        tf.convert_to_tensor(attention_masks),
        tf.convert_to_tensor(label_list)
    )

In [22]:
max_len = 128
train_inputs, train_masks, train_labels = tokenize_data(train_texts, train_labels, tokenizer, max_len)
test_inputs, test_masks, test_labels = tokenize_data(test_texts, test_labels, tokenizer, max_len)

In [None]:
print(train_inputs.shape)  # Forma esperada: (num_samples, seq_length)
print(train_masks.shape)   # Forma esperada: (num_samples, seq_length)

(19826, 1, 128)
(19826, 1, 128)


In [None]:
train_inputs = tf.squeeze(train_inputs, axis=1)
train_masks = tf.squeeze(train_masks, axis=1)
test_inputs = tf.squeeze(test_inputs, axis=1)
test_masks = tf.squeeze(test_masks, axis=1)

In [27]:
print(train_inputs.shape)  # Forma esperada: (num_samples, seq_length)
print(train_masks.shape)   # Forma esperada: (num_samples, seq_length)

(19826, 128)
(19826, 128)


## **Cargar el modelo pre-entrenado BERT**

In [28]:
model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=3)

# Configurar optimizador y métrica
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metrics = [tf.keras.metrics.SparseCategoricalAccuracy('accuracy')]

model.compile(optimizer=optimizer, loss=loss, metrics=metrics)


All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## **Entrenar el modelo**

In [29]:
batch_size = 32
epochs = 3

history = model.fit(
    [train_inputs, train_masks],
    train_labels,
    validation_data=([test_inputs, test_masks], test_labels),
    batch_size=batch_size,
    epochs=epochs
)

Epoch 1/3

Epoch 2/3
Epoch 3/3


## **Evaluar el modelo**

In [30]:
preds = model.predict([test_inputs, test_masks])
pred_classes = np.argmax(preds.logits, axis=1)

print("Classification Report:")
print(classification_report(test_labels, pred_classes, target_names=['Hate Speech', 'Offensive Language', 'Neither']))

print("Confusion Matrix:")
print(confusion_matrix(test_labels, pred_classes))

Classification Report:
                    precision    recall  f1-score   support

       Hate Speech       0.52      0.31      0.39       290
Offensive Language       0.92      0.97      0.95      3832
           Neither       0.90      0.83      0.86       835

          accuracy                           0.91      4957
         macro avg       0.78      0.70      0.73      4957
      weighted avg       0.90      0.91      0.90      4957

Confusion Matrix:
[[  90  188   12]
 [  56 3714   62]
 [  28  118  689]]
