# Procesamiento del Lenguaje Natural Escrito
## Práctica 2 Evaluable
Francisco Javier Mercader Martínez

[Enlace a los archivos obtenidos](https://upct-my.sharepoint.com/:u:/g/personal/franciscojavier_mercader_edu_upct_es/EVmgALHYQ3JIoaaXrV2jjI4Bn8B2T2y_-mKJCO_XCrM8gw?e=VFEAdP)

# Instalación de dependencias

In [1]:
!wget https://raw.githubusercontent.com/franjavi-upct-es/cid-upct/refs/heads/main/Pr%C3%A1cticas/3%C2%BA%20Curso/2%C2%BA%20Cuatrimestre/PNLE/Practica%202%20Evaluable/requirements.txt

--2025-05-06 09:59:25--  https://raw.githubusercontent.com/franjavi-upct-es/cid-upct/refs/heads/main/Pr%C3%A1cticas/3%C2%BA%20Curso/2%C2%BA%20Cuatrimestre/PNLE/Practica%202%20Evaluable/requirements.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 259 [text/plain]
Saving to: ‘requirements.txt’


2025-05-06 09:59:26 (3.80 MB/s) - ‘requirements.txt’ saved [259/259]



In [2]:
!pip install -r requirements.txt

Collecting praw (from -r requirements.txt (line 2))
  Downloading praw-7.8.1-py3-none-any.whl.metadata (9.4 kB)
Collecting fasttext (from -r requirements.txt (line 9))
  Downloading fasttext-0.9.3.tar.gz (73 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/73.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.4/73.4 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting datasets (from -r requirements.txt (line 18))
  Downloading datasets-3.5.1-py3-none-any.whl.metadata (19 kB)
Collecting prawcore<3,>=2.4 (from praw->-r requirements.txt (line 2))
  Downloading prawcore-2.4.0-py3-none-any.whl.metadata (5.0 kB)
Collecting update_checker>=0.18 (from praw->-r requirements.txt (line 2))
  Downloading update_checker-0.18.0-py3-none-any.

In [3]:
!pip install torchvision
!pip install hf_xet
!pip install --upgrade transformers

Collecting hf_xet
  Downloading hf_xet-1.1.0-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (494 bytes)
Downloading hf_xet-1.1.0-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (53.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.6/53.6 MB[0m [31m46.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: hf_xet
Successfully installed hf_xet-1.1.0


# 1. Importaciones y configuración global

**Descripción**

Se importan las librerías necesarias para:
- Operaciones con archivos, JSON y expresiones regulares.
- Acceso a la API de Reddit (`praw`).
- Procesamiento numérico (`numpy`).
- Manejo de fechas.
- Modelos y utilidades de `scikit-learn`, `fasttext`, `Hugging Face` y `SentenceTransformers`.
- Stopwords y stemmer con `nltk`.
Además, se definen subreddits, número de hilos/comentarios a extraer, y se crea la carpeta `data` para guardar los JSON.

In [4]:
import os
import json
import re
import praw
import numpy as np
from datetime import datetime
import fasttext
import warnings
import tempfile

# scikit-learn
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

# Hugging Face
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    Trainer,
    TrainingArguments,
    DataCollatorWithPadding,
    pipeline
)
from datasets import Dataset

# SBERT y similitud
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# NLTK para limpieza léxica
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer

import nltk
nltk.download('stopwords')

# Configuración básica
SUBREDDITS = ['technology', 'programming', 'machinelearning',
              'datascience', 'computerscience', 'gadgets']
THREADS_PER_SUB = 20
COMMENTS_PER_THREAD = 50
JSON_DIR = 'data'
os.makedirs(JSON_DIR, exist_ok=True)

STOPWORDS = set(stopwords.words('english')) | set(stopwords.words('spanish'))
STEMMER = SnowballStemmer('spanish')

warnings.filterwarnings('ignore')
reddit = praw.Reddit(
    client_id='ShOBXaW1U-PMc1hhr88znw',
    client_secret='o12KLkUR18D5wZqnPxG5lp8jQFszgg',
    user_agent='pln-practica-2025',
    check_for_async=False
)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# 2. Funciones de preprocesamiento de texto
## 2.1 Conversión de timestamps

Convierte segundos UNIX a formato legible.

In [5]:
def convertir_fecha(utc_timestamp):
    """
    Convierte un timestamp UNIX a 'YYYY-MM-DD HH:MM:SS'.
    """
    return datetime.fromtimestamp(utc_timestamp).strftime('%Y-%m-%d %H:%M:%S')

## 2.2 Limpieza léxica
- Elimina URLs, meciones y enlaces Markdown.
- Tokeniza, filtra stopwords y tokens muy cortos.
- Aplica stemming.

In [6]:
LEXICAL_PATTERN = re.compile(r"http\S+|www\.\S+|\[.*?\]\(.*?\)|@[A-Za-z0-9_]+")

def limpiar_texto(texto):
    """
    - Elimina URLs, menciones y markdown.
    - Tokeniza en palabras, pasa a minúsculas.
    - Filtra stopwords y tokens < 3 caracteres.
    - Aplica stemming.
    """
    texto_limpio = LEXICAL_PATTERN.sub('', texto)
    tokens = re.findall(r"\b\w+\b", texto_limpio.lower())
    procesados = [
        STEMMER.stem(tok)
        for tok in tokens
        if tok not in STOPWORDS and len(tok) > 2
    ]
    return " ".join(procesados)

# 3. Extracción y guardado del corpus
Recorre cada subreddit, extrae los hilos "hot" y hasta 50 comentarios por hilo, los preprocesa y guarda un JSON por subreddit en `data/corpus_<sr>.json`.

In [7]:
def extraer_y_guardar_corpus():
    corpus = {}
    for sr in SUBREDDITS:
        hilos = []
        for post in reddit.subreddit(sr).hot(limit=THREADS_PER_SUB):
            hilo = {
                'title': post.title,
                'flair': post.link_flair_text,
                'author': str(post.author),
                'date': convertir_fecha(post.created_utc),
                'score': post.score,
                'description': limpiar_texto(post.selftext),
                'comments': []
            }
            post.comments.replace_more(limit=0)
            for c in post.comments.list()[:COMMENTS_PER_THREAD]:
                hilo['comments'].append({
                    'user': str(c.author),
                    'comment': limpiar_texto(c.body),
                    'score': c.score,
                    'date': convertir_fecha(c.created_utc)
                })
            hilos.append(hilo)
        ruta = os.path.join(JSON_DIR, f'corpus_{sr}.json')
        with open(ruta, 'w', encoding='utf-8') as f:
            json.dump(hilos, f, ensure_ascii=False, indent=4)
        corpus[sr] = hilos
    return corpus

# 4. Prepraración de datos y clasificación
## 4.1 Separación por hilo
Evita fugas de información: 14 hilos para entrenamiento y 6 para validación.

In [8]:
def split_by_thread(corpus):
    X_train, y_train, X_val, y_val = [], [], [], []
    for sr, threads in corpus.items():
        train_threads, val_threads = threads[:14], threads[14:]
        for th in train_threads:
            for c in th['comments']:
                X_train.append(c['comment']); y_train.append(sr)
        for th in val_threads:
            for c in th['comments']:
                X_val.append(c['comment']); y_val.append(sr)
    return X_train, y_train, X_val, y_val


## 4.2 Modelos baseline
1. **TF-IDF + RandomForest**
2. **FastText embeddings + SVM lineal**

In [9]:
def train_baselines(X_train, y_train, X_val, y_val):
    # TF-IDF + Random Forest
    vec = TfidfVectorizer(ngram_range=(1,2), max_features=5000)
    X_tr_vec = vec.fit_transform(X_train)
    X_val_vec = vec.transform(X_val)
    rf = RandomForestClassifier(n_estimators=200, random_state=0)
    rf.fit(X_tr_vec, y_train)
    preds_rf = rf.predict(X_val_vec)
    print("--- TF-IDF + Random Forest ---")
    print(classification_report(y_val, preds_rf))
    print(confusion_matrix(y_val, preds_rf))

    # fastText oficial + SVM
    with tempfile.NamedTemporaryFile('w+', delete=False, encoding='utf-8') as tmp:
        for doc in X_train:
            tmp.write(doc + '\n')
        tmp_path = tmp.name
    ft_model = fasttext.train_unsupervised(tmp_path, model='skipgram', dim=100, ws=5, minCount=2)

    def embed_docs(docs):
        return np.vstack([ft_model.get_sentence_vector(doc) for doc in docs])

    X_tr_ft = embed_docs(X_train)
    X_val_ft = embed_docs(X_val)
    svm = SVC(kernel='linear', probability=True)
    svm.fit(X_tr_ft, y_train)
    preds_svm = svm.predict(X_val_ft)
    print("--- fastText oficial + SVM ---")
    print(classification_report(y_val, preds_svm))
    print(confusion_matrix(y_val, preds_svm))

## 4.3 Fine-tuning con Transformers

Utiliza BERT multilingüe para clasificación. Carga el modelo pre-entrenado si existe, si no lo entrena y guarda en `finetune/`.

In [10]:
def train_transformer(X_train, y_train, X_val, y_val):
    from pathlib import Path
    finetune_dir = 'finetune'
    if Path(finetune_dir).exists():
        print("Cargando modelo ya entrenado…")
        model = AutoModelForSequenceClassification.from_pretrained(finetune_dir)
        tokenizer = AutoTokenizer.from_pretrained(finetune_dir)
    else:
        print("Entrenando modelo desde cero…")
        datos = {'text': X_train + X_val, 'label': y_train + y_val}
        ds = Dataset.from_dict(datos).class_encode_column('label')
        train_ds, val_ds = ds.train_test_split(
            test_size=len(X_val)/len(ds)
        ).values()

        tokenizer = AutoTokenizer.from_pretrained('bert-base-multilingual-uncased')
        def tokenize_fn(batch):
            return tokenizer(
                batch['text'], padding='max_length',
                truncation=True, max_length=128
            )
        train_ds = train_ds.map(tokenize_fn, batched=True) \
                         .rename_column('label','labels')
        val_ds   = val_ds.map(tokenize_fn, batched=True)   \
                         .rename_column('label','labels')

        collator = DataCollatorWithPadding(tokenizer)
        model = AutoModelForSequenceClassification.from_pretrained(
            'bert-base-multilingual-uncased',
            num_labels=len(set(y_train))
        )
        args = TrainingArguments(
            output_dir=finetune_dir, num_train_epochs=3,
            per_device_train_batch_size=16, per_gpu_eval_batch_size=16,
            eval_strategy='epoch', save_strategy='epoch',
            load_best_model_at_end=True
        )
        trainer = Trainer(
            model=model, args=args,
            train_dataset=train_ds, eval_dataset=val_ds,
            tokenizer=tokenizer, data_collator=collator
        )
        trainer.train()
        model.save_pretrained(finetune_dir)
        tokenizer.save_pretrained(finetune_dir)

    # Evaluación
    datos = {'text': X_val, 'label': y_val}
    val_ds = Dataset.from_dict(datos).class_encode_column('label')
    val_ds = val_ds.map(
        lambda b: tokenizer(
            b['text'], padding='max_length',
            truncation=True, max_length=128
        ), batched=True
    ).rename_column('label','labels')
    res = Trainer(model=model).predict(val_ds)
    preds = np.argmax(res.predictions, axis=1)
    print('--- BERT Fine-Tuning ---')
    print(classification_report(val_ds['labels'], preds))
    print(confusion_matrix(val_ds['labels'], preds))

# 5. Similitud de hilos
## 5.1 FastText
Embedding promedio de comentarios por cada hilo y similitud coseno.

In [11]:
def buscar_hilos_similares_fasttext(corpus, top_k=5):
    # Prepara un archivo temporal con cada hilo como línea
    with tempfile.NamedTemporaryFile('w+', delete=False, encoding='utf-8') as tmp:
        for threads in corpus.values():
            for hilo in threads:
                comments_text = ' '.join(c['comment'] for c in hilo['comments'])
                tmp.write(comments_text + '\n')
        tmp_path = tmp.name

    ft_model = fasttext.train_unsupervised(tmp_path, model='skipgram', dim=100, ws=5, minCount=2)
    ids, vectors = [], []
    for sr, threads in corpus.items():
        for idx, hilo in enumerate(threads):
            ids.append((sr, idx))
            text = ' '.join(c['comment'] for c in hilo['comments'])
            vectors.append(ft_model.get_sentence_vector(text))
    sims = cosine_similarity(vectors)
    similares = {
        ids[i]: [
            (ids[j], float(sims[i][j]))
            for j in np.argsort(sims[i])[-top_k-1:-1][::-1]
        ]
        for i in range(len(ids))
    }
    return similares

## 5.2 SBERT
Embeddings de título + comentarios con SentenceTransformers.

In [12]:
def buscar_hilos_similares_sbert(corpus, model_name='all-MiniLM-L6-v2', top_k=5):
    model = SentenceTransformer(model_name)
    ids, texts = [], []
    for sr, threads in corpus.items():
        for idx, hilo in enumerate(threads):
            ids.append((sr, idx))
            combined = hilo['title'] + ' ' + ' '.join(
                c['comment'] for c in hilo['comments']
            )
            texts.append(combined)
    embs = model.encode(texts)
    sims = cosine_similarity(embs)
    similares = {
        ids[i]: [
            (ids[j], float(sims[i][j]))
            for j in np.argsort(sims[i])[-top_k-1:-1][::-1]
        ]
        for i in range(len(ids))
    }
    return similares


# 6. Análisis de sentimiento y resumen automático
## 6.1 Sentimiento y emoción
Pipelines de Hugging Face para sentimiento (`finiteautomata/beto-sentiment-analysis`) y emoción (`pysentimiento/robertuito-emotion-analysis`).

In [13]:
def analisis_sentimiento(corpus):
    sent_pipe = pipeline(
        'sentiment-analysis',
        model='finiteautomata/beto-sentiment-analysis',
        truncation=True, max_length=128
    )
    emo_pipe = pipeline(
        'text-classification',
        model='pysentimiento/robertuito-emotion-analysis',
        return_all_scores=True,
        truncation=True, max_length=128
    )
    for threads in corpus.values():
        for hilo in threads:
            for c in hilo['comments']:
                text = c['comment'][:512]
                s = sent_pipe(text)[0]
                e = emo_pipe(text)[0]
                c['sentiment'] = s['label']
                c['sentiment_score'] = s['score']
                c['emotion'] = {item['label']: item['score'] for item in e}
    return corpus


## 6.2 Resumen preentrenado y zero-shot
- **mT5 multilingual XLSum**
- **Flan-T5 small**

In [14]:
def resumen_preentrenado(corpus, model_name='csebuetnlp/mT5_multilingual_XLSum'):
    from transformers import AutoModelForSeq2SeqLM
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
    for threads in corpus.values():
        for hilo in threads:
            inp = hilo['title'] + ": " + hilo['description']
            tokens = tokenizer(inp, return_tensors='pt',
                               truncation=True, max_length=512)
            out = model.generate(**tokens, max_length=100, num_beams=4)
            hilo['summary_pretrained'] = tokenizer.decode(
                out[0], skip_special_tokens=True
            )
    return corpus

def resumen_zero_shot(corpus, model_name='google/flan-t5-small'):
    zsl_pipe = pipeline('text2text-generation', model=model_name)
    for threads in corpus.values():
        for hilo in threads:
            prompt = f"Resume: {hilo['title']}. {hilo['description']}"
            gen = zsl_pipe(prompt, max_length=100)[0]
            hilo['summary_zero_shot'] = gen['generated_text']
    return corpus


# 7. Detección de contenido inapropiado
## 7.1 Zero-shot + Chain-of-thought
Utiliza BART-MNLI para clasificación y Flan-T5 para explicar razonamiento.

In [15]:
def deteccion_inapropiado(corpus, zsl_batch_size=32, cot_batch_size=16):
    import torch

    # ── 1. Aceleradores PyTorch ───────────────────────────────
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.allow_tf32    = True
    torch.backends.cudnn.benchmark     = True

    # ── 2. Inicializa pipelines en GPU con FP16 ───────────────
    zsl_cls = pipeline(
        'zero-shot-classification',
        model='facebook/bart-large-mnli',
        device=0,
        torch_dtype=torch.float16,
        batch_size=zsl_batch_size
    )
    cot_pipe = pipeline(
        'text2text-generation',
        model='google/flan-t5-small',
        device=0,
        torch_dtype=torch.float16,
        batch_size=cot_batch_size
    )
    labels = ['apropiado', 'inapropiado']

    # ── 3. Recolecta y preasigna comentarios vacíos ──────────
    pending_texts = []
    metadata = []   # tuplas (sr, hilo_idx, comment_idx)
    for sr, threads in corpus.items():
        for hilo_idx, hilo in enumerate(threads):
            for c_idx, c in enumerate(hilo['comments']):
                txt = c['comment'].strip()
                if not txt:
                    c['zs_label']  = 'apropiado'
                    c['zs_score']  = 1.0
                    c['cot_output']= "Comentario vacío, asumido como apropiado."
                else:
                    pending_texts.append(txt)
                    metadata.append((sr, hilo_idx, c_idx))

    # ── 4. Zero-shot clasificación por lotes ───────────────────
    if pending_texts:
        zsl_outs = zsl_cls(pending_texts, candidate_labels=labels)
        if isinstance(zsl_outs, dict):
            zsl_outs = [zsl_outs]
        for out, (sr, hi, ci) in zip(zsl_outs, metadata):
            c = corpus[sr][hi]['comments'][ci]
            c['zs_label'] = out['labels'][0]
            c['zs_score'] = out['scores'][0]

    # ── 5. Chain-of-thought generación por lotes ──────────────
    cot_prompts = [
        f"Evalúa si este comentario contiene lenguaje inapropiado. "
        f"Primero explica tu razonamiento y luego clasifica. Comentario: {text}"
        for text in pending_texts
    ]
    if cot_prompts:
        cot_outs = cot_pipe(cot_prompts, max_new_tokens=50)
        for out, (sr, hi, ci) in zip(cot_outs, metadata):
            corpus[sr][hi]['comments'][ci]['cot_output'] = out['generated_text']

    return corpus

## 7.2 Few-Shot para r/OpinionesPolemicas
Inyecta ejemplos manuales en el prompt.

In [16]:
FEW_SHOT_EXAMPLES = [
    ('Este comentario es ofensivo y soez','inapropiado'),
    ('¡Me encanta esta publicación!','apropiado'),
    ('Qué horror, no soporto esto.','apropiado'),
]

def deteccion_inapropiado_fsl(corpus):
    zsl_pipe = pipeline('zero-shot-classification',
                       model='facebook/bart-large-mnli')
    prompt_fsl = 'Clasifica como apropiado o inapropiado:\n' + "".join([
        f'Ejemplo: {ex[0]} -> {ex[1]}.\n' for ex in FEW_SHOT_EXAMPLES
    ])
    results = {}
    for sr, threads in corpus.items():
        if sr != 'OpinionesPolemicas':
            continue
        for idx, hilo in enumerate(threads[:10]):
            for c in hilo['comments']:
                text = f"{prompt_fsl}Comentario: {c['comment']} ->"
                r = zsl_pipe(text, candidate_labels=['apropiado','inapropiado'])
                results[(sr, idx, c['date'])] = {
                    'label_zsl': r['labels'][0],
                    'score': r['scores'][0]
                }
    return results

# 8. Ejecución del código

In [17]:
# 1) Extracción
corpus = extraer_y_guardar_corpus()

In [18]:
# 2) Clasificación
import wandb
wandb.require("legacy-service")
X_train, y_train, X_val, y_val = split_by_thread(corpus)
train_baselines(X_train, y_train, X_val, y_val)
# Wandb API_KEY: 2d75120400b77d01fafd28db25130420fb4cac8f
train_transformer(X_train, y_train, X_val, y_val)

--- TF-IDF + Random Forest ---
                 precision    recall  f1-score   support

computerscience       0.23      0.03      0.05        99
    datascience       0.58      0.65      0.61       146
        gadgets       0.49      0.59      0.54       239
machinelearning       0.00      0.00      0.00        25
    programming       0.52      0.15      0.23        74
     technology       0.35      0.51      0.41       185

       accuracy                           0.45       768
      macro avg       0.36      0.32      0.31       768
   weighted avg       0.43      0.45      0.41       768

[[  3  11  36   0   4  45]
 [  6  95  17   3   2  23]
 [  0  15 142   0   3  79]
 [  0   5  12   0   1   7]
 [  4  12  21   0  11  26]
 [  0  26  63   1   0  95]]
--- fastText oficial + SVM ---
                 precision    recall  f1-score   support

computerscience       0.00      0.00      0.00        99
    datascience       0.00      0.00      0.00       146
        gadgets       0.33    

Casting to class labels:   0%|          | 0/2698 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.72M [00:00<?, ?B/s]

Map:   0%|          | 0/1930 [00:00<?, ? examples/s]

Map:   0%|          | 0/768 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/672M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mfranciscojavier-mercader[0m ([33mfranciscojavier-mercader-upct-universidad-polit-cnica-de[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss
1,No log,1.226535
2,No log,1.026511
3,No log,1.07111


Using deprecated `--per_gpu_eval_batch_size` argument which will be removed in a future version. Using `--per_device_eval_batch_size` is preferred.
Using deprecated `--per_gpu_eval_batch_size` argument which will be removed in a future version. Using `--per_device_eval_batch_size` is preferred.
Using deprecated `--per_gpu_eval_batch_size` argument which will be removed in a future version. Using `--per_device_eval_batch_size` is preferred.
Using deprecated `--per_gpu_eval_batch_size` argument which will be removed in a future version. Using `--per_device_eval_batch_size` is preferred.
Using deprecated `--per_gpu_eval_batch_size` argument which will be removed in a future version. Using `--per_device_eval_batch_size` is preferred.
Using deprecated `--per_gpu_eval_batch_size` argument which will be removed in a future version. Using `--per_device_eval_batch_size` is preferred.
Using deprecated `--per_gpu_eval_batch_size` argument which will be removed in a future version. Using `--per_de

Casting to class labels:   0%|          | 0/768 [00:00<?, ? examples/s]

Map:   0%|          | 0/768 [00:00<?, ? examples/s]

--- BERT Fine-Tuning ---
              precision    recall  f1-score   support

           0       0.79      0.71      0.74        99
           1       0.85      0.87      0.86       146
           2       0.84      0.85      0.85       239
           3       0.00      0.00      0.00        25
           4       0.67      0.47      0.56        74
           5       0.70      0.89      0.78       185

    accuracy                           0.78       768
   macro avg       0.64      0.63      0.63       768
weighted avg       0.76      0.78      0.76       768

[[ 70   6  12   0   5   6]
 [  1 127   2   0   2  14]
 [  2   1 204   0   6  26]
 [  7   6   3   0   4   5]
 [  9   6   5   0  35  19]
 [  0   4  17   0   0 164]]


In [19]:
# 3) Similitud
sims_ft = buscar_hilos_similares_fasttext(corpus)
sims_sbert = buscar_hilos_similares_sbert(corpus)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [20]:
# 4) Sentimiento y emoción
corpus = analisis_sentimiento(corpus)

config.json:   0%|          | 0.00/841 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/528 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/242k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/439M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/481k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/67.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cuda:0


config.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/435M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/384 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/435M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/167 [00:00<?, ?B/s]

Device set to use cuda:0
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


In [21]:
# 5) Resúmenes
corpus = resumen_preentrenado(corpus)
corpus = resumen_zero_shot(corpus)

tokenizer_config.json:   0%|          | 0.00/375 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/730 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


pytorch_model.bin:   0%|          | 0.00/2.33G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.33G [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Device set to use cuda:0
Token indices sequence length is longer than the specified maximum sequence length for this model (552 > 512). Running this sequence through the model will result in indexing errors


In [22]:
# 6) Detección inapropiado
corpus = deteccion_inapropiado(corpus)
inap_fsl = deteccion_inapropiado_fsl(corpus)

config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0
Device set to use cuda:0
Token indices sequence length is longer than the specified maximum sequence length for this model (1196 > 512). Running this sequence through the model will result in indexing errors
Device set to use cuda:0


In [23]:
# Guardar análisis extra
with open(os.path.join(JSON_DIR,'analysis_extras.json'),
          'w', encoding='utf-8') as f:
    sims_ft_str_keys = {str(key): value for key, value in sims_ft.items()}
    sims_sbert_str_keys = {str(key): value for key, value in sims_sbert.items()}
    json.dump({
        'sims_ft': sims_ft_str_keys,
        'sims_sbert': sims_sbert_str_keys,
        'inap_fsl': inap_fsl
    }, f, ensure_ascii=False, indent=4)