# Preparar clasificación de noticias RPP
Este cuaderno:
1. Cargamos las 50 noticias desde `data/rpp_source.json`.
2. Generamos el bloque numerado para pedir clasificación a un LLM (0-World, 1-Sports, 2-Business, 3-Science/Tech).
3. Guardamos el resultado clasificado en `data/rpp_classified.json`.

In [1]:
# === Celda 0: sincronizar ruta del repo en Colab ===
import os, sys

In [2]:
# 1️. Borrar clonaciones anteriores
!rm -rf /content/News_Classification-lab_LovatonValeria_PizarroSebastian

# 2️. Clonar el repo completo desde GitHub
!git clone https://github.com/ValLovaton/News_Classification-lab_LovatonValeria_PizarroSebastian
%cd /content/News_Classification-lab_LovatonValeria_PizarroSebastian


# 3️⃣ Mostrar estructura
print("\n📂 Estructura inicial:")
!ls -R | head -40

# 4️⃣ Asegurar ruta base y carpeta data
sys.path.append(os.path.abspath("."))
os.makedirs("data", exist_ok=True)

print("\n✅ Ubicación actual:", os.getcwd())


Cloning into 'News_Classification-lab_LovatonValeria_PizarroSebastian'...
remote: Enumerating objects: 97, done.[K
remote: Counting objects: 100% (97/97), done.[K
remote: Compressing objects: 100% (94/94), done.[K
remote: Total 97 (delta 42), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (97/97), 94.68 KiB | 3.51 MiB/s, done.
Resolving deltas: 100% (42/42), done.
/content/News_Classification-lab_LovatonValeria_PizarroSebastian

📂 Estructura inicial:
.:
data
notebooks
outputs
README.md
requirements.txt
src

./data:
rpp_classified.json
rpp_source.json

./notebooks:
agnews_train_eval.ipynb
agnews_train_eval_step2_roberta.ipynb
agnews_train_eval_step3_compare.ipynb
bonus_prepare_rpp.ipynb
bonus_prepare_rpp.ipynbbonus_prepare_rpp.ipynb

./outputs:

./src:
train_eval.py

✅ Ubicación actual: /content/News_Classification-lab_LovatonValeria_PizarroSebastian


In [3]:
import json, pandas as pd, os

#1. Cargar las 50 noticias del RSS
with open('data/rpp_source.json', 'r', encoding='utf-8') as f:
    news = json.load(f)
df = pd.DataFrame(news)
print(f'Noticias cargadas: {len(df)}')
df.head(2)

Noticias cargadas: 50


Unnamed: 0,title,description,link,date_published
0,Ollanta Humala se pronuncia tras anulación de ...,"El expresidente Humala, sentenciado a 15 años ...",https://rpp.pe/politica/judiciales/ollanta-hum...,"Mon, 20 Oct 2025 22:19:10 -0500"
1,"Vecina de San Isidro considera que hubo ""tardí...","Cynthia Yamamoto, vocera de 'Peruanos de a pie...",https://rpp.pe/lima/actualidad/san-isidro-veci...,"Mon, 20 Oct 2025 22:01:43 -0500"


In [4]:
# === Reparar conflictos de versiones (seguro en Colab) ===
!pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 -q
!pip install transformers accelerate -q
import torch, transformers
print("✅ Versiones sincronizadas:")
print("Torch:", torch.__version__)
print("Transformers:", transformers.__version__)


✅ Versiones sincronizadas:
Torch: 2.8.0+cu128
Transformers: 4.57.1


In [5]:
from transformers import pipeline

candidate_labels = ["World", "Sports", "Business", "Science/Technology"]
label_to_id = { "World":0, "Sports":1, "Business":2, "Science/Technology":3 }

# Modelo multilingüe (zero-shot)
classifier = pipeline(
    "zero-shot-classification",
    model="MoritzLaurer/mDeBERTa-v3-base-mnli-xnli"
)

preds = []
for it in news:
    text = f"{it['title']} — {it['description']}"
    result = classifier(text, candidate_labels=candidate_labels, multi_label=False)
    label = result["labels"][0]
    preds.append(label_to_id[label])

df["label_llm"] = preds
df.to_json("data/rpp_classified.json", orient="records", indent=2, force_ascii=False)
print("✅ Archivo guardado en data/rpp_classified.json con 50 etiquetas generadas por LLM (open-source).")
df.head(5)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Device set to use cpu


KeyboardInterrupt: 