<a href="https://colab.research.google.com/github/componavt/sns4human/blob/main/src/vk/topics/Finetune_XLM_RoBERTa_Topic_HuggingFace.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
### Inference with Fine-Tuned Model from Hugging Face Hub

**English:**
This script loads a fine-tuned XLM-RoBERTa model for topic classification from the Hugging Face Hub. It then uses this model to classify new, unlabeled Russian social media posts and assigns each post a predicted topic and its confidence score.

**Русский:**
Этот скрипт загружает дообученную модель XLM-RoBERTa для тематической классификации с Hugging Face Hub. Затем он использует эту модель для классификации новых, неразмеченных постов из социальных сетей, присваивая каждому посту предсказанную тему и степень уверенности.

```python
# Required libraries
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TextClassificationPipeline
import torch
import pandas as pd
import requests
from io import StringIO
from sklearn.preprocessing import LabelEncoder

# Parameters
model_repo = "componavt/xlm-roberta-base-topic-classification-2025"
max_length = 512

# Load tokenizer and model from Hugging Face Hub
model = AutoModelForSequenceClassification.from_pretrained(model_repo)
tokenizer = AutoTokenizer.from_pretrained(model_repo)

# Recreate label encoder (must match training order)
label_encoder = LabelEncoder()
label_encoder.classes_ = [
    "благоустройство", "война", "выставка", "день рождения и юбилей",
    "еда кухня", "здравоохранение", "карельский язык", "конкурс", "литература",
    "мероприятия проекты", "музей. экскурсия", "музыка", "образование", "поэзия",
    "праздник", "природа", "социальная сфера", "традиция", "фестиваль", "фольклор",
    "этнокультура", "язык", "вепсы", "Эпос_Калевала"
]  # Ensure correct order of classes

# Load unlabeled posts
unlabeled_posts = "speechvepkar.csv"
test_response = requests.get(f'https://raw.githubusercontent.com/componavt/sns4human/refs/heads/main/data/vk/posts/{unlabeled_posts}')
test_df = pd.read_csv(StringIO(test_response.text), delimiter=',', encoding='utf-8')

# Set up classification pipeline
pipe = TextClassificationPipeline(
    model=model,
    tokenizer=tokenizer,
    top_k=None,
    device=0 if torch.cuda.is_available() else -1,
    truncation=True,
    max_length=max_length,
    padding=True
)

# Predict topic with score
results = []
for _, row in test_df.iterrows():
    text = row['text']
    if not isinstance(text, str) or not text.strip():
        continue  # Skip invalid text entries

    preds = pipe(text)[0]  # list of dicts [{label: 'LABEL_0', score: ...}, ...]
    best = max(preds, key=lambda x: x['score'])
    topic_label = label_encoder.inverse_transform([int(best['label'].split('_')[-1])])[0]
    results.append({
        "id": row.get('id', None),
        "topic": topic_label,
        "relatedness": round(best['score'], 4),
        "text": text,
        "date": row.get('date'),
        "likes": row.get('likes'),
        "reposts": row.get('reposts'),
        "views": row.get('views'),
    })

# Save labeled results
pd.DataFrame(results).to_csv("labeled_predictions.csv", index=False, encoding="utf-8")
```
