<a href="https://colab.research.google.com/github/componavt/sns4human/blob/main/src/vk/topics/Finetune_XLM_RoBERTa_Topic_HuggingFace_Top-N-posts_checkpoints.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Social Media Posts Topic Classification 🗂️🔍

## English ☕

This script performs automated topic classification of social media posts using a fine-tuned XLM-RoBERTa model. Key features include:

- ⚡ **Parallel batch processing** for high-speed classification
- 📌 **Automatic checkpointing** to resume progress after interruptions  
- 🎯 **Top-k filtering** to save only the most relevant posts per topic

### Implementation Details

The script uses our pre-trained [`xlm-roberta-base-topic-classification-2025`](https://huggingface.co/componavt/xlm-roberta-base-topic-classification-2025) model from Hugging Face. It processes CSV files from the [`sns4human/data/vk/posts/`](https://github.com/componavt/sns4human/tree/main/data/vk/posts) directory.

For each post:
1. Classifies content into expert-defined topics
2. Calculates a confidence score (0-100%)
3. Maintains leaderboards of top posts per topic

Results are saved as separate CSV files (one per topic) containing only the highest-confidence posts. Real-time progress is displayed in the notebook.

> **Note:** The script automatically handles GPU/CPU detection and optimizes batch size accordingly.

## Русский 🫖

Скрипт выполняет тематическую классификацию постов. Ключевые особенности: ⚡ параллельная обработка батчей для ускорения, 📌 автоматические контрольные точки для продолжения после сбоев и 🎯 отбор топ-k постов для каждой темы.

Этот скрипт загружает дообученную модель [XLM-RoBERTa](https://huggingface.co/componavt/xlm-roberta-base-topic-classification-2025) (`xlm-roberta-base-topic-classification-2025`) из Hugging Face для тематической классификации. Скрипт читает CSV-файлы из директории [`./data/vk/posts/`](https://github.com/componavt/sns4human/tree/main/data/vk/posts). Выполняется классификация постов по темам, определённым экспертами так: каждому посту присваивается тема и рассчитывается степень уверенности. Для каждой темы сохраняются только `k` постов с наибольшей уверенностью в отдельные CSV-файлы.

In [7]:
# Number of CSV files to process (change as needed)
n_files = 999 # 2 999

# Top-N posts per topic to keep
k_top = 1000    # 10 30 1000

print_every_n = 100

# Define model path (Hugging Face repo)
model_repo = "componavt/xlm-roberta-base-topic-classification-2025"

batch_size = 32 if torch.cuda.is_available() else 8

checkpoint_path = "checkpoint.json"

In [8]:
# Load or initialize checkpoint
def load_checkpoint(path):
    if not os.path.exists(path): return set(), defaultdict(list)
    with open(path, "r") as f:
        data = json.load(f)
        heap = defaultdict(list)
        for topic, items in data["top_k_heap"].items():
            for score, uid, record in items:
                heapq.heappush(heap[topic], (score, uid, record))
        return set(data["processed_files"]), heap

def save_checkpoint(processed, heap, path):
    serializable = {
        "processed_files": list(processed),
        "top_k_heap": {
            topic: [(score, uid, rec) for score, uid, rec in entries]
            for topic, entries in heap.items()
        }
    }
    with open(path, "w") as f:
        json.dump(serializable, f)

# Batched inference
def process_batch(texts):
    try:
        return pipe(texts, batch_size=len(texts))
    except Exception as e:
        print(f"Batch error: {e}")
        return [None] * len(texts)

In [None]:
### Inference with Fine-Tuned Model from Hugging Face Hub

!pip install -U transformers pandas scikit-learn

import os
import torch
import json
import pandas as pd
import heapq
from collections import defaultdict
from itertools import count
from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TextClassificationPipeline
from sklearn.preprocessing import LabelEncoder

# Model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained(model_repo)
tokenizer = AutoTokenizer.from_pretrained(model_repo)

# Set model to GPU if available
device = 0 if torch.cuda.is_available() else -1
pipe = TextClassificationPipeline(
    model=model,
    tokenizer=tokenizer,
    top_k=None,
    device=device,
    truncation=True,
    max_length=512,
    padding=True
)

# Define the topic list (same as during training)
topic_labels = [
    "благоустройство", "вепсы", "война", "выставка",
    "день рождения и юбилей", "еда кухня", "карельский язык",
    "конкурс", "мероприятия проекты", "музей. экскурсия",
    "музыка", "образование", "праздник", "природа",
    "поэзия Калевала литература", "социальная сфера",
    "традиция", "фестиваль", "этнокультура фольклор",
    "язык", "Эпос_Калевала", "поэзия", "литература"
]

# Fit label encoder for inverse mapping
label_encoder = LabelEncoder()
label_encoder.fit(topic_labels)

# Clone the GitHub repo and collect CSVs
if not os.path.exists("sns4human"):
    !git clone https://github.com/componavt/sns4human.git
posts_dir = "sns4human/data/vk/posts"

csv_files = [f for f in os.listdir(posts_dir) if f.endswith(".csv")][:n_files]

# Dictionary of min-heaps for each topic
processed_files, top_k_heap = load_checkpoint(checkpoint_path)

unique_counter = count()
row_counter = 0

for file in tqdm(csv_files, desc="Processing files"):
    if file in processed_files:
        continue

    df = pd.read_csv(os.path.join(posts_dir, file), encoding="utf-8")
    texts, rows = [], []

    for _, row in df.iterrows():
        text = row.get("text", "")
        if isinstance(text, str) and text.strip():
            texts.append(text)
            rows.append(row)

    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        batch_rows = rows[i:i+batch_size]
        batch_results = process_batch(batch)

        for result, row in zip(batch_results, batch_rows):
            row_counter += 1
            if not result: continue

            try:
                best = max(result, key=lambda x: x["score"])
                pred_label = int(best["label"].replace("LABEL_", ""))
                topic = label_encoder.inverse_transform([pred_label])[0]
                score = round(best["score"], 4)
                heap = top_k_heap[topic]

                record = {
                    "topic": topic,
                    "relatedness": score,
                    "text": row.get("text"),
                    "date": row.get("date"),
                    "group": row.get("group"),
                    "likes": row.get("likes"),
                    "reposts": row.get("reposts"),
                    "views": row.get("views")
                }

                if len(heap) < k_top:
                    heapq.heappush(heap, (score, next(unique_counter), record))
                    print(f"1: {topic} (confidence: {score:.2%}) | {record['text'][:50]}...") if row_counter % print_every_n == 0 else None
                elif score > heap[0][0]:
                    heapq.heappushpop(heap, (score, next(unique_counter), record))
                    print(f"2: {topic} (confidence: {score:.2%}) | {record['text'][:50]}...") if row_counter % print_every_n == 0 else None
            except Exception as e:
                print(f"Error in row: {e}")

    processed_files.add(file)
    save_checkpoint(processed_files, top_k_heap, checkpoint_path)

# Save final results
for topic, heap in top_k_heap.items():
    sorted_records = [rec for _, _, rec in sorted(heap, key=lambda x: -x[0])]
    pd.DataFrame(sorted_records).to_csv(f"{topic}.csv", index=False, encoding="utf-8")