# Kazakh → Russian CEFR Pipeline (Notebook)

This notebook installs all dependencies, downloads a slice of the KazParC corpus, builds silver CEFR labels, and runs the text-level predictor. It works on CPU or GPU runtimes (e.g., Google Colab, Kaggle, or local) and only relies on internet access for model/data downloads.

In [9]:
import os
import pathlib
import sys

PROJECT_ROOT = pathlib.Path.cwd()

PROJECT_ROOT

PosixPath('/Users/galymzhantore/Documents/cefr-kk-ru')

In [10]:
import platform
import torch

print("Python:", platform.python_version())
print("Torch:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())

Python: 3.10.18
Torch: 2.5.1
CUDA available: False


## 1. Download Kazakh–Russian parallel sentences

We grab the first 10k sentence pairs for a quick demo. Set `split="train"` for the full dataset once you’re ready for a longer run.

In [11]:
from src.data.download_parallel import save_kz_ru

PARALLEL_PATH = save_kz_ru(split="train[:100]", out_dir="data/parallel", out_name="kazparc_kz_ru.csv")
PARALLEL_PATH

Saved: data/parallel/kazparc_kz_ru.csv rows: 100


## 2. Build silver word-level CEFR labels

Uses the mutual soft-aligner with fallback handling for very long sentences (those are skipped with a warning). Expect the first run to download the alignment model.

In [13]:
from src.align.mutual_align import EmbeddingAligner
from src.pipeline.build_silver_labels import main as build_silver_labels

# Use GPU explicitly if available
aligner_device = "cuda" if torch.cuda.is_available() else "cpu"
custom_aligner = EmbeddingAligner(device=aligner_device)

SILVER_PATH = build_silver_labels(parallel_csv=PARALLEL_PATH, aligner=custom_aligner)
SILVER_PATH

TypeError: expected str, bytes or os.PathLike object, not NoneType

In [None]:
import pandas as pd

silver_df = pd.read_csv(SILVER_PATH)
silver_df.head()

## 3. Run text-level CEFR prediction

Translate a sample sentence, align phrases, and aggregate CEFR levels.

In [None]:
from src.domain.services import TextCefrPipeline, TranslationService, AlignmentService, CefrScorer
from src.data.repositories import RussianCefrRepository
from src.translation.translator import get_translator

translator = get_translator(device=aligner_device)
translation_service = TranslationService(translator)
alignment_service = AlignmentService(custom_aligner)
scorer = CefrScorer(RussianCefrRepository())

pipeline = TextCefrPipeline(
    translation_service=translation_service,
    alignment_service=alignment_service,
    scorer=scorer,
)

sample_text = "Ол қыста ауылға барған еді"
prediction = pipeline.predict(sample_text)
prediction.to_dict()

In [None]:
pd.DataFrame(
    {
        "kazakh_phrase": [p.kazakh_phrase for p in prediction.phrase_alignments],
        "russian_token": [p.russian_token for p in prediction.phrase_alignments],
        "kazakh_span": [p.kazakh_span for p in prediction.phrase_alignments],
        "russian_index": [p.russian_index for p in prediction.phrase_alignments],
    }
)

## 4. (Optional) Train the word-level CEFR classifier

Requires the silver labels generated above. Training on CPU is slow; GPU recommended.

In [None]:
# Uncomment to fine-tune the word-level classifier
# !python -m src.models.train_word

## Next steps

- Increase the dataset slice to the full `train` split when you are ready for production data.
- Persist outputs to cloud storage (e.g., Google Drive) if running in Colab.
- Build a CLI or API on top of `TextCefrPipeline` for integration into larger systems.