Skip to content

CloudEerie/ScandiProb

Repository files navigation

ScandiProb: Hybrid Language ID Classifier

Code for fine-tuning ScandiBERT (by Snæbjarnarson, et al.) to be a hybrid language ID text classifier, reporting independent probabilities that a given text is: Norwegian, Swedish, Danish, and Non-Scandinavian / None of the Above. Code was primarily written in Kaggle Notebooks (scandiprob.ipynb), then committed to GitHub with each iteration, and finally hosted on HuggingFace. This was done as an undergraduate final project for a Spring 2026 NLP course at the University of Alaska Fairbanks. You can use the resulting web app here.

The program turns ScandiBERT into a multilabel classifier, trained on limited amounts of translation corpus OPUS-100 (6000 samples per label) and a tiny handwritten multilabel dataset, and combined with regex-enforced heuristics. Achieves ~93% macro-F1 score on OPUS-100 test set and ~84% macro-F1 score against the comprehensive SLIDE eval set, with a fraction of the training data used in the 2025 SLIDE paper.

(Kaggle Notebooks | Hugging Face Model Page)

Project Progression / Rubric

Completed in iterations from Part D to Part A, as follows.

Part D: Zero-Shot/Few-Shot Norwegian Classification

  • Load ScandiBERT into Kaggle Notebooks. Attach a classification head to it, using a sigmoid function to output the probability between 0 and 1 that a given text is Norwegian or not.
  • Implement a bare bones negative class with just regex and an equation of (1 - pn).
  • Run tests to see how successful the model is with just pretraining and just one language.

Part C: Training for Norwegian, Swedish, and the Negative Class

  • Properly train the model on limited labeled corpora of labeled Norwegian, Swedish, and a limited non-Scandinavian corpus.
  • Adjust the classification head to now return three independent probabilities for whether a text is in Norwegian, Swedish, or non-Scandinavian.
  • Run new tests for all three classes.

Part B: Training for Danish

  • Properly train the model on extensive labeled corpora of Norwegian, Swedish, and Danish.
  • Adjust the classification head to now return an independent probability for Danish in addition to existing classes. Note that Danish is likely the hardest class to implement properly, due to significant overlap with Norwegian Bokmål.

Part A: Options for Expansion

Complete two of the following for completion:

Italicized weren't pursued or completed.

  • Implement a proper UI for the final program. Have the user input/upload their text and have it output a bar graph for the language probabilities.
  • Retrain the Norwegian class to have multiple subclasses for Bokmål, Nynorsk, and "Miscellaneous Norwegian Dialect". ScandiBERT is only pretrained on Bokmål and Nynorsk and does not label them as part of Norwegian at-large. NorDial is a corpus of written Norwegian dialects and will help to make the Norwegian class as accurate as possible.
  • After the model outputs the probabilities, write a function that goes character-by-character and adds or removes language score based on linguistic heuristics, such as Danish having -d at the end of words that gets dropped off in Norwegian Bokmål. Does this meaningfully improve accuracy? This may be necessary to some extent to distinguish between Danish and Bokmål.
  • Implement classes and probabilities for the insular Scandinavian languages of Icelandic and Faroese, as ScandiBERT is pretrained on them.

References

Formal references to be found in Hugging Face raw model README.md

Base Model: ScandiBERT by Snæbjarnarson, et al., 2023.

Training, Validation, and Test Data: OPUS-100 by Zhang, et al.

Final Validation Dataset for Comparison: SLIDE by Fedorova, et al., 2025.

General References on Project Methodology

Multi-label Scandinavian Language Identification (SLIDE) by Fedorova, et al., 2025.

Transfer to a Low-Resource Language via Close Relatives: The Case Study on Faroese by by Snæbjarnarson, et al., 2023.

Discriminating between Similar Nordic Languages by Haas & Derczynski, 2021.

About

Code for fine-tuning ScandiBERT (by Snæbjarnarson, et al.) to be a hybrid language ID text classifier, reporting independent probabilities that a given text is: Norwegian, Swedish, Danish, and Non-Scandinavian / NOTA. Written in Kaggle Notebooks and Hosted on HuggingFace.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors