ScandiProb: Hybrid Language ID Classifier

Code for fine-tuning ScandiBERT (by Snæbjarnarson, et al.) to be a hybrid language ID text classifier, reporting independent probabilities that a given text is: Norwegian, Swedish, Danish, and Non-Scandinavian / None of the Above. Code was primarily written in Kaggle Notebooks (scandiprob.ipynb), then committed to GitHub with each iteration, and finally hosted on HuggingFace. This was done as an undergraduate final project for a Spring 2026 NLP course at the University of Alaska Fairbanks. You can use the resulting web app here.

The program turns ScandiBERT into a multilabel classifier, trained on limited amounts of translation corpus OPUS-100 (6000 samples per label) and a tiny handwritten multilabel dataset, and combined with regex-enforced heuristics. Achieves ~93% macro-F1 score on OPUS-100 test set and ~84% macro-F1 score against the comprehensive SLIDE eval set, with a fraction of the training data used in the 2025 SLIDE paper.

(Kaggle Notebooks | Hugging Face Model Page)

Project Progression / Rubric

Completed in iterations from Part D to Part A, as follows.

Part D: Zero-Shot/Few-Shot Norwegian Classification

Load ScandiBERT into Kaggle Notebooks. Attach a classification head to it, using a sigmoid function to output the probability between 0 and 1 that a given text is Norwegian or not.
Implement a bare bones negative class with just regex and an equation of (1 - p_n).
Run tests to see how successful the model is with just pretraining and just one language.

Part C: Training for Norwegian, Swedish, and the Negative Class

Properly train the model on limited labeled corpora of labeled Norwegian, Swedish, and a limited non-Scandinavian corpus.
Adjust the classification head to now return three independent probabilities for whether a text is in Norwegian, Swedish, or non-Scandinavian.
Run new tests for all three classes.

Part B: Training for Danish

Properly train the model on extensive labeled corpora of Norwegian, Swedish, and Danish.
Adjust the classification head to now return an independent probability for Danish in addition to existing classes. Note that Danish is likely the hardest class to implement properly, due to significant overlap with Norwegian Bokmål.

Part A: Options for Expansion

Complete two of the following for completion:

Italicized weren't pursued or completed.

Implement a proper UI for the final program. Have the user input/upload their text and have it output a bar graph for the language probabilities.
Retrain the Norwegian class to have multiple subclasses for Bokmål, Nynorsk, and "Miscellaneous Norwegian Dialect". ScandiBERT is only pretrained on Bokmål and Nynorsk and does not label them as part of Norwegian at-large. NorDial is a corpus of written Norwegian dialects and will help to make the Norwegian class as accurate as possible.
After the model outputs the probabilities, write a function that goes character-by-character and adds or removes language score based on linguistic heuristics, such as Danish having -d at the end of words that gets dropped off in Norwegian Bokmål. Does this meaningfully improve accuracy? This may be necessary to some extent to distinguish between Danish and Bokmål.
Implement classes and probabilities for the insular Scandinavian languages of Icelandic and Faroese, as ScandiBERT is pretrained on them.

References

Formal references to be found in Hugging Face raw model README.md

Base Model: ScandiBERT by Snæbjarnarson, et al., 2023.

Training, Validation, and Test Data: OPUS-100 by Zhang, et al.

Final Validation Dataset for Comparison: SLIDE by Fedorova, et al., 2025.

General References on Project Methodology

Multi-label Scandinavian Language Identification (SLIDE) by Fedorova, et al., 2025.

Transfer to a Low-Resource Language via Close Relatives: The Case Study on Faroese by by Snæbjarnarson, et al., 2023.

Discriminating between Similar Nordic Languages by Haas & Derczynski, 2021.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
HF-full-demo @ 0c9a524		HF-full-demo @ 0c9a524
HF-model-raw @ 3ba777c		HF-model-raw @ 3ba777c
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
ScandiProb_Lit-Review-and-Outline.pdf		ScandiProb_Lit-Review-and-Outline.pdf
scandiprob.ipynb		scandiprob.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ScandiProb: Hybrid Language ID Classifier

Project Progression / Rubric

Part D: Zero-Shot/Few-Shot Norwegian Classification

Part C: Training for Norwegian, Swedish, and the Negative Class

Part B: Training for Danish

Part A: Options for Expansion

References

General References on Project Methodology

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ScandiProb: Hybrid Language ID Classifier

Project Progression / Rubric

Part D: Zero-Shot/Few-Shot Norwegian Classification

Part C: Training for Norwegian, Swedish, and the Negative Class

Part B: Training for Danish

Part A: Options for Expansion

References

General References on Project Methodology

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages