# NLP Sources for Norwegian

## Introduction
Norwegian is a Germanic language from the Indo-European language family. Even though it is spoken only by approximately 5 million of people, there is a plenty of tools for natural language processing. The NLP of Norwegian faces a range of problems including two written norms (bokmål and nynorsk) which leads towards the use of different lists of stopwords (Bjerke-Lindstrøm, 2017), and dialectal differences in both writing and speaking that trigger syntax, morphology, and lexics. In this handout I aim to overview **Spacy** and **Word2Vec** tools that support Norwegian and, respectively, can be used for NLP of Norwegian. To explore the tools I used a tiny self-prepared corpus (5 texts) in Norwegian Bokmål from the website *forskning.no* (news in arts and science) and a novel *Den lykkelige alder* by Sigrid Undset. 

I tried to test some other tools like **polyglot** and **NLTK**, but they both faced serious issues when I tried to implement them. Polyglot is currently unavailable due to the package *pycld2* issues, and NLTK doesn't have Norwegian Bokmål in its inventory, although people use an external corpora to work with it, see (Bjerke-Lindstrøm, 2017). 

## Spacy for Norwegian

Spacy provides three pipelines for Norwegian NLP: *nb_core_news_lg* with the best accuracy for all the components, but the heaviest one; *nb_core_news_md*; and *nb_core_news_sm*. 

To illustrate the use of Spacy, I'm going to do the following: 
1. Extract all the proper names for people, locations, and organizations.
2. Tokenize, lemmatize and extract part of speech tags for words
3. Build a syntactic tree for one sentence

In [52]:
!pip install spacy
! python -m spacy download nb_core_news_sm

import spacy

nlp_norsk = spacy.load("nb_core_news_sm")

Collecting nb-core-news-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/nb_core_news_sm-3.8.0/nb_core_news_sm-3.8.0-py3-none-any.whl (12.5 MB)
     ---------------------------------------- 0.0/12.5 MB ? eta -:--:--
     ------------ --------------------------- 3.9/12.5 MB 23.5 MB/s eta 0:00:01
     --------------------------------- ----- 10.7/12.5 MB 28.0 MB/s eta 0:00:01
     --------------------------------------- 12.5/12.5 MB 28.0 MB/s eta 0:00:00
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('nb_core_news_sm')


In [54]:
forskning_file = open("Downloads/forskningno.txt", "r", encoding = "utf-8")
forskning_no = forskning_file.read()
forskning_no = forskning_no.replace("\n", " ")

In [56]:
doc = nlp_norsk(forskning_no)

people = []
organizations = []
locations = []
for ent in doc.ents:
    if ent.label_ == "PER":
        people.append(ent.text)
    elif ent.label_ == "ORG":
        organizations.append(ent.text)
    elif ent.label_ == "LOC" or ent.label_ == "GPE_LOC":
        locations.append(ent.text)
print(people)
print(organizations)
print(locations)

['Christian Løchsen Rødsrud', 'Rødsrud', 'Haugen', 'Rødsrud', 'Ellen Marie Næss', 'Frühholz', 'Frühholz', 'Ehecatl', 'Ehecatl', 'Frühholz', 'Frühholz', 'Frühholz', 'Frühholz', 'Joda', 'Trygve B. Broch', 'Broch', 'Trygve B. Broch', 'Erlend Moe', 'Ingeborg Busterud Flagstad', 'Flagstad', 'Kjell Tore Hovik', 'Psykolog Kjell Tore Hovik', 'Erlend Moe', 'Annonse Minner', 'Einar Øverenget', 'Einar Øverenget', 'Lars Bjarne Mythen', 'Eystein Victor Våpenstad', 'Johann Sebastian Bachs', 'Eystein Victor Våpenstad', 'Våpenstad', 'Våpenstad', 'Ingeborg Busterud Flagstad', 'Trygve B. Broch', 'Broch', 'Karin Boson', 'Boson', 'Boson', 'Boson', 'Forskning', 'Karin Boson', 'Boson', 'Boson', 'Karin Boson', 'Karin Boson', 'Tore Bonsaksen', 'Tore Bonsaksen', 'Bonsaksen', 'Tore Bonsaksen', 'Tore Bonsaksen', 'Bonsaksen', 'Bonsaksen', 'Tore Bonsaksen', 'Bonsaksen', 'Tore Bonsaksen']
['Riksantikvaren', 'Riksantikvaren', 'Riksantikvaren', 'Kulturhistorisk museum', 'Sascha Frühholz', 'psykologisk institutt ved U

The results show that Spacy determines proper nouns pretty well. It still makes some mistakes (e. g. Haugen here is not a last name, but *pile*, Sascha Frühholz is not an organization, but a person), but in general the processing is pretty accurate. It even parsed a completely non-Norwegian *Ehecatl* which is a name of an Aztec god of the wind. 

In [58]:
forskning_tokenization = []

for token in doc:
    text = token.text
    lemma = token.lemma_
    pos = token.pos_
    one_word_info = [text, lemma, pos]
    forskning_tokenization.append(one_word_info)
    
print(forskning_tokenization[:100])

[['Den', 'den', 'DET'], ['svære', 'svær', 'ADJ'], ['haugen', 'haug', 'NOUN'], [',', '$,', 'PUNCT'], ['kalt', 'kalle', 'VERB'], ['Karnilshaugen', 'Karnilshaugen', 'PROPN'], [',', '$,', 'PUNCT'], ['er', 'være', 'AUX'], ['en', 'en', 'DET'], ['menneskeskapt', 'menneskeskapt', 'ADJ'], ['gravhaug', 'gravhaug', 'NOUN'], ['fra', 'fra', 'ADP'], ['tidsperioden', 'tidsperiode', 'NOUN'], ['før', 'før', 'ADP'], ['vikingtiden', 'vikingtid', 'NOUN'], [',', '$,', 'PUNCT'], ['opplyser', 'opplyse', 'VERB'], ['Riksantikvaren', 'Riksantikvaren', 'PROPN'], ['i', 'i', 'ADP'], ['en', 'en', 'DET'], ['pressemelding', 'pressemelding', 'NOUN'], ['.', '$.', 'PUNCT'], [' ', ' ', 'SPACE'], ['Graven', 'Graven', 'PROPN'], ['ligger', 'ligge', 'VERB'], ['på', 'på', 'ADP'], ['en', 'en', 'DET'], ['gård', 'gård', 'NOUN'], ['i', 'i', 'ADP'], ['Gloppen', 'Gloppen', 'PROPN'], ['kommune', 'kommune', 'NOUN'], ['i', 'i', 'ADP'], ['Vestland', 'Vestland', 'PROPN'], ['.', '$.', 'PUNCT'], ['Den', 'den', 'PRON'], ['ble', 'bli', 'AUX

The lemmatization and part of speech determination is really accurate. The model even succeded with lemmatization of irregular verbs - *ble* is a past form for *bli* 'become', *er* is a present form for *være* 'be'.

In [60]:
from spacy import displacy

forskning_sentences = str.split(forskning_no ,". ")
doc_1 = nlp_norsk(forskning_sentences[1])

displacy.render(doc_1, style='dep', jupyter=True)

The strcuture of a simple sentence seems to be correct (although I'm not sure if it's a good idea to claim nouns to be heads for prepositions). 

## Word2Vec for Norwegian

I tried to implement Word2Vec tool for Norwegian, but it met a range of problems. The main problem was that all the free-access huge texts in Norwegian are written before the unification of Bokmål, so the lemmatization of such texts turns out to be not completely successful. The newest texts in Bokmål don't have open access. Although, I tried to use it and see whether it work. 

In [62]:
!pip install gensim 



In [64]:
import re
import gensim
import logging
import nltk.data
import pandas as pd
import urllib.request
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from gensim.models import word2vec

import warnings
warnings.filterwarnings('ignore')

In [68]:
den_lykkelige_alder_file = open("Downloads/denlykkeligealder.txt", "r", encoding = "utf-8")
den_lykkelige_alder = den_lykkelige_alder_file.read()
den_lykkelige_alder = den_lykkelige_alder.replace("\n", " ")


den_lykkelige_alder_å = den_lykkelige_alder.replace("aa", "å")

preprocessed_den_lykkelige_alder = []

doc_alder = nlp_norsk(den_lykkelige_alder_å)

for token in doc_alder:
    if token.pos_ == "PUNCT": 
        continue
    else: 
        preprocessed_den_lykkelige_alder.append(token.lemma_)

preprocessed_den_lykkelige_alder = " ".join(preprocessed_den_lykkelige_alder) 

print(preprocessed_den_lykkelige_alder)

with open("preprocessed_den_lykkelige_alder.txt", "w", encoding = "utf-8") as file_1: 
    file_1.write(preprocessed_den_lykkelige_alder)

sigrid Undset       den lykkelig alder       bokselskap.no   Oslo 2021       bokselskap.no Oslo 2021   Sigrid Undset den lykkelig alder   tekst i bokselskap.no følge 1. utgave 1908 Oslo Aschehoug digitalisering være basere på fil motta fra Nasjonalbiblioteket nb.no   ISBN 978-82-8319-621-4 bokselskap.no   978-82-8319-622-1 epub 978-82-8319-623-8 mobi   tekst være laste ned fra bokselskap.no           ET HALVT DUSIN LOMMETØRKLÆR      3 Torleif og Helge sad på kjøkkentrappe og gnog på hver sin skalk   ha du sidde igjen du da å fy for skam Bildit ha sidde igjen på skole   men kjær barn mit ha du sidde igjen spurgte fru Iversen ud af kjøkkenvindu   Neida sagde Bildit fort hun stå og daske skolevæske ind mod lægg det være en skolevæske af strie med brodere blade og B. I. på i rød uldgarn jeg havde følge med Sossa Ødegård jeg jeg følge hende hjem   Jasså du være du med i hos Ødegårds også kansk sagde fru Iversen interesser   nei i   jam komme ind og få af dig på ben da Bildit du skulle ikke 

The only preprocessing pattern to update the orthography I did is replacing "aa" with "å" as it is the newest innovation of Norwegian orthography which is quite easy to apply to huge texts. At the same time, it can be seen that some forms cannot be lemmatized, e. g. *fik* (Bokmål: *fikk*, lemma: *få*), so the preprocessing can hardly be called successful. 

In [70]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
file_dla = "preprocessed_den_lykkelige_alder.txt"
data_dla = gensim.models.word2vec.LineSentence(file_dla)
model_dla = gensim.models.Word2Vec(data_dla, vector_size=300, window=5, min_count=2)

2024-12-22 22:38:22,811 : INFO : collecting all words and their counts
2024-12-22 22:38:22,832 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2024-12-22 22:38:22,847 : INFO : collected 6359 word types from a corpus of 58622 raw words and 6 sentences
2024-12-22 22:38:22,848 : INFO : Creating a fresh vocabulary
2024-12-22 22:38:22,859 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=2 retains 2523 unique words (39.68% of original 6359, drops 3836)', 'datetime': '2024-12-22T22:38:22.859831', 'gensim': '4.3.3', 'python': '3.12.7 | packaged by Anaconda, Inc. | (main, Oct  4 2024, 13:17:27) [MSC v.1929 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19045-SP0', 'event': 'prepare_vocab'}
2024-12-22 22:38:22,860 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=2 leaves 54786 word corpus (93.46% of original 58622, drops 3836)', 'datetime': '2024-12-22T22:38:22.860832', 'gensim': '4.3.3', 'python': '3.12.7 | packaged by Anaconda, Inc. | (mai

Next, I decided to check the similarities between words and find the most similar words to some adjectives. First, in order to avoid rarely found words, I print all the keys. 

In [72]:
model_dla.wv.key_to_index.keys()

dict_keys(['og', 'være', 'det', 'hun', 'jeg', 'i', 'en', 'ikke', 'han', 'du', 'på', 'så', 'de', 'at', 'til', 'med', 'der', 'som', 'den', 'af', 'ha', 'have', 'sig', 'hende', 'for', 'da', 'om', 'komme', 'men', 'dig', 'mig', 'se', 'ved', 'gå', 'nu', 'nog', 'kunne', 'gik', 'ud', 'Edele', 'få', 'vi', 'skulle', 'skulde', 'bare', 'stå', 'kunde', 'jo', 'sagde', 'sin', 'gjøre', 'vilde', 'op', 'når', 'ind', 'hans', 'over', 'havde', 'ja', 'fra', 'lid', 'tro', 'hvor', 'Kristian', 'Uni', 'blev', 'måtte', 'man', 'uni', 'hvad', 'ligge', 'bli', 'si', 'god', 'ville', 'selv', 'her', 'liten', 'nei', 'hendes', 'vær', 'synes', 'aldrig', 'De', 'mod', 'sad', 'hel', 'sammen', 'igjen', 'all', 'tænke', 'eft', 'også', 'Dyrssen', 'Bildit', 'ned', 'sån', 'tog', 'fik', 'min', 'din', 'hjem', 'Tom', 'end', 'liv', 'snakke', 'annen', 'dag', 'under', 'aa', 'elske', 'nok', 'engang', 'ta', 'meget', 'edele', 'hånd', 'vel', 'gang', 'hver', 'le', 'mere', 'føle', 'mens', 'by', 'før', 'mange', 'hos', 'barn', 'alt', 'bort', 'ga

In [74]:
model_dla.wv.most_similar("dårlig", topn=3)

[('deilig', 0.9958454966545105),
 ('mange', 0.9957602024078369),
 ('også', 0.9957500696182251)]

I was pleasently surprised that the most similar word to *dårlig* 'bad' is *deilig* 'lovely'. At the same time, the other words with high similarity are not similar to 'bad' at all ('go' and 'many' respectively). Also, the results are pretty close to each other. 

In [76]:
model_dla.wv.most_similar("god", topn=3)

[('jo', 0.9998985528945923),
 ('bli', 0.9998976588249207),
 ('vi', 0.9998897314071655)]

The most similar words to *god* 'good' do not make sense: 'no/though', 'become', and 'we' respectively. 

In [79]:
model_dla.wv.similarity("god", "liten")

0.99946225

In [81]:
model_dla.wv.similarity("god", "dårlig")

0.9955387

In [83]:
model_dla.wv.similarity("god", "elektrisk")

0.9889947

Again, according to the model, the word 'small' is closer to 'good' than 'bad'. At the same time, 'electric' is further from 'good' than the others. 

## Discussion

In this project, I tried to implement some tools for natural language processing in Norwegian. As we have seen, Norwegian is not the easiest language for NLP, because of its not-unified orthography and lack of huge open-access modern texts. At the same time, Spacy analyzes modern Norwegian Bokmål extremely well, and Word2Vec also sometimes gets useful data.  I believe, that the following problems have to be solved in order to enhace the use of NLP tools for Norwegian: 

1. Orthograhy unifier: a tool that transcribes texts in old orthography into the new Bokmål one.
2. Dialect orthography processing: a tool that helps to analyze the data written with dialect orthography peculiarities, e. g. *æ* instead of *e* for Northern dialects. 

## References

Bjerke-Lindstrøm, B. (2017). *Teaching nltk norwegian* (Master's thesis). UiO. [Text] (https://www.duo.uio.no/bitstream/handle/10852/59276/11/Teaching_NLTK_Norwegian.pdf)
Paliwal, R., Fedorova, M., & Hjertaker, O. (2019). *Norwegian NLP Resources*. MIT. [Github] (https://github.com/web64/norwegian-nlp-resources?tab=readme-ov-filehttps://github.com/web64/norwegian-nlp-resources?tab=readme-ov-file)