In [158]:
import spacy
import nb_core_news_sm
import json
from seaborn import heatmap
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

In [121]:
data_3class = {}
for name in ["train", "dev", "test"]:
    with open(f"norec_sentence/3class/{name}.json") as infile:
        data_3class[name] = json.load(infile)

data_binary = {}
for name in ["train", "dev", "test"]:
    with open(f"norec_sentence/binary/{name}.json") as infile:
        data_binary[name] = json.load(infile)


The data from thre json files named train, dev, and test are loaded are collected in the two dictionaries data_3class and data_binary

Dataset cloned from github: https://github.com/ltgoslo/norec_sentence

Kutuzov, A., Barnes, J., Velldal, E., Øvrelid, L., & Oepen, S. (2021). Large-Scale Contextualised Language Modelling for Norwegian. Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa 2021).

Øvrelid, L., Mæhlum, P., Barnes, J., & Velldal, E. (2020). A Fine-grained Sentiment Dataset for Norwegian. Proceedings of the 12th Edition of the Language Resources and Evaluation Conference. Marseille, France, 2020.

The NoReC dataset is a collection of data from reviews, since it comes from reviews it can easely be maped to sentiment, ass most reviews come with a rating.

In [122]:
data_3class['train']

[{'sent_id': '201911-01-01', 'text': 'Philips 190G6', 'label': 'Neutral'},
 {'sent_id': '201911-02-01',
  'text': 'Med integrerte høyttalere som på ingen måte er diskret plassert , og med en stor subwoofer inkludert , da snakker vi om en gutteskjerm .',
  'label': 'Neutral'},
 {'sent_id': '201911-02-02',
  'text': 'Eller bedrar skinnet ?',
  'label': 'Negative'},
 {'sent_id': '201911-03-01',
  'text': 'De fleste skjermer har et diskret design , med smale rammer og slank fot .',
  'label': 'Neutral'},
 {'sent_id': '201911-03-02',
  'text': 'Men 190G6 fra Philips er en helt annen historie .',
  'label': 'Neutral'},
 {'sent_id': '201911-03-03',
  'text': 'Den har et utseende som krever oppmerksomhet , med glinsende svart ramme , glansbelegg på skjermflaten og store sølvfargede sidepaneler med fire innfelte høyttalere med svart deksel .',
  'label': 'Neutral'},
 {'sent_id': '201911-04-01', 'text': 'LES OGSÅ :', 'label': 'Neutral'},
 {'sent_id': '201911-05-01',
  'text': 'Foten har en stor 

In [123]:
for sent in data_3class['train']:
    print(sent['text'])

Philips 190G6
Med integrerte høyttalere som på ingen måte er diskret plassert , og med en stor subwoofer inkludert , da snakker vi om en gutteskjerm .
Eller bedrar skinnet ?
De fleste skjermer har et diskret design , med smale rammer og slank fot .
Men 190G6 fra Philips er en helt annen historie .
Den har et utseende som krever oppmerksomhet , med glinsende svart ramme , glansbelegg på skjermflaten og store sølvfargede sidepaneler med fire innfelte høyttalere med svart deksel .
LES OGSÅ :
Foten har en stor og blank søyle , og det er store knapper og blå lys .
Baksiden er sort , blank og skinnende , med et deksel som skjuler kontakter og kabler .
De fire høyttalerbrønnene stikker tydelig ut - her er det ikke snakk om å gjemme noe .
Likegyldig er det uansett vanskelig å være .
God betjening
I midten finner vi volumknappen , som roterer fritt .
Nivået leser du av på skjermen , det dukker opp en skala så snart du skrur på knappen .
Til venstre finner vi en inngangsvelger , som bestemmer om

From these last three test we see that oure data is orginized in lists of dictionaries, so the "data" variable is a dictionary of lists of dictionaries.

Using bag of words to orginize method to orginize the text into easaly computable vectors

In [124]:
text_3class = {'train':[x['text'] for x in data_3class['train']], 
               'test':[x['text'] for x in data_3class['test']]}

text_binary = {'train':[x['text'] for x in data_binary['train']],
                'test':[x['text'] for x in data_binary['test']]}

labels_3class ={'train':[x['label'] for x in data_3class['train']],
                'test':[x['label'] for x in data_3class['test']]}

labels_binary ={'train':[x['label'] for x in data_binary['train']],
                'test':[x['label'] for x in data_binary['test']]}

as the original data comes in dictionaries of lists of reviews which are also dictionaries, we chose to re structure the data for our own purposes

In [125]:
print(labels_3class['train'])

['Neutral', 'Neutral', 'Negative', 'Neutral', 'Neutral', 'Neutral', 'Neutral', 'Neutral', 'Positive', 'Neutral', 'Neutral', 'Positive', 'Neutral', 'Neutral', 'Neutral', 'Positive', 'Neutral', 'Neutral', 'Negative', 'Neutral', 'Positive', 'Positive', 'Positive', 'Positive', 'Negative', 'Positive', 'Positive', 'Negative', 'Neutral', 'Neutral', 'Neutral', 'Negative', 'Negative', 'Neutral', 'Negative', 'Neutral', 'Positive', 'Neutral', 'Neutral', 'Neutral', 'Neutral', 'Neutral', 'Neutral', 'Neutral', 'Positive', 'Positive', 'Neutral', 'Neutral', 'Neutral', 'Neutral', 'Positive', 'Neutral', 'Positive', 'Neutral', 'Neutral', 'Positive', 'Neutral', 'Positive', 'Positive', 'Negative', 'Neutral', 'Neutral', 'Negative', 'Neutral', 'Negative', 'Positive', 'Negative', 'Neutral', 'Positive', 'Positive', 'Neutral', 'Neutral', 'Neutral', 'Neutral', 'Neutral', 'Neutral', 'Neutral', 'Neutral', 'Neutral', 'Neutral', 'Neutral', 'Negative', 'Neutral', 'Neutral', 'Negative', 'Neutral', 'Neutral', 'Neutral'

In [126]:
!python -m spacy download nb_core_news_sm

Defaulting to user installation because normal site-packages is not writeable
Collecting nb-core-news-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/nb_core_news_sm-3.5.0/nb_core_news_sm-3.5.0-py3-none-any.whl (12.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.5/12.5 MB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('nb_core_news_sm')


In [127]:
# load spacy language model
nb_nlp = spacy.load('nb_core_news_sm')
en_nlp = spacy.load('en_core_web_sm')

def lemmatizer(list_sentc:list)->list:
    """takes a list of sentences and returns a list of lemmatized sentences

    Args:
        list_sentc (list): sentences

    Returns:
        list: lemmatized sentences
    """
    result = []
    for sentence in list_sentc:
        doc = nb_nlp(sentence)
        sentence = ' '.join([token.lemma_ for token in doc])
        doc = en_nlp(sentence)
        result.append(' '.join([token.lemma_ for token in doc]))

    return result

test = TfidfVectorizer(use_idf=False).fit(lemmatizer(text_3class['train']))
compare = TfidfVectorizer(use_idf=False).fit(text_3class['train'])

print(1-len(test.get_feature_names_out())/len(compare.get_feature_names_out()))

0.18661988753784786


lemmatizer is reducing the complexity of the text by turning words into their base form also known as lemma, we se that there are 19 percent fewer words, this seems like a small amount of reduced complexity...

In [128]:
print(*compare.get_feature_names_out()[::20], sep='\n')

000
11b
1450
178
1920
1960
1981
2000
21
252
2x
344
3g
46
525d
6000
76
999
aborten
actionbabe
acts
adventures
afterlife
aid
akselerasjonsdata
aktiverer
albert
aldersgrensen
ali
allan
allmennpublikum
alsace
alvor
ambisiøs
amerikaniseres
anakondaene
andakt
andspace
angrepet
animasjonsvenner
anmeldere
anonym
ansiktsuttrykk
ante
antydninger
appetittlig
arbeidende
arcade
arif
armstrong
arten
arve
assi
atomfrykt
audun
autistiske
autoritær
avdøde
avhengige
avmakten
avslått
avventende
bacalao
bak
bakke
bakverk
ballen
bandene
banker
barer
barnehagens
barneværelset
baserte
batterikapasitet
bearnaisen
bedrar
befinner
begravelse
begår
behovet
bekjentskap
belastningen
bemerkelsesverdige
benytte
bergen
beroende
berøringsskjerm
beskutt
bestemmer
består
betjeningen
betydningen
beveget
bi
bikker
bildet
billetter
biografi
bit
bjørgen
blair
bleke
blindsone
blogg
bluegrass
blågrå
bo
bokhandel
boligmagasin
bone
bordene
borteste
bowies
brann
breakdown
brekkes
brikke
britt
browser
brukernavn
brune
bryggeriet


here we se a posible explenation, the first 360 words are numbers and there are "new norwegian" words witch we do not have a spacy language model for

In [129]:
print(*test.get_feature_names_out()[::20], sep='\n')

000
11år
1500
1808
1931
1964
1987
2005
23
270
31
364
4200
50
58
700
8x
aarøs
abu
actionsekvense
adrian
aftenbladet
aids
aksent
aktør
albumslipp
alexander
allah
allsangbok
altfor
amatøraft
amerika
anakonda
andre
ang
aniello
anmassende
anonym
anstrengte
antivirusprogram
appellere
arbeide
arden
arketype
arrangemente
arv
assi
atomvåpen
aulaen
autobots
avant
avgårde
avmålt
avsporet
axxe
bader
bakgrunnsvokal
baktepp
balloon
banebrytende
barbossa
barnefilm
barnevennlig
basert
batteritid
beatles
bedriftsei
beggar
behendig
bekjentskap
bellamy
benny
berettige
bernt
besitt
beste
betale
between
beverdokka
bibelver
bilde
billedfront
bind
bisqu
bjørn
blande
blikktak
blogg
blue
blåkopi
bobla
bokstavel
bomb
boost
borte
bow
brann
breakdown
brendan
brikke
broderskapet
brudekjole
bruksdag
bryggekafé
brødrene
buldrende
burried
buyers
byrjar
bærekraftig
bøyleklemme
canadisk
carrie
cd
century
chastain
chorizo
clank
cluelesse
collins
comprendo
cooper
county
cripple
cv
dagligdag
dampet
dansbar
dark
datt
dean


her we see that a lot of the words are already in their neutral form, this might also explain why lemmatization removes so few words. an explenation for this might be that when writing a review you usualy use the neutralform or past tense of words

lets test use GridSearchCV to find the best combination of teqnuiqes and parameters for MultinomialNB

In [130]:
pipe = Pipeline([('vect', TfidfVectorizer()),
                ('clf', MultinomialNB())])

parameters = {'clf__alpha': [0.1, 0.2, 0.4, 0.8, 1.6, 3.2],
              'vect__use_idf': [True, False],
              'vect__ngram_range': [(1,1), (1,2)]}

# combining the test and training set to test for highest possible score, GridSearchCV will splitt it up again so that it can score afte training on the data
text1 = text_3class['train']
text1.extend(text_3class['test'])

text2 = lemmatizer(text1)

labels = labels_3class['train']
labels.extend(labels_3class['test'])

test1 = GridSearchCV(pipe, parameters, n_jobs=-1)
test1.fit(text1, labels)
test2 = GridSearchCV(pipe, parameters, n_jobs=-1)
test2.fit(text2, labels)

score1 = test1.best_score_
score2 = test2.best_score_

parms1 = test1.best_params_
parms2 = test2.best_params_

print(score1, parms1)
print(score2, parms2)

0.6159053698746243 {'clf__alpha': 0.1, 'vect__ngram_range': (1, 1), 'vect__use_idf': False}
0.620602197133162 {'clf__alpha': 0.1, 'vect__ngram_range': (1, 1), 'vect__use_idf': False}


when testing we see that we can get a slightly higher performance if we lemmatize and we also se that the best is to have alpha=0.1 so lets have a range of smaller alphas

In [131]:
text = text2
pipe = Pipeline([('vect', TfidfVectorizer()),
                ('clf', MultinomialNB())])

parameters = {'clf__alpha': [0.01, 0.02, 0.05, 0.1],
              'vect__use_idf': [True, False],
              'vect__ngram_range': [(1,1), (1,2)]}

test = GridSearchCV(pipe, parameters, n_jobs=-1)
test.fit(text, labels)

score = test.best_score_
parms = test.best_params_

print(score, parms)

0.620602197133162 {'clf__alpha': 0.1, 'vect__ngram_range': (1, 1), 'vect__use_idf': False}


same best parameters!

In [132]:
pipe = Pipeline([('vect', TfidfVectorizer()),
                ('clf', BernoulliNB())])

parameters = {'clf__alpha': [0.05, 0.08, 0.1, 0.2, 0.5],
              'vect__use_idf': [True, False],
              'vect__ngram_range': [(1,1), (1,2)]}

test = GridSearchCV(pipe, parameters, n_jobs=-1)
test.fit(text, labels)

score = test.best_score_
parms = test.best_params_

print(score, parms)

0.6094599087363053 {'clf__alpha': 0.5, 'vect__ngram_range': (1, 1), 'vect__use_idf': True}


BernoulliNB performed slightly worse

In [133]:
pipe = Pipeline([('vect', TfidfVectorizer()),
                ('clf', RandomForestClassifier(n_jobs=-1, n_estimators=100))])

parameters = {'clf__max_depth': [1, 2, 5, 10, 20, 40, 80, 160],
              'vect__use_idf': [True, False],
              'vect__ngram_range': [(1,1), (1,2)]}

test = GridSearchCV(pipe, parameters, n_jobs=-1)
test.fit(text, labels)

score = test.best_score_
parms = test.best_params_

print(score, parms)

0.5994107552682549 {'clf__max_depth': 160, 'vect__ngram_range': (1, 1), 'vect__use_idf': True}


performes better with word count and might need even deeper trees, and maby more estimators

In [153]:
pipe = Pipeline([('vect', TfidfVectorizer()),
                ('clf', RandomForestClassifier(n_jobs=-1))])

parameters = {'clf__max_depth': [100, 200, 400, None],
              'clf__n_estimators': [50, 100, 200]}

test = GridSearchCV(pipe, parameters, n_jobs=-1)
test.fit(text, labels)

time = test.cv_results_['mean_fit_time']
scores = test.cv_results_['mean_test_score']
parameters = test.cv_results_['params']

for i in range(0, len(time)):
    print(f'time: {time[i]} \tscore: {scores[i]} \tparameters: {parameters[i]}')

time: 5.387199068069458 	score: 0.5897978649428632 	parameters: {'clf__max_depth': 100, 'clf__n_estimators': 50}
time: 9.206300830841064 	score: 0.5897968502386048 	parameters: {'clf__max_depth': 100, 'clf__n_estimators': 100}
time: 20.128989267349244 	score: 0.5986456085688789 	parameters: {'clf__max_depth': 100, 'clf__n_estimators': 200}
time: 8.44985055923462 	score: 0.5978805812464747 	parameters: {'clf__max_depth': 200, 'clf__n_estimators': 50}
time: 12.896164608001708 	score: 0.5953682928794621 	parameters: {'clf__max_depth': 200, 'clf__n_estimators': 100}
time: 26.008536720275877 	score: 0.5951499523984326 	parameters: {'clf__max_depth': 200, 'clf__n_estimators': 200}
time: 8.616584873199463 	score: 0.5961324845630653 	parameters: {'clf__max_depth': 400, 'clf__n_estimators': 50}
time: 15.275428199768067 	score: 0.5993005703234817 	parameters: {'clf__max_depth': 400, 'clf__n_estimators': 100}
time: 28.717516374588012 	score: 0.5978817153277046 	parameters: {'clf__max_depth': 400,

we see that doubeling the number of estimators close to doubles the amount of time it takes to estimate, and the resulting score is les than 1% better than with half the amount of estimators, Random forest probably performs bad because we have a very sparse data

In [164]:
pipe = Pipeline([('vect', TfidfVectorizer()),
                ('clf', LogisticRegression(max_iter=500, n_jobs=-1))])

params = {'clf__C': [0.001, 0.01, 0.1, 1, 10, 100]}

grid = GridSearchCV(pipe, params, n_jobs=-1)
grid.fit(text, labels)

score = grid.best_score_
parms = grid.best_params_

print(score, parms)

0.6348028041650624 {'clf__C': 1}


In [177]:
pipe = Pipeline([('vect', TfidfVectorizer()),
                ('clf', LinearSVC(max_iter=8000))])

params = {'clf__C': [0.001, 0.01, 0.1, 1, 10, 100]}

grid = GridSearchCV(pipe, params, n_jobs=-1)
grid.fit(text, labels)

score = grid.best_score_
parms = grid.best_params_

print(score, parms)

0.628794680562146 {'clf__C': 0.1}


CountVectorizer, tokenizes, builds a vocabulary, and encodes the text, in other words, it seperates sentences into lists of words, removes norwegian stopwords (due to us defining norwegian stopwords), and builds dictionaries of the frequencie of each word across each sentence. the stoppwords used here are downloaded from the git: https://raw.githubusercontent.com/stopwords-iso/stopwords-no/master/stopwords-no.json

The transform function turns the dictionaries from "CountVectorizer" into vectors, and the "get_feature_names_out" function returns the words in the two vectors as a list in alphabetical order. we see that the vectors build on lemmatized sentences are smaler

the cross validation score for the 3class dataset is already quite low, so i chose not to use stopword removal and minimum documents limitation, as they resulted in a dropp in accurassy. The binary data did not dropp in accurasy when implementing these methods, i belive that this is due to the curse of dimasjonality, the complexity introduced when using 3 classifications for sentiment instead of 2 makes it so we need exponentially more data to train oure machinelearning algorythm.

still need to test neural network accuracy, and find the optimal classifier for binary sentiment analysis

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=16e2c059-4e65-449e-ba77-714128ae48d1' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>