In this assignment you will be asked to extend the work by Gatti et al by checking whether form-meaning mappings learned on a different yet related language to that considered in the original study still capture the perceived valence of pseudowords. To do this you will be asked to engage with several different resources and adapt the pipeline following the instructions. Along the way, you will be asked to answer a few questions.

You need to submit the complete notebook in .ipynb format, with intermediate outputs visible. The notebook should be named as follows:

CL2025_groupN_assignment.ipynb

where N is the group number. Submissions in the wrong format or with names not adhering to the guidelines will not be evaluated.

In [None]:
# the code has been tested using the psycho-embeddings library to extract representations from LLMs.
!git clone https://github.com/MilaNLProc/psycho-embeddings.git
%cd psycho-embeddings
!pip install datasets
!pip install fasttext
!pip install pyreadr
!pip install enchant
!pip install pyenchant


Cloning into 'psycho-embeddings'...
remote: Enumerating objects: 199, done.[K
remote: Counting objects: 100% (199/199), done.[K
remote: Compressing objects: 100% (138/138), done.[K
remote: Total 199 (delta 105), reused 141 (delta 53), pack-reused 0 (from 0)[K
Receiving objects: 100% (199/199), 67.91 KiB | 574.00 KiB/s, done.
Resolving deltas: 100% (105/105), done.
/content/psycho-embeddings
Collecting fasttext
  Downloading fasttext-0.9.3.tar.gz (73 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.4/73.4 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting pybind11>=2.2 (from fasttext)
  Using cached pybind11-2.13.6-py3-none-any.whl.metadata (9.5 kB)
Using cached pybind11-2.13.6-py3-none-any.whl (243 kB)
Building wheels for collected packages: fasttext
  Building wheel for fasttext 

In [None]:
# the solution to the assignment has been obtained using these packages.

import nltk
import numpy as np
import pandas as pd
import fasttext as ft
import pickle as pkl
import fasttext.util
from tqdm import tqdm
from collections import defaultdict
from transformers import AutoTokenizer
from psycho_embeddings import ContextualizedEmbedder

GroupViT models are not usable since `tensorflow_probability` can't be loaded. It seems you have `tensorflow_probability` installed with the wrong tensorflow version.Please try to reinstall it following the instructions here: https://github.com/tensorflow/probability.
TAPAS models are not usable since `tensorflow_probability` can't be loaded. It seems you have `tensorflow_probability` installed with the wrong tensorflow version. Please try to reinstall it following the instructions here: https://github.com/tensorflow/probability.


**Task 1** (*10 points available, see breakdown per task below*)

You should replicate the main design in the paper *Valence without meaning* by Gatti and colleagues (2024), using estimates collected for Dutch word valence to train linear regression models and apply them to predict the valence of English pseudowords from Gatti and colleagues.

In detail, to train your regression models, you should use the dataset by Speed and Brysbaert (2024) containing crowd-sourced valence ratings (use the metadata to identify the relevant columns) collected for approximately 24,000 Dutch words. See the paper *Ratings of valence, arousal, happiness, anger, fear, sadness, disgust, and surprise for 24,000 Dutch words* by Speed and Brysbaert (2024).Use the dataset available at this link: https://osf.io/h76zj.

You should train a letter unigram model and a bigram model. Each model should be trained on Dutch words only.

Pay attention to one issue though: pseudowords created for English may be valid words in Dutch: therefore, you should first filter the list of pseudowords against a large store of Dutch words. To do so, use the words in the Dutch prevalence lexicon available in this OSF repository: https://osf.io/9zymw/. Essentially, you need to exclude any pseudoword that happens to be a word for which a prevalence estimate is available, whatever the prevalence is.

Each code block indicates how many points are available and how they are attributed.

In [None]:
from google.colab import drive
#drive.mount('/content/drive')

In [None]:
# read in the pseudowords from Gatti and colleagues, as well as the valence ratings for 24,000 Dutch words from Speed and Brysbaert (2024)
# show the first 5 lines of each dataset.
# 1 point for identifying the correct files and correctly loading their content

import pyreadr
import pandas as pd

original = pyreadr.read_r('/content/data_pseudovalence.RData')
dutch_lexicon = pd.read_csv('/content/prevalence_netherlands.csv', sep="\t")
twentyfour_thousand= pd.read_excel('/content/SpeedBrysbaertEmotionNorms.xlsx')

#


In [None]:
print(dutch_lexicon.head())
print(dutch_lexicon['word'].head)

      word  n.obs  irt.prevalence  z.irt.prevalence  prevalence  z.prevalence
0  T-shirt    324        0.986622          2.215053    0.978395      1.689888
1    aagje    303        0.907405          1.324941    0.877888      1.075808
2     aagt    324        0.169817         -0.954888    0.188272     -0.827920
3      aai    335        0.993290          2.472451    0.988060      1.794794
4  aaibaar    333        0.996284          2.676802    0.990991      1.830889
<bound method NDFrame.head of 0           T-shirt
1             aagje
2              aagt
3               aai
4           aaibaar
            ...    
54314           één
54315    éénzijdige
54316           öre
54317     überhaupt
54318    übermensch
Name: word, Length: 54319, dtype: object>


In [None]:
print(twentyfour_thousand.head())

   Word   Arousal   Valence ValenceCategory ValenceVsNeutral  Happiness  \
0  mama  2.812500  4.000000        positive         valenced   3.300000   
1    ja  2.823529  3.894737        positive         valenced   3.818182   
2  papa  2.562500  3.722222        positive         valenced   4.142857   
3   nee  2.928571  2.350000        negative          neutral   1.000000   
4  kaka  3.357143  2.050000        negative          neutral   1.090909   

      Anger      Fear   Sadness   Disgust  ...  Length Nsyl  N_phonemes  \
0  1.000000  1.000000  1.100000  1.000000  ...       4    2           4   
1  1.090909  1.181818  1.181818  1.000000  ...       2    1           2   
2  1.142857  1.000000  1.000000  1.000000  ...       4    2           4   
3  1.727273  1.363636  1.454545  1.363636  ...       3    1           2   
4  1.454545  1.181818  1.000000  4.727273  ...       4    2           4   

        PoS  OLD20       AoA      DLP_RT   DLP_Acc   DCP_RT DCP_Acc  
0         N   1.55  2.044257

In [None]:
pseudovalence_df = list(original.values())[1]
pseudovalence_df.head()

#loaded the words, with the [0], wrong collumn

Unnamed: 0,X,pseudoword,Value,predicted_valence,predictedL_valence,predictedL_Bi_valence,predicted_Dim_valence,predictedL_Dim_valence,predictedBi_Dim_valence,predictedBi_valence,LDist,Ortho_VAL,Semant_Neigh,SDist,Semant_VAL
0,1,abhert,0.452501,7.414814,5.116167,6.444633,6.783771,6.630497,7.414814,6.444633,2,4.655714,ordinary,0.558492,5.05
1,2,abhict,0.434171,8.233714,5.059183,6.509936,7.366068,7.377534,8.233714,6.509936,2,3.093333,cardigan,0.622202,5.95
2,3,acleat,0.527803,5.552468,5.262971,5.245826,5.268643,5.396114,5.552468,5.245826,1,4.24,solarium,0.57515,6.1
3,4,acmure,0.604889,8.71464,5.120029,6.562896,7.680827,7.58323,7.80991,5.414532,2,5.885,bad,0.570299,3.24
4,5,acoed,0.53899,7.340002,5.115652,5.309727,7.105662,7.024771,7.340002,5.309727,1,5.68,girl,0.499035,7.15


In [None]:
print(original.keys())
pseudowords0 = original['data_2']['pseudoword']
pseudowords1 = original['data_3']['pseudoword']
pseudowords0 = pseudowords0.tolist()
pseudowords1 = pseudowords1.tolist()
print(pseudowords0[:5])
print('next data:')
print(pseudowords1[:5])
print(original['data_2'].head())


odict_keys(['data_fin', 'data_2', 'data_3', '.Random.seed', 'Count', 'comb_2', 'comb_3'])
['abhert', 'abhict', 'acleat', 'acmure', 'acoed']
next data:
['acleat', 'acmure', 'acraw', 'adlor', 'adpite']
   X pseudoword     Value  predicted_valence  predictedL_valence  \
0  1     abhert  0.452501           7.414814            5.116167   
1  2     abhict  0.434171           8.233714            5.059183   
2  3     acleat  0.527803           5.552468            5.262971   
3  4     acmure  0.604889           8.714640            5.120029   
4  5      acoed  0.538990           7.340002            5.115652   

   predictedL_Bi_valence  predicted_Dim_valence  predictedL_Dim_valence  \
0               6.444633               6.783771                6.630497   
1               6.509936               7.366068                7.377534   
2               5.245826               5.268643                5.396114   
3               6.562896               7.680827                7.583230   
4               

In [None]:
# filter out pseudowords that happen to be valid Dutch words (mind case folding!)
# show the set of pseudowords filtered out.
# 1 point for applying the correct filtering

# for each pseudoword it is checked if it is part of the Dutch prevalence lexicon, if this is the case it is considered as a dutch word.
# with the only dutch word being found: pimpen.

dutch_words=[]
pseudoword_complete= []

dutch_lexicon_lower = []
for ele in dutch_lexicon["word"]:
  dutch_lexicon_lower.append(ele.lower())

for i in pseudowords0:
  if i not in dutch_lexicon_lower:
    pseudoword_complete.append(i)
  else:
    dutch_words.append(i)

print("there are",len(dutch_lexicon_lower), "words in the lexicon")
print("pseudoword database length is", len(pseudowords0))
print(len(pseudoword_complete), "are pseudowords")
print("and the words are", pseudoword_complete)
print("there are",len(dutch_words), "dutch words")
dutch_words



there are 54319 words in the lexicon
pseudoword database length is 1500
1499 are pseudowords
and the words are ['abhert', 'abhict', 'acleat', 'acmure', 'acoed', 'acoy', 'acraw', 'adeb', 'adlor', 'adpite', 'adrord', 'aercup', 'aflo', 'aflouse', 'afruist', 'aftot', 'aftul', 'afuke', 'agind', 'akiype', 'akiyse', 'akiysm', 'alinch', 'alproubt', 'amesse', 'amle', 'ampgrair', 'ancil', 'anneenths', 'ansey', 'apeted', 'apgen', 'apgents', 'apgert', 'apgor', 'apgra', 'apjoled', 'apjorm', 'appite', 'apuss', 'areese', 'arepent', 'arfin', 'arreges', 'artaits', 'aruds', 'arvol', 'arwarts', 'asarps', 'ascel', 'atben', 'atelecks', 'atrur', 'atryr', 'atsty', 'attice', 'atux', 'avoed', 'avol', 'awgits', 'awslonts', 'awturps', 'axnur', 'axswan', 'axswe', 'axude', 'axwas', 'aymupt', 'baflew', 'balal', 'balrims', 'bancid', 'baper', 'bapet', 'bapey', 'bapger', 'bapgion', 'bapgy', 'bapil', 'barought', 'barrocts', 'basaves', 'bavelts', 'baxswing', 'bayfy', 'beenish', 'belorse', 'belsot', 'bengeer', 'beplalds'

['pimpen']

In [None]:
true_valence_pseudo_word = original['data_2']['predicted_valence']
print(len(true_valence_pseudo_word))
print(pseudowords0.index('pimpen'))
del true_valence_pseudo_word[900]
print(len(true_valence_pseudo_word))

1500
900
1499


In [None]:
# encode Dutch words and pseudowords from Gatti et al as uni- and bi-gram vectors
# show the uni-gram and bi-gram encoding of the pseudoword ampgrair
# 2 points for correctly encoding the target strings as uni- and bi-gram vectors

In [None]:
import numpy as np
from collections import Counter

def extract_ngram_features(training_words):
    """Extract unique uni-grams and bi-grams from training data"""
    unigrams = set()
    bigrams = set()

    for word in training_words:
        word_str = str(word)
        # Add word boundary markers
        word_with_boundaries = '#' + word_str + '#'

        # Extract unigrams
        for char in word_str:
            unigrams.add(char)

        # Extract bigrams
        for i in range(len(word_with_boundaries) - 1):
            bigram = word_with_boundaries[i:i+2]
            bigrams.add(bigram)

    unigram_list = sorted(list(unigrams))
    bigram_list = sorted(list(bigrams))

    return unigram_list, bigram_list

def encode_ngrams(word, unigram_features, bigram_features):
    """Encode a word using the extracted n-gram features"""
    # Create feature vectors
    uni_vector = np.zeros(len(unigram_features))
    bi_vector = np.zeros(len(bigram_features))
    word_str = str(word)

    # Unigram encoding
    for char in word_str:
        if char in unigram_features:
            idx = unigram_features.index(char)
            uni_vector[idx] += 1

    # Bigram encoding with word boundaries
    word_with_boundaries = '#' + word_str + '#'
    for i in range(len(word_with_boundaries) - 1):
        bigram = word_with_boundaries[i:i+2]
        if bigram in bigram_features:
            idx = bigram_features.index(bigram)
            bi_vector[idx] += 1

    return uni_vector, bi_vector

# The training data is the list of Dutch words from Speed and Brysbaert
training_data = twentyfour_thousand['Word'].tolist()

# Extract features from training data
unigram_features, bigram_features = extract_ngram_features(training_data)

print(f"Number of unique unigrams: {len(unigram_features)}")
print(f"Unigrams: {unigram_features}")
print(f"\nNumber of unique bigrams: {len(bigram_features)}")
print(f"Sample bigrams: {bigram_features[:10]}...")

# Encode the pseudoword "ampgrair"
target_word = "ampgrair"
uni_result, bi_result = encode_ngrams(target_word, unigram_features, bigram_features)

print(f"\n\nEncoding for '{target_word}':")
print(f"Uni-gram vector shape: {uni_result.shape}")
print(f"Uni-gram encoding: {uni_result}")

print(f"\nBi-gram vector shape: {bi_result.shape}")
print(f"Bi-gram encoding: {bi_result}")

# Show which features are active
print(f"\nActive unigrams in '{target_word}':")
for i, count in enumerate(uni_result):
    if count > 0:
        print(f"  '{unigram_features[i]}': {int(count)} times")

print(f"\nActive bigrams in '{target_word}' (with boundaries):")
word_with_boundaries = '#' + target_word + '#'
for i, count in enumerate(bi_result):
    if count > 0:
        print(f"  '{bigram_features[i]}': {int(count)} times")

Number of unique unigrams: 34
Unigrams: ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'è', 'é', 'ê', 'ë', 'î', 'ï', 'ö', 'ü']

Number of unique bigrams: 656
Sample bigrams: ['#a', '#b', '#c', '#d', '#e', '#f', '#g', '#h', '#i', '#j']...


Encoding for 'ampgrair':
Uni-gram vector shape: (34,)
Uni-gram encoding: [2. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 2. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

Bi-gram vector shape: (656,)
Bi-gram encoding: [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0.

In [None]:
twentyfour_thousand_words = twentyfour_thousand['Word'].tolist()

In [None]:
# Encode all words and store results
uni_vectors_twentyfour_thousand = {}
bi_vectors_twentyfour_thousand = {}

for word in twentyfour_thousand_words:
    uni_vector, bi_vector = encode_ngrams(word, unigram_features, bigram_features)
    uni_vectors_twentyfour_thousand[word] = uni_vector
    bi_vectors_twentyfour_thousand[word] = bi_vector

# Encode all words and store results
uni_vectors_pseudowords = {}
bi_vectors_pseudowords = {}

for word in pseudoword_complete:
    uni_vector, bi_vector = encode_ngrams(word, unigram_features, bigram_features)
    uni_vectors_pseudowords[word] = uni_vector
    bi_vectors_pseudowords[word] = bi_vector

print("Finished encoding Dutch words and pseudowords.")

Finished encoding Dutch words and pseudowords.


In [None]:
from sklearn.linear_model import LinearRegression
import numpy as np
# use word valence estimates from Speed and Brysbaert (2024) to train
# - a uni-gram model
# - a bi-gram model
# 2 points for correctly trained models


twentyfour_thousand_valence = twentyfour_thousand['Valence']

# Uni-gram training data
X_uni = np.array([uni_vectors_twentyfour_thousand[word] for word in twentyfour_thousand_words])
y = np.array(twentyfour_thousand_valence)

# Train uni-gram model
uni_model = LinearRegression()
uni_model.fit(X_uni, y)

# Bi-gram training data
X_bi = np.array([bi_vectors_twentyfour_thousand[word] for word in twentyfour_thousand_words])

# Train bi-gram model
bi_model = LinearRegression()
bi_model.fit(X_bi, y)


In [None]:
# apply trained models to predict the valence of pseudowords from Gatti et al (2024).
# Then apply the same models back onto the training set to see how well they predict the valence of words in Speed and Brysbaert (2024).
# 2 points for correctly applied models



In [None]:


pseudowords = list(uni_vectors_pseudowords.keys())
X_pseudo_uni = np.array([uni_vectors_pseudowords[pw] for pw in pseudowords])
X_pseudo_bi = np.array([bi_vectors_pseudowords[pw] for pw in pseudowords])

# Predict using uni-gram and bi-gram models
pseudo_pred_uni = uni_model.predict(X_pseudo_uni)
pseudo_pred_bi = bi_model.predict(X_pseudo_bi)

# Create DataFrame for pseudoword predictions
pseudo_results = pd.DataFrame({
    'pseudoword': pseudowords,
    'predicted_valence_uni': pseudo_pred_uni,
    'predicted_valence_bi': pseudo_pred_bi,
})

pseudo_results.head()

Unnamed: 0,pseudoword,predicted_valence_uni,predicted_valence_bi
0,abhert,2.919647,3.325024
1,abhict,2.932709,3.132259
2,acleat,3.001939,3.247477
3,acmure,2.991462,3.151264
4,acoed,2.983596,3.112954


In [None]:
#Predict valence for dutch words
X_TFT_uni = np.array([uni_vectors_twentyfour_thousand[pw] for pw in twentyfour_thousand_words])
X_TFT_bi = np.array([bi_vectors_twentyfour_thousand[pw] for pw in twentyfour_thousand_words])

# Predict using uni-gram and bi-gram models
TFT_pred_uni = uni_model.predict(X_TFT_uni)
TFT_pred_bi = bi_model.predict(X_TFT_bi)

# Create DataFrame for pseudoword predictions
TFT_results = pd.DataFrame({
    'dutch word': twentyfour_thousand_words,
    'predicted_valence_uni': TFT_pred_uni,
    'predicted_valence_bi': TFT_pred_bi,
    'actual_valence': twentyfour_thousand_valence
})

TFT_results.head()

Unnamed: 0,dutch word,predicted_valence_uni,predicted_valence_bi,actual_valence
0,mama,2.973566,3.154043,4.0
1,ja,3.048731,3.020741,3.894737
2,papa,2.996438,3.214637,3.722222
3,nee,2.989986,2.865388,2.35
4,kaka,2.888163,2.985265,2.05


In [None]:
# compute the Spearman correlation coefficients between true valence and predicted valence under both uni- and bi-gram models for
# - words from Speed and Brysbaert (2024)
# - pseudowords from Gatti and colleagues (2024)
# Then show both correlation coefficients.
# 2 points for the correct Spearman correlation coefficients (rounded to the third decimal place)
from scipy.stats import spearmanr
# Spearman correlation
rho_uni_dutch, _ = spearmanr(twentyfour_thousand_valence, TFT_pred_uni)
rho_bi_dutch, _ = spearmanr(twentyfour_thousand_valence, TFT_pred_bi)

print(f"Spearman correlation (Dutch words) — Uni-gram: {rho_uni_dutch:.3f}")
print(f"Spearman correlation (Dutch words) — Bi-gram: {rho_bi_dutch:.3f}")

# need correlation for pseudowords as well, however, no clear true valence so predicted_valence original dataset used
rho_uni_pseudo, _ = spearmanr(true_valence_pseudo_word, pseudo_pred_uni)
rho_bi_pseudo, _ = spearmanr(true_valence_pseudo_word, pseudo_pred_bi)

print(f"Spearman correlation (Pseudowords) — Uni-gram: {rho_uni_pseudo:.3f}")
print(f"Spearman correlation (Pseudowords) — Bi-gram: {rho_bi_pseudo:.3f}")


Spearman correlation (Dutch words) — Uni-gram: 0.090
Spearman correlation (Dutch words) — Bi-gram: 0.328
Spearman correlation (Pseudowords) — Uni-gram: 0.327
Spearman correlation (Pseudowords) — Bi-gram: 0.341


**Task 2** (*8 points available, see breakdown below*)

Again following Gatti and colleagues, you should encode the target strings (pseudowords and Dutch words from Speed and Brysbaert) as fastText embeddings, train a multiple regression model on Dutch words and apply it to the pseudowords in Gatti et al. You should finally report the Spearman correlation coefficient between observed and predicted valence for both words and pseudowords.

You should use the pre-trained fastText model for Dutch, available at this page: https://fasttext.cc/docs/en/crawl-vectors.html

Finally, you should answer two questions about the fastText model (see below).

In [None]:
!pip install fasttext



In [None]:
import fasttext.util

fasttext.util.download_model('nl', if_exists='ignore')

ft = fasttext.load_model('cc.nl.300.bin')

Downloading https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.nl.300.bin.gz



What is the dimensionality of the pre-trained Dutch fastText embeddings? (*1 point for the correct answer*)

In [None]:
embedding_dimension = ft.get_dimension()
print(f"The dimensionality of the pre-trained Dutch fastText embeddings is {embedding_dimension}.")

The dimensionality of the pre-trained Dutch fastText embeddings is 300.


In [None]:
def encode_corpus_fasttext(corpus, ft_model, mapping=None):
    # if no mapping is provided, then use all dimensions of the model
    if mapping is None:
        dim = ft_model.get_dimension()
        mapping = list(range(dim))

    # create a feature matrix of the appropriate dimensionality
    X = np.zeros((len(corpus), len(mapping)))

    for i, instance in enumerate(corpus):
        vec = ft_model.get_word_vector(instance)
        X[i] = vec
    return X, mapping


# Encoding Dutch words and pseudowords as fastText embeddings

# Real Dutch words from Speed & Brysbaert
dutch_words = twentyfour_thousand['Word'].tolist()
ft_dutch_words, _ = encode_corpus_fasttext(dutch_words, ft)

# Pseudowords filtered from Gatti et al. set
pseudowords = pseudoword_complete  # this is are the already filtered list
ft_pseudowords, _ = encode_corpus_fasttext(pseudowords, ft)

print("Dutch words embeddings shape: ", ft_dutch_words.shape)
print("Pseudowords embeddings shape: ", ft_pseudowords.shape)

print("Embedding of 'speelplaats':", ft.get_word_vector('speelplaats')[:20])
print("Embedding of 'danchunk':", ft.get_word_vector('danchunk')[:20])

# Show subwords used in 'speelplaats'
subwords, _ = ft.get_subwords('speelplaats')
print("Subwords of 'speelplaats':", subwords)


Dutch words embeddings shape:  (23986, 300)
Pseudowords embeddings shape:  (1499, 300)
Embedding of 'speelplaats': [ 0.0253247  -0.00634261  0.02746305 -0.04024595  0.04888906  0.00660965
 -0.04152017 -0.01824508 -0.00645641  0.00093806  0.0708492  -0.03291791
  0.00263817 -0.02825846 -0.02188046 -0.03188037 -0.01846142 -0.02203094
 -0.01883078 -0.00259199]
Embedding of 'danchunk': [-0.00592199  0.00097547  0.05925412  0.00053251 -0.00386978 -0.02089076
 -0.02829577  0.00972911 -0.02510111 -0.11454885 -0.02695064  0.01551034
  0.02384409  0.01009528  0.04545438  0.00997385 -0.00474529  0.02524533
  0.02430548 -0.02851078]
Subwords of 'speelplaats': ['speelplaats', '<spee', 'speel', 'peelp', 'eelpl', 'elpla', 'lplaa', 'plaat', 'laats', 'aats>']


What minimum and maximum n-gram size was specified for training this fastText model? (*1 point for the correct answer*)

In [None]:
long_word = next(word for word in ft.words if len(word) >= 10 and word.isalpha())

# Get subwords from FastText
subwords, _ = ft.get_subwords(long_word)

# Keep only clean subwords
clean_subwords = [s for s in subwords if s.isalpha() and s != long_word]
subword_lengths = [len(s) for s in clean_subwords]

print(f"The minimum n-gram size is {min(subword_lengths)}, and the maximum is {max(subword_lengths)}.")

The minimum n-gram size is 5, and the maximum is 5.


Using FastText's build_in functions instead of multiple loops.


In [None]:

# Get min and max n-gram sizes
min_n = ft.f.getArgs().minn
max_n = ft.f.getArgs().maxn

print(f"Min n-gram size: {min_n}")
print(f"Max n-gram size: {max_n}")

Min n-gram size: 5
Max n-gram size: 5


In [None]:
# encode Dutch words and pseudowords as fastText embeddings
# show the first 20 values of the embedding of the word 'speelplaats' and of the pseudoword 'danchunk'
# 2 points for correctly encoding words and pseudowords with fastText

embedding_real = ft.get_word_vector('speelplaats')
embedding_fake = ft.get_word_vector('danchunk')

print(embedding_real[:20])
print(embedding_fake[:20])

[ 0.0253247  -0.00634261  0.02746305 -0.04024595  0.04888906  0.00660965
 -0.04152017 -0.01824508 -0.00645641  0.00093806  0.0708492  -0.03291791
  0.00263817 -0.02825846 -0.02188046 -0.03188037 -0.01846142 -0.02203094
 -0.01883078 -0.00259199]
[-0.00592199  0.00097547  0.05925412  0.00053251 -0.00386978 -0.02089076
 -0.02829577  0.00972911 -0.02510111 -0.11454885 -0.02695064  0.01551034
  0.02384409  0.01009528  0.04545438  0.00997385 -0.00474529  0.02524533
  0.02430548 -0.02851078]


In [None]:

dutch_embeddings_dict = {}
pseudo_embeddings_dict = {}

# Loop over the list of words/pseudowords
for word in twentyfour_thousand_words:
    embedding = ft.get_word_vector(word)
    dutch_embeddings_dict[word] = embedding

# Loop over the list of words/pseudowords
for word in pseudowords:
    embedding = ft.get_word_vector(word)
    pseudo_embeddings_dict[word] = embedding
# Print the first 20 values of the embeddings for 'speelplaats' and 'danchunk'
print("Embedding for 'speelplaats' (first 20 values):", dutch_embeddings_dict['speelplaats'][:20])
print("Embedding for 'danchunk' (first 20 values):", pseudo_embeddings_dict['danchunk'][:20])


Embedding for 'speelplaats' (first 20 values): [ 0.0253247  -0.00634261  0.02746305 -0.04024595  0.04888906  0.00660965
 -0.04152017 -0.01824508 -0.00645641  0.00093806  0.0708492  -0.03291791
  0.00263817 -0.02825846 -0.02188046 -0.03188037 -0.01846142 -0.02203094
 -0.01883078 -0.00259199]
Embedding for 'danchunk' (first 20 values): [-0.00592199  0.00097547  0.05925412  0.00053251 -0.00386978 -0.02089076
 -0.02829577  0.00972911 -0.02510111 -0.11454885 -0.02695064  0.01551034
  0.02384409  0.01009528  0.04545438  0.00997385 -0.00474529  0.02524533
  0.02430548 -0.02851078]


In [None]:
# train regression model on word valence
# 1 point for correctly training the regression model

X_dutch_embedding = np.array([dutch_embeddings_dict[word] for word in twentyfour_thousand_words])
y_dutch_embedding = np.array(twentyfour_thousand_valence)
embed_model = LinearRegression()
embed_model.fit(X_dutch_embedding, y_dutch_embedding)


In [None]:
# apply the trained model to predict the valence of pseudowords from Gatti et al (2024).
# Then apply the same model back onto the training set to see how well it predicts the valence of words in Speed and Brysbaert (2024).
# 1 point for correctly applied model

X_pseudo_embedding = np.array([pseudo_embeddings_dict[pw] for pw in pseudowords])
prediction_pseudo_words = embed_model.predict(X_pseudo_embedding )
prediction_dutch_words = embed_model.predict(X_dutch_embedding)

In [None]:
# compute the Spearman correlation coefficients between true valence and predicted valence for
# - words from Speed and Brysbaert (2024)
# - pseudowords from Gatti and colleagues (2024)
# Then show the correlation coefficient.
# 1 point for the correct Spearman correlation coefficients (rounded to the third decimal place)

# Spearman correlation
rho_dutch_embedding, _ = spearmanr(twentyfour_thousand_valence, prediction_dutch_words)
rho_pseudo_embedding, _ = spearmanr(true_valence_pseudo_word, prediction_pseudo_words)

print(f"Spearman correlation (Dutch words) — embedding model: {rho_dutch_embedding:.3f}")
print(f"Spearman correlation (Pseudo words) — embedding model: {rho_pseudo_embedding:.3f}")


Spearman correlation (Dutch words) — embedding model: 0.724
Spearman correlation (Pseudo words) — embedding model: 0.176


**Task 3** (*6 points available, see breakdown below*)

Now you are asked to extend the work by Gatti et al by also considering the representations learned by a transformer-based models, in detail *RobBERT v2* (https://huggingface.co/pdelobelle/robbert-v2-dutch-base). You should follow the same pipeline as for the previous models, encoding both Dutch words from Speed and Brysbaert (2024) and the pseudowords from Gatti et al using the embedding of each string at layer 0, before positional information is factored in. If a string consists of multiple tokens, average the embeddings of all tokens to produce the embedding of the whole string. Then train a multiple regression model on the valence of Dutch words, apply it to the pseudowords, and compute the Spearman correlation between observed and predicted ratings.

Use the HuggingFace model card for RobBERT v2 to check how to access it.

I recommend saving the embeddings to file once you have generated them and you know they are correct: embedding thousands of strings takes some time, and you don't want to have to do it again. For the same reason, develop your code by considering only a small fractions of the words and pseudowords, in order to quickly see if something is wrong. Only when you are positive it works, embed all strings.

In [None]:
# load and instantiate the right model
# 1 point for loading the right model
from transformers import RobertaTokenizer, RobertaForSequenceClassification
tokenizer = RobertaTokenizer.from_pretrained("pdelobelle/robbert-v2-dutch-base")
model = RobertaForSequenceClassification.from_pretrained("pdelobelle/robbert-v2-dutch-base")


loading file vocab.json from cache at /root/.cache/huggingface/hub/models--pdelobelle--robbert-v2-dutch-base/snapshots/271b8bf12b7e429434ce953efb432e8373e84453/vocab.json
loading file merges.txt from cache at /root/.cache/huggingface/hub/models--pdelobelle--robbert-v2-dutch-base/snapshots/271b8bf12b7e429434ce953efb432e8373e84453/merges.txt
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--pdelobelle--robbert-v2-dutch-base/snapshots/271b8bf12b7e429434ce953efb432e8373e84453/special_tokens_map.json
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--pdelobelle--robbert-v2-dutch-base/snapshots/271b8bf12b7e429434ce953efb432e8373e84453/tokenizer_config.json
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--pdelobelle--robbert-v2-dutch-base/snapshots/271b8bf12b7e429434ce953efb432e8373e84453/tokenizer.json
loading file chat_template.jinja from c

In [None]:
import torch
import numpy as np
import pandas as pd
from transformers import AutoTokenizer, AutoModel

In [None]:
model.eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

def chunks(lst, n):
    """Chunks a list into equal chunks containing n elements. Returns a list of lists."""
    return [lst[i:i + n] for i in range(0, len(lst), n)]

def get_embeddings(words):
    inputs = tokenizer(words, return_tensors='pt', padding=True, truncation=True).to(device)
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)

    token_embeddings = outputs.hidden_states[0]  # shape: (batch_size, seq_len, hidden_size)
    attention_mask = inputs['attention_mask'].unsqueeze(-1)

    # Average token embeddings across non-padding tokens
    summed = (token_embeddings * attention_mask).sum(1)
    counts = attention_mask.sum(1)
    avg_embeddings = (summed / counts).cpu().numpy()

    return avg_embeddings

# Encode in batches
batch_size = 32
realword_chunks = chunks(twentyfour_thousand_words, batch_size)
pseudoword_chunks = chunks(pseudoword_complete, batch_size)

realword_embeddings = []
pseudoword_embeddings = []

for chunk in realword_chunks:
    realword_embeddings.extend(get_embeddings(chunk))

for chunk in pseudoword_chunks:
    pseudoword_embeddings.extend(get_embeddings(chunk))



In [None]:
# Check shapes
print("Real word embeddings shape:", np.array(realword_embeddings).shape)
print("Pseudoword embeddings shape:", np.array(pseudoword_embeddings).shape)

# Print first 20 embedding values for example words
sample_real_word = twentyfour_thousand_words[0]
sample_pseudoword = pseudoword_complete[0]

print(f"\nFirst 20 values for '{sample_real_word}':")
print(realword_embeddings[0][:20])

print(f"\nFirst 20 values for '{sample_pseudoword}':")
print(pseudoword_embeddings[0][:20])


Real word embeddings shape: (23986, 768)
Pseudoword embeddings shape: (1499, 768)

First 20 values for 'mama':
[ 0.10573046 -0.13243617  0.02660742 -0.16841179 -0.27212244  0.22280149
  0.34766036  0.1087931   0.18475425 -0.15579909 -0.16617611  0.39199668
  0.14786334  0.31454724 -0.08547205  0.09090855  0.2049371   0.33787522
 -0.08932306 -0.07955239]

First 20 values for 'abhert':
[-0.12733552 -0.12629339  0.19149742 -0.11284536 -0.15879726  0.19802031
 -0.13981844  0.39876682 -0.27620262  0.26281267  0.16537903  0.29070553
  0.01715505  0.4492181  -0.06860778  0.26204032  0.1455008   0.24524692
 -0.26653197  0.02474991]


In [None]:
# Check and print 'miauwen'
if 'miauwen' in twentyfour_thousand_words:
    index_miauwen = twentyfour_thousand_words.index('miauwen')
    embedding_miauwen = realword_embeddings[index_miauwen]
    print(f"'miauwen' found at index {index_miauwen}")
    print("First 20 embedding values for 'miauwen':")
    print(embedding_miauwen[:20])
else:
    print("'miauwen' not found in real word list.")

# Check and print 'lixthless'
if 'lixthless' in pseudoword_complete:
    index_lixthless = pseudoword_complete.index('lixthless')
    embedding_lixthless = pseudoword_embeddings[index_lixthless]
    print(f"\n'lixthless' found at index {index_lixthless}")
    print("First 20 embedding values for 'lixthless':")
    print(embedding_lixthless[:20])
else:
    print("'lixthless' not found in pseudoword list.")


'miauwen' found at index 274
First 20 embedding values for 'miauwen':
[-0.10084925 -0.21053834 -0.04177775 -0.08898214 -0.23452406  0.45046386
  0.40555865 -0.174505    0.3688112   0.06516795  0.13631444 -0.10772562
  0.11235957  0.12476961 -0.23446035  0.4449307   0.14150427 -0.07074475
 -0.14118971 -0.1270578 ]

'lixthless' found at index 694
First 20 embedding values for 'lixthless':
[-0.20995182  0.34993732 -0.17283982  0.08660037 -0.24632312  0.32064924
  0.252103   -0.02056611 -0.00912013  0.15606797  0.08206394  0.16542049
 -0.3391408   0.48972556 -0.25270566  0.06884155  0.4151193  -0.08710929
  0.0304805  -0.21037637]


In [None]:
# train regression model on word valence estimates from Speed and Brysbaert (2024)
# 1 point for correctly training the regression model

X_train = realword_embeddings
y_train = twentyfour_thousand["Valence"].tolist()

lm = LinearRegression()
lm.fit(X_train, y_train)

In [None]:
# apply the trained model to predict the valence of pseudowords from Gatti et al (2024).
# Then apply the same model back onto the training set to see how well it predicts the valence of words in Speed and Brysbaert (2024).
# 1 point for correctly applied model

from scipy.stats import spearmanr

# Applying trained model to pseudowords
X_test = pseudoword_embeddings
y_test = true_valence_pseudo_word

y_pred = lm.predict(X_test)  # predicted valence for pseudowords
y_pred_train = lm.predict(X_train)  # predicted valence for real words

In [None]:
# compute the Spearman correlation coefficients between true valence and predicted valence for
# - words from Speed and Brysbaert (2024)
# - pseudowords from Gatti and colleagues (2024)
# show the correlation coefficient.
# 1 point for the correct Spearman correlation coefficients (rounded to the third decimal place)

# Spearman correlation
from scipy.stats import spearmanr

corr_real, _ = spearmanr(y_train, y_pred_train)
print("Spearman correlation for real words (training set): {:.3f}".format(corr_real))

corr_pseudo, _ = spearmanr(y_test, y_pred)
print("Spearman correlation for pseudowords: {:.3f}".format(corr_pseudo))



Spearman correlation for real words (training set): 0.518
Spearman correlation for pseudowords: 0.185


hope it works now


**Task 4** (*16 points available, 4 for each question*)

Answer the following questions.

**4a.** Describe the performance of each featurization, comparing
- the performance of a same model betmween the training and test set
- the performance of different models on the training set
- the performance of different models on the test set

(*4 points available, max 150 words*)

**Same model (training [Dutch words] vs. test [Pseudowords]):**
Uni-gram performs poorly on the training set (ρ = 0.090) but improves on pseudowords (ρ = 0.327), likely due to letter-level cues matching affective proxies. Bi-gram performs moderately on Dutch (ρ = 0.328) and slightly better on pseudowords (ρ = 0.341), indicating stable but modest generalization. FastText excels on Dutch (ρ = 0.724) but drops sharply on pseudowords (ρ = 0.176), suggesting strong lexical representation but weak transfer. RobBERT also shows moderate performance on Dutch (ρ = 0.518) and declines on pseudowords (ρ = 0.184), likely due to the model's reliance on meaningful context.

**Training set comparison:**
FastText (ρ = 0.724) outperforms RobBERT (ρ = 0.518), bi-gram (ρ = 0.328), and uni-gram (ρ = 0.090), highlighting the advantage of rich subword embeddings for real-word valence prediction.

Test set comparison:
Bi-gram (ρ = 0.341) leads slightly, followed by uni-gram (ρ = 0.327), RobBERT (ρ = 0.184), and FastText (ρ = 0.176), suggesting simpler models better align with pseudoword surface patterns.



**4b.** Compare the correlations you found when training uni-gram, bi-gram, and fastText models on Dutch words and the correlations of similar models trained on English data as reported by Gatti and colleagues; summarize the most important similarities and differences.

(*4 points available, max 150 words*)

In our Dutch data, uni-gram and bi-gram models achieved modest valence correlations on real words (ρ = 0.090 and ρ = 0.328), closely matching Gatti et al.’s English-trained letter (ρ = 0.11) and bigram (ρ = 0.33) models. FastText, however, achieved a high correlation in both studies (ρ = 0.724 for Dutch; ρ = 0.79 for English), confirming its strong semantic generalization across languages. A key difference is that in our Dutch results, bigrams outperformed unigrams, while Gatti et al found the opposite: in pseudowords, English letter-based models outperformed bigrams and embeddings. Additionally, Gatti's best-performing pseudoword model was letter-only, whereas our Dutch test set (pseudowords) favored bigram models (ρ = 0.341) slightly over unigrams (ρ = 0.327), suggesting that Dutch speakers may rely more on subword co-occurrence patterns. Overall, the studies agree that surface-form cues matter more than semantic ones in pseudoword valence, but the specific form-based predictors may vary across languages.

**4c.** Do you think the performance of the fastText featurization would change if you were to use different n-grams? Would you make them smaller or larger? Justify your answer.

(*4 points available, max 150 words*)

The performance of the FastText featurization could change with different n-gram sizes, as n-grams capture varying levels of contextual and subword information. Smaller n-grams focus on finer-grained subword patterns, potentially improving performance for short words or pseudowords with subtle letter-based variations, but they may miss broader contextual patterns. Larger n-grams capture more context, which could benefit longer words or those with complex morphological structures, but they risk overfitting or generating sparse representations for rare patterns.
For this task, given our findings that fastText underperformed on pseudowords (ρ = 0.176), slightly shortening the n-grams might improve performance by increasing overlap with subword patterns common in pseudowords. This is supported by Gatti et al.’s supplementary results, where altering n-gram lengths had a measurable (though limited) effect on prediction quality.

**4d.** Do you think that training the same models on uni-grams, bi-grams, fastText and transformer-based embeddings but using valence ratings for Finnish (a language which uses the same alphabet as English but is not a IndoEuropean language) words would yield a similar pattern of results? Justify your answer.

(*4 points available, max 150 words*)

Training models on uni-grams, bi-grams, fastText, and transformer-based embeddings with Finnish valence ratings is unlikely to replicate the Dutch results. Furthermore, morphologically speaking, Finnish vasty differs from Dutch despite both using the 'English' alphabet. Finnish’s complex, longer words may weaken uni- and bi-gram models, as they struggle to capture multi-morphemic structures, likely yielding lower correlations than Dutch. FastText, using subword embeddings, should perform better, but optimal n-gram sizes may differ, potentially altering its 0.724 correlation. Transformer-based embeddings (e.g., Finnish BERT) could outperform others by modeling contextual nuances, but success depends on robust pretraining. Cultural differences in valence perception may further diverge results. While fastText and transformers may remain strong, the pattern of correlations will likely differ due to linguistic and cultural factors.


**Task 5** (*3 points available*)

Compute the average Levenshtein Distance (aLD) between each pseudoword and the 20 words at the smallest edit distance from it. Consider the set of words you used to filter out pseudowords that happen to be valid Dutch words (the file is available in this OSF repository: https://osf.io/9zymw/) to retrieve the 20 words at the smallest edit distance.

In [None]:
# compute the average Levenshtein distance from each pseudoword to the words used to filter out pseudowords.
# Show the aLD estimate for the pseudowords 'nedukes', 'pewbin', and 'vibcines'
# 3 points for correctly computing aLD for pseudowords
!pip install Levenshtein

Collecting Levenshtein
  Downloading levenshtein-0.27.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.6 kB)
Collecting rapidfuzz<4.0.0,>=3.9.0 (from Levenshtein)
  Downloading rapidfuzz-3.13.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Downloading levenshtein-0.27.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (161 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m161.7/161.7 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading rapidfuzz-3.13.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m67.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rapidfuzz, Levenshtein
Successfully installed Levenshtein-0.27.1 rapidfuzz-3.13.0


In [None]:
import numpy as np
import Levenshtein

# Function to compute average Levenshtein distance
def compute_aLD(pseudoword, word_list, top_n=20):
    distances = [(word, Levenshtein.distance(pseudoword, word)) for word in word_list]
    distances.sort(key=lambda x: x[1])  # sort by distance
    top_20 = distances[:top_n]
    avg_ld = np.mean([dist for word, dist in top_20])
    return avg_ld

# aLD for all pseudowords
ald_values = {pw: compute_aLD(pw, twentyfour_thousand) for pw in pseudoword_complete}

for target in ['nedukes', 'pewbin', 'vibcines']:
    ald = ald_values.get(target)
    if ald is not None:
        print(f"Average Levenshtein Distance for '{target}': {ald:.3f}")
    else:
        print(f"'{target}' not found in pseudoword list.")


Average Levenshtein Distance for 'nedukes': 6.800
Average Levenshtein Distance for 'pewbin': 6.500
Average Levenshtein Distance for 'vibcines': 7.500


**Task 6** (*3 points available*)

For each pseudoword, record the number of tokens in which RobBERT v2 encodes it.

In [None]:
# record the number of tokens in which RobBERT divides each pseudoword
# show the number of tokens for the pseudowords 'yuxwas', 'skibfy', and 'errords'
# 3 points for correctly mapping pseudowords to number of tokens
dictionary_robbert_tokencount = {}
for pseudoword in pseudoword_complete:
  dictionary_robbert_tokencount[pseudoword] = len(tokenizer.tokenize(pseudoword))

print("yuxwas:", dictionary_robbert_tokencount["yuxwas"])
print("skibfy:", dictionary_robbert_tokencount["skibfy"])
print("errords:", dictionary_robbert_tokencount["errords"])

yuxwas: 3
skibfy: 4
errords: 3


**Task 7** (*5 points available, see breakdown below*)

Compute the residuals of the predicted valence under the four regressors trained and applied in tasks 1 to 3. Then, correlate the residuals from all four models with aLD. Finally, correlate the residuals from the RobBERT v2 model with the number of tokens in which each pseudoword is split. Use the Pearson's correlation coefficient.

In [None]:
# compute the residuals from all four regression models fitted before
# 1 point available for correctly computing residuals
residuals_embed = true_valence_pseudo_word - embed_model.predict(X_pseudo_embedding)
residuals_robbert = true_valence_pseudo_word - lm.predict(X_test)
residuals_uni = true_valence_pseudo_word - uni_model.predict(X_pseudo_uni)
residuals_bi = true_valence_pseudo_word - bi_model.predict(X_pseudo_bi)

print(residuals_embed.describe(), end="\n\n")
print(residuals_robbert.describe(), end="\n\n")
print(residuals_uni.describe(), end="\n\n")
print(residuals_bi.describe(), end="\n\n")

count    1499.000000
mean        3.185445
std         1.446554
min        -0.417114
25%         1.985112
50%         3.162839
75%         4.412358
max         6.841151
Name: predicted_valence, dtype: float64

count    1499.000000
mean        3.235827
std         1.439069
min        -0.671984
25%         2.110548
50%         3.188955
75%         4.440921
max         6.345326
Name: predicted_valence, dtype: float64

count    1499.000000
mean        3.200384
std         1.444841
min        -0.675607
25%         2.001129
50%         3.200173
75%         4.438290
max         6.139644
Name: predicted_valence, dtype: float64

count    1499.000000
mean        3.148647
std         1.381874
min        -0.697697
25%         2.033229
50%         3.137965
75%         4.321047
max         7.634713
Name: predicted_valence, dtype: float64



In [None]:
print(len(residuals_robbert))
print(len(ald_values.values()))

1499
1499


In [None]:
# Pearson's correlation between residuals and average LD for all models,
# as well as the correlation between RobBERT v2 residuals and the number of tokens in which each pseudoword
#  is encoded by the RobBERT v2 model.
# Finally print all correlation coefficients
# 4 points for the correct correlation coefficients

from scipy.stats import pearsonr
import numpy as np

pseudowords = list(ald_values.keys())  
ald_list = [ald_values[p] for p in pseudowords]
robbert_tokens_list = [dictionary_robbert_tokencount[p] for p in pseudowords]

# Pearson's correlation between residuals and aLD for all models
correlation_embed, _ = pearsonr(residuals_embed, ald_list)
correlation_robbert, _ = pearsonr(residuals_robbert, ald_list)
correlation_uni, _ = pearsonr(residuals_uni, ald_list)
correlation_bi, _ = pearsonr(residuals_bi, ald_list)

# Pearson's correlation between RobBERT residuals and number of tokens
correlation_robbert_tokens, _ = pearsonr(residuals_robbert, robbert_tokens_list)

print(f"Correlation between embed residuals and aLD: {correlation_embed:.4f}")
print(f"Correlation between RobBERT residuals and aLD: {correlation_robbert:.4f}")
print(f"Correlation between uni residuals and aLD: {correlation_uni:.4f}")
print(f"Correlation between bi residuals and aLD: {correlation_bi:.4f}")
print(f"Correlation between RobBERT residuals and number of tokens: {correlation_robbert_tokens:.4f}")




Correlation between embed residuals and aLD: -0.3621
Correlation between RobBERT residuals and aLD: -0.3483
Correlation between uni residuals and aLD: -0.3579
Correlation between bi residuals and aLD: -0.3986
Correlation between RobBERT residuals and number of tokens: -0.2369


**Task 8** What is the relation between the errors each model made and aLD? what about the number of tokens (limited to the RobBERT v2 model)?

(*4 points available, max 150 words*)

The negative correlations between the residuals of each model and aLD indicate that as aLD increases, meaning the pseudowords are more complex or differ more from typical words. Hnece, the models tend to make larger errors. This is because higher residuals indicate worse model performance, so the models struggle more with inputs that have higher complexity or deviation.

Among the models, the bigram model shows the strongest negative correlation (-0.3986), suggesting its errors increase the most as the complexity of the pseudowords grows.

For the RobBERT v2 model, the negative correlation between residuals and the number of tokens (-0.2369) means that pseudowords encoded into fewer tokens tend to have smaller errors, while those split into more tokens yield larger residuals. This suggests that more fragmented or complex tokenization corresponds to higher model errors.

Overall, both higher complexity (aLD) and greater tokenization complexity are associated with increased prediction errors in the models.

*testo in corsivo*