In this assignment you will be asked to extend the work by Gatti et al by checking whether form-meaning mappings learned on a different yet related language to that considered in the original study still capture the perceived valence of pseudowords. To do this you will be asked to engage with several different resources and adapt the pipeline following the instructions. Along the way, you will be asked to answer a few questions.

You need to submit the complete notebook in .ipynb format, with intermediate outputs visible. The notebook should be named as follows:

CL2025_groupN_assignment.ipynb

where N is the group number. Submissions in the wrong format or with names not adhering to the guidelines will not be evaluated.

Indicate group members' names, student numbers, and contributions below:
- 1. 
- 2.
- 3.
- 4.
- 5.

In [1]:
# the code has been tested using the psycho-embeddings library to extract representations from LLMs. You can also use other libraries,
# as long as you make sure that you are producing the correct output.
!git clone https://github.com/MilaNLProc/psycho-embeddings.git
%cd psycho-embeddings
!pip install datasets


fatal: destination path 'psycho-embeddings' already exists and is not an empty directory.
/Users/bramdewaal/Desktop/Uni/VSC/CL/Group Assignment/psycho-embeddings
[33mDEPRECATION: Loading egg at /Users/bramdewaal/anaconda3/lib/python3.11/site-packages/colorcorrect-0.9.1-py3.11-macosx-11.1-arm64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m


In [2]:
%pip install nltk
%pip install fasttext
%pip install psycho_embeddings 

# Needed to import Rdata file 
%pip install pyreadr

# Needed to read Brysbaert valence excel sheet
%pip install openpyxl

[33mDEPRECATION: Loading egg at /Users/bramdewaal/anaconda3/lib/python3.11/site-packages/colorcorrect-0.9.1-py3.11-macosx-11.1-arm64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
Note: you may need to restart the kernel to use updated packages.
[33mDEPRECATION: Loading egg at /Users/bramdewaal/anaconda3/lib/python3.11/site-packages/colorcorrect-0.9.1-py3.11-macosx-11.1-arm64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0mCollecting fasttext
  Using cached fasttext-0.9.3-cp311-cp311-macosx_15_0_universal2.whl
Collecting pybind11>=2.2 (from fasttext)
  Using cached pybind11-2.13.6-py3-none-any.whl.metadata (9.5 kB)
Using cached pybind11-2.13.6-py3-none-any.whl (243 kB)
Ins

In [3]:
# the solution to the assignment has been obtained using these packages.
# you're free to use other packages though: consider this as an indication, not a prescription.
import nltk
import numpy as np
import pandas as pd
import fasttext as ft
import pickle as pkl
import fasttext.util
from tqdm import tqdm
from collections import defaultdict
from transformers import AutoTokenizer
from psycho_embeddings import ContextualizedEmbedder

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]



**Task 1** (*10 points available, see breakdown per task below*)

You should replicate the main design in the paper *Valence without meaning* by Gatti and colleagues (2024), using estimates collected for Dutch word valence to train linear regression models and apply them to predict the valence of English pseudowords from Gatti and colleagues.

In detail, to train your regression models, you should use the dataset by Speed and Brysbaert (2024) containing crowd-sourced valence ratings (use the metadata to identify the relevant columns) collected for approximately 24,000 Dutch words. See the paper *Ratings of valence, arousal, happiness, anger, fear, sadness, disgust, and surprise for 24,000 Dutch words* by Speed and Brysbaert (2024).

You should train a letter unigram model and a bigram model. Each model should be trained on Dutch words only.

Pay attention to one issue though: pseudowords created for English may be valid words in Dutch: therefore, you should first filter the list of pseudowords against a large store of Dutch words. To do so, use the words in the Dutch prevalence lexicon available in this OSF repository: https://osf.io/9zymw/. Essentially, you need to exclude any pseudoword that happens to be a word for which a prevalence estimate is available, whatever the prevalence is.

Each code block indicates how many points are available and how they are attributed.

In [4]:
# read in the pseudowords from Gatti and colleagues, 
# as well as the valence ratings for 24,000 Dutch words from Speed and Brysbaert (2024)
# show the first 5 lines of each dataset.
# 1 point for identifying the correct files and correctly loading their content

import pyreadr
import pandas as pd

# Using pyreadr to import the Rdata gatti dataset
gatti_result = pyreadr.read_r("/Users/bramdewaal/Desktop/Uni/VSC/CL/Group Assignment/data/data_pseudovalence.Rdata")
gatti_df = list(gatti_result.values())[0]

print("Gatti et al. pseudoword valence dataset:")
print(gatti_df.head())


# Importing Speed & Brysbaert dataset
speed_df = pd.read_csv("/Users/bramdewaal/Desktop/Uni/VSC/CL/Group Assignment/data/prevalence_netherlands.csv", sep="\t")


print("\nSpeed & Brysbaert Dutch valence dataset:")
print(speed_df.head())





Gatti et al. pseudoword valence dataset:
             Valence  predicted_val  predicted_valL  predicted_valL_BI  \
rownames                                                                 
aardvark        6.26       6.392012        4.920180           6.410768   
abalone         5.30       4.756492        5.284912           5.115389   
abandon         2.84       4.260055        5.001226           5.479860   
abandonment     2.63       4.196807        5.022504           5.334364   
abbey           5.85       6.123953        5.147159           5.162931   

             predicted_valDIM  predicted_valL_DIM  predicted_valBI  \
rownames                                                             
aardvark             5.772722            5.774341         6.410768   
abalone              4.728264            4.858120         5.115389   
abandon              3.978241            3.987623         5.479860   
abandonment          3.833330            3.828077         5.334364   
abbey               

In [5]:
print(speed_df.columns)


Index(['word', 'n.obs', 'irt.prevalence', 'z.irt.prevalence', 'prevalence',
       'z.prevalence'],
      dtype='object')


In [6]:
# filter out pseudowords that happen to be valid Dutch words (mind case folding!)
# show the set of pseudowords filtered out.
# 1 point for applying the correct filtering


# Gatti pseudowords (row names)
gatti_words = gatti_df.index.str.lower()

# Dutch real words from Speed & Brysbaert prevalence lexicon
dutch_words = set(speed_df['word'].str.lower())

# Converting to set & filtering out overlapping words (pseudowords that are valid Dutch words)
filtered_out = sorted(set(gatti_words).intersection(dutch_words))

# Filtering Gatti pseudowords that are in the real Dutch words set
gatti_filtered_df = gatti_df[~gatti_df.index.str.lower().isin(dutch_words)]


print("Pseudowords that were filtered out:")
print(filtered_out)


Pseudowords that were filtered out:
['abandon', 'abdomen', 'abject', 'abracadabra', 'abrupt', 'absence', 'absent', 'abstract', 'absurd', 'abundant', 'accent', 'accept', 'accident', 'account', 'accountant', 'ace', 'acid', 'acne', 'acquit', 'acre', 'act', 'activist', 'actor', 'ad', 'adder', 'addict', 'adept', 'adolescent', 'adrenaline', 'adult', 'advocate', 'aerobics', 'affect', 'affidavit', 'affront', 'aftershave', 'agenda', 'agent', 'aids', 'air', 'airbag', 'airstrip', 'alarm', 'albino', 'album', 'alcohol', 'alert', 'alfalfa', 'algebra', 'alias', 'alibi', 'allegro', 'alligator', 'allure', 'alpine', 'alter', 'altimeter', 'alumnus', 'amaretto', 'amateur', 'amber', 'ambrosia', 'ambulance', 'ammonia', 'ammonium', 'amulet', 'amuse', 'amusement', 'anaconda', 'anagram', 'anarchist', 'angel', 'angina', 'angora', 'angst', 'anklet', 'annex', 'anti', 'antichrist', 'anus', 'aorta', 'apache', 'apex', 'apparent', 'appendage', 'appendicitis', 'appendix', 'appetizer', 'aquarium', 'arcade', 'architect'

### Version 1 - not sure if its wrong?

In [7]:
# # encode Dutch words and pseudowords from Gatti et al as uni- and bi-gram vectors
# # show the uni-gram and bi-gram encoding of the pseudoword ampgrair
# # 2 points for correctly encoding the target strings as uni- and bi-gram vectors

# from collections import Counter
# import numpy as np

# # Step 1: Create a list of Dutch words from the valence dataset
# valence_df = pd.read_excel("/Users/bramdewaal/Desktop/Uni/VSC/CL/Group Assignment/data/BrysbaertValence.xlsx")

# # Show the first few rows
# print(valence_df.head())

# # Show all column names
# print(valence_df.columns)

# # Clean: Keep only rows marked as known words (RemoveUnknown == 1)
# valence_df = valence_df[valence_df["RemoveUnknown"] == 1]

# # Keep only the columns we care about
# valence_df = valence_df[["Word", "Valence"]]

# # Normalize: lowercase the words
# valence_df["Word"] = valence_df["Word"].str.lower()

# # Final check
# print(valence_df.head())

# dutch_words = valence_df['Word'].dropna().astype(str).str.lower().tolist()

# # Step 2: Extract all unigrams and bigrams in the corpus
# def get_ngrams(word: str, n: int):
#     return [word[i:i+n] for i in range(len(word) - n + 1)]

# # Collect all unigrams and bigrams from the corpus
# unigrams = set()
# bigrams = set()
# for word in dutch_words:
#     unigrams.update(get_ngrams(word, 1))
#     bigrams.update(get_ngrams(word, 2))

# # Create sorted vocabularies (to lock feature order)
# unigram_vocab = sorted(unigrams)
# bigram_vocab = sorted(bigrams)

# # Map each n-gram to a vector index
# unigram_index = {gram: i for i, gram in enumerate(unigram_vocab)}
# bigram_index = {gram: i for i, gram in enumerate(bigram_vocab)}

# # Step 3: Define encoding functions
# def encode_word_ngrams(word: str, index_map: dict, n: int) -> np.ndarray:
#     """Encodes a word as a sparse n-gram vector using a given n-gram index map."""
#     vec = np.zeros(len(index_map))
#     ngrams = get_ngrams(word.lower(), n)
#     counts = Counter(ngrams)
#     for gram, count in counts.items():
#         if gram in index_map:
#             vec[index_map[gram]] = count
#     return vec



# # Encode the pseudoword "ampgrair"
# word = "ampgrair"

# uni_vec = encode_word_ngrams(word, unigram_index, 1)
# bi_vec = encode_word_ngrams(word, bigram_index, 2)

# # Preview the non-zero parts
# def print_vector(vec, vocab):
#     found = False
#     for i, v in enumerate(vec):
#         if v != 0:
#             print(f"{vocab[i]}: {int(v)}")
#     if not found:
#         print("(No known n-grams found in vocabulary)")

# print("Unigram vector for 'ampgrair':")
# print_vector(uni_vec, unigram_vocab)

# print("\nBigram vector for 'ampgrair':")
# print_vector(bi_vec, bigram_vocab)


### Version 2 

In [8]:
# # encode Dutch words and pseudowords from Gatti et al as uni- and bi-gram vectors
# # show the uni-gram and bi-gram encoding of the pseudoword ampgrair
# # 2 points for correctly encoding the target strings as uni- and bi-gram vectors


import numpy as np
from collections import Counter

# Function to get character n-grams
def get_ngrams(word: str, n: int):
    return [word[i:i+n] for i in range(len(word) - n + 1)]

valence_df = pd.read_excel("/Users/bramdewaal/Desktop/Uni/VSC/CL/Group Assignment/data/BrysbaertValence.xlsx")
# Collect all words for vocabulary (Dutch + pseudowords)
dutch_words = valence_df["Word"].astype(str).str.lower().tolist()
pseudowords = gatti_filtered_df.index.str.lower().tolist()
all_words = dutch_words + pseudowords

# Build unigram and bigram vocabularies
unigrams = sorted(set(c for w in all_words for c in get_ngrams(w, 1)))
bigrams = sorted(set(b for w in all_words for b in get_ngrams(w, 2)))

# Index maps
unigram_index = {gram: i for i, gram in enumerate(unigrams)}
bigram_index = {gram: i for i, gram in enumerate(bigrams)}

# N-gram encoding function
def encode_word_ngrams(word: str, index_map: dict, n: int) -> np.ndarray:
    vec = np.zeros(len(index_map))
    ngrams = get_ngrams(word.lower(), n)
    counts = Counter(ngrams)
    for gram, count in counts.items():
        if gram in index_map:
            vec[index_map[gram]] = count
    return vec

# Helper to print non-zero entries
def print_vector(vec, vocab):
    found = False
    for i, v in enumerate(vec):
        if v != 0:
            print(f"{vocab[i]}: {int(v)}")
            found = True
    if not found:
        print("(No known n-grams found in vocabulary)")

# Encoding 'ampgrair'
word = "ampgrair"
print("Unigram vector for 'ampgrair':")
uni_vec = encode_word_ngrams(word, unigram_index, 1)
print_vector(uni_vec, unigrams)

print("\nBigram vector for 'ampgrair':")
bi_vec = encode_word_ngrams(word, bigram_index, 2)
print_vector(bi_vec, bigrams)


Unigram vector for 'ampgrair':
a: 2
g: 1
i: 1
m: 1
p: 1
r: 2

Bigram vector for 'ampgrair':
ai: 1
am: 1
gr: 1
ir: 1
mp: 1
pg: 1
ra: 1


In [9]:





# !git add data/BrysbaertValence.xlsx data/data_pseudovalence.RData data/prevalence_netherlands.csv
# !git commit -m "Add data files"
# !git push origin main


In [28]:
# use word valence estimates from Speed and Brysbaert (2024) to train
# - a uni-gram model
# - a bi-gram model
# 2 points for correctly trained models

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error



valence_df = pd.read_excel("/Users/bramdewaal/Desktop/Uni/VSC/CL/Group Assignment/data/BrysbaertValence.xlsx")

# Create the uni-gram and bi-gram features for each word using the previously defined encoding function
# Note: adjust the column names if necessary. Here we assume the words are in the column "Word"
valence_df["unigram_features"] = valence_df["Word"].apply(lambda w: encode_word_ngrams(w, unigram_index, 1))
valence_df["bigram_features"] = valence_df["Word"].apply(lambda w: encode_word_ngrams(w, bigram_index, 2))

# Convert the features into numpy arrays for modeling
X_uni = np.vstack(valence_df["unigram_features"].values)
X_bi = np.vstack(valence_df["bigram_features"].values)
y = valence_df["Valence"].values  # Adjust if the column name is different

# Split the data into training and testing sets (using the same random_state for reproducibility)
X_uni_train, X_uni_test, y_train, y_test = train_test_split(X_uni, y, test_size=0.2, random_state=42)
X_bi_train, X_bi_test, _, _ = train_test_split(X_bi, y, test_size=0.2, random_state=42)

# Train a uni-gram model using Linear Regression
uni_model = LinearRegression()
uni_model.fit(X_uni_train, y_train)

# Train a bi-gram model using Linear Regression
bi_model = LinearRegression()
bi_model.fit(X_bi_train, y_train)

# Evaluate both models on their respective test sets
y_uni_pred = uni_model.predict(X_uni_test)
y_bi_pred = bi_model.predict(X_bi_test)

uni_mse = mean_squared_error(y_test, y_uni_pred)
bi_mse = mean_squared_error(y_test, y_bi_pred)

print("Uni-gram model Mean Squared Error:", uni_mse)
print("Bi-gram model Mean Squared Error:", bi_mse)


Uni-gram model Mean Squared Error: 0.42390078742333853
Bi-gram model Mean Squared Error: 2.7532474695153363e+22


In [11]:
# apply trained models to predict the valence of pseudowords from Gatti et al (2024).
# Then apply the same models back onto the training set to see how well they predict the valence of words in Speed and Brysbaert (2024).
# 2 points for correctly applied models


'''First we encode the pseudowords so they work with the models'''
pseudowords = gatti_filtered_df.index.str.lower().tolist()
pseudoword_uni_features = np.vstack([encode_word_ngrams(w, unigram_index, 1) for w in pseudowords])
pseudoword_bi_features = np.vstack([encode_word_ngrams(w, bigram_index, 2) for w in pseudowords])

'''Then we apply the trained models on the uni and bigram features'''
pw_valence_uni = uni_model.predict(pseudoword_uni_features)
pw_valence_bi = bi_model.predict(pseudoword_bi_features)

'''Applying models to training set -> predict valence of words in Speed and Brysbaert (2024)'''
speed_valence_uni = uni_model.predict(X_uni_train)
speed_valence_bi = bi_model.predict(X_bi_train)



In [12]:
# compute the Spearman correlation coefficients between true valence and predicted valence under both uni- and bi-gram models for
# - words from Speed and Brysbaert (2024)
# - pseudowords from Gatti and colleagues (2024)
# show both correlation coefficients.
# 2 points for the correct Spearman correlation coefficients (rounded to the third decimal place)

import scipy.stats

spearman_uni_train, p_val_uni_train = scipy.stats.spearmanr(y_train, speed_valence_uni)
spearman_bi_train, p_val_bi_train = scipy.stats.spearmanr(y_train, speed_valence_bi)


true_pseudoword_valence = gatti_filtered_df["Valence"].values
spearman_uni_pw, p_val_uni_pw = scipy.stats.spearmanr(true_pseudoword_valence, pw_valence_uni)
spearman_bi_pw, p_val_bi_pw = scipy.stats.spearmanr(true_pseudoword_valence, pw_valence_bi)


print("Training words - uni-gram Spearman correlation:", round(spearman_uni_train, 3))
print("Training words - bi-gram Spearman correlation:", round(spearman_bi_train, 3))
print("Pseudowords - uni-gram Spearman correlation:", round(spearman_uni_pw, 3))
print("Pseudowords - bi-gram Spearman correlation:", round(spearman_bi_pw, 3))


Training words - uni-gram Spearman correlation: 0.098
Training words - bi-gram Spearman correlation: 0.327
Pseudowords - uni-gram Spearman correlation: 0.049
Pseudowords - bi-gram Spearman correlation: 0.049


**Task 2** (*8 points available, see breakdown below*)

Again following Gatti and colleagues, you should encode the target strings (pseudowords and Dutch words from Speed and Brysbaert) as fastText embeddings, train a multiple regression model on Dutch words and apply it to the pseudowords in Gatti et al. You should finally report the Spearman correlation coefficient between observed and predicted valence for both words and pseudowords.

You should use the pre-trained fastText model for Dutch, available at this page: https://fasttext.cc/docs/en/crawl-vectors.html

Finally, you should answer two questions about the fastText model (see below).

In [29]:
%pip install gensim


[33mDEPRECATION: Loading egg at /Users/bramdewaal/anaconda3/lib/python3.11/site-packages/colorcorrect-0.9.1-py3.11-macosx-11.1-arm64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
Collecting FuzzyTM>=0.4.0 (from gensim)
  Downloading FuzzyTM-2.0.9-py3-none-any.whl.metadata (7.9 kB)
Collecting pyfume (from FuzzyTM>=0.4.0->gensim)
  Downloading pyFUME-0.3.4-py3-none-any.whl.metadata (9.7 kB)
Collecting scipy>=1.7.0 (from gensim)
  Downloading scipy-1.10.1-cp311-cp311-macosx_12_0_arm64.whl.metadata (100 kB)
Collecting numpy>=1.18.5 (from gensim)
  Downloading numpy-1.24.4-cp311-cp311-macosx_11_0_arm64.whl.metadata (5.6 kB)
Collecting simpful==2.12.0 (from pyfume->FuzzyTM>=0.4.0->gensim)
  Downloading simpful-2.12.0-py3-none-any.whl.metadata (4.8 kB)
Collecting fst-pso==1.8.1 (from pyfume->FuzzyTM>=0.4.0->gensim)
  Downloading fst-pso

In [None]:
# load the fastText model
# 1 point for correctly loading the appropriate fastText model
from gensim.models import KeyedVectors

ft_dutch_model = KeyedVectors.load_word2vec_format("/Users/bramdewaal/Desktop/Uni/VSC/CL/Group Assignment/cc.nl.300.vec.gz", binary=False)

"Checking if 'huis' is in the model"
print('huis' in ft_dutch_model)  

"Checking the vector representation for 'huis' "
print(ft_dutch_model['huis'])




True
[-0.0213 -0.0391  0.0677  0.0304 -0.048   0.0472 -0.0011 -0.0132  0.0328
 -0.0447 -0.0006  0.0119  0.0267  0.0311  0.0165 -0.0883 -0.0669  0.1147
  0.0342 -0.0505  0.011   0.0914  0.0319  0.0157 -0.0684 -0.045   0.0083
  0.021   0.0024  0.0326  0.0225 -0.0943 -0.0057 -0.0929  0.0377  0.0309
  0.009   0.0455  0.0524 -0.0686  0.0159 -0.2064 -0.0215 -0.0543 -0.0482
  0.0352 -0.0042  0.0118 -0.0643 -0.0157  0.0179  0.0572  0.094   0.0348
 -0.0006 -0.0306  0.0057 -0.0461 -0.0689  0.0268  0.0446  0.0193 -0.145
  0.0084  0.0695 -0.0026  0.0162 -0.1211 -0.0066  0.0668 -0.0199 -0.0007
 -0.0977  0.0244  0.0486 -0.0185 -0.0491  0.0343 -0.0124 -0.0414  0.0128
 -0.0964 -0.0023 -0.0662  0.0128 -0.0283 -0.0403  0.0565  0.0372 -0.0346
  0.0049  0.0493 -0.0534  0.0414 -0.0077  0.0033  0.0162  0.0019 -0.0302
 -0.0832  0.0952  0.0043 -0.04   -0.1043 -0.0862 -0.0148 -0.013   0.037
 -0.1954 -0.0126  0.0244  0.0609  0.0312 -0.0162  0.1114 -0.0281 -0.022
  0.0318  0.0679  0.0072 -0.042   0.1186 -0.0333 

In [None]:
print("Dimensionality of pre-trained Dutch fastText embeddings:", ft_dutch_model.vector_size) # Information also available on fastText documentation website: https://fasttext.cc/docs/en/crawl-vectors.html




Dimensionality of pre-trained Dutch fastText embeddings: 300


**What is the dimensionality of the pre-trained Dutch fastText embeddings? (*1 point for the correct answer*)**

- The dimensionality of the pre-trained Dutch fastText embeddings is 300


**What minimum and maximum n-gram size was specified for training this fastText model? (*1 point for the correct answer*)**

- According to the documentation (https://fasttext.cc/docs/en/crawl-vectors.html) the n-gram size used during training was 5, so both the minimum and maximum n-gram size is 5.

In [None]:
# encode Dutch words and pseudowords as fastText embeddings
# show the first 20 values of the embedding of the word 'speelplaats' and of the pseudoword 'danchunk'
# 2 points for correctly encoding words and pseudowords with fastText



In [15]:
# train regression model on word valence
# 1 point for correctly training the regression model

In [16]:
# apply the trained model to predict the valence of pseudowords from Gatti et al (2024).
# Then apply the same model back onto the training set to see how well it predicts the valence of words in Speed and Brysbaert (2024).
# 1 point for correctly applied model

In [17]:
# compute the Spearman correlation coefficients between true valence and predicted valence for
# - words from Speed and Brysbaert (2024)
# - pseudowords from Gatti and colleagues (2024)
# show the correlation coefficient.
# 1 point for the correct Spearman correlation coefficients (rounded to the third decimal place)

**Task 3** (*6 points available, see breakdown below*)

Now you are asked to extend the work by Gatti et al by also considering the representations learned by a transformer-based models, in detail *RobBERT v2* (https://huggingface.co/pdelobelle/robbert-v2-dutch-base). You should follow the same pipeline as for the previous models, encoding both Dutch words from Speed and Brysbaert (2024) and the pseudowords from Gatti et al using the embedding of each string at layer 0, before positional information is factored in. If a string consists of multiple tokens, average the embeddings of all tokens to produce the embedding of the whole string. Then train a multiple regression model on the valence of Dutch words, apply it to the pseudowords, and compute the Spearman correlation between observed and predicted ratings.

Use the HuggingFace model card for RobBERT v2 to check how to access it.

I recommend saving the embeddings to file once you have generated them and you know they are correct: embedding thousands of strings takes some time, and you don't want to have to do it again. For the same reason, develop your code by considering only a small fractions of the words and pseudowords, in order to quickly see if something is wrong. Only when you are positive it works, embed all strings.

In [18]:
# load and instantiate the right model
# 1 point for loading the right model

In [19]:
# encode the words and pseudowords using RobBERT v2. I've used the free GPU runtime on COLAB to speed things up,
# but in this case you need to batch the words and pseudowords. You can use the function below to create batches
# but you will have to pay attention at how you store embeddings.
# show the first 20 values of the embedding of the word 'miauwen' and of the pseudoword 'lixthless'
# 2 points for correctly encoding words and pseudowords

def chunks(lst, n):

    """Chunks a list into equal chunks containing n elements. Returns a list of lists."""

    chunked = []
    for i in range(0, len(lst), n):
        chunked.append(lst[i:i + n])
    return chunked

In [20]:
# train regression model on word valence estimates from Speed and Brysbaert (2024)
# 1 point for correctly training the regression model

In [21]:
# apply the trained model to predict the valence of pseudowords from Gatti et al (2024).
# Then apply the same model back onto the training set to see how well it predicts the valence of words in Speed and Brysbaert (2024).
# 1 point for correctly applied model

In [22]:
# compute the Spearman correlation coefficients between true valence and predicted valence for
# - words from Speed and Brysbaert (2024)
# - pseudowords from Gatti and colleagues (2024)
# show the correlation coefficient
# 1 point for the correct Spearman correlation coefficients (rounded to the third decimal place)

**Task 4** (*16 points available, 4 for each question*)

Answer the following questions.

**4a.** Describe the performance of each featurization, comparing
- the performance of a same model between the training and test set
- the performance of different models on the training set
- the performance of different models on the test set

(*4 points available, max 150 words*)

*type your answer here*

**4b.** Compare the correlations you found when training uni-gram, bi-gram, and fastText models on Dutch words and the correlations of similar models trained on English data as reported by Gatti and colleagues; summarize the most important similarities and differences.

(*4 points available, max 150 words*)

*type your answer here*

**4c.** Do you think the performance of the fastText featurization would change if you were to use different n-grams? Would you make them smaller or larger? Justify your answer.

(*4 points available, max 150 words*)

*type your answer here*

**4d.** Do you think that training the same models on uni-grams, bi-grams, fastText and transformer-based embeddings but using valence ratings for Finnish (a language which uses the same alphabet as English but is not a IndoEuropean language) words would yield a similar pattern of results? Justify your answer.

(*4 points available, max 150 words*)

*type your answer here*

**Task 5** (*3 points available*)

Compute the average Levenshtein Distance (aLD) between each pseudoword and the 20 words at the smallest edit distance from it. Consider the set of words you used to filter out pseudowords that happen to be valid Dutch words (the file is available in this OSF repository: https://osf.io/9zymw/) to retrieve the 20 words at the smallest edit distance.

In [23]:
# compute the average Levenshtein distance from each pseudoword to the words used to filter out pseudowords.
# Show the aLD estimate for the pseudowords 'nedukes', 'pewbin', and 'vibcines'
# 3 points for correctly computing aLD for pseudowords

**Task 6** (*3 points available*)

For each pseudoword, record the number of tokens in which RobBERT v2 encodes it.

In [24]:
# record the number of tokens in which RobBERT divides each pseudoword
# show the number of tokens for the pseudowords 'yuxwas', 'skibfy', and 'errords'
# 3 points for correctly mapping pseudowords to number of tokens

**Task 7** (*5 points available, see breakdown below*)

Compute the residuals of the predicted valence under the four regressors trained and applied in tasks 2 to 4. Then, correlate the residuals from all four models with aLD. Finally, correlate the residuals from the RobBERT v2 model with the number of tokens in which each pseudoword is split. Use the Pearson's correlation coefficient.

In [25]:
# compute the residuals from all four regression models fitted before
# 1 point available for correctly computing residuals

In [26]:
# compute the Pearson's correlation between residuals and average LD for all models,
# as well as the correlation between RobBERT v2 residuals and the number of tokens in which each pseudoword
#    is encoded by the RobBERT v2 model.
# show all correlation coefficients
# 4 points for the correct correlation coefficients

**Task 8** What is the relation between the errors each model made and aLD? what about the number of tokens (limited to the RobBERT v2 model)?

(*4 points available, max 150 words*)

*testo in corsivo*