In this assignment you will be asked to extend the work by Gatti et al by checking whether form-meaning mappings learned on a different yet related language to that considered in the original study still capture the perceived valence of pseudowords. To do this you will be asked to engage with several different resources and adapt the pipeline following the instructions. Along the way, you will be asked to answer a few questions.

You need to submit the complete notebook in .ipynb format, with intermediate outputs visible. The notebook should be named as follows:

CL2025_groupN_assignment.ipynb

where N is the group number. Submissions in the wrong format or with names not adhering to the guidelines will not be evaluated.

Indicate group members' names, student numbers, and contributions below:
- 1.
- 2.
- 3.
- 4.
- 5.

I suggest that we use "##" for comments instead, to distinguish between our and the original comments, but idk if there is any standard for this -Frey

To do:
- get everyone setup and comfortable with git(hub)
- check if frey found the correct files

In [None]:
## This block has been modified to work for local usage 

# the code has been tested using the psycho-embeddings library to extract representations from LLMs. You can also use other libraries,
# as long as you make sure that you are producing the correct output.

## SETUP:
## Make sure you have Git and Pip installed then run the following commands in the command line (make sure you are an administrator of course)

## git clone https://github.com/MilaNLProc/psycho-embeddings.git
## pip install datasets
## pip install fasttext
## pip install nltk
%cd psycho-embeddings

[WinError 2] The system cannot find the file specified: 'psycho-embeddings'
c:\Users\etrus\Documents\CSAI\year3\CL\CL_group8


In [9]:
# the solution to the assignment has been obtained using these packages.
# you're free to use other packages though: consider this as an indication, not a prescription.
import nltk
import numpy as np
import pandas as pd
import fasttext as ft
import pickle as pkl
import fasttext.util
from tqdm import tqdm
from collections import defaultdict
from transformers import AutoTokenizer
from psycho_embeddings import ContextualizedEmbedder

## psycho_embeddings throws a problems about tensorflow versions, but these are the same as the lecture 5 notebook
## so I assume they can be ignored

ModuleNotFoundError: No module named 'fasttext'

**Task 1** (*10 points available, see breakdown per task below*)

You should replicate the main design in the paper *Valence without meaning* by Gatti and colleagues (2024), using estimates collected for Dutch word valence to train linear regression models and apply them to predict the valence of English pseudowords from Gatti and colleagues.

In detail, to train your regression models, you should use the dataset by Speed and Brysbaert (2024) containing crowd-sourced valence ratings (use the metadata to identify the relevant columns) collected for approximately 24,000 Dutch words. See the paper *Ratings of valence, arousal, happiness, anger, fear, sadness, disgust, and surprise for 24,000 Dutch words* by Speed and Brysbaert (2024).

You should train a letter unigram model and a bigram model. Each model should be trained on Dutch words only.

Pay attention to one issue though: pseudowords created for English may be valid words in Dutch: therefore, you should first filter the list of pseudowords against a large store of Dutch words. To do so, use the words in the Dutch prevalence lexicon available in this OSF repository: https://osf.io/9zymw/. Essentially, you need to exclude any pseudoword that happens to be a word for which a prevalence estimate is available, whatever the prevalence is.

Each code block indicates how many points are available and how they are attributed.

In [4]:
# read in the pseudowords from Gatti and colleagues, as well as the valence ratings for 24,000 Dutch words from Speed and Brysbaert (2024)
# show the first 5 lines of each dataset.
# 1 point for identifying the correct files and correctly loading their content

In [11]:
## getting the pseudowords from Gatti -Frey

## I am using a pandas dataframe (df)
## pseudowords_Gatti is called comb_2 in the Rdata file

df_Gatti = pd.read_csv("pseudowords_Gatti.csv")
del df_Gatti['Unnamed: 0']
df_Gatti.head(5)

Unnamed: 0,Word,Value1,Value2
0,abhert,0.473009,0.406491
1,abhict,0.375453,0.472723
2,acleat,0.58384,0.496628
3,acmure,0.607354,0.597101
4,acoed,0.526847,0.551518


In [12]:
## Getting the word valence from Speed -Frey

## I have uploaded the file as wordvalence_Speed.csv after I have saved "All_Valence" as a csv
df_Speed = pd.read_csv("wordvalence_Speed.csv")
df_Speed.head(5)

Unnamed: 0,List,Participant,Word,Valence,Unknown,RemoveParticipant
0,Lijst 5,Lijst 5_PP1,aai,5.0,0,0
1,Lijst 5,Lijst 5_PP11,aai,3.0,0,0
2,Lijst 5,Lijst 5_PP12,aai,3.0,0,0
3,Lijst 5,Lijst 5_PP2,aai,3.0,0,0
4,Lijst 5,Lijst 5_PP3,aai,4.0,0,0


In [7]:
# filter out pseudowords that happen to be valid Dutch words (mind case folding!)
# show the set of pseudowords filtered out.
# 1 point for applying the correct filtering

In [14]:
## Combining the file for the prevalence of Dutch words and Belgian words -Frey
df_prevalence_Netherlands = pd.read_csv("prevalence_netherlands.csv" , sep = "\t")
df_prevalence_Belgium = pd.read_csv("prevalence_belgium.csv", sep= "\t")

df_prevalence_combined = pd.concat([df_prevalence_Belgium, df_prevalence_Netherlands], join= "outer") #outer: takes union

## Checking if they all have a prevalence value (they do)
df_prevalence_combined_filtered = df_prevalence_combined[df_prevalence_combined["prevalence"] != None]

## Saving only the column with words as a Series
df_Dutch_words = df_prevalence_combined_filtered.word
df_Dutch_words = df_Dutch_words.drop_duplicates()

## Saving only the column with pseudowords from pseudowords_Gatti as a Series
df_pseudowords = df_Gatti.Word

## Apply filtering here
print(df_pseudowords[df_pseudowords.isin(df_Dutch_words)])
df_pseudowords = df_pseudowords[~df_pseudowords.isin(df_Dutch_words)]

900    pimpen
Name: Word, dtype: object


In [9]:
# encode Dutch words and pseudowords from Gatti et al as uni- and bi-gram vectors
# show the uni-gram and bi-gram encoding of the pseudoword ampgrair
# 2 points for correctly encoding the target strings as uni- and bi-gram vectors

In [17]:
## this is simplified code from the notebook of class one. Permission has been given to use it with the correct reference.

def ngram_featurizer(s, n):

    string_boundary = ["letter"]*(n-1)                        # necessary to encode features such as 'this string begins/ends with this specific symbol'
    s = string_boundary + list(s) + string_boundary

    return [tuple(s[i:i+n]) for i in range(len(s)-n+1)]       # this is where the n-gram featurization actually happens.


In [None]:
## this is code from the notebook of class one. Permission has been given to use it with the correct reference.

# This function encodes all tweets as frequency counts over n-grams
def encode_corpus(corpus, n, mapping=None):

    """
    Takes in
      - a list of strings,
      - an integer indicating the n-grams size,
      - a dictionary mapping ngrams to numerical indices. If no dictionary is
          passed, one is created inside the function.
    The function outputs a 2d NumPy array with as many rows as there are strings in
    the input list, and the mapping from ngrams to indices, representing the columns
    of the NumPy array.
    """

    if not mapping:
        all_ngrams = set()
        for instance in corpus:
            # get a comprehensive set of all n-grams in the corpus
            all_ngrams = all_ngrams.union(
                set(ngram_featurizer(instance, n))
                )

        # map each n-gram to an integer which will index the feature matrix
        mapping = {ngram: i for i, ngram in enumerate(sorted(all_ngrams))}

    # create a feature matrix of the appropriate dimensionality
    X = np.zeros((len(corpus), len(mapping)))
    for i, instance in enumerate(corpus):
        for ngram in ngram_featurizer(instance, n):
            try:
                # access the right column given the n-gram being processed
                X[i, mapping[ngram]] += 1
            except KeyError:
                # if the current n-gram is new, skip it
                pass

    return X, mapping

feature_matrix_unigram, mapping_unigram = encode_corpus(list(df_pseudowords), 1)
feature_matrix_bigram, mapping_bigram = encode_corpus(list(df_pseudowords), 2)

In [12]:
## WORK IN PROGRESS vv

In [34]:
## trying to map all the items to their vector here (unsuccesfully)

from functools import partial

encode_corpus_unigram = partial(encode_corpus, n=1, mapping=mapping_unigram)
# list_pseudowords_vectors = map(encode_corpus_unigram, df_pseudowords)

# df_pseudowords.map(encode_corpus_unigram)

print(df_pseudowords)

0       abhert
1       abhict
2       acleat
3       acmure
4        acoed
         ...  
1495     zauze
1496     zerow
1497      zilk
1498    zohels
1499    zokils
Name: Word, Length: 1499, dtype: object


In [14]:
## testing that ampgrair can be encoded

# feature_matrix_ampgrair_unigram, _ = encode_corpus(['ampgrair'], 1, mapping=mapping_unigram)
# print(feature_matrix_ampgrair_unigram)

# print("")
# feature_matrix_ampgrair_bigram, _ = encode_corpus(['ampgrair'], 2, mapping=mapping_bigram)
# print(feature_matrix_ampgrair_bigram)

In [15]:
# use word valence estimates from Speed and Brysbaert (2024) to train
# - a uni-gram model
# - a bi-gram model
# 2 points for correctly trained models

In [16]:
# apply trained models to predict the valence of pseudowords from Gatti et al (2024).
# Then apply the same models back onto the training set to see how well they predict the valence of words in Speed and Brysbaert (2024).
# 2 points for correctly applied models

In [17]:
# compute the Spearman correlation coefficients between true valence and predicted valence under both uni- and bi-gram models for
# - words from Speed and Brysbaert (2024)
# - pseudowords from Gatti and colleagues (2024)
# show both correlation coefficients.
# 2 points for the correct Spearman correlation coefficients (rounded to the third decimal place)

**Task 2** (*8 points available, see breakdown below*)

Again following Gatti and colleagues, you should encode the target strings (pseudowords and Dutch words from Speed and Brysbaert) as fastText embeddings, train a multiple regression model on Dutch words and apply it to the pseudowords in Gatti et al. You should finally report the Spearman correlation coefficient between observed and predicted valence for both words and pseudowords.

You should use the pre-trained fastText model for Dutch, available at this page: https://fasttext.cc/docs/en/crawl-vectors.html

Finally, you should answer two questions about the fastText model (see below).

In [18]:
# load the fastText model
# 1 point for correctly loading the appropriate fastText model

What is the dimensionality of the pre-trained Dutch fastText embeddings? (*1 point for the correct answer*)

What minimum and maximum n-gram size was specified for training this fastText model? (*1 point for the correct answer*)

In [19]:
# encode Dutch words and pseudowords as fastText embeddings
# show the first 20 values of the embedding of the word 'speelplaats' and of the pseudoword 'danchunk'
# 2 points for correctly encoding words and pseudowords with fastText

In [20]:
# train regression model on word valence
# 1 point for correctly training the regression model

In [21]:
# apply the trained model to predict the valence of pseudowords from Gatti et al (2024).
# Then apply the same model back onto the training set to see how well it predicts the valence of words in Speed and Brysbaert (2024).
# 1 point for correctly applied model

In [22]:
# compute the Spearman correlation coefficients between true valence and predicted valence for
# - words from Speed and Brysbaert (2024)
# - pseudowords from Gatti and colleagues (2024)
# show the correlation coefficient.
# 1 point for the correct Spearman correlation coefficients (rounded to the third decimal place)

**Task 3** (*6 points available, see breakdown below*)

Now you are asked to extend the work by Gatti et al by also considering the representations learned by a transformer-based models, in detail *RobBERT v2* (https://huggingface.co/pdelobelle/robbert-v2-dutch-base). You should follow the same pipeline as for the previous models, encoding both Dutch words from Speed and Brysbaert (2024) and the pseudowords from Gatti et al using the embedding of each string at layer 0, before positional information is factored in. If a string consists of multiple tokens, average the embeddings of all tokens to produce the embedding of the whole string. Then train a multiple regression model on the valence of Dutch words, apply it to the pseudowords, and compute the Spearman correlation between observed and predicted ratings.

Use the HuggingFace model card for RobBERT v2 to check how to access it.

I recommend saving the embeddings to file once you have generated them and you know they are correct: embedding thousands of strings takes some time, and you don't want to have to do it again. For the same reason, develop your code by considering only a small fractions of the words and pseudowords, in order to quickly see if something is wrong. Only when you are positive it works, embed all strings.

In [23]:
# load and instantiate the right model
# 1 point for loading the right model

In [24]:
# encode the words and pseudowords using RobBERT v2. I've used the free GPU runtime on COLAB to speed things up,
# but in this case you need to batch the words and pseudowords. You can use the function below to create batches
# but you will have to pay attention at how you store embeddings.
# show the first 20 values of the embedding of the word 'miauwen' and of the pseudoword 'lixthless'
# 2 points for correctly encoding words and pseudowords

def chunks(lst, n):

    """Chunks a list into equal chunks containing n elements. Returns a list of lists."""

    chunked = []
    for i in range(0, len(lst), n):
        chunked.append(lst[i:i + n])
    return chunked


In [25]:
# train regression model on word valence estimates from Speed and Brysbaert (2024)
# 1 point for correctly training the regression model

In [26]:
# apply the trained model to predict the valence of pseudowords from Gatti et al (2024).
# Then apply the same model back onto the training set to see how well it predicts the valence of words in Speed and Brysbaert (2024).
# 1 point for correctly applied model

In [27]:
# compute the Spearman correlation coefficients between true valence and predicted valence for
# - words from Speed and Brysbaert (2024)
# - pseudowords from Gatti and colleagues (2024)
# show the correlation coefficient
# 1 point for the correct Spearman correlation coefficients (rounded to the third decimal place)

**Task 4** (*16 points available, 4 for each question*)

Answer the following questions.

**4a.** Describe the performance of each featurization, comparing
- the performance of a same model between the training and test set
- the performance of different models on the training set
- the performance of different models on the test set

(*4 points available, max 150 words*)

*type your answer here*

**4b.** Compare the correlations you found when training uni-gram, bi-gram, and fastText models on Dutch words and the correlations of similar models trained on English data as reported by Gatti and colleagues; summarize the most important similarities and differences.

(*4 points available, max 150 words*)

*type your answer here*

**4c.** Do you think the performance of the fastText featurization would change if you were to use different n-grams? Would you make them smaller or larger? Justify your answer.

(*4 points available, max 150 words*)

*type your answer here*

**4d.** Do you think that training the same models on uni-grams, bi-grams, fastText and transformer-based embeddings but using valence ratings for Finnish (a language which uses the same alphabet as English but is not a IndoEuropean language) words would yield a similar pattern of results? Justify your answer.

(*4 points available, max 150 words*)

*type your answer here*

**Task 5** (*3 points available*)

Compute the average Levenshtein Distance (aLD) between each pseudoword and the 20 words at the smallest edit distance from it. Consider the set of words you used to filter out pseudowords that happen to be valid Dutch words (the file is available in this OSF repository: https://osf.io/9zymw/) to retrieve the 20 words at the smallest edit distance.

In [28]:
# compute the average Levenshtein distance from each pseudoword to the words used to filter out pseudowords.
# Show the aLD estimate for the pseudowords 'nedukes', 'pewbin', and 'vibcines'
# 3 points for correctly computing aLD for pseudowords

**Task 6** (*3 points available*)

For each pseudoword, record the number of tokens in which RobBERT v2 encodes it.

In [29]:
# record the number of tokens in which RobBERT divides each pseudoword
# show the number of tokens for the pseudowords 'yuxwas', 'skibfy', and 'errords'
# 3 points for correctly mapping pseudowords to number of tokens

**Task 7** (*5 points available, see breakdown below*)

Compute the residuals of the predicted valence under the four regressors trained and applied in tasks 2 to 4. Then, correlate the residuals from all four models with aLD. Finally, correlate the residuals from the RobBERT v2 model with the number of tokens in which each pseudoword is split. Use the Pearson's correlation coefficient.

In [30]:
# compute the residuals from all four regression models fitted before
# 1 point available for correctly computing residuals

In [31]:
# compute the Pearson's correlation between residuals and average LD for all models,
# as well as the correlation between RobBERT v2 residuals and the number of tokens in which each pseudoword
#    is encoded by the RobBERT v2 model.
# show all correlation coefficients
# 4 points for the correct correlation coefficients

**Task 8** What is the relation between the errors each model made and aLD? what about the number of tokens (limited to the RobBERT v2 model)?

(*4 points available, max 150 words*)

*testo in corsivo*