# CL Fall School 2024 in Passau: Multimodal NLP
Carina Silberer and Hsiu-Yu Yang, University of Stuttgart

---

# Lab 3: Word Similarity Estimation

In [None]:
!pip install -q git+https://github.com/huggingface/transformers.git

In [None]:
conda install pytorch torchvision -c pytorch-nightly

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
from scipy.stats import spearmanr

In [None]:
try: 
    import pandas
except ModuleNotFoundError:
    #!conda update -n base -c defaults conda
    !conda install --yes pandas

In [None]:
from PIL import Image
import requests
import torch

import operator
import os
import json
import pickle

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {device} device")

# Exercise: Word Similarity Estimation
Word similarity and relatedness datasets have long been used to intrinsically evaluate distributional representations of word meaning. The standard evaluation metric for such datasets is the [Spearman correlation coefficient (Spearman's $\rho$)](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient). 
It is computed between the human-elicited scores and your model's estimated scores.

The goal of this exercise is to compare 3 classes of models, a `pure language model`, a `pure vision model` and a `vision-language model` on the word similarity task. 

### Dataset: SimLex-999
We will use the word pairs and human similarity judgements of SimLex-999.
Download the dataset either from the course's github space (under `data/`), or from the website (https://fh295.github.io/simlex.html), the filename is `SimLex-999/SimLex-999.txt`. Check also the description of the dataset in the `README`. The relevant data for this assignment are provided in the columns `word1`, `word2`, `POS`, `SimLex999` (scale 0-10), and `concQ` (derived from  concreteness ratings (scale 1-7) for the individual words of a pair). 

In [None]:
sim_data = pandas.read_csv("data/SimLex-999/SimLex-999.txt", sep="\t")

In [None]:
# the first 10 entries in SimLex-999
sim_data.head(10)

### Methodology
We load the models and prepare them and the vocabulary, and use Spearman's $\rho$ to measure the correlation between the human-elicited similarity judgements and the model's estimated similarity scores. 

To cleanly disentangle the contribution of the respective modality, ensure the selected models have similar backbone/architecture. 
For example, ViLT's vision backbone is ViT. Ideally, we would also use ViLT's textual backbone, but since that was trained from scratch, we use the commonly used linguistic encoder BERT.

* `Language model`: [BERT-base](https://huggingface.co/google-bert/bert-base-uncased) (**BERT**)
* `Vision model`: [VIT](https://huggingface.co/docs/transformers/model_doc/vit#overview) (**ViT**)
* `Vision-language model`: [ViLT](https://huggingface.co/docs/transformers/model_doc/vilt) (**ViLT**)


##### Procedure:
1. Step 1: (For vision-based models) Prepare visual input for words in the SimLex-999 dataset.
2. Step 2: Load and prepare the models
3. Step 3: Use the the models to extract the words and images' representation for calculating similarity scores
4. Step 4: Calculating similarity scores
5. Step 5: Use Spearman's $\rho$ to measure the correlation between the human-elicited similarity judgements and the 

### Step 1: Loading the images
We need visual representations for the target concepts (words) in the form of images. 
In practice, one can search for the images online to match a word in the unimodal dataset.
(*note: in distribution*)

In this lab, let's try to use the most common image-caption dataset for vision-based model pretraining, Conceptual Captions ([CC3M](https://ai.google.com/research/ConceptualCaptions/download)), and retrieve images from there showing our target words as their visual representation.
In the `data/CC3M` folder, you will find a file called `topic2image.json`, which gives a list of image ids for each topic (i.e., word). `imageID2url.json`, in turn, gives the corresponding urls.
Each topic word has **at most 10 images** sorted by so called tf-idf values (the highest, the more associated it is to the word).

**To save time, the images that are available for WordSim-999 have been already downloaded. You will find them in a google drive (see slack). Download them, put them under data/CC3M/ and unzip them.**

In [None]:
import pandas
import json

In [None]:
datadir = "data/CC3M/"
imgname = os.listdir(os.path.join(datadir, "images/"))[0]
print(imgname)

In [None]:
# We load the words for which we found images in CC3M
with open(os.path.join(datadir, "words_w_imgs.txt")) as f:
    wordsinCC3M = [w.strip() for w in f.readlines()]

print(f'Num of words in CC3M: {len(wordsinCC3M)}')

In [None]:
# Let's check the subset of words from SimLex999 that have images in CC3M for analysis
sim_data = pandas.read_csv(os.path.join("data/SimLex-999", "SimLex-999.txt"), sep="\t")

# Check how many words are overlapped between SimLex999 and CC3M
sim_words = set(sim_data['word1']).union(set(sim_data['word2']))
overlapped_words = sim_words.intersection(set(wordsinCC3M))
print(len(overlapped_words), 'overlapped words between SimLex999 and CC3M')
print('Words:', list(overlapped_words)[:5])

# For similarity analysis, we can only keep the rows for whose words we have the corresponding images
new_sim_data = sim_data[sim_data['word1'].isin(overlapped_words) & sim_data['word2'].isin(overlapped_words)]
print('Num of word pairs that have images for analysis:', len(new_sim_data))

In [None]:
new_sim_data

We need to load images to represent visually a concept which we do with the function `get_images`. <br/>
*Remark to the function `get_images`:* If an identified image that shows corresponding objects is not locally in your data folder yet, we'll download it using its url and save it locally. Otherwise we directly load it from the data folder. 

In [None]:
# Write a helper function to access the image url easier
import copy
def get_images(word, num=3, download=False):
    images = []
    for i in range(num):
        filepath = os.path.join(datadir, "images/", word+str(i)+".jpg")
        if os.path.exists(filepath):
            #image = Image.open(filepath)
            image = copy.deepcopy(Image.open(filepath)).convert('RGB')
            images.append(image)
        elif download==True:
            with open(os.path.join(datadir, "topic2image.json")) as f:
                topic2image = json.load(f)
            with open(os.path.join(datadir, "imageID2url.json")) as f:
                imageID2url = json.load(f)
            image_IDs = [item[1] for item in topic2image.get(word)]
            loaded_imgs = 0
            for j in range(len(image_IDs)):
                if i+loaded_imgs >= num-1:
                    break
                url = imageID2url.get(image_IDs[j])
                #print(url)
                try:
                    img = Image.open(requests.get(url, stream=True).raw)
                    img.save(os.path.join(datadir, "images/", word+str(i+j)+".jpg"))
                    images.append(img.convert('RGB'))
                    loaded_imgs += 1
                except:
                    print(f'Failed to load image {url}')
    #print(f'Loaded {len(images)} images for "{word}"')
    return images

### Step 2: Loading the models
* We will use ViT to encode images
* To encode text, we will use ?
* Finally, to encode image + text, i.e., to get a visual--linguistic representation, we will use ViLT

#### Load the visual encoder model (ViT)
We will use [ViT-32](https://huggingface.co/google/vit-base-patch32-224-in21k), the visual transformer model we discussed in class. It was  trained on ImageNet, which is a huge collection of images labeled with the objects they show.

In [None]:
# Load VIT 
# We need a processor to read in images in pixel values
from transformers import ViTImageProcessor
from transformers import ViTModel

image_processor = ViTImageProcessor.from_pretrained('google/vit-base-patch32-224-in21k')
vit_model = ViTModel.from_pretrained('google/vit-base-patch32-224-in21k')

#### Load the text encoder model (BERT)
We will use [BERT-base](https://huggingface.co/google-bert/bert-base-uncased) to encode text. It is not a state-of-the-art representation model anymore, but has been used to initialise the textual encoder for many VL models (including CLIP). This means, BERT was used as a starting point in other models to embed (represent) the textual input.

In [None]:
from transformers import BertTokenizer
from transformers import BertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', clean_up_tokenization_spaces=True)
text_model = BertModel.from_pretrained("bert-base-uncased")

#### Load the visual--linguistiv model (ViLT)
To represent jointly image+text, we will encode the two modalities using the VL model [ViLT](https://huggingface.co/docs/transformers/model_doc/vilt).

In [None]:
from transformers import ViltProcessor
from transformers import ViltModel

mm_processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-mlm", clean_up_tokenization_spaces=True)
mm_model = ViltModel.from_pretrained("dandelin/vilt-b32-mlm")

### Step 3a: Loading and preparing the data
We need to load the images and the words.

In [None]:
# prepare image and text
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
text = "two cats with two remote controls"
image

In [None]:
images = get_images("cat")
text = "cat"
images[0]

### Step 3b: Encoding the images and words

#### Encode the image --> feature vector

In [None]:
# Get the pixel values of the image to prepare the model input
#inputs = image_processor(images=word1_images, return_tensors="pt").to(device)
inputs = image_processor(images=images, return_tensors="pt").to(device)

with torch.no_grad():
    outputs = vit_model(**inputs)
    # We take the [CLS] token representation of the last layer as the representation of the whole image
    pooler_output = outputs.pooler_output
    print(pooler_output.shape) # batch size is 2 because there are two images for word1
    
    # Get the average of the image embeddings as the representation for a word (if there were more than one image for the text)
    image_embedding = pooler_output.mean(dim=0).unsqueeze(0)
    print(f'Averaged image representation for "{text}":', image_embedding.shape) # batch size: 1, hidden size: 768 

#### Encode the text --> feature vector

In [None]:
encoded_input = tokenizer(text, return_tensors='pt')

with torch.no_grad():
    outputs = text_model(**encoded_input)
    pooler_output = outputs.pooler_output
    print(f'Representation size for input "{text}":', pooler_output.shape)

#### Encode the image + word --> multimodal feature vector

In [None]:
prompt = "A picture of a {}."
words4vilt = len(images)*[prompt.format(text)]

In [None]:
# ViLT multimodal embeddings
inputs = mm_processor(images, words4vilt, return_tensors="pt")
outputs = mm_model(**inputs)
#last_hidden_states = outputs.last_hidden_state
mm_pooler_output = outputs.pooler_output
print(mm_pooler_output.shape)

# Get the average of the image-text embeddings as the representation for a text+image (if there were more than one image for the text)
mm_embedding = pooler_output.mean(dim=0).unsqueeze(0)
print(f'Averaged multimodal representation for "{text}" and its images:', mm_embedding.shape) # batch size: 1, hidden size: 768 

---

**Let's have a helper function that gives us the embedding of the respective modalities:**

In [None]:
# Note: you can adapt the following code on a bigger batch size, but be aware of the memory usage
def get_representation(model, input_processer, device, inputs, modality="Text"):
    returned_embedding = None
    if modality == "Text":
        input2model = input_processer([inputs['word']], return_tensors="pt").to(device)
        with torch.no_grad():
            outputs = model(**input2model)
            last_hidden_states = outputs.last_hidden_state[:, :-1, :] # disregard the last token </s>
            # we average the hidden states of the tokens (if > 1) to get the representation of the word
            word_embedding = last_hidden_states.mean(dim=1)
            returned_embedding = word_embedding
    elif modality == "Image":
        in_images = inputs['images']
        #in_images = [img.convert('RGB') for img in inputs['images']]
        input2model = input_processer(images=in_images, return_tensors="pt").to(device)
        with torch.no_grad():
            outputs = model(**input2model)
            pooler_output = outputs.pooler_output
            # get the average representation of the images
            image_embedding = pooler_output.mean(dim=0).unsqueeze(0)
            returned_embedding = image_embedding
    elif modality == "Image+Text":
        prompt = "A picture of a {}."
        text = inputs["word"]
        in_images = inputs['images']
        #in_images = [img.convert('RGB') for img in inputs['images']]
        words4vilt = len(in_images)*[prompt.format(text)] # we have three images, so we input the word for each image, i.e., have a list that contains the target word three times
        input2model = input_processer(in_images, words4vilt, return_tensors="pt").to(device)
        outputs = model(**input2model)
        #last_hidden_states = outputs.last_hidden_state
        pooler_output = outputs.pooler_output

        # Get the average of the image-text embeddings as the representation for a text+image (if there were more than one image for the text)
        returned_embedding = pooler_output.mean(dim=0).unsqueeze(0)
        
    return returned_embedding.detach().cpu().numpy()
    

In [None]:
# Test the function
images = get_images("fast")
inputs = {"word": "fast", "images": images}

print('BERT:', get_representation(text_model, tokenizer, device, inputs, modality="Text").shape)
print('ViT Base:', get_representation(vit_model, image_processor, device, inputs, modality="Image").shape)
print('ViLT:', get_representation(mm_model, mm_processor, device, inputs, modality="Image+Text").shape)


### Step 4: Compute the Similiarity

In [None]:
# Compute the similarity score
from sklearn.metrics.pairwise import cosine_similarity

# Compute the similarity predictions with the model for a word pair
# Take a random row
row=4
word1, word2 = new_sim_data.iloc[row]['word1'], new_sim_data.iloc[row]['word2']
print(word1, word2)

# Get the images for the words
num_of_images = 3
word1_images = get_images(word1, num=num_of_images)
word2_images = get_images(word2, num=num_of_images)
if len(word1_images) < 1 or len(word2_images) < 1:
    print("Not enough images.")

# Get the representations for the words
# Alternatively, you can extract the embeddings beforehand for all words and save it in a pickle file for computation efficiency

# Visual model
inputs = {"word": word1, "images": word1_images}
word1_embedding = get_representation(vit_model, image_processor, device, inputs, modality="Image") # 0.12
inputs = {"word": word2, "images": word2_images}
word2_embedding = get_representation(vit_model, image_processor, device, inputs, modality="Image")

# Compute the similarity score
similarity_score = cosine_similarity(word1_embedding, word2_embedding)
print(f'Visual model: Similarity score between "{word1}" and "{word2}":', similarity_score[0][0])

# Textual model
inputs = {"word": word1, "images": word1_images}
word1_embedding = get_representation(text_model, tokenizer, device, inputs, modality="Text") # 0.12
inputs = {"word": word2, "images": word2_images}
word2_embedding = get_representation(text_model, tokenizer, device, inputs, modality="Text")

# Compute the similarity score
similarity_score = cosine_similarity(word1_embedding, word2_embedding)
print(f'Textual model: Similarity score between "{word1}" and "{word2}":', similarity_score[0][0])

# Multimodal model
inputs = {"word": word1, "images": word1_images}
word1_embedding = get_representation(mm_model, mm_processor, device, inputs, modality="Image+Text") # 0.59
inputs = {"word": word2, "images": word2_images}
word2_embedding = get_representation(mm_model, mm_processor, device, inputs, modality="Image+Text")

# Compute the similarity score
similarity_score = cosine_similarity(word1_embedding, word2_embedding)
print(f'Multimodal model: Similarity score between "{word1}" and "{word2}":', similarity_score[0][0])

## Putting it all together: Computing the similarities for all word pairs

Below are some helper functions which do what was shown above, just in condensed form and for all word pairs.

In [None]:
def get_model_inputs(word, num_of_images=3):
    """ For a given word, extract <num_of_images> images that show (presumably) the object denoted by the word.
    """
    word_images = get_images(word, num=num_of_images)
    model_inputs = {"word": word, "images": word_images}
    return model_inputs

def calculate_similiarity_score(word1, word2, type, embeddings):
    word1_embedding = embeddings.get(word1)[type]
    word2_embedding = embeddings.get(word2)[type]
    similarity_score = cosine_similarity(word1_embedding, word2_embedding)
    return similarity_score[0][0]

In [None]:
# Prepare model inputs from words in the dataset
new_sim_data.loc[:,'word1_model_inputs'] = new_sim_data['word1'].apply(get_model_inputs)
new_sim_data.loc[:,'word2_model_inputs'] = new_sim_data['word2'].apply(get_model_inputs)

It's more efficient to extract the representations for all the words once and save them in a pickle file. The code below does that, and saves the embeddings under `my_lab_embeddings.pkl`. However, you can also directly load the embeddings in the code below (else-block; the file is called `03_embeddings.pkl`, see the github repo).

In [None]:
import pickle
# set to True if you want to extract the embeddings yourself
extract_embeddings = False

# ------Streamline the pipeline------
# Prepare model inputs from words in the dataset
#new_sim_data['word1_model_inputs'] = new_sim_data['word1'].apply(get_model_inputs)
#new_sim_data['word2_model_inputs'] = new_sim_data['word2'].apply(get_model_inputs)

if extract_embeddings:
    # Extract the embeddings and save them
    all_reprs = {}
    for i in range(len(new_sim_data)):
        for col in ['word1_model_inputs', 'word2_model_inputs']:
            model_inputs = new_sim_data.iloc[i][col]
            word = model_inputs['word']
            text_based_representation = get_representation(text_model, tokenizer, device, model_inputs, modality="Text")
            image_based_representation = get_representation(vit_model, image_processor, device, model_inputs, modality="Image")
            image_text_based_representation = get_representation(mm_model, mm_processor, device, model_inputs, modality="Image+Text")
            all_reprs[word] = {"text_based": text_based_representation, 
                                "image_based": image_based_representation, 
                                "image_text_based": image_text_based_representation}
    
    save_to_file=True
    if save_to_file:
        # save the embedding using pickle
        with open(f'my_lab_embeddings.pkl', 'wb') as f:
            print('Saving the embeddings to embeddings.pkl')
            pickle.dump(all_reprs, f)
else:
    # Read the embeddings from the pickle file
    with open(f'03_embeddings.pkl', 'rb') as f:
        embeddings = pickle.load(f)

In [None]:

# Calculate the similarity scores for all word pairs
new_sim_data.loc[:,'text_based_similarity_score'] = new_sim_data.apply(lambda x: calculate_similiarity_score(x['word1'], x['word2'], 'text_based', embeddings), axis=1)
new_sim_data.loc[:,'image_based_similarity_score'] = new_sim_data.apply(lambda x: calculate_similiarity_score(x['word1'], x['word2'], 'image_based', embeddings), axis=1)
new_sim_data.loc[:,'image_text_based_similarity_score'] = new_sim_data.apply(lambda x: calculate_similiarity_score(x['word1'], x['word2'], 'image_text_based', embeddings), axis=1)

### Step 5: Measuring Spearman's $\rho$
Now where we obtained cosine similarities for all word pairs computed with the three different models (the representations extracted with them from the text and the images), we can compute in how far the scores (i.e., the models) can account for human judgements on word similarity. 
We do this by computing Spearman's rank correlation between the ranking of the scores of a model with the ranking of the similarity judgements elicited from humans (average). 

In [None]:
from scipy.stats import spearmanr

In [None]:
# Compute the similarity predictions with the model for the test word pairs,
# and compare them against the human ratings, using Spearman's rho:
rho_text = spearmanr(new_sim_data["SimLex999"], new_sim_data["text_based_similarity_score"])
rho_vis = spearmanr(new_sim_data["SimLex999"], new_sim_data["image_based_similarity_score"])
rho_vt = spearmanr(new_sim_data["SimLex999"], new_sim_data["image_text_based_similarity_score"])

print("Spearman's rho for the textual model: {} (p-value: {}".format(rho_text.correlation, rho_text.pvalue))
print("Spearman's rho for the visual model: {} (p-value: {}".format(rho_vis.correlation, rho_vis.pvalue))
print("Spearman's rho for the vl model: {} (p-value: {}".format(rho_vt.correlation, rho_vt.pvalue))

In [None]:
# Helper function for the analyses below
def subset_spearman(subset, pred_column='text_based_similarity_score'):
    return spearmanr(subset["SimLex999"], subset[pred_column])

#### Random Baseline
It is useful to compare a model against a baseline. We'll use a random baseline, that assigns a random similarity value (between 0 and 1) to each test word pair. 
Evaluate it on the word similarity task measuring Spearman's $\rho$.

In [None]:
# random baseline
# conveniently, np.random.rand produces floats between 0.0 and 1.0
# the same range as the word similarity
import numpy
rand_vals = numpy.random.rand(len(new_sim_data))
new_sim_data.loc[:,"random"] = rand_vals

In [None]:
new_sim_data.head()

## Exercises: Analysis

### Evaluation: Error Analysis
For error analysis, we can look at the words with the largest delta between the gold score and the predicted similarity score of the glove model. Analoguously to our evaluation metric (Spearman's $\rho$), we base our comparison on ranks. We normalise both rankings to make their levels comparable. <br/>

(Link to a pandas' `Series` method we will use for that:
https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.Series.rank.html#pandas-series-rank)

In [None]:
def _normalised_ranking(sim_list):
    ranks = pandas.Series(sim_list).rank(method='dense')  #'average')
    return ranks / ranks.sum()

def error_analysis(predictions, reference_scores, word_pairs):
    predictions_rank = _normalised_ranking(predictions)
    refscore_rank = _normalised_ranking(reference_scores)
    error =  abs(predictions_rank - refscore_rank)
    return sorted(zip(word_pairs, error, predictions, reference_scores), key=operator.itemgetter(1), reverse=True)

In [None]:
# Get the ten worst and ten best predictions, respectively
def analyze(subset, pred_column='text_based_similarity_score', target_column="SimLex999"):
    rank_diffs = error_analysis(
        subset[pred_column],
        subset[target_column],
        [f'{first}-{second}' for first, second in subset[["word1", "word2"]].to_numpy()]
    )
    print("Ten worst predictions (ranks differ the most): ", rank_diffs[:10])
    print("\nTen best predictions (ranks differ the least): ", rank_diffs[-10:])

analyze(new_sim_data)

### Quantitative Results
Compare the correlation coefficients you obtained for the models. Make sure that the models have the same underlying vocabulary which should comprise only those word pairs for which each model has representations. 

In [None]:
print(subset_spearman(new_sim_data, pred_column='text_based_similarity_score'))
print(analyze(new_sim_data, pred_column='text_based_similarity_score'))

### Qualitative Analysis (Error Analysis)
1. Evaluate the models on a subset of word pairs
  1. Separately for each word category (column `POS`), i.e., nouns, verbs, adjectives
  2. Separately for abstract and concrete word pairs (column `concQ`). Do this for each of the 4 quartiles (from abstract to concrete).
2. Inspect the worst and best predictions. 

***Qualitative Analysis 1A:***

In [None]:
# adjectives only:
adjective_subset = new_sim_data[new_sim_data["POS"] == "A"]

In [None]:
# Spearman's rho between text model and humans for adjectives only:
subset_spearman(adjective_subset, pred_column='text_based_similarity_score')

In [None]:
# gloVe:
analyze(adjective_subset, pred_column='text_based_similarity_score')

In [None]:
# TODO: Your code here for the other models

In [None]:
# verbs only:
verb_subset = new_sim_data[new_sim_data["POS"] == "V"]

# TODO: Your code here for verbs

In [None]:
# nouns only:
noun_subset = new_sim_data[new_sim_data["POS"] == "N"]
subset_spearman(noun_subset)

# TODO: Your code here for nouns

***Qualitative Analysis 1B:***

In [None]:
first_quartile = new_sim_data[new_sim_data["concQ"] == 1]

In [None]:
print(subset_spearman(first_quartile, pred_column='text_based_similarity_score'))
print(analyze(first_quartile, pred_column='text_based_similarity_score'))

In [None]:
# TODO: Your code here for the other models

**Task: Evaluate the models also on the second, third and fourth quartile.**

In [None]:
# TODO: your code here