# CL Fall School 2024 in Passau: Multimodal NLP
Carina Silberer and Hsiu-Yu Yang, University of Stuttgart

---

# Lab 3: Word Similarity Estimation

In [1]:
!pip install -q git+https://github.com/huggingface/transformers.git

In [2]:
conda install pytorch torchvision -c pytorch-nightly

Channels:
 - pytorch-nightly
 - defaults
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: done

# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.


In [1]:
from sklearn.metrics.pairwise import cosine_similarity
from scipy.stats import spearmanr

In [2]:
try: 
    import pandas
except ModuleNotFoundError:
    #!conda update -n base -c defaults conda
    !conda install --yes pandas

In [3]:
from PIL import Image
import requests
import torch

import operator
import os
import json
import pickle

In [4]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {device} device")

Using cpu device


In [None]:
# Load VIT 
# We need a processor to read in images in pixel values
from transformers import ViTImageProcessor
from transformers import ViTModel

image_processor = ViTImageProcessor.from_pretrained('google/vit-base-patch32-224-in21k')
vit_model = ViTModel.from_pretrained('google/vit-base-patch32-224-in21k')

In [None]:
from transformers import BertTokenizer
from transformers import BertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', clean_up_tokenization_spaces=True)
text_model = BertModel.from_pretrained("bert-base-uncased")

In [None]:
from transformers import ViltProcessor
from transformers import ViltModel

mm_processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-mlm", clean_up_tokenization_spaces=True)
mm_model = ViltModel.from_pretrained("dandelin/vilt-b32-mlm")

# Exercise: Word Similarity Estimation
Word similarity and relatedness datasets have long been used to intrinsically evaluate distributional representations of word meaning. The standard evaluation metric for such datasets is the [Spearman correlation coefficient (Spearman's $\rho$)](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient). 
It is computed between the human-elicited scores and your model's estimated scores.

The goal of this exercise is to compare 3 classes of models, a `pure language model`, a `pure vision model` and a `vision-language model` on the word similarity task. 

### Dataset: SimLex-999
We will use the word pairs and human similarity judgements of SimLex-999.
Download the dataset either from the course's github space (under `data/`), or from the website (https://fh295.github.io/simlex.html), the filename is `SimLex-999/SimLex-999.txt`. Check also the description of the dataset in the `README`. The relevant data for this assignment are provided in the columns `word1`, `word2`, `POS`, `SimLex999` (scale 0-10), and `concQ` (derived from  concreteness ratings (scale 1-7) for the individual words of a pair). 

In [5]:
sim_data = pandas.read_csv("data/SimLex-999/SimLex-999.txt", sep="\t")

In [6]:
# the first 10 entries in SimLex-999
sim_data.head(10)

Unnamed: 0,word1,word2,POS,SimLex999,conc(w1),conc(w2),concQ,Assoc(USF),SimAssoc333,SD(SimLex)
0,old,new,A,1.58,2.72,2.81,2,7.25,1,0.41
1,smart,intelligent,A,9.2,1.75,2.46,1,7.11,1,0.67
2,hard,difficult,A,8.77,3.76,2.21,2,5.94,1,1.19
3,happy,cheerful,A,9.55,2.56,2.34,1,5.85,1,2.18
4,hard,easy,A,0.95,3.76,2.07,2,5.82,1,0.93
5,fast,rapid,A,8.75,3.32,3.07,2,5.66,1,1.68
6,happy,glad,A,9.17,2.56,2.36,1,5.49,1,1.59
7,short,long,A,1.23,3.61,3.18,2,5.36,1,1.58
8,stupid,dumb,A,9.58,1.75,2.36,1,5.26,1,1.48
9,weird,strange,A,8.93,1.59,1.86,1,4.26,1,1.3


### Methodology
We load the models and prepare them and the vocabulary, and use Spearman's $\rho$ to measure the correlation between the human-elicited similarity judgements and the model's estimated similarity scores. 

To cleanly disentangle the contribution of the respective modality, ensure the selected models have similar backbone/architecture. 
For example, ViLT's vision backbone is ViT. Ideally, we would also use ViLT's textual backbone, but since that was trained from scratch, we use the commonly used linguistic encoder BERT.

* `Language model`: [BERT-base](https://huggingface.co/google-bert/bert-base-uncased) (**BERT**)
* `Vision model`: [VIT](https://huggingface.co/docs/transformers/model_doc/vit#overview) (**ViT**)
* `Vision-language model`: [ViLT](https://huggingface.co/docs/transformers/model_doc/vilt) (**ViLT**)


##### Procedure:
1. Step 1: (For vision-based models) Prepare visual input for words in the SimLex-999 dataset.
2. Step 2: Load and prepare the models
3. Step 3: Use the the models to extract the words and images' representation for calculating similarity scores
4. Step 4: Calculating similarity scores
5. Step 5: Use Spearman's $\rho$ to measure the correlation between the human-elicited similarity judgements and the 