Skip to content

Artificial-Memory-Lab/imageability

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Instructions

Installation

Set up your virtual environment first using Python 3.12. Then install this package and its dependencies with:

pip install --upgrade pip
pip install -e .

We have also released the raw text datasets as well as all embeddings necessary to reproduce our experiments to HuggingFace. You can download the files individually from our repo, or you can clone the repo using the following command lines.

brew install git-lfs
git lfs install
git clone git@hf.co:datasets/artificial-memory-lab/imageability-geometry-data

Preparing Embedding Collections

The Neighborhood Stability Measure (NSM) operates on a collection of semantic embeddings of text. Naturally, we need to use a semantic embedding model to transform a text collection into embeddings.

We have included a script in the repo to do just that. At the risk of stating the obvious, you must embed the text collection as well as the ``queries'' (i.e., terms whose imageability or concreteness we wish to predict) using the same model.

Here is an example of how you can embed the MS COCO captions, assuming that the HuggingFace data is located in the current directory.

python imageability/scripts/embed.py \
       --input imageability-geometry-data/raw_data/coco_captions_2015.raw \
       --output <path-to-embedding-collection> \
       --model [allminilm_l6|gte_base|gte_large] \
       --device [cpu|cuda|mps]

Some embedding models can be expensive to execute and may take a few hours to complete on larger collections. To make it easier to reproduce our experiments then, we uploaded the embedding collections to HuggingFace, so that you can skip the embedding part and simply download the collections directly. See the section above for instructions.

Computing the NSM

The second and final step is to execute the script that computes the NSM for a set of query words against an embedding collection. The script executes three stages:

  • ANN indexing: It first indexes the entire embedding collection to prepare the data for a clustering-based ANN search algorithm implemented using the Faiss library. If this proves too expensive, you can download the index directly from HuggingFace.
  • Parameter tuning on the validation set: It uses a validation set of queries to find the optimal number of neighbors to analyze.
  • Compute NSM: Using the optimal value found in the previous step, the script computes NSM for the remaining (eval) queries and reports the Spearman correlation coefficient.

Here is an example of how you may run the entire procedure:

python imageability/scripts/measure_density.py \
       --output <path-to-output-directory> \
       --prefix demo \  # Generated figures will have this prefix in their name.
       --algorithm nsm \
       --collections imageability-geometry-data/allminilm-l6-v2/image_word_score.npy \
       --queries imageability-geometry-data/allminilm-l6-v2/coco_captions_2015.npy \
       --gt imageability-geometry-data/raw_data/image_word_score.tsv \
       --nn_min 64 --nn_max 4096 --nn_step 64 --validation 0.2

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors