Multi-modal features toolkit in Python
Branch: master
Clone or download
Type Name Latest commit message Commit time
Failed to load latest commit information.
demos Fixed ESP demo download (thanks to Guy Emerson for reporting this) Oct 17, 2016
mmfeat dynamic date queries to avoid flickr rate limit (#10) Aug 12, 2017
.gitignore Added Flickr miner Apr 15, 2016
LICENSE Update LICENSE Mar 17, 2016 Update Nov 2, 2016
TODO.txt Switch to using request streams for downloading files instead of urll… Apr 23, 2016 Explicitly distinuish between alexnet and caffenet, allow for custom … Oct 17, 2016
miner-example.yaml Added Flickr miner Apr 15, 2016


Multi-modal features toolkit in Python, developed at the University of Cambridge Computer Laboratory. The aim of this toolkit is to make it easier for researchers to use multi-modal features. Both image and sound (i.e., visual and auditory representations) are supported.

The following models are currently available:

  1. CNN: Convolutional neural network representations for images
  2. BoVW: Bag-of-visual-words for images, using DSIFT local descriptors
  3. BoAW: Bag-of-audio-words for sound files, using MFCC local descriptors

Getting started

The following dependencies need to be installed: numpy, scipy, scikit-learn and yaml. You may need to install Pillow for reading images from files into arrays. If you want to use the CNN model, you will also need to install Caffe. For BoAW you will need to install librosa as well.

Installing the main dependencies on Ubuntu:

sudo apt-get install build-essential python-dev python-setuptools \
                python-numpy python-scipy python-sklearn python-yaml


If you use this toolkit in your work, please cite the following paper:

D. Kiela (2016). MMFEAT: A Toolkit for Extracting Multi-Modal Features. Proceedings of ACL 2016: System Demonstrations, Berlin, Germany.


The toolkit comes with two tools that do not require any knowledge of Python and that can be run from the command-line.

For mining images or sound files. Before you can use the miner you need to acquire API keys from Google, Bing, FreeSound or Flickr and set them in miner.yaml (see miner-example.yaml for an example). ImageNet does not require an API key. The query_file argument should point to a file that contains a list of queries, one query per line. Usage: [-h] [-n NUM_FILES]
                {bing,google,freesound,flickr,imagenet} query_file data_dir


# Get 10 images per query term from Bing and store in a data directory
python -n 10 bing list_of_queries.txt ./img_data_dir
# Get 100 sound files per query term from FreeSound and store in a data directory
python -n 100 freesound list_of_queries.txt ./sound_data_dir

For extracting representations from a data directory. The data directory needs to contain an index file (index.pkl) that is automatically generated by the miner, or that you can manually construct. Usage: [-h] [-gpu] [-k K] [-c CENTROIDS] [-o {pickle,json,csv}]
                  [-s SAMPLE_FILES] [-m {vgg,alexnet}] [-v]
                  {boaw,bovw,cnn} data_dir out_file


# Extract BoVW representations with k=100, sampling 10% for clustering, and store as a Python pickle.
python -k 100 -s 0.1 bovw ./img_data_dir ./output_vectors.pkl
# Extract CNN representations, using an AlexNet on a GPU, and store as a JSON file.
python -gpu -o json cnn ./img_data_dir ./output_vectors.json
# Extract BoAW representation with k=300, sampling 50% for clustering, and store as a CSV file.
python -k 300 -s 0.5 -o csv boaw ./sound_data_dir ./output_vectors.csv

To extract layers from the CNN you need to tell the toolkit where it can find Caffe. For example (run this, or simply add to your ~/.bashrc):

export CAFFE_ROOT_PATH="/usr/local/caffe/"


1. Similarity and relatedness (1-simrel)

The demo downloads images from either Google or Bing and creates BoVW or CNN representations. It then evaluates similarity and relatedness (i.e., Spearman correlation with human similarity ratings) on the well-known MEN and SimLex-999 datasets. See e.g. Learning Image Embeddings using Convolutional Neural Networks for Improved Multi-Modal Semantics

2. ESP Game dataset (2-esp)

The demo downloads the ESP Game dataset sample and extracts it. It then builds an index from the label lookup and obtains BoAW or CNN representations for the thumbnail images. The representations are stored in a file for later use.

3. Matlab interfacing (3-matlab)

A simple demo to show that you can get local descriptors from Matlab and load them. This means you can use VLFeat or other libraries for getting descriptors (for instance, PHOW) as well.

4. Music instrument clustering (4-instruments)

The demo downloads sound files for 8 instruments of two classes and obtains auditory representations. It then clusters the representations and reports the outcomes. See Multi- and Cross-Modal Semantics Beyond Vision: Grounding in Auditory Perception

5. Image dispersion scores (5-dispersion)

The demo downloads images for "elephant" and "happiness" and calculates the image dispersion scores of these concepts. See Improving Multi-Modal Representations Using Image Dispersion: Why Less is Sometimes More.

6. Image search plot (6-searchplot)

A simple plotting demo of images returned by various search engines. Requires matplotlib.

7. CNN layers (7-cnnlayers)

Shows how you can transfer different layers from the CNN models.

8. ImageNet (8-imagenet)

Uses ImageNet to retrieve images for provided synsets. This requires NLTK and the NLTK WordNet corpus to be installed.