FigMemes: A Dataset for Figurative Language Identification in Politically-Opinionated Memes

This repository contains:

Instructions on how to get the OCR extraction (text) and labels through a request form.
Dataset download script and baselines

Warning

Please make sure you understand the data source here

The dataset contains memes that may be offensive to readers. Please see the Dataset Statement section from our paper here to understand the risks before you proceed!

Dataset Request Form

To request the dataset (labels and extracted texts), please fill out the following form.

After submitting the required info, you will see a link to a folder containing the datasets in a zip format and the password to uncompress the files.

Note: this dataset can only be used for non-commercial, research purposes.

Don't hesitate to report an issue, if something is broken or if you have further questions or feedback. Email: figmemes22 AT gmail DOT com

https://www.tu-darmstadt.de/

https://www.ukp.tu-darmstadt.de/

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

Baselines

We present our code that we used to run the different baseline models.

Setup

Install all Python requirements listed in requirements.txt (check here to see how to install Pytorch on your system). Tested with torch==1.10.0, transformers==4.17.0 - newer version likely work but no guarantees.

You can install the requirements.txt like:

pip install --upgrade pip
pip install -r requirements.txt

Preparations

Data (FigMemes):

The script expects the data resides in {data_root} (specified as input arg to run the script), the {data_root} must contain:

figmemes_annotations.tsvwith the labels and style,
figmemes_ocrs.tsv with OCR captions,
a {name}_split.tsv that sets train, validation, and test set to each example as well as optionally a cluster for OOD experiments.

CLIP and image-only models expect the images in {data_root}/images. You can download the images following these instructions.

Data (other): You need to obtain the images and labels for the other datasets (Memotion 2 , MAMI) yourself. The script expects again a {name}_split.tsv that resplits the data (NOTE: we append the image folder name to the name to differentiate between same-name images from different original splits). The script also expects in {data_root}/style_labels our style annotation tsvs.

Image Features: If you use VinVL or BERT+CLIP, you need to pre-extract the image features. See here for a description. Feature files have to be in {data_root}/features

VinVL checkpoint: You have to download the VinVL checkpoint manually. See here for more information. You can run azcopy copy 'https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/vqa' base –recursive and then in vqa\base\checkpoint-2000000 you find the model. This is not a VQA-finetuned model but the checkpoint after pretraining (as can be seen from instructions here: https://github.com/microsoft/Oscar/blob/master/VinVL_MODEL_ZOO.md)

Usage

Our code uses the Huggingface trainer class so most config values are defined there.

Image-only and multimodal CLIP:

python -u run_classification_img.py \
      --linear_probe False \ #Toggle linear probe or full finetuning
      --model_name_or_path "RN50x4" \ #CLIP model name or ConvNext model name
      --clip_text True \ #False to use image-only CLIP
      --class_weights True \ #False to not up-weight poitive examples
      --data_root "/path/to" \
      --split text_cluster \  #Name of the split file: {split}_split.tsv
      --train_split $train_cluster \ #For OOD experiments: name of style or cluster to train on 
      --test_split "cluster_0,cluster_1,no_text" \ #For OOD experiments: name of styles or clusters to test on 
      --validation_split $train_cluster \ #For OOD experiments: style/cluster for dev set
      --metric_for_best_model "f1" \
      --lr_scheduler_type linear \
      --greater_is_better True \
      --per_device_train_batch_size 64 \
      --per_device_eval_batch_size 128 \
      --num_train_epochs 20 \
      --warmup_ratio 0.00 \
      --dataloader_num_workers 8 \
      --gradient_accumulation_steps 1 \
      --max_grad_norm 1.0 \
      --learning_rate 2e-5 \
      --weight_decay 0.05 \
      --output_dir $output \
      --do_predict \
      --do_train \
      --do_eval \
      --evaluation_strategy epoch \
      --save_strategy epoch \
      --save_total_limit 2 \
      --load_best_model_at_end True \
      --logging_steps 25 \
      --seed $seed \
      --fp16 \
      --report_to all \
      --all_feature_type ""

Text-only or VinVL/Bert+CLIP:

python -u run_classification.py \
      --all_feature_type "clip-RN50x4-6,vinvl-obj36" \ # "" for text-only or all used image features (avoids redundant dataset creation with different features)
      --feature_type vinvl-obj36 \ # Features to use for model. "" for text-only
      --visual_feat_dim 2054 \ #image feature dimension; 2054 for VinVL, 2560 for RN50x4
      #--no_vision \ # Toggle to use no image features for text-only
      --data_root "/path/to" \
      --class_weights True \ #False to not up-weight poitive examples
      --model_name_or_path "bert-base-uncased" \
      --split text_cluster \  #Name of the split file: {split}_split.tsv
      --train_split $train_cluster \ #For OOD experiments: name of style or cluster to train on 
      --test_split "cluster_0,cluster_1,no_text" \ #For OOD experiments: name of styles or clusters to test on 
      --validation_split $train_cluster \ #For OOD experiments: style/cluster for dev set
      --metric_for_best_model "f1" \
      --greater_is_better True \
      --lr_scheduler_type linear \
      --per_device_train_batch_size 64 \
      --per_device_eval_batch_size 128 \
      --num_train_epochs 20 \
      --warmup_ratio 0.00 \
      --dataloader_num_workers 8 \
      --gradient_accumulation_steps 1 \
      --max_grad_norm 1.0 \
      --learning_rate 3e-5 \
      --weight_decay 0.05 \
      --output_dir $output \
      --do_predict \
      --do_train \
      --do_eval \
      --evaluation_strategy epoch \
      --save_strategy epoch \
      --save_total_limit 2 \
      --load_best_model_at_end True \
      --logging_steps 25 \
      --seed $seed \
      --report_to all \
      --fp16

CLIP-MM-OOD: This is identical to CLIP above expect the script is changed and one option is added to set the learning rate for the second (full fine-tune) stage.

python -u run_classification_img_twostage.py \
      --learning_rate2 2e-5 \

Citation

If you find this repository helpful, feel free to cite the following publication:

@inproceedings{liu-etal-2022-figmemes,
    title = "{F}ig{M}emes: A Dataset for Figurative Language Identification in Politically-Opinionated Memes",
    author = "Liu, Chen  and
      Geigle, Gregor  and
      Krebs, Robin  and
      Gurevych, Iryna",
    booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.emnlp-main.476",
    pages = "7069--7086",
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
feature_extraction		feature_extraction
oscar		oscar
LICENSE		LICENSE
NOTICE.txt		NOTICE.txt
README.md		README.md
clip_model.py		clip_model.py
data_util.py		data_util.py
requirements.txt		requirements.txt
run_classification.py		run_classification.py
run_classification_img.py		run_classification_img.py
run_classification_img_twostage.py		run_classification_img_twostage.py

License

UKPLab/emnlp2022-figmemes

Folders and files

Latest commit

History

Repository files navigation

FigMemes: A Dataset for Figurative Language Identification in Politically-Opinionated Memes

Warning

Dataset Request Form

Baselines

Setup

Preparations

Usage

Citation

About

Resources

License

Stars

Watchers

Forks

Languages