Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Automatic Generation of Topic labels

This repository contains the source code and data used for the paper:

Automatic Generation of Topic Labels (2020) Areej Alokaili, Nikolaos Aletras and Mark Stevenson in Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ’20), July 25–30, 2020, Virtual Event, China. Pre-print

(A) Install required libraries

Python 3.6.9 is used.

  1. TensorFlow V2
  2. NumPy
  3. scikit-learn
  4. ipykernel

below libraries needed for evaluation only. You can skip if you want to do different evaluation metric other than BERTScore

  1. sentencepiece
  2. transformers
  3. bert-score
  4. matplotlib
  5. pandas

use pip install -r requirements.txt to install all needed libraries

(B) Training

To run the model (data are processed and ready, only training is needed):

  • Navigate to topic_labelling/

    1. To train the model with [inputs=top-30 terms from wikipedia article and outputs=wikipedia titles]
    python -m 'bigru_bahdanau_attention'  -d 'wiki_tfidf'
    1. To train the model with [inputs=first-30 words from wikipedia article and outputs=wikipedia titles] (refer to paper for details).
    python -m 'bigru_bahdanau_attention'  -d 'wiki_sent'
  • Training will stop if no improvment is recorded and all checkpoints will be saved in training_checkpoint/data_name/ .

  • Training options are detailed in the code or run

python -h

(C) Inference (generate titles/labels)

  1. Generate TITLES for a subset of wikipedia articles (1000 articles)
python -m 'bigru_bahdanau_attention' -s 1000 -d 'wiki_tfidf' --load 'NAME_OF_CHECKPOINT'

*replace NAME_OF_CHECKPOINT with the name of your checkpoint. For example, python -d 'wiki_tfidf' -m 'bigru_bahdanau_attention' --load bigru_bahdanau_attention_e_1_valloss_2.19_-2

  1. Generate LABELS for bhatia_topics python -m 'bigru_bahdanau_attention' -s 1000 -d 'wiki_tfidf' --load 'NAME_OF_CHECKPOINT' -te 'bhatia_topics'

  2. Generate LABELS for bhatia_topics_tfidf python -m 'bigru_bahdanau_attention' -s 1000 -d 'wiki_tfidf' --load 'NAME_OF_CHECKPOINT' -te 'bhatia_topics_tfidf'

  3. Predictions, golds, and topics will be stored at results/data_name/ as

    • [model_name]_pred.out
    • [model_name]_gold.out
    • [model_name]_topics.out.

(D) Evaluation

  1. To measure the similarity between predicted and gold labels, python -g results/path_to_gold_file.out -p results/path_to_predict_file.out
  2. Output includes precision (P), recall (R) and f-score (F).

Repository hierarchy

  • code to train the labelling network.

  • code to generate new titles/labels.

  • neural network structure defind here.

  • contain some method needed methods through out the system.

  • extract_additional_terms_for_topics.ipynb notebook showing the steps taken to filter topic/labels pairs based on the overall human rating and matching them to similar documents to extract additional terms for bhatia_topics_tfidf.

  • script to compute pairwise BERTScore between predicted titles/labels and gold titles/labels.

  • data

    1. wiki_tfidf: contain files after preprocessing that are ready to be passed to the model.
    2. wiki_sent: contain files after preprocessing that are ready to be passed to the model.
    3. bhatia_topics: contains a csv file with two columns (column1: topic labels, columns2: topic's top 10 terms).
    4. bhatia_topics_tfidf: contains a csv file with two columns (column1: topic labels, columns2: topic's top 10 terms +20 terms from similar document (the 20 terms are extract using file extract_additional_topic_terms.ipynb).
  • results: this is where the model's output are saved in text files.

  • training_checkpoints: model checkpoints are saved here.


No description, website, or topics provided.






No releases published


No packages published