Skip to content


Folders and files

Last commit message
Last commit date

Latest commit



26 Commits

Repository files navigation

The Arabic Keyphrase Extraction Corpus

This repository contains the Arabic Keyphrase Extraction Corpus (AKEC) built by Muhammad Helmy, Marco Basaldella, Eddy Maddalena, Stefano Mizzaro and Gianluca Demartini.

The corpus and the process we used for its building are described in detail in the paper ''Towards Building a Standard Dataset for Arabic Keyphrase Extraction Evaluation'', presented at the 20th International Conference on Asian Language Processing (IALP 2016), held in Tainan, Taiwan, from November 21 to 23, 2016.

You can find a brief statistical overview of the Corpus in the docs folder, or you can see it online at this link.

The corpus

The corpus consists in 160 arabic documents and their keyphrases. We selected the documents from a variety of sources, while we collected the keyphrases using a large-scale Crowdsourcing experiment.

The repository is structured as follows.

├── docs 
├── documents 
│   ├── pure
│   └── raw
└── keyphrases
    ├── sort_frequency
    └── sort_lm

The docs folder contains the stastical analysis mentioned above.

The documents folder contains the documents. We provide the documents in their original form (plus some formatting) in the raw folder and in pure form, i.e. with diacritics removed, in the pure folder.

The keyphrases folder contains the crowd-assigned keyphrases. We provide four files; two of them contain the keyphrases ordered using their frequency inside the crowd workers selections (in the sort_frequency folder) and the other two contain the keyphrases ordered using a simple language model generated from the crowd selection as well (in the sort_lm folder). For each sorting, we provide both the keyphrases in their pure form (pure.txt) and in their lemmatized form (lemmatized.txt).

Test/train split

We divided the corpus in 100 documents for training and 60 documents for training. The filenames of the training and testing documents are contained in train.ids and test.ids respectively.

We provide you a convenient shell script called that does the dirty job of getting the training/testing documents and putting them into two folders called, unsurprisingly, train and test. To split the documents in their original form, just navigate to the folder where you downloaded the corpus and run ./ raw. To split the documents in the pure form, just write pure instead of raw.


If you use our dataset, please cite the reference paper:

	author={Helmy, Muhammad and Basaldella, Marco and Maddalena, Eddy and Mizzaro, Stefano and Demartini, Gianluca},
    title={Towards Building a Standard Dataset for Arabic Keyphrase Extraction Evaluation},
    booktitle={Proceedings of the 20th International Conference on Asian Language Processing (IALP 2016)},
    address={Tainan (Taiwan)}


To perform NLP in Arabic, we used the AraMorph software, which is a Java port of the Buckwalter Arabic Morphological Analyzer.

We selected the documents for our corpus from 4 different sources:


Please be aware that some of the resources we used to assemble our corpus are licensed for research purposes only. For this reason, we also make our corpus available for research purposes only.