Skip to content

A Multilingual Dataset For Cross-lingual News Recommendation

License

Notifications You must be signed in to change notification settings

andreeaiana/xMIND

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

xMIND

CC BY-NC-SA 4.0

Description

xMIND is a large-scale multilingual news dataset for multi- and cross-lingual news recommendation. xMIND is derived from the English MIND (https://msnews.github.io/) dataset using open-source neural machine translation (i.e., NLLB 3.3B). xMIND contains 130K news translated into 14 linguistically and geographically diverse languages, with digital footprints of varying sizes. The goal of xMIND is to serve as a benchmark dataset for news recommendation, and to foster broader research into multilingual and cross-lingual news recommendation, for speakers of both high and low-resource languages.

The table below summarizes information about each language included in xMIND, according to the following criteria:

  • Code: the three-letter ISO 693-3 code of the language;
  • Language: the language name from WALS;
  • Script: the English name of the script;
  • Macro-area, Family ,and Genus: the macro-area, language family and genus from WALS and Glottolog
  • Res.: the classification from into low-resource and high-resource
Code Language Script Macro-area Family Genus Res.
SWH Swahili Latin Africa Niger-Congo Bantu high
SOM Somali Latin Africa Afro-Asiatic Lowland East Cushitic low
CMN Mandarin Chinese Han Eurasia Sino-Tibetan Sinitic high
JPN Japanese Japanese Eurasia Japonic Japanesic high
TUR Turkish Latin Eurasia Altaic Turkic high
TAM Tamil Tamil Eurasia Dravidian Dravidian low
VIE Vietnamese Latin Eurasia Austro-Asiatic Vietic high
THA Thai Thai Eurasia Tai-Kadai Kam-Tai high
RON Romanian Latin Eurasia Indo-European Romance high
FIN Finnish Latin Eurasia Uralic Finnic high
KAT Georgian Georgian Eurasia Kartvelic Georgian-Zan low
HAT Haitian Creole Latin North-America Indo-European Creoles and Pidgins low
IND Indonesian Latin Papunesia Austronesian Malayo-Sumbawan high
GRN Guarani Latin South-America Tupian Maweti-Guarani low

Download

The xMIND dataset is free to download for research purposes.

We release the xMIND in two versions, corresponding to the original splits of MIND: xMINDsmall (training and validation sets) and xMINDlarge (training, validation, and test sets).

The zip-compressed TSV file containing the translated news, for each language and each split, can be downloaded from xMIND.

Automatically download

The download script enables automatically downloading the dataset for the chosen language, dataset size, and dataset split. By default, the scripts downloads the zipped dataset, extracts the TSV news file, and deletes the zip file.

The following commands can be used to choose which dataset version to dowload:

  • Download xMIND for all languages, all dataset sizes, all dataset splits (default setting):

        python download.py
  • Download only one or more languages:

        python download.py --languages {language_1} {language_2}

    Use the ISO 693-3 code of the language from the table above to choose a specific language.

  • Download only one or more dataset sizes:

        python download.py --sizes {dataset_size_1} {dataset_size_2}

    Supported dataset sizes: large or small.

  • Download only one or more dataset splits:

        python download.py --splits {dataset_split_1} {dataset_split_2} {dataset_split_3}

    Supported dataset splits: train, dev, or test.

  • Download without extracting the zipped file:

        python download.py --extract_archive 
  • Download without deleting the zipped file:

        python download.py --clean_archive 
  • The downloaded dataset is by default stored in a newly created directory called xmIND. Change the destination directory as follows:

        python download.py --dst_dir 'my_folder' 

Data Format

Each news.tsv file contains the translated news; it has 3 columns, separated by the tab symbol:

  • nid: News ID of the article, identical to the corresponding news ID from the MIND dataset of the article.
  • title: The title of the news translated into the target language.
  • abstract: The abstract of the news (when provided in the original MIND dataset) translated into the target language.

An example for Romanian (RON) is shown below:

nid title abstract
N49265 Aceste reţete cu sos de afine sunt perfecte pentru cina de Ziua Recunoştinţei. Nu vei mai vrea niciodată versiunea cumpărată din magazin.

Integration with MIND

The news in xMIND can be easily combined with the corresponding source news in English from the MIND dataset based on the unique news IDs. This should help researchers use xMIND in conjunction with the additional news annotations (e.g., categories, subcategories, named entities) and user behavior information provided in MIND.

To facilitate a seamless integration of xMIND with the MIND data, we provide scripts for loading the dataset and constructing bilingual user consumption patterns in the NewsRecLib library.

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

CC BY-NC-SA 4.0

If you intend to use, adapt, or share xMIND, particularly together with additional news and click behavior information from the original MIND dataset, please read and reference the Microsoft Research License Terms of MIND.

Citation

If you use xMIND, please cite the following publication:

@misc{iana2024mind,
      title={MIND Your Language: A Multilingual Dataset for Cross-lingual News Recommendation}, 
      author={Andreea Iana and Goran Glavaš and Heiko Paulheim},
      year={2024},
      eprint={2403.17876},
      archivePrefix={arXiv},
      primaryClass={cs.IR}
}

Also consider citing the following:

@inproceedings{wu2020mind,
  title={Mind: A large-scale dataset for news recommendation},
  author={Wu, Fangzhao and Qiao, Ying and Chen, Jiun-Hung and Wu, Chuhan and Qi, Tao and Lian, Jianxun and Liu, Danyang and Xie, Xing and Gao, Jianfeng and Wu, Winnie and others},
  booktitle={Proceedings of the 58th annual meeting of the association for computational linguistics},
  pages={3597--3606},
  year={2020}
}

About

A Multilingual Dataset For Cross-lingual News Recommendation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages