Skip to content

Latest commit

 

History

History
61 lines (60 loc) · 11.3 KB

sources.md

File metadata and controls

61 lines (60 loc) · 11.3 KB

Sources

USED Dataset Name Reliable Language Tags Domain Related Links License
Yes, (V1, V2, V3) Wikipedia Yes, however! Articles HF_Wikipedia dumps.wikipedia, wili_2018, Leipzig CC BY-NC-SA 3.0
Yes, (V1, V2, V3) SETIMES Yes News Opus_SETIMES CC-BY-SA 3.0
Yes, (V1, V2, V3) Tatoeba Yes, mostly Crowdsourcing Tatoeba CC-BY
Yes, (V1, V2, V3) Global Voices Yes News stories Opus_GlobalVoices CC BY 3.0
Yes, (V1, V2, V3) XL-Sum Yes BBC News HF_xlsum, Github_xlsum CC BY-NC-SA 4.0
Yes, (V1, V2, V3) Leipzig - News and Newscrawl Yes News Leipzig CC BY-NC-SA 3.0
Yes, (V1, V2, metadata check for V3) NLLB_seed Yes Professionally-translated sentences (Wikipedia domain) NLLB_Seed CC-BY-SA 4.0
Yes, (V1, V2, clean for V3) MT-560v1 ?, openlid version is much cleaner. Multiple domain OpenLID compilation Apache License 2.0
Yes, (V1, V2, V3) Autshumato Yes Government domain HF_autshumato CC Attribution 2.5 South Africa License
Yes, (V1, V2, metadata check for V3) Open Bibles Yes, but some closely related languages (or similar name) might be confused. Bible versions 1000Langs, PBC, CorpusCrawler, PNG, Open-Bibles, Bible.is, ebible, biblenlp-corpus, JHUBC, bible.com Mostly CC BY-NC-ND
Yes (partly), (V1, V2, more clean for V3) JW Yes, except sign language codes New World Bible Masakhane-Mt, JW, JW300 Usage for your own personal and non-commercial purposes is permitted. However, distribution is not allowed.
Yes (partly), (V1, V2, V3) LTI ? Multiple domain whatlang Custom License/ Open partly
Yes (partly), (V1, V2, delete some languages for V3) Arabic (DART, SHAMI, TSAC, PADIC, AOC, Arabic Dialects Dataset, MADAR) ? Multiple domain IADD, MADAR, Arabic Dialects Multiple open licenses
Yes, (V1, V2, V3) Persian (TEP, MIZAN) Yes literature, subtitle TEP, MIZAN TEP: GNU General Public License, MIZAN: CC BY 4.0
Yes, (V1, V2, V3) TIL Corpus ? Multiple domain TIL-MT CC BY-NC-SA 4.0
Yes, (V1, V2, V3) bho-resources Yes News bho-resources CC BY-NC-SA 4.0
Yes, (V1, V2, V3) Guaraní Parallel Set Yes News Guaraní Parallel Set No explicit license
Yes, (V1, V2, V3) HKCanCor Yes Transcribed conversations hkcancor CC BY 4.0
Yes, (V2, V3) ai4d challenge (Nyanja) Yes News ai4d-malawi-news No explicit license
Yes, (V2, V3) Wanca 2016 ? Web wanca2016 CC - BY
No, (V2, delete for V3) smugri No, after training V2 model we find this data is not clean News smugri-data CC BY 4.0
No, (V2, delete for V3) finno-ugric ? ? finno-ugric-train CC BY 4.0
Yes, (V2, V3) smugri-flores Yes Human Translation smugri-flores-testset CC BY 4.0
Yes, (V2, V3) Abkhaz National Corpus Yes grammatically annotated text (linguistics, literary studies, history, political and social sciences) Abkhaz National Corpus, abkhaz_text Public domain (cc0-1.0)
Yes, (V2, V3) Luo News (Radio Ramogi) Yes News Luo CC BY 4.0
Yes, (V2, more metadata check for V3) Lyrics Yes, but there might be some issues with the translation part. Song lyrics lyricstranslate Copyright issues prevent distributing the original lyrics. Lyrics on Lyricstranslate.com are licensed through Musixmatch. Translations on Lyricstranslate.com belong to their authors.
Yes, (V2, V3) GlotSparse Yes News and Articles HF_GlotSparse, Github_GlotSparse Public domain (cc0-1.0)
Yes, (V2, more metadata check for V3) GlotStoryBook Yes StoryBooks HF_GlotStoryBook, Github_GlotStoryBook Public domain (cc0-1.0)
Yes (partly), (V2, more clean for V3) Universal Dependencies v2.12 Yes, mostly Multiple domain UD CC family
Yes, (V2, V3) CommonVoice v11 Yes, mostly crowdsourcing CommonVoice v11 Public domain (cc0-1.0)
Yes, (V2, V3) GOUD.MA ?, only the headlines are definitely written in Moroccan Darija, but some noises exist. News Goud-sum No explicit license
Yes, (V2, V3) Vuk'uzenzele Yes government domain vukuzenzele-monolingual License for Data - CC BY 4.0
Yes, (V2, V3) Masakhanews Yes News masakhane/masakhanews CC 4.0 Non-Commercial
Yes, (V2, V3) AfriQA Yes Human Translation masakhane/afriqa CC 4.0 Non-Commercial
Todo African News Corpus Yes News African News Corpus Non-Commercial Government Licence
Todo AfriSenti ?, location tags converted to language tags followed by annotation Tweets AfriSenti CC BY 4.0
Todo Bambara Dataset - Sentiment ?, followed by annotation CommonCrawl Bambara Dataset No explicit license
Todo TUNIZI Dataset - Sentiment ?, followed by annotation YouTube videos comments TUNIZI Dataset, HF_TUNIZI No explicit license
Todo CTAB ?, followed by manual verification Facebook Public Pages Zenodo_CTAB CC BY 4.0
Todo Open Subtitles ? Movie subtitles OpenSubtitles
Todo GNOME Yes Human Translation opus_GNOME, HF_opus_GNOME, gnome/releases
Todo KDE4 Yes Human Translation opus_kde4 HF_kde4 kde4
Todo Ubuntu Yes Human Translation HF_opus_ubuntu
Todo Web Inventory of Transcribed & Translated (WIT) Ted Talks ? Ted talks ted_talks_iwslt
Todo igbo_monolingual Yes News, Radio, Books igbo_monolingual
Todo QADI ?, location tags can be converted to language tags. However, we cannot deny the existence of such a resource, even if some level of noise exists. Arabic dialects are close to each other, and even news websites might be written in standard Arabic. Tweets QADI Tweet IDs Apache License 2.0
Todo Mot Yes VOA News mot MIT License
Todo gov-za Yes government domain gov-za-monolingual License for Data - CC BY 4.0
Yes, (V3) NusaT, NusaP, NusaX Yes Human Translation indonlp, nusa-writes, NusaX CC-BY-SA 4.0 or Apache License 2.0
Yes, (V3) MT Shared Task American NLP americasnlp2024
Yes, (V3) ShaShiYaYi multilingual-data-peru
No, (delete for v3, it's code-switch) Hinglish collection english-to-hinglish
Yes, (V3) Indic corpora In22Conv, In22Gen CC-BY-SA 4.0
Yes, (V3) Yue parallel text yue-cmn-eng
Todo Dakshina dakshina
Yes, (V3) Bhasha-Abhijnaanam Bhasha-Abhijnaanam
Yes, (V3) Bloom Library bloom-lm CC-BY-(NC?-ND?-SA?)-4.0