Skip to content

What languages are included in the catalogue?

ggdupont edited this page Nov 14, 2021 · 4 revisions

While the catalogue is being built as part of the BigScience effort which focuses on a predefined set of languages, it will continue being maintained and aims to be of general use beyond the scope of this project. As such, we welcome contributions of entries corresponding to any language.

BigScience Languages

The BigScience effort chose to focus on the following languages and language group based on a combination for demographic and geographic coverage and availability and first-hand knowledge of BigScience participants. We especially invite contributions for entries corresponding to the following language groups:

  • African Languages of the Niger-Congo family, including e.g. Swahili and other Bantu languages
  • Arabic
  • Basque
  • Catalan
  • Chinese
  • English
  • French
  • Indic languages, including Bengali, Hindi, Urdu
  • Indonesian
  • Portuguese
  • Spanish
  • Vietnamese

We also welcome contributions of programming language data to test a large-scale model's ability to learn their distribution.

If you choose, African languages, Arabic, or Indic languages for your entry, a further drop-down menu will also appear to allow you to select the specific language, language variety, or dialect: the full list can be found here.

We also recommend you add free text comments about the language variety whenever possible (for example, language variety information not covered by the above selection), as this will be helpful to navigate the catalogue!

Other Languages

If the language or one of the languages corresponding to your entry is absent from the above list, you can bring up a selection menu with a broader selection (all languages that have a BCP-47 code) by checking the Show other languages box in the Languages and Locations section of the form