Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a European Benchmark #361

Open
32 tasks done
Tracked by #366
KennethEnevoldsen opened this issue Apr 13, 2024 · 7 comments
Open
32 tasks done
Tracked by #366

Add a European Benchmark #361

KennethEnevoldsen opened this issue Apr 13, 2024 · 7 comments

Comments

@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented Apr 13, 2024

Add a benchmark for the European languages. This issue gives an overview of the status.

The EU has 24 official languages:

Germanic Languages

  • Danish - dan
  • English - eng
  • German - deu
  • Dutch - nld
  • Swedish - swe

Romance Languages

  • French - fra
  • Italian - ita
  • Portuguese - por
  • Spanish - spa
  • Romanian - ron

Slavic Languages

  • Bulgarian - bul
  • Croatian - hrv
  • Czech - ces
  • Polish - pol
  • Slovak - slk
  • Slovenian - slv

Baltic Languages

  • Latvian - lav
  • Lithuanian - lit

Finno-Ugric Languages

  • Estonian - est (2 tasks)
  • Finnish - fin (covered with only 1 task outside of bitext and translations)
  • Hungarian - hun (1 task)

Other Indo-European Languages

  • Greek - ell

Non-Indo-European Languages

  • Maltese - mlt (Semitic Language)
  • Irish - gle (Celtic Language)

Additionally to this list we might add (feel free to add languages that I might have missed):

  • Basque (although it is not an official EU language, it is officially recognized within Spain)
  • Norwegian Nynorsk (official language of Norway, which is in the Schengen Area but not in the EU)
  • Norwegian Bokmål (official language of Norway, which is in the Schengen Area but not in the EU)
  • Icelandic (official language of Iceland, also in the Schengen Area but not in the EU)
  • Albanian (official language of Albania, a candidate for EU membership and part of the Schengen visa regime)
  • Serbian (official language of Serbia, another candidate for EU membership and part of Schengen visa policies)
  • Macedonian (official language of North Macedonia, an EU candidate country and part of the Schengen visa system)
  • Romani (recognized minority language in numerous European countries)

Note I haven't checked off languages only covered in bitext tasks or in translated tasks

@x-tabdeveloping
Copy link
Contributor

We might get reasonable coverage of Macedonian by including Bulgarian, they are as close as Bokmål and Danish pretty much. Same thing goes for Serbian and Croatian (maybe also Slovenian) up until to nineties Serbo-Croatian was considered done language. Perhaps the reason to include both would be because Serbian is written with cyrillic script (but then we can make sure the model understands that by having Bulgarian, Russian or Ukrainian)

@x-tabdeveloping
Copy link
Contributor

Also might consider some minority languages and regional dialects for fairness's and ethics's sake, a handful of examples I can think of:

  • Romani is a minority language in a lot of European countries, and Romas make up a sizeable portion of the population. I might be able to scramble some resources as the Hungarian Roma community is quite prevalent.
  • Schwiizerdütsch as it is very different from mainstream Hochdeutsch
  • Sønderjysk, we can hopefully find some resources for that
  • Frisian, as about half a million people have it as their mother tongue

@isaac-chung
Copy link
Collaborator

Something from EURLEX like https://huggingface.co/datasets/coastalcph/multi_eurlex or https://huggingface.co/datasets/ddrg/super_eurlex is promising. Though for the classification task MTEB doesn't seem to support mutli-label classification, right?

@KennethEnevoldsen
Copy link
Contributor Author

No sadly it does not support it (would love a PR on it though).

@PierreMesure
Copy link

Great job, Kenneth et al.! We might be able to contribute with some Swedish data later this Spring. We embed several types of Swedish administrative documents to enable semantic search in them and we're planning on improving our evaluation pipeline this year. Will try to contribute back.

I'm not from Sápmi but I think it's the right place to mention that Sami languages should be included at some points. I haven't heard of any Nordic LLM initiative regarding these languages but I hope one emerges soon with a strong link to the people still using them.

@KennethEnevoldsen
Copy link
Contributor Author

Thanks for the addition, @PierreMesure; if you know of any Sami dataset, feel free to add them or create an issue so that other contributors can add them in.

@Rysias
Copy link
Contributor

Rysias commented Apr 22, 2024

I think with my recent PR that Albanian and Latvian has some representation - at least for a clustering task

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants