Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multilinguality in GIANT #2

Open
silviaegt opened this issue Apr 11, 2021 · 0 comments
Open

Multilinguality in GIANT #2

silviaegt opened this issue Apr 11, 2021 · 0 comments

Comments

@silviaegt
Copy link

silviaegt commented Apr 11, 2021

Dear BeelGroup Team (cc @GrennanM),

I just read your paper and I'm very excited about this development, it is brilliant!
In order to understand better your development, would you care to share how many languages are present in the GIANT corpus? Also, is there a way of knowing which domains are predominant? I saw an "articletype" column, but wasn't quite sure what the numbers meant.
Thanks!

UPDATE: just in case it is useful for you, I ran a language detection algorithm with only one citation style (style "0") using one part of your corpus 881,206 instances), the language distribution was as follows:

lang | n |  
en | 73563 |  
de | 3518 |  
fr | 1453 |  
pt | 640 |  
es | 504 |  
id | 142 |  
ru | 136 |  
it | 131 |  
nl | 118 |  
tr | 93 |  
pl | 86 |  
da | 71 |  
fi | 71 |  
gl | 70 |  
no | 69 |  
ar | 54 |  
el | 52 |  
ms | 29 |  
af | 25 |  
ca | 25 |  
zh | 23 |  
hu | 21 |  
ja | 20 |  
lt | 20 |  
uk | 19 |  

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant