New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Script and results on eurlex #21
Comments
Hi @Glaciohound, I'm trying to figure out what could go wrong. Here are the results for two of the runs (seed 1 and 2) with
Across cases (five seeds), the model in our experiments stopped after [10,10,13,10,11] epochs*, so in the log you presented the model is severely under-trained (under-fit), i.e., the model barely "learned" how to resolve the most frequent classes (~60-70% micro-F1) and it's much worse for the rest of infrequent classes (~30% macro-F1).
Can you please attach the log when you train the model for 20 epochs? I'll try to find time to rerun experiments with the last version of the code, but it's very unlikely that there is a bug since there are others who already replicated experiments with very similar results. |
Thank you I will give it a try ^_^ |
Hi @Glaciohound, there was a major bug in data loading and the label list under consideration. In the released HuggingFace data loader, all 127 labels are pre-defined, lexically ordered based on their EUROVOC IDs (https://github.com/huggingface/datasets/blob/1529bdca496d2180bc2af6e1607dd0708438b873/datasets/lex_glue/lex_glue.py#L48). Then, as you mentioned, the EUR-LEX training script considers the first 100 labels, instead of the most-frequent ones based on the training label distribution. lex-glue/experiments/eurlex.py Line 214 in d640bfc
In the original experiments, we used custom data loaders at the time, but then we built and released the HuggingFace data loader w/o noticing this “stealthy” bug...). Permanent Bug Fix I have already made a pull request to fix this issue on the data loader (huggingface/datasets#5048). Temporary Bug Fix Until this happen, early next week, you can also replicate the results by manually defining the label list based on the 100 most-frequent labels, by replacing this line lex-glue/experiments/eurlex.py Line 214 in d640bfc
with this line of code: labels = [119, 120, 114, 90, 28, 29, 30, 82, 87, 8, 44, 31, 33, 94, 22, 14, 52, 91, 92, 13, 89, 86, 118, 93, 12, 68, 83,
98, 11, 7, 32, 115, 96, 79, 116, 106, 81, 75, 117, 112, 59, 6, 77, 95, 72, 108, 60, 99, 74, 24, 27, 34, 58,
66, 84, 61, 16, 107, 20, 43, 97, 105, 76, 67, 80, 57, 63, 37, 36, 85, 5, 109, 69, 38, 78, 39, 49, 23, 42, 100,
17, 70, 9, 51, 113, 103, 102, 110, 0, 41, 111, 101, 35, 64, 10, 121, 21, 26, 71, 122] |
Hello! Thanks for this great repository. I have tried experiments on many of its subtasks and it works beautifully.
Now a problem is, when I am trying to reproduce the results on EUR-LEX, using
run_eurlex.sh
, it fails to give results similar to (or somewhere near) the ones in paper.( I tried to change the model to
legal-base-uncased
, or change the number of epochs from 2 to 20, but these attempts failed too)Can you help to have a look into this and give some suggestions?
A more detailed log for one of the 5 seeds are as follows:
The text was updated successfully, but these errors were encountered: