New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
header shape mis-match #4
Comments
We assume that the vocabulary is fixed, and the size is known in advance. The vocabulary is changing in your case during test time. Here are the solutions:
PS: The default value of force_header is already True. |
thanks for the info. so it sounds like the header mismatch issue is not a big problem, and does not affect training too much |
Yes, things should be fine in general. However, please note that you may want to look into sub-words to reduce the number of OOV tokens. |
when running train mode, it gives a header shape mis-match error during the surrogate task (clustering). I've tested it on the EURlex-4k dataset and my own custom dataset.
after some investigation, i believe it might be due to some words that do not occur in the train data (but are in test data and therefore part of the vocabulary), so when the sparse file for features is read, the max index it finds is not necessarily the same as the number of features provided in the header, thus throwing the shape mismatch issue.
when providing the exact same data for train and test, it does not raise the header shape mis-match error.
here is the output:
there is also a similar possibly related error during evaluation for the extreme task.
model loaded from checkpoint is a slightly different dimension than the current model:
I think this comes from when valid labels are being selected, some criteria creates a difference of 2 (regardless of dataset, the size mismatch is always 2).
not sure what the best fix would be --i see the force_header option is there so maybe it should be the default. during evaluation, i can bypass the issue by manually updating the number of labels in the params.json, but this doesn't seem like a solution.
The text was updated successfully, but these errors were encountered: