Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

header shape mis-match #4

Closed
cairomo opened this issue Jul 9, 2021 · 3 comments
Closed

header shape mis-match #4

cairomo opened this issue Jul 9, 2021 · 3 comments

Comments

@cairomo
Copy link

cairomo commented Jul 9, 2021

when running train mode, it gives a header shape mis-match error during the surrogate task (clustering). I've tested it on the EURlex-4k dataset and my own custom dataset.

after some investigation, i believe it might be due to some words that do not occur in the train data (but are in test data and therefore part of the vocabulary), so when the sparse file for features is read, the max index it finds is not necessarily the same as the number of features provided in the header, thus throwing the shape mismatch issue.

when providing the exact same data for train and test, it does not raise the header shape mis-match error.

here is the output:

Loading training data.
Surrogate mapping is not None, mapping labels
Loading validation data.
/home/chmo/.local/lib/python3.7/site-packages/xclib-0.97-py3.7-linux-x86_64.egg/xclib/data/data_utils.py:263: UserWarning: Header mis-match from inferred shape!
  warnings.warn("Header mis-match from inferred shape!")
Surrogate mapping is not None, mapping labels

there is also a similar possibly related error during evaluation for the extreme task.
model loaded from checkpoint is a slightly different dimension than the current model:

Error(s) in loading state_dict for DeepXMLf:
	size mismatch for classifier.weight: copying a param with shape torch.Size([17965, 300]) from checkpoint, the shape in current model is torch.Size([17963, 300]). 

I think this comes from when valid labels are being selected, some criteria creates a difference of 2 (regardless of dataset, the size mismatch is always 2).

not sure what the best fix would be --i see the force_header option is there so maybe it should be the default. during evaluation, i can bypass the issue by manually updating the number of labels in the params.json, but this doesn't seem like a solution.

@kunaldahiya
Copy link
Collaborator

We assume that the vocabulary is fixed, and the size is known in advance. The vocabulary is changing in your case during test time. Here are the solutions:

  1. Ignore the extra tokens during test time.

  2. Change the header of train data file to reflect the overall number of tokens (during training). You'll need to change the initialization accordingly, i.e., you can set those to zero or change the vocabulary file used to compute FastText embeddings. Please note that Astec will not train embeddings for those tokens since no training point is available for those tokens.

PS: The default value of force_header is already True.

@cairomo
Copy link
Author

cairomo commented Jul 28, 2021

thanks for the info. so it sounds like the header mismatch issue is not a big problem, and does not affect training too much
?

@kunaldahiya
Copy link
Collaborator

Yes, things should be fine in general. However, please note that you may want to look into sub-words to reduce the number of OOV tokens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants