header shape mis-match #4

cairomo · 2021-07-09T23:08:19Z

when running train mode, it gives a header shape mis-match error during the surrogate task (clustering). I've tested it on the EURlex-4k dataset and my own custom dataset.

after some investigation, i believe it might be due to some words that do not occur in the train data (but are in test data and therefore part of the vocabulary), so when the sparse file for features is read, the max index it finds is not necessarily the same as the number of features provided in the header, thus throwing the shape mismatch issue.

when providing the exact same data for train and test, it does not raise the header shape mis-match error.

here is the output:

Loading training data.
Surrogate mapping is not None, mapping labels
Loading validation data.
/home/chmo/.local/lib/python3.7/site-packages/xclib-0.97-py3.7-linux-x86_64.egg/xclib/data/data_utils.py:263: UserWarning: Header mis-match from inferred shape!
  warnings.warn("Header mis-match from inferred shape!")
Surrogate mapping is not None, mapping labels

there is also a similar possibly related error during evaluation for the extreme task.
model loaded from checkpoint is a slightly different dimension than the current model:

Error(s) in loading state_dict for DeepXMLf:
	size mismatch for classifier.weight: copying a param with shape torch.Size([17965, 300]) from checkpoint, the shape in current model is torch.Size([17963, 300]).

I think this comes from when valid labels are being selected, some criteria creates a difference of 2 (regardless of dataset, the size mismatch is always 2).

not sure what the best fix would be --i see the force_header option is there so maybe it should be the default. during evaluation, i can bypass the issue by manually updating the number of labels in the params.json, but this doesn't seem like a solution.

The text was updated successfully, but these errors were encountered:

kunaldahiya · 2021-07-10T13:07:45Z

We assume that the vocabulary is fixed, and the size is known in advance. The vocabulary is changing in your case during test time. Here are the solutions:

Ignore the extra tokens during test time.
Change the header of train data file to reflect the overall number of tokens (during training). You'll need to change the initialization accordingly, i.e., you can set those to zero or change the vocabulary file used to compute FastText embeddings. Please note that Astec will not train embeddings for those tokens since no training point is available for those tokens.

PS: The default value of force_header is already True.

cairomo · 2021-07-28T18:16:46Z

thanks for the info. so it sounds like the header mismatch issue is not a big problem, and does not affect training too much
?

kunaldahiya · 2021-07-29T14:11:04Z

Yes, things should be fine in general. However, please note that you may want to look into sub-words to reduce the number of OOV tokens.

anshumitts closed this as completed Jul 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

header shape mis-match #4

header shape mis-match #4

cairomo commented Jul 9, 2021 •

edited

kunaldahiya commented Jul 10, 2021

cairomo commented Jul 28, 2021

kunaldahiya commented Jul 29, 2021

header shape mis-match #4

header shape mis-match #4

Comments

cairomo commented Jul 9, 2021 • edited

kunaldahiya commented Jul 10, 2021

cairomo commented Jul 28, 2021

kunaldahiya commented Jul 29, 2021

cairomo commented Jul 9, 2021 •

edited