Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AttributeError while using DataSilo._load_data for loading data from dict #373

Closed
lucalila opened this issue May 26, 2020 · 5 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@lucalila
Copy link

lucalila commented May 26, 2020

Describe the bug
I'm using the multilabel text classification prediction head for training a custom dataset. Earlier when using the FARM example code with my data, everything worked fine. Now, when trying to re-train in a new environment with the same dataset, I'm facing the following issue:

Error message

Train epoch 1/1 (Cur. train loss: 0.0000):   0%|          | 0/3124 [00:00<?, ?it/s]Traceback (most recent call last):

  File "<ipython-input-4-04631879caec>", line 1, in <module>
    model_output = trainer.train()

  File "\farm\farm\train.py", line 216, in train
    per_sample_loss = model.logits_to_loss(logits=logits, **batch)

  File "\farm\farm\modeling\adaptive_model.py", line 143, in logits_to_loss
    all_losses = self.logits_to_loss_per_head(logits, **kwargs)

  File "\farm\farm\modeling\adaptive_model.py", line 130, in logits_to_loss_per_head
    all_losses.append(head.logits_to_loss(logits=logits_for_one_head, **kwargs))

  File "\farm\farm\modeling\prediction_head.py", line 389, in logits_to_loss
    label_ids = kwargs.get(self.label_tensor_name).to(dtype=torch.float)

AttributeError: 'NoneType' object has no attribute 'to'

To Reproduce
What I changed is basically the following: I'm not loading my dataset from a local dir anymore but instead I'm passing a dict based on the suggestion in #127

Any ideas what I'm missing? Thanks for any help.

## dicts = [{'text': 'example text', 'label': 'my_label_1'}]

processor = TextClassificationProcessor(tokenizer=tokenizer,
                                        max_seq_len=40,
                                        data_dir=None,
                                        train_filename=None,
                                        label_list=label_list,
                                        metric=metric,
                                        dev_filename=None,
                                        test_filename=None,
                                        dev_split=0.0,
                                        label_column_name="label")

data_silo = DataSilo(
    processor=processor,
    batch_size=batch_size,
    automatic_loading=False)

data_silo._load_data(train_dicts=dicts)

language_model = LanguageModel.load(pretrained_model_name_or_path=lang_model, language="multilingual")
prediction_head =  MultiLabelTextClassificationHead(num_labels=len(label_list),
                                                        layer_dims=[768,14])
## just using example code here:

model = AdaptiveModel(
        language_model=language_model,
        #language="multilingual",
        prediction_heads=[prediction_head],
        embeds_dropout_prob=0.1,
        lm_output_types=["per_sequence"],
        device=device)

model, optimizer, lr_schedule = initialize_optimizer(
        model=model,
        learning_rate=3e-5,
        device=device,
        n_batches=len(data_silo.loaders["train"]),
        n_epochs=n_epochs)

trainer = Trainer(
        optimizer=optimizer,
        data_silo=data_silo,
        epochs=n_epochs,
        n_gpu=n_gpu,
        lr_schedule=lr_schedule,
        evaluate_every=evaluate_every,
        device=device,
        model=model)

model_output = trainer.train()

Additional context
Seems like my data_silo is missing some information. E.g., in the first version I had:

data_silo.tensor_names
Out[50]: ['input_ids', 'padding_mask', 'segment_ids', 'text_classification_label_ids']

When loading from dict, I've got:

data_silo.tensor_names
Out[48]: ['input_ids', 'padding_mask', 'segment_ids']

System:

  • OS: Windows 10
  • GPU/CPU: CPU
  • FARM version: 0.3.2 and 0.4.2 (same behavior)
@lucalila lucalila added the bug Something isn't working label May 26, 2020
@Timoeller
Copy link
Contributor

Hey @lucalila thanks for using FARM. Lets figure this out, but I need more information, since it is a unique use I havent seen before.

Could you describe the need for parsing the data as dicts inside the data_silo with data_silo._load_data(train_dicts=dicts)? Couldnt you create a file out of it, at least for training the model?

@Timoeller Timoeller self-assigned this May 27, 2020
@lucalila
Copy link
Author

Hi @Timoeller , thank you for the quick reply! Sure, you're right, I can use the file option as well.

Currently, I'm preprocessing my file(s) and the idea was to pass the data straight into a datasilo instead of saving the preprocessed file and load it again - just for convenience. To do so, I came across the automatic_loading=False parameter in the datasilo constructor and found the load from dictionaries link/option in the comments. That's why I'm bypassing the data_dir and train_filename parameters in the processor as suggested. However, I seem to be doing something wrong.

@Timoeller
Copy link
Contributor

Hey @lucalila you are totally right: data_silo._load_data(train_dicts=dicts) should work on dicts as well. But there is a little issue with naming the keys inside the dict.

Thanks for pointing out the missing label tensor name btw. That's how I realized the name conversion through the parameter "label_column_name" in the TextClassificationProcessor is not applied when the data silo loads dicts directly. So you have to name your label key "text_classification_label" as in
dicts = [{'text': 'example text', 'text_classification_label': 'my_label_1'}]

Could you report back if that works for your case?

@lucalila
Copy link
Author

lucalila commented Jun 4, 2020

Nice, this is working for me! Thanks a lot!!

@lucalila lucalila closed this as completed Jun 4, 2020
@Timoeller Timoeller changed the title DataSilo loading data from dict AttributeError while using DataSilo._load_data for loading data from dict Jun 4, 2020
@lucalila
Copy link
Author

lucalila commented Jun 4, 2020

Sorry, not sure what I did this morning, but with the same config as above and the new label key, I'm here now:

Train epoch 1/1 (Cur. train loss: 0.0000):   0%|          | 0/2 [00:00<?, ?it/s]Traceback (most recent call last):

  File "<ipython-input-23-04631879caec>", line 1, in <module>
    model_output = trainer.train()

  File "\farm\farm\train.py", line 216, in train
    per_sample_loss = model.logits_to_loss(logits=logits, **batch)

  File "\farm\farm\modeling\adaptive_model.py", line 143, in logits_to_loss
    all_losses = self.logits_to_loss_per_head(logits, **kwargs)

  File "\farm\farm\modeling\adaptive_model.py", line 130, in logits_to_loss_per_head
    all_losses.append(head.logits_to_loss(logits=logits_for_one_head, **kwargs))

  File "\farm\farm\modeling\prediction_head.py", line 390, in logits_to_loss
    loss = self.loss_fct(logits.view(-1, self.num_labels), label_ids.view(-1, self.num_labels))

RuntimeError: shape '[-1, 14]' is invalid for input of size 18

Maybe you can help one more time?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants