New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inference tensors cannot be saved for backward. #2144
Comments
Thanks for pointing out this bug to us - this is an issue in the NoopTranslator when using the PyTorch engine. We will take a look and determine the best fix. For now, you can adjust your code as follows to get around the exception:
|
The problem comes from
|
@frankfliu Thanks for helping. I have tried ur solution. And here comes out a new problem with initializing parameter. Exception in thread "main" java.lang.NullPointerException: No parameter shape has been set
at java.util.Objects.requireNonNull(Objects.java:228)
at ai.djl.nn.Parameter.initialize(Parameter.java:181)
at ai.djl.nn.AbstractBaseBlock.initialize(AbstractBaseBlock.java:182)
at ai.djl.nn.SequentialBlock.initializeChildBlocks(SequentialBlock.java:224)
at ai.djl.nn.AbstractBaseBlock.initialize(AbstractBaseBlock.java:184)
at ai.djl.training.Trainer.initialize(Trainer.java:117)
at cn.amberdata.misc.djl.rankcls.Main.main(Main.java:72) It seems that djl do not find a suitable implements for function @siddvenk Code in tutorial works fine with ur solution, thanks a lot. |
I'm in exactly the same point. I cannot follow the fix for #2173. because it freezes the original distilbert model, while I need to finetune all the layers. @frankfliu setup seemed to me good, since it just put the model block in the sequence and nothing seems preventing it to be part of the backward. But @SuperMaskv last NullPointerError blocks the initialize. It initializes correctly the input of the sequenceBlock, then it finds a PtSymbolBlock.
Here we have some issues:
and they are initialized (but it checks for arrays == null, not shape == null - maybe it's a consequence, but this is a different check and shape == null will be the problem, as we'll see in a while). But later, in AbstractBaseBlock.initialize() in the loop:
it finds that the first parameter has shape null, so it throws the error.
Since I understand that shapes should be the input/output of every layer, it seems to me that somewhere they should be set correctly if it's true (and it's true, as in the AmazonRating example) that the same block, if frozen, works. Maybe is there a way to use that code (wherever it may be)? |
@adepase Could you provide the code you have tried? Especially your implementation of not freezing the embedding layer? The problems you ran into here are not too different from the example Please take a look at the PR below to see if it works for you. |
@KexinFeng Sure, it's almost the same as the one provided before by @SuperMaskv , with the incremental corrections, but let's put it here to have a refreshed and summarized view:
I looked at your suggested example, but, sadly, I cannot spot logical differences, so I'm hoping in an help from you. |
@KexinFeng Do we need your #2203 PR to succeed in fine tuning the model ? |
I'm trying to repeat results from https://aclanthology.org/2021.naacl-industry.38.pdf for Distilbert and CLINC150: hyperparameters don't match (are really very different!) to those of the paper, but probably it's due to the fact I didn't understand that I was using a freezed block (as in the AmazonExample, also at the beginning of this post). Well, for your knowledge, there could be probably another minor issue (I'll check it again before opening it and it is not so important for the current input, but it seems to me useful to highlight here for those reading) in the current DJL framework to correctly reach the results of the cited paper. This modifies the input sentences lengths, while BERT model are not trained for fixed length, so I expect a possible degradation. Indeed, trying a post-training workaround, consisting in padding till an predefined length (only if the sentence length is minor than the predefined one) the same input sentences used for test and running them (execution, not training) on the final model manually, you obtain a better accuracy than without the padding. |
@adepase Another headup is that you might need to assign a smaller learning rate to the pre-trained layer (ie embedding layer), which makes sure it is "fine" tuning. This implementation is demonstrated again in By the way, the major difference in this #2203 PR is that the model now consists of a Lambda block, then the distilbert, then the following classifier layers. Previously Lambda block and the distilbert are together in one layer. |
In reply to #2144 (comment) @adepase I see that you are trying to reproduce the results in https://aclanthology.org/2021.naacl-industry.38.pdf for Distilbert and CLINC150. In this paper the best accuracy for this task is 85.7%. Is it already better than: "with a frozen BERT you can reach about a 91% in accuracy"? Maybe you can first experiment with the unfrozen BERT, as you planed, and then see if it gets better. By the way, you mentioned "hyperparameters don't match (are really very different!) to those of the paper,". Even though the unfreezing of the layer is enabled, it is still important to make sure hyperparamters, as well as the model and pre-processing agree to the paper, in order to reproduce or make sensible comparison of the results. |
mmm, no, the 85.7% you mentioned appears in the difficult case (table 3), but Distilbert in the full case scores a 96.3% and that's what I'm trying to emulate now and where I reach a 91%, so I'm missing about a 5%. Moreover, using few example tests (e.g. Curekart full, see table 7, where BERT reaches 83.6%) the gap is even greater (Distilbert frozen reaches about a 29%, absolutely not comparable and explainable probably only with the missing full training of all layers, i.e. with Distilbert not frozen). So experimenting with the unfrozen model is a must.
Obviously, it's important to use the same hyperparameters with the unfrozen case, but with the frozen one the paper hyperparameters perform absolutely worse than the values you see in the code. That's the reason why you see them in the code. My notice in the previous comment was only to explain that difference, it wasn't intended to highlight an issue. Still, I'm not so sure to understand what should I do to make the unfrozen case work |
@KexinFeng Sorry, I'm not so sure to understand. Let's try point by point:
So, I should wait for the next release or is it enough that I just take the 2 classes you modified and put this new version of them in place (before in the classpath, i.e., in my project) to test if this corrects the NullPointerException occurred to me and to @SuperMaskv (as in his last comment)?
At this time, more then the improvement of the metric (I understand you're referring to the accuracy, let me know if I misunderstood, please), I'm interested in understanding how to finetune all layers (i.e. how to have an unfrozen Distilbert, if I understand correctly) and not to the metric itself (well, yes, but it's the final result: I'm trying to understand the correct setup to avoid the NullPointerError shown in the last comment of @SuperMaskv).
Ok, I see that there is a different learning rate tracker in the example, I'll try it, thank you, but I expect just it improves the results, not it avoids the NullPointerException.
Sorry, here I don't understand anything at all about this point. Can you rephrase, please? Should I add a Lambda block before Distilbert? Thank you again for your support |
The PR mentioned has been merged, you can now fetch the most updated snapshot version. To unfreeze the distilber layer you can use the option "trainParam" in the lines shown below. djl/examples/src/main/java/ai/djl/examples/training/transferlearning/TrainAmazonReviewRanking.java Lines 84 to 92 in 97004d6
Then, in your code, after
the embeddingBlock will be tuned during training. But I notice another minor thing,
but in the example, |
adamw #2206 |
@KexinFeng Thank you for your further effort in #2206 .
No more NullPointerException (fine, thank you again).
I debugged and found that 0.20.0 not yet contains your merged updates. :( Then I figured out: I saw the AmazonReview updates and understood how to change the code. The training started and it was very different than before (with old hyperparameters, but now I'm changing them to follow the paper, hoping that's enough). Thank you so much again for your really impressive fast answers. |
@KexinFeng Just trying the first hyperparameters setup, I already reached 94.17%, which is not the 96.3% of the paper, but I'm confident that I can find a better setup with more work and the code is using Adam and not AdamW (I expect a further minor enhancement from that). |
I'm glad to know that the answers helped! By the way, the newest snapshot version is not in the 0.20.0 released version. It is accessed by
Currently, the |
In my previous case everything worked with the last snapshot (even AdamW) with the Distilbert Model. I almost reached the results I expected (well, with different parameters, but I think the original paper I already mentioned isn't so clear: it leaves many possible interpretations - at least for me, who are not so good at interpreting all the models).
Then using the following code:
I get the following error:
I understand (well, probably...) from https://huggingface.co/docs/transformers/model_doc/bert#transformers.models.bert.modeling_bert.BertForPreTrainingOutput that it depends upon the fact that bert-base returns 2 outputs (prediction_logits and seq_relationship_logits, I suppose). But how can I select the correct output, matching my following 768 Linear block, just after the BERT model? |
well, ok, in the previous code the modelDir seems not to match the phisical one where I saved the model, but it's just a not complete cleaning of the code: obviously they match, else I'd receive an error before. |
It looks like it requires some model level design and debugging. Also the output of the Bert is decided by the model, the subsequent of which needs to be designed accordingly. |
Regarding
consider use the Lambda Block or other kinds of blocks in DJL to build it. |
Well... I didn't open this issue. It seems to me that the first reason for opening this issue has been solved, but I think @SuperMaskv should say that. Anyway, I tried the Lambda Block, but I got another issue. Opening apart. |
I don't know if this is a bug. I'm trying to follow the official tutorial using pytorch engine.
Here is my code and exception.
And here is my dependencies.
The text was updated successfully, but these errors were encountered: