Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ktrain 0.25.4] Possible Memory Leak in preprocessor/hf_convert_examples #351

Closed
RAbraham opened this issue Apr 5, 2021 · 3 comments
Closed
Labels
user question Further information is requested

Comments

@RAbraham
Copy link

RAbraham commented Apr 5, 2021

Hi,
I'm investigating a memory leak in our application and one of the signals I get is that preprocessor/hf_convert_examples may be leaking memory at line 383

    return  TransformerDataset(np.array(features_list), np.array(labels))

Our code calls the following methods on ktrain

class Bert:
    def __init__(self, model_name=None):
         self.predictor = ktrain.load_predictor(..)

    def predict(self, batch_string):
        res = self.predictor.predict(batch_string, return_proba=True)
	    output = .. post processing on res...
        return output
            
        

I am using tracemalloc and running the above code in a for loop and capturing the memory snapshot and printing it after every iteration. Almost all other objects have stable memory usage except below where I show the ktrain usage after every iteration(first iteration 248 KiB, last iteration 992 KiB)

../ktrain/text/preprocessor.py:383: size=248 KiB, count=17, average=14.6 KiB
../ktrain/text/preprocessor.py:383: size=496 KiB, count=32, average=15.5 KiB
../ktrain/text/preprocessor.py:383: size=744 KiB, count=49, average=15.2 KiB
../ktrain/text/preprocessor.py:383: size=992 KiB, count=67, average=14.8 KiB

Would this be a memory leak? If so, is there anything I can do in my code that can prevent this for now?

Additional info:

  • If this is a memory leak, it's a higher in a lower version(i.e. 0.21.4) but I report on the latest I can upgrade to. I can't upgrade to 0.26 right now. Actually, I prefer sticking to 0.21.4 for now if there are any workarounds.
@amaiya amaiya added the user question Further information is requested label Apr 5, 2021
@amaiya
Copy link
Owner

amaiya commented Apr 5, 2021

Hi @RAbraham

I wasn't able to reproduce this using the latest versions of ktrain and transformers and TensorFlow 2.3.1.

But, if you look at preprocessor.py, it is not building any sort of cache or anything that would cause a memory leak. It could be something related to your deployment setup. I'm not sure what version of TensorFlow you're using. But, if there really is a memory leak, it may be in lower-level TensorFlow code (e.g., tf.data.Dataset which is used by preprocessor). Also, the hf_convert_examples function and other portions of peprocessor.py have not been changed for several versions now. One of the main differences across ktrain versions that you're testing is the version of transformers. So, if the leak changes across versions, another possibility is that it may be an issue with an older version of transformers. Like I said, I wasn't able to reproduce with latest ktrain which uses transformers>=4.0.

One easy way to possibly address this issue is to deploy your model using ONNX which allows you to deploy WITHOUT the need for TensorFlow/PyTorch/ktrain. Please see the example ONNX notebook that shows you how to convert your ktrain-trained transformers model to ONNX. This allows you to deploy ktrain models with much smaller memory/storage footprints. I have done this using both Heroku and AWS Lambda and it is quite efficient.

@amaiya amaiya closed this as completed Apr 5, 2021
@amaiya
Copy link
Owner

amaiya commented Apr 6, 2021

It may or may not be related to this TensorFlow issue. From the thread, a workaround is to use del and gc.collect().

However, as I mentioned before, the better solution would be to deploy your model using ONNX.

@RAbraham
Copy link
Author

RAbraham commented Apr 6, 2021

Thank you very much for your detailed investigation for this issue 🙏
Sorry for not mentioning, but I'm on TF 2.2, ktrain 0.25.4
I did try del and gc.collect but that didn't change anything.
The ONNX recommendation is quite valuable!
I'll try out your suggestions. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
user question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants