-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PipeModeDataset leads to infinite loop / memory exhaust when re-using dataset with tf.keras #46
Comments
Hello @fmannhardt, Apologies for the late response. Let me look into this and I'll try to respond as soon as possible. This potentially looks like it may require some dedicated investigation time to root cause. Let me speak with my team about this. Thank you for your patience! |
Thanks. If you need more information let me know. |
Hi @fmannhardt, I've noticed that you are passing Would you mind sharing a complete example allowing us to reproduce the issue? Thanks for using SageMaker Márcio |
I will try to set-up a complete example including data. Based on the log messages, the issue appears on the second use. So it hangs on the call to |
Thanks @fmannhardt, an minimal example will be very helpful to help us diagnose the issue. |
Same issue, here is an example of the code:
I can remove |
I think I'm seeing same behaviour as @kafka399, but not sure whether it's the same as this parent issue or should be tracked as separate: For me the hang is at 0% CPU utilization and static memory consumption - looks more like a deadlock than an infinite loop. SetupMy script creates a ds_train = PipeModeDataset(channel="train") \
.repeat(args.epochs) \
.batch(2) \
.map(data.get_tf_parse_mapper(args.data_shape, randomize=True)) \
.batch(args.batch_size) \
.map(data.get_tf_train_batch_mapper(args.batch_size, args.data_shape, args.num_classes))
ds_val = [Pretty much the same with different channel name]
# Do the pre-traininng:
train_model.fit(
ds_train,
epochs=args.epochs_stabilize,
initial_epoch=0,
shuffle=False,
steps_per_epoch=args.num_samples_train // args.batch_size,
validation_data=ds_val,
validation_steps=args.num_samples_validation // args.batch_size,
verbose=2,
)
# [Unfreeze some layers] then recompile the model:
train_model.compile(
optimizer=Adam(lr=1e-4),
)
# Train for remaining epochs:
train_model.fit(
ds_train,
callbacks=train_callbacks,
epochs=args.epochs,
initial_epoch=args.epochs_stabilize,
shuffle=False,
steps_per_epoch=args.num_samples_train // args.batch_size,
validation_data=ds_val,
validation_steps=args.num_samples_validation // args.batch_size,
verbose=2,
) FindingsOn SM TensorFlow container v1.15.2 (which I was originally targeting), the code ran (both training rounds) as long as I either removed the I noticed in the README that PipeModeDataset only advertises support for TF v1.7-1.12 so tried dropping back to TF v1.12. This fixed the issue, but only if I made sure my datasets were exact multiples of the batch size - otherwise it froze for me in the first epoch of the second So my asks would be:
Edit: Some additional observations:
|
Here's my attempt at a full minimal-ish reproducible example on TF 1.15, with MNIST digits classification: It's adapted from a workshop, so to run you need to:
On TFv1.15 (as per Git) the training freezes on the first epoch with 0 CPU/GPU utilization. On TFv1.12 (if taking care to make sure batch size is a factor of both num training samples and num test samples), the training completes successfully. |
I'm also having similar struggles here. Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations. |
Same issue here. In my case, I am trying to train my model. I am using a CPU machine, with 6GB Ram. Could it be the memory is not adequate for the model training? Same error ---- GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations. |
I have tried out with several configurations to use PipeModeDataset together with tf.keras and I run into trouble re-using the same dataset (e.g. validation) for use in both
fit
andevaluate
. It seems that on the second call the Sagemaker instance exhausts all available GPU memory and goes into some kind of loop.This is my current training script (I will try to strip it down further, but this works perfectly when using
File
mode but fails on theevaluate
call when executed in ``Pipe` mode:I tried with framework version:
v1.13
,v1.14
both show the same behaviour and this seems to be related to re-using the dataset aftermodel.fit
is done. If I don't callmodel.evaluate
, then everything is fine.Unfortunately not much logging output except for this warning:
The text was updated successfully, but these errors were encountered: