Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rebatch api produce an Check failed: limit <= dim0_size error #44

Closed
liurcme opened this issue Mar 25, 2022 · 2 comments · Fixed by #46
Closed

rebatch api produce an Check failed: limit <= dim0_size error #44

liurcme opened this issue Mar 25, 2022 · 2 comments · Fixed by #46
Assignees
Labels
bug Something isn't working

Comments

@liurcme
Copy link

liurcme commented Mar 25, 2022

Current behavior

After rebatch(), data iterator get_next() produce an error:

F tensorflow/core/framework/tensor.cc:833] Check failed: limit <= dim0_size (8194 vs. 8193)

Expected behavior

no error

System information

  • OS Platform and Distribution: Ubuntu 18.04.5 LTS
  • TensorFlow version: 1.15.0
  • Python version: 3.6
  • CUDA/cuDNN version: 10.1
  • RAM: 94G
  • GPU model and memory: Tesla T4, 16G

Code to reproduce

Step 1: Generate a parquet file by running following code

import numpy as np
import pandas as pd
import random

data_list = []
for i in range(1, 10000):
    int_feature = random.randint(1, 100)
    # float_feature = random.random()
    array_feature = [random.randint(1, 10) for x in range(0, 4)]
    data_list.append([int_feature, array_feature])

df = pd.DataFrame(data_list, columns=["int_feature", "array_feature"])
df.to_parquet("parquet_sample_file.parquet")

Step 2: Load generated parquet in step 1 by HybridBackend

import tensorflow as tf
import hybridbackend.tensorflow as hb


filenames_ds = tf.data.Dataset.from_tensor_slices(['file1.snappy.parquet', 'file2.snappy.parquet', ... 'fileN.snappy.parquet'])


hb_fields = []
hb_fields.append(hb.data.DataFrame.Field("feature1", tf.int64, ragged_rank=0))
hb_fields.append(hb.data.DataFrame.Field("feature2", tf.float32, ragged_rank=1))
hb_fields.append(hb.data.DataFrame.Field("feature3", tf.int64, ragged_rank=1))

ds = filenames_ds.apply(hb.data.read_parquet(8192, hb_fields, num_parallel_reads=tf.data.experimental.AUTOTUNE))
iterator = ds.apply(hb.data.rebatch(8192, fields=hb_fields))

it = iterator.make_one_shot_iterator()
item = it.get_next()

batch_size_dict = {}
with tf.Session() as sess:
    print("======  start ======")
    total_batch_size = 0
    while True:
        try:
            batch = sess.run(item)
            batch_size = len(batch['mod_series'])
            batch_size_dict[batch_size] = batch_size_dict.get(batch_size, 0) + 1
        except tf.errors.OutOfRangeError:
            break

Running above code in a pyhon3 shell, an error shall be thrown:

F tensorflow/core/framework/tensor.cc:833] Check failed: limit <= dim0_size (8194 vs. 8193)

Willing to contribute

Yes

@2sin18
Copy link
Collaborator

2sin18 commented Mar 25, 2022

Thanks for reporting, can you provide a sample file for reproducing this issue?

@liurcme
Copy link
Author

liurcme commented Mar 28, 2022

Thanks for reporting, can you provide a sample file for reproducing this issue?

(1) Generate a parquet file by running following code

import numpy as np
import pandas as pd
import random

data_list = []
for i in range(1, 10000):
    int_feature = random.randint(1, 100)
    # float_feature = random.random()
    array_feature = [random.randint(1, 10) for x in range(0, 4)]
    data_list.append([int_feature, array_feature])

df = pd.DataFrame(data_list, columns=["int_feature", "array_feature"])
df.to_parquet("parquet_sample_file.parquet")

(2) Load generated parquet file by HybridBackend will reproduce this issue

filenames_ds = tf.data.Dataset.from_tensor_slices(["parquet_sample_file.parquet"])

hb_fields = []
hb_fields.append(hb.data.DataFrame.Field("int_feature", tf.int64, ragged_rank=0))
# hb_fields.append(hb.data.DataFrame.Field("float_feature", tf.float32, ragged_rank=0))
hb_fields.append(hb.data.DataFrame.Field("array_feature", tf.int64, ragged_rank=1))

iterator = filenames_ds.apply(hb.data.read_parquet(100, hb_fields, num_parallel_reads=tf.data.experimental.AUTOTUNE))
iterator = iterator.apply(hb.data.rebatch(100, fields=hb_fields)).repeat(30)

iterator = iterator.make_one_shot_iterator()
item = iterator.get_next()
with tf.Session() as sess:
    print("======  start ======")
    total_batch_size = 0
    while True:
        try:
            a = sess.run(item)
        except tf.errors.OutOfRangeError:
            break

@2sin18 2sin18 self-assigned this Apr 9, 2022
@2sin18 2sin18 added the bug Something isn't working label Apr 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants