### Input Pipline

As we're using TensorFlow we can make use of the `tf.data.Dataset` object. First, we'll load in our Numpy binaries from file:

In [2]:
import numpy as np

with open('/content/Digikala-comments/files/movie-xids.npy', 'rb') as f:
    Xids = np.load(f, allow_pickle=True)
with open('/content/Digikala-comments/files/movie-xmask.npy', 'rb') as f:
    Xmask = np.load(f, allow_pickle=True)
with open('/content/Digikala-comments/files/movie-labels.npy', 'rb') as f:
    labels = np.load(f, allow_pickle=True)

We can take these three arrays and create a TF dataset object with them using `from_tensor_slices` like so:

In [3]:
import tensorflow as tf

dataset = tf.data.Dataset.from_tensor_slices((Xids, Xmask, labels))

dataset.take(1)

<TakeDataset element_spec=(TensorSpec(shape=(512,), dtype=tf.int64, name=None), TensorSpec(shape=(512,), dtype=tf.int64, name=None), TensorSpec(shape=(6,), dtype=tf.float64, name=None))>

To rearrange the dataset format we can `map` a function that modifies the format like so:

In [4]:
def map_func(input_ids, masks, labels):
    # we convert our three-item tuple into a two-item tuple where the input item is a dictionary
    return {'input_ids': input_ids, 'attention_mask': masks}, labels

# then we use the dataset map method to apply this transformation
dataset = dataset.map(map_func)

dataset.take(1)

<TakeDataset element_spec=({'input_ids': TensorSpec(shape=(512,), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(512,), dtype=tf.int64, name=None)}, TensorSpec(shape=(6,), dtype=tf.float64, name=None))>

Now we can see that our dataset sample format has been changed. Next, we need to shuffle our data, and batch it. We will take batch sizes of 3 and drop any samples that don't fit evenly into chunks of 3

In [5]:
batch_size = 50

dataset = dataset.shuffle(1000).batch(batch_size, drop_remainder=True)

dataset.take(1)

<TakeDataset element_spec=({'input_ids': TensorSpec(shape=(50, 512), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(50, 512), dtype=tf.int64, name=None)}, TensorSpec(shape=(50, 6), dtype=tf.float64, name=None))>

Now our dataset samples are organized into batches of 3. The final step is to split our data into training and validation sets. For this we use the take and skip methods, creating and 90-10 split.

In [6]:
split = 0.9

# we need to calculate how many batches must be taken to create 90% training set
size = int((Xids.shape[0] / batch_size) * split)

size

3228

In [7]:
train_ds = dataset.take(size)
val_ds = dataset.skip(size)

Our two datasets are fully prepared for our model inputs. Now, we can save both to file using `tf.data.experimental.save`.

In [None]:
# tf.data.experimental.save(train_ds, 'train')
# tf.data.experimental.save(val_ds, 'val')

In the next notebook we will be loading these files using `tf.data.experimental.load`. Which requires us to define the tensor `element_spec` - which describes the tensor shape. To find our dataset element spec we can write:

In [8]:
train_ds.element_spec

({'attention_mask': TensorSpec(shape=(50, 512), dtype=tf.int64, name=None),
  'input_ids': TensorSpec(shape=(50, 512), dtype=tf.int64, name=None)},
 TensorSpec(shape=(50, 6), dtype=tf.float64, name=None))

In [9]:
val_ds.element_spec == train_ds.element_spec

True

We will be using this tuple when loading our data in the next notebook.

In [11]:
# ds = tf.data.experimental.load('train', element_spec=train_ds.element_spec)