In [1]:
import numpy as np

with open("movie-xids.npy", "rb") as f:
    Xids = np.load(f, allow_pickle=True)
with open("movie-xmask.npy", "rb") as f:
    Xmask = np.load(f, allow_pickle=True)
with open("movie-labels.npy", "rb") as f:
    labels = np.load(f, allow_pickle=True)

As we're using TensorFlow we can make use of the `tf.data.Dataset` object

In [2]:
import tensorflow as tf

dataset = tf.data.Dataset.from_tensor_slices((Xids, Xmask, labels))
dataset.take(1)

<TakeDataset shapes: ((50,), (50,), (5,)), types: (tf.int32, tf.int32, tf.float64)>

Each sample in our dataset is a tuple containing a single `Xids, Xmask`, and `labels` tensor. However, when feeding data into our model we need a two-item tuple in the format **(\<inputs>, \<outputs>)**. Now, we have two tensors for our inputs - so, what we do is enter our **\<inputs>** tensor as a dictionary:


{

        'input_ids': <input_id_tensor>,

        'attention_mask': <mask_tensor>

}

To rearrange the dataset format we can `map` a function that modifies the forma

In [3]:
def map_func(input_ids, masks, labels):
    # we convert our three-item tuple into a two-item tuple where the input item is a dictionary
    return {"input_ids": input_ids,
            "attention_mask": masks}, labels

In [4]:
dataset = dataset.map(map_func)
dataset.take(1)

<TakeDataset shapes: ({input_ids: (50,), attention_mask: (50,)}, (5,)), types: ({input_ids: tf.int32, attention_mask: tf.int32}, tf.float64)>

Now we can see that our dataset sample format has been changed. Next, we need to shuffle our data, and batch it. We will take batch sizes of 16 and drop any samples that don't fit evenly into chunks of 16

In [5]:
batch_size=16
dataset = dataset.shuffle(10000).batch(batch_size, drop_remainder=True)

dataset.take(1)

<TakeDataset shapes: ({input_ids: (16, 50), attention_mask: (16, 50)}, (16, 5)), types: ({input_ids: tf.int32, attention_mask: tf.int32}, tf.float64)>

Now our dataset samples are organized into batches of 16. The final step is to split our data into training and validation sets. For this we use the `take` and `skip` methods, creating and 90-10 split.

In [6]:
Xids[2]

array([ 101,  138, 1326,  102,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0])

In [7]:
split = 0.9
size = int((Xids.shape[0] / batch_size) * split)
size

8778

In [8]:
train_ds = dataset.take(size)
val_ds = dataset.skip(size)

len(dataset), len(train_ds), len(val_ds)

(9753, 8778, 975)

Our two datasets are fully prepared for our model inputs. Now, we can save both to file using `tf.data.experimental.save`

In [9]:
tf.data.experimental.save(train_ds, "train")
tf.data.experimental.save(val_ds, "val")

We will be loading these files `using tf.data.experimental.load`. Which requires us to define the tensor `element_spec` - which describes the tensor shape

In [10]:
train_ds.element_spec

({'input_ids': TensorSpec(shape=(16, 50), dtype=tf.int32, name=None),
  'attention_mask': TensorSpec(shape=(16, 50), dtype=tf.int32, name=None)},
 TensorSpec(shape=(16, 5), dtype=tf.float64, name=None))

In [11]:
val_ds.element_spec

({'input_ids': TensorSpec(shape=(16, 50), dtype=tf.int32, name=None),
  'attention_mask': TensorSpec(shape=(16, 50), dtype=tf.int32, name=None)},
 TensorSpec(shape=(16, 5), dtype=tf.float64, name=None))

In [12]:
train_ds.element_spec == val_ds.element_spec

True