# SNLI Glove TFRecord

This code converts the SNLI dataset into a TFRecord. The dataset is also preprocessed before it is saved as a tfrecord file. The preprocessing done are shown below.

Text Preprocessing

*   Convert to lower case
*   Tokenized with the default NLTK tokenizer
*   Converted to IDs using Glove embeddings

Label Preprocessing

*   Entailment is 1 else 0

## Import all required modules

The ```snli``` is the utility python script used for reading the snli dataset

In [1]:
import os
import snli
import tensorflow as tf
import kotoba as kt
import narau as nr

## Define all path as constants

*   ```SRC_*```: SNLI dataset jsonl files
*   ```DST_*```: TFRecord save location
*   ```EMBEDDING_FILE```: Location of the Glove embedding file

In [2]:
SRC_DATA_PATH = os.path.expanduser('~/Documents/data/snli_1.0')
SRC_TRAIN_PATH = os.path.join(SRC_DATA_PATH, 'snli_1.0_train.jsonl')
SRC_DEV_PATH = os.path.join(SRC_DATA_PATH, 'snli_1.0_dev.jsonl')
SRC_TEST_PATH = os.path.join(SRC_DATA_PATH, 'snli_1.0_test.jsonl')

DST_DATA_PATH = os.path.join('data', 'glove.6B')
DST_TRAIN_PATH = os.path.join(DST_DATA_PATH, 'train.tfrecord')
DST_DEV_PATH = os.path.join(DST_DATA_PATH, 'dev.tfrecord')
DST_TEST_PATH = os.path.join(DST_DATA_PATH, 'test.tfrecord')

EMBEDDING_FILE = os.path.expanduser('~/Documents/data/glove.6B/glove.6B.100d.txt')

## Define a pipeline to process the text

Kotoba is used to declare pipelines. The glove file is loaded and then the pipeline is created. The loaded embedding is given to the last stage in the pipeline.

In [7]:
embedding = kt.embedding.Embedding.from_glove_file(EMBEDDING_FILE, ['<PAD>', '<UNK>'], 1)

text_pipeline = kt.Pipeline([
    kt.LowerCase(),
    kt.tokenizer.NLTKTokenizer(),
    kt.embedding.EmbedTokenToID(embedding),
])

## Define a pipeline to process the labels

Since there is no predefined kotoba preprocessor for our labels, a function is provided to create the pipeline

In [None]:
label_pipeline = kt.Pipeline([
    kt.MapItems(lambda x: int(x=='entailment'))
])

## Combine and complete the pipeline

To make it easier, a pipeline from the file name up to the desired output is created. The text and label pipelines created before are used in this pipeline.

In [None]:
data_pipeline = kt.Pipeline([
    kt.MapItems(lambda x: snli.read_file(x, label_filters=('contradiction', 'entailment'))),
    kt.MapItems(lambda x: (x['sentence1'],
                           x['sentence2'],
                           x['gold_label'])),
    kt.Transpose2D(),
    kt.HorizontalPipeline([
        text_pipeline,
        text_pipeline,
        label_pipeline,
    ])
])

## Pipeline to TFRecord

The pipeline will provide the ```x1```, ```x2```, and ```y``` for the network training. To convert this to TFRecords, the ```narau``` helpers are used.

TFRecords are just protobufs with a defined format. Below are references that one can use to study them.

*   Protocol Buffers: https://developers.google.com/protocol-buffers/
*   Sequence Example: https://www.tensorflow.org/api_docs/python/tf/train/SequenceExample
*   Tensorflow Examples Code: https://github.com/tensorflow/tensorflow/blob/r1.11/tensorflow/core/example/example.proto

In [None]:
def convert_to_tfrecord(src_path, dst_path):
    x1, x2, y = data_pipeline.transform(src_path)
    ex = nr.example.SequenceExample(
        nr.example.FeatureLists({
            'x1': nr.example.Int64FeatureList(x1),
            'x2': nr.example.Int64FeatureList(x2),
            'y': nr.example.FloatFeatureList(y),
        })
    )
    nr.example.save_example(ex, dst_path)

## Processing the files

After all the pipelines and conversion were done, the ```convert_to_tfrecord``` can be easily called to transform the data into the TFRecord format

In [None]:
convert_to_tfrecord(SRC_TRAIN_PATH, DST_TRAIN_PATH)
convert_to_tfrecord(SRC_TEST_PATH, DST_TEST_PATH)
convert_to_tfrecord(SRC_DEV_PATH, DST_DEV_PATH)