# Amazon 2 Label Glove TFRecord

This code converts the amazon reviews into a TFRecord. The dataset is also preprocessed before it is saved as a tfrecord file. The preprocessing done are shown below.

Text Preprocessing

*   Convert to lower case
*   Tokenized with the default NLTK tokenizer
*   Converted to IDs using Glove embeddings

Label Preprocessing

*   Rating of 5 is 1 (positive) and the 1 is 0 (negative)

## Import all required modules

The ```snli``` is the utility python script used for reading the snli dataset

In [1]:
import os
import amazon_review
import tensorflow as tf
import kotoba as kt
import narau as nr
from sklearn.model_selection import train_test_split

## Define data related information as constants

*   ```SRC_*```: Amazon reviews dataset export configuration
*   ```DST_*```: TFRecord save location
*   ```EMBEDDING_FILE```: Location of the Glove embedding file

In [2]:
SRC_DATA_PATH = os.path.expanduser('~/Documents/data/reviews_Books_5.json.gz')
SRC_LABELS = (1, 5)
SRC_COUNT = 50000
SRC_MIN_LENGTH = 1
SRC_MAX_LENGTH = 1000
SRC_TEST_SPLIT = 0.1
SRC_DEV_SPLIT = 0.1

DST_DATA_PATH = os.path.join('data', '2_label', 'glove.6B')
DST_TRAIN_PATH = os.path.join(DST_DATA_PATH, 'train.tfrecord')
DST_DEV_PATH = os.path.join(DST_DATA_PATH, 'dev.tfrecord')
DST_TEST_PATH = os.path.join(DST_DATA_PATH, 'test.tfrecord')

EMBEDDING_FILE = os.path.expanduser('~/Documents/data/glove.6B/glove.6B.100d.txt')

## Reading the dataset
Call the ```read``` function from the ```amazon_review``` module to read the amazon review files

In [3]:
count_per_label = int(SRC_COUNT/len(SRC_LABELS))
filter_length = lambda x: amazon_review.length_between(x, SRC_MIN_LENGTH, SRC_MAX_LENGTH)
data = amazon_review.read(SRC_DATA_PATH, SRC_LABELS, count_per_label, filter_length)

## Define a pipeline to process the text

Kotoba is used to declare pipelines. The glove file is loaded and then the pipeline is created. The loaded embedding is given to the last stage in the pipeline.

In [4]:
embedding = kt.embedding.Embedding.from_glove_file(EMBEDDING_FILE, ['<PAD>', '<UNK>'], 1)

text_pipeline = kt.Pipeline([
    kt.LowerCase(),
    kt.tokenizer.NLTKTokenizer(),
    kt.embedding.EmbedTokenToID(embedding),
])

## Define a pipeline to process the labels

Since there is no predefined kotoba preprocessor for our labels, a function is provided to create the pipeline

In [5]:
label_pipeline = kt.Pipeline([
    kt.MapItems(lambda x: int(x==5))
])

## Combine and complete the pipeline

To make it easier, a pipeline from the file name up to the desired output is created. The text and label pipelines created before are used in this pipeline.

In [6]:
data_pipeline = kt.Pipeline([
    kt.MapItems(lambda x: (x['reviewText'],
                           x['overall'])),
    kt.Transpose2D(),
    kt.HorizontalPipeline([
        text_pipeline,
        label_pipeline,
    ])
])

## Transform the data with the pipeline
Call the ```transform``` method of the ```data_pipeline``` process the data from the amazon review

In [7]:
ids, labels = data_pipeline.transform(data)

## Divide the dataset to different sets
The dataset is divided into training, development, and testing sets. To do this, the ```train_test_split``` function from scikit-learn is used. Since the function can only split a dataset into two, the function is used twice.

In [8]:
train_ids, test_ids, train_labels, test_labels = train_test_split(ids, labels, test_size=SRC_TEST_SPLIT, 
                                                                  random_state=0, stratify=labels)
train_ids, dev_ids, train_labels, dev_labels = train_test_split(train_ids, train_labels, 
                                                                test_size=SRC_DEV_SPLIT/(1-SRC_TEST_SPLIT),
                                                                random_state=0, stratify=train_labels)

## Sort the training set based on length
Bucketing is done during NLP training to speed up the training time. To easily do this, the training set is sorted by its length. This is done so that there is finer control with the bucket compared to using tensorflow's bucketing functions

Reference:
*   Bucket Sequence by Length: https://www.tensorflow.org/api_docs/python/tf/contrib/data/bucket_by_sequence_length

In [9]:
train_indices = list(range(len(train_ids)))
train_indices.sort(key=lambda x: len(train_ids[x]))
train_ids = [train_ids[idx] for idx in train_indices]
train_labels = [train_labels[idx] for idx in train_indices]

## Data to TFRecord

To convert the ids and labels to TFRecords, the ```narau``` helpers are used.

TFRecords are just protobufs with a defined format. Below are references that one can use to study them.

*   Protocol Buffers: https://developers.google.com/protocol-buffers/
*   Sequence Example: https://www.tensorflow.org/api_docs/python/tf/train/SequenceExample
*   Tensorflow Examples Code: https://github.com/tensorflow/tensorflow/blob/r1.11/tensorflow/core/example/example.proto

In [10]:
def convert_to_tfrecord(text, score, dst_path):
    ex = nr.example.SequenceExample(
        nr.example.FeatureLists({
            'text': nr.example.Int64FeatureList(text),
            'score': nr.example.FloatFeatureList(score),
        })
    )
    nr.example.save_example(ex, dst_path)

## Conversion to TFRecord

Call the  ```convert_to_tfrecord``` to transform the different datasets on their own TFRecord

In [11]:
convert_to_tfrecord(train_ids, train_labels, DST_TRAIN_PATH)
convert_to_tfrecord(test_ids, test_labels, DST_TEST_PATH)
convert_to_tfrecord(dev_ids, dev_labels, DST_DEV_PATH)