# Text classification with Preprocessed Text: Movie reviews
## Tensorflow Tutorial
In this tutorial I will use TensorFlow and Keras to perform binary classification. I will classify movie reviews as positive or negative using the text of the review.

### Data
I'll use the IMDB dataset that contains the text of 50,000 movie reviews from the Internet Movie Database. These are split into 25,000 reviews for training and 25,000 reviews for testing. The training and testing sets are balanced, meaning they contain an equal number of positive and negative reviews.

In [4]:
from __future__ import absolute_import, division, print_function, unicode_literals
import numpy as np
import tensorflow as tf
from tensorflow import keras
import tensorflow_datasets as tfds
tfds.disable_progress_bar()

print(tf.__version__)

1.14.0


### Downloading the Data
The IMDB movie reviews dataset comes packaged in `tfds`. It has already been preprocessed so that the reviews (sequences of words) have been converted to sequences of integers, where each integer represents a specific word in a dictionary.

In [5]:
(train_data, test_data), info = tfds.load(
    # using the version pre-encoded with an ~8k vocabulary.
    'imdb_reviews/subwords8k', 
    # Return the train/test datasets as a tuple.
    split = (tfds.Split.TRAIN, tfds.Split.TEST),
    # Return (example, label) pairs from the dataset (instead of a dictionary).
    as_supervised=True,
    # Also return the `info` structure. 
    with_info=True)

Downloading and preparing dataset imdb_reviews (80.23 MiB) to C:\Users\deand\tensorflow_datasets\imdb_reviews\subwords8k\0.1.0...
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


Dataset imdb_reviews downloaded and prepared to C:\Users\deand\tensorflow_datasets\imdb_reviews\subwords8k\0.1.0. Subsequent calls will reuse this data.




### The encoder
The dataset info includes the text encoder (`tfds.features.text.SubwordTextEncoder`).

In [6]:
encoder = info.features['text'].encoder
print('Vocabulary size: {}'.format(encoder.vocab_size))

Vocabulary size: 8185


The text encoder will reversible encode any string:

In [8]:
sample_string = 'TensorFlow is great'
#encoding
encoded_string = encoder.encode(sample_string)
print('Encoded string is {}'.format(encoded_string))

#decoding
original_string = encoder.decode(encoded_string)
print('The original string: "{}"'.format(original_string))

assert original_string == sample_string

Encoded string is [6307, 2327, 4043, 4265, 9, 526]
The original string: "TensorFlow is great"
