# Text Classification from Scratch

https://keras.io/examples/nlp/text_classification_from_scratch/

## Setup

In [None]:
import tensorflow as tf
import numpy as np

## Load the data: IMDB movie review sentiment classification

Run these commands to download the data into `./data` folder:
```bash
$ curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
$ tar -xf aclImdb_v1.tar.gz
```

In [None]:
batch_size = 32

raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'data/aclImdb/train',
    batch_size=batch_size,
    validation_split=0.2,
    subset='training',
    seed=2405,
)

raw_val_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'data/aclImdb/train',
    batch_size=batch_size,
    validation_split=0.2,
    subset='validation',
    seed=2405,
)

raw_test_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'data/aclImdb/test',
    batch_size=batch_size,
)

print(f'Number of batches in raw_train_ds: {tf.data.experimental.cardinality(raw_train_ds)}')
print(f'Number of batches in raw_val_ds: {tf.data.experimental.cardinality(raw_val_ds)}')
print(f'Number of batches in raw_test_ds: {tf.data.experimental.cardinality(raw_test_ds)}')

Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.
Found 25000 files belonging to 2 classes.
Number of batches in raw_train_ds: 625
Number of batches in raw_val_ds: 157
Number of batches in raw_test_ds: 782


In [None]:
# Preview a few examples
for text_batch, label_batch in raw_train_ds.take(1):
    for i in range(5):
        #print(text_batch.numpy()[i].decode('latin1'))
        print(text_batch.numpy()[i])
        print(label_batch.numpy()[i])
        print()

b'We\'re talking about a low budget film, and it\'s understandable that there are some weaknesses (no spoilers: one sudden explosives expert and one meaningless alcoholic); but in general the story keeps you interested, most of the characters are likable and there are some original situations. <br /><br />I really like films that surprise you with some people that are not who they want you to believe and then twist and turn the plot ... I applaud this one on that. <br /><br />If you know what I mean, try to see also "Nueve Reinas" (Nine Queens) a film from Argentina.'
1

b"I've seen this movie, when I was traveling in Brazil. I found it difficult to really understand Brazilian culture and society, because it has so many regional and class differences. To see this movie in Sao Paulo itself was a revelation. It shows something of the everyday life of many Brazilians. On the other side, it is sometimes a little bit over-dramatized. And that's the only negative comment I have on this film.

## Prepare the data

In [None]:
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
import string
import re

In [None]:
def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
    return tf.strings.regex_replace(
        stripped_html, f'[{re.escape(string.punctuation)}]', ''
    )

In [None]:
# Model constants:
max_features = 20_000
embedding_dim = 128
sequence_length = 500

In [None]:
vectorize_layer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=max_features,
    output_mode='int',
    output_sequence_length=sequence_length,
)

In [None]:
text_ds = raw_train_ds.map(lambda x, y: x)
vectorize_layer.adapt(text_ds)

## Two options to vectorize the data

**Option 1: Make it part of the model,** so as to obtain a model that
preprocesses raw strings, like this:

```python
text_input = tf.keras.Input(shape=(1,), dtype=tf.string, name='text')
x = vectorize_layer(text_input)
x = tf.keras.layers.Embedding(max_features + 1, embedding_dim)(x)
...
```

**Option 2: Apply it to the text dataset** to obtain a dataset of word indices,
then feed it into a model that expects integer sequences as inputs.

An important difference between the two is that option 2 enables you to do
**asynchronous CPU processing and buffering** of your data when training on
GPU. So if you're training the model on GPU, you probably want to go with
this option to get the best performance.

If we were to export our model to production, we'd ship a model that accepts
raw strings as input, like in the code snippet for option 1 above. This can
be done after training. We do this in the last section.

In [None]:
def vectorize_text(text, label):
    text = tf.expand_dims(text, -1)
    return vectorize_layer(text), label


In [None]:
# Vectorize the data
train_ds = raw_train_ds.map(vectorize_text)
val_ds = raw_val_ds.map(vectorize_text)
test_ds = raw_test_ds.map(vectorize_text)

In [None]:
# Do async prefetching / buffering of the data for best performance on GPU
train_ds = train_ds.cache().prefetch(buffer_size=10)
val_ds = val_ds.cache().prefetch(buffer_size=10)
test_ds = test_ds.cache().prefetch(buffer_size=10)

## Build a model

Start with a simple 1D convnet starting with an Embedding layer

In [None]:
from tensorflow.keras import layers

In [None]:
# A integer input for vocab indices
inputs = tf.keras.Input(shape=(None,), dtype='int64')

In [None]:
# Next, we add a layer to map those vocab indices into a space of
# dimensionality 'embedding_dim'
x = layers.Embedding(max_features, embedding_dim)(inputs)

# Deep Learning with Python, 3e

In [1]:
from keras.datasets import mnist

In [2]:
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
[1m11490434/11490434[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step
