In [57]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import LSTM, GRU, SimpleRNN
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.utils import to_categorical
from tensorflow.keras import Input
from tensorflow.keras import Model

# Text Processing

Recurrent architechtures (like the LSTM and GRU) are often used to process sequences of text. But Neural Netoworks can't understand strings; they need numbers. So how do we turn words into numbers?

## Standardization

We often want to change our string to make the characters more "standard". For example by making everything lower case so that `"The"` and `"the"` aren't counted as different tokens or `"zoe"` and `"zöe"`. We also may want to remove punctuation (unless we want to count them as their own tokens). 

## Tokenization

Then we need to break down our now standardized sequence into tokens. Tokens can be characters (like in our Pride and Prejudice LSTM from last classwork), words (most common), ngrams, or even parts of words!


Let's try to process this `text` by hand, then with a `TextVectorization()` layer from keras.

In [58]:
text = '''Fox
Socks
Box
Knox

Knox in box.
Fox in socks.

Knox on fox in socks in box.

Socks on Knox and Knox in box.

Fox in socks on box on Knox.

Chicks with bricks come.
Chicks with blocks come.
Chicks with bricks and blocks and clocks come.

Look, sir.  Look, sir.  Mr. Knox, sir.
Let's do tricks with bricks and blocks, sir.
Let's do tricks with chicks and clocks, sir.

First, I'll make a quick trick brick stack.
Then I'll make a quick trick block stack.

You can make a quick trick chick stack.
You can make a quick trick clock stack.

And here's a new trick, Mr. Knox....
Socks on chicks and chicks on fox.
Fox on clocks on bricks and blocks.
Bricks and blocks on Knox on box.

Now we come to ticks and tocks, sir.
Try to say this Mr. Knox, sir....

Clocks on fox tick.
Clocks on Knox tock.
Six sick bricks tick.
Six sick chicks tock.

Please, sir.  I don't like this trick, sir.
My tongue isn't quick or slick, sir.
I get all those ticks and clocks, sir, 
mixed up with the chicks and tocks, sir.
I can't do it, Mr. Fox, sir.

I'm so sorry, Mr. Knox, sir.

Here's an easy game to play.
Here's an easy thing to say....

New socks.
Two socks.
Whose socks?
Sue's socks.

Who sews whose socks?
Sue sews Sue's socks.

Who sees who sew whose new socks, sir?
You see Sue sew Sue's new socks, sir.

That's not easy, Mr. Fox, sir.

Who comes? ...
Crow comes.
Slow Joe Crow comes.

Who sews crow's clothes?
Sue sews crow's clothes.
Slow Joe Crow sews whose clothes?
Sue's clothes.

Sue sews socks of fox in socks now.

Slow Joe Crow sews Knox in box now.

Sue sews rose on Slow Joe Crow's clothes.
Fox sews hose on Slow Joe Crow's nose.

Hose goes.
Rose grows.
Nose hose goes some.
Crow's rose grows some.

Mr. Fox!
I hate this game, sir.
This game makes my tongue quite lame, sir.

Mr. Knox, sir, what a shame, sir.

We'll find something new to do now.
Here is lots of new blue goo now.
New goo.  Blue goo.
Gooey.  Gooey.
Blue goo.  New goo.
Gluey. Gluey.

Gooey goo for chewy chewing!
That's what that Goo-Goose is doing.
Do you choose to chew goo, too, sir?
If, sir, you, sir, choose to chew, sir, 
with the Goo-Goose, chew, sir.
Do, sir.

Mr. Fox, sir, 
I won't do it.  
I can't say.  
I won't chew it.

Very well, sir.
Step this way.
We'll find another game to play.

Bim comes.
Ben comes.
Bim brings Ben broom.
Ben brings Bim broom.

Ben bends Bim's broom.
Bim bends Ben's broom.
Bim's bends.
Ben's bends.
Ben's bent broom breaks.
Bim's bent broom breaks.

Ben's band.  Bim's band.
Big bands.  Pig bands.

Bim and Ben lead bands with brooms.
Ben's band bangs and Bim's band booms.

Pig band!  Boom band!
Big band!  Broom band!
My poor mouth can't say that.  No, sir.
My poor mouth is much too slow, sir.

Well then... bring your mouth this way.
I'll find it something it can say.

Luke Luck likes lakes.
Luke's duck likes lakes.
Luke Luck licks lakes.
Luck's duck licks lakes.

Duck takes licks in lakes Luke Luck likes.
Luke Luck takes licks in lakes duck likes.

I can't blab such blibber blubber!
My tongue isn't make of rubber.

Mr. Knox.  Now come now.  Come now.
You don't have to be so dumb now....

Try to say this, Mr. Knox, please....

Through three cheese trees three free fleas flew.
While these fleas flew, freezy breeze blew.
Freezy breeze made these three trees freeze.
Freezy trees made these trees' cheese freeze.
That's what made these three free fleas sneeze.

Stop it!  Stop it!
That's enough, sir.
I can't say such silly stuff, sir.

Very well, then, Mr. Knox, sir.

Let's have a little talk about tweetle beetles....

What do you know about tweetle beetles?  Well...

When tweetle beetles fight, 
it's called a tweetle beetle battle.

And when they battle in a puddle, 
it's a tweetle beetle puddle battle.

AND when tweetle beetles battle with paddles in a puddle, 
they call it a tweetle beetle puddle paddle battle.

AND...

When beetles battle beetles in a puddle paddle battle 
and the beetle battle puddle is a puddle in a bottle...
...they call this a tweetle beetle bottle puddle paddle battle muddle.

AND...

When beetles fight these battles in a bottle with their paddles 
and the bottle's on a poodle and the poodle's eating noodles...
...they call this a muddle puddle tweetle poodle beetle noodle 
bottle paddle battle.

AND...

Now wait a minute, Mr. Socks Fox!

When a fox is in the bottle where the tweetle beetles battle 
with their paddles in a puddle on a noodle-eating poodle, 
THIS is what they call...

...a tweetle beetle noodle poodle bottled paddled 
muddled duddled fuddled wuddled fox in socks, sir!

Fox in socks, our game is done, sir.
Thank you for a lot of fun, sir.
'''

In [64]:
# standardize 

import re
import string
from tensorflow.keras.layers import TextVectorization

# put text into list of lines
text_list = text.split("\n")
text_list = [sub for sub in text_list if len(sub) > 0]

# make everything lowercase
text_list = tf.strings.lower(text_list)

# replace all puncuation with nothing
text_list = tf.strings.regex_replace(text_list,
            f"[{re.escape(string.punctuation)}]", "")

# split into word level tokens
text_list = tf.strings.split(text_list)

# get vocabulary
vocab = np.unique(np.hstack(text_list))

vocab_d = {1: "", 0: b"[UNK]"}
for i, j in enumerate(vocab):
    vocab_d[i+2] = j

vocab_inv_d = {v: k for k, v in vocab_d.items()}

vocab_inv_d

{'': 1,
 b'[UNK]': 0,
 b'a': 2,
 b'about': 3,
 b'all': 4,
 b'an': 5,
 b'and': 6,
 b'another': 7,
 b'band': 8,
 b'bands': 9,
 b'bangs': 10,
 b'battle': 11,
 b'battles': 12,
 b'be': 13,
 b'beetle': 14,
 b'beetles': 15,
 b'ben': 16,
 b'bends': 17,
 b'bens': 18,
 b'bent': 19,
 b'big': 20,
 b'bim': 21,
 b'bims': 22,
 b'blab': 23,
 b'blew': 24,
 b'blibber': 25,
 b'block': 26,
 b'blocks': 27,
 b'blubber': 28,
 b'blue': 29,
 b'boom': 30,
 b'booms': 31,
 b'bottle': 32,
 b'bottled': 33,
 b'bottles': 34,
 b'box': 35,
 b'breaks': 36,
 b'breeze': 37,
 b'brick': 38,
 b'bricks': 39,
 b'bring': 40,
 b'brings': 41,
 b'broom': 42,
 b'brooms': 43,
 b'call': 44,
 b'called': 45,
 b'can': 46,
 b'cant': 47,
 b'cheese': 48,
 b'chew': 49,
 b'chewing': 50,
 b'chewy': 51,
 b'chick': 52,
 b'chicks': 53,
 b'choose': 54,
 b'clock': 55,
 b'clocks': 56,
 b'clothes': 57,
 b'come': 58,
 b'comes': 59,
 b'crow': 60,
 b'crows': 61,
 b'do': 62,
 b'doing': 63,
 b'done': 64,
 b'dont': 65,
 b'duck': 66,
 b'duddled': 67,
 b'du

In [66]:
s = "The Fox Battles Chelsea"
# make everything lowercase
s = tf.strings.lower(s)

# replace all puncuation with nothing
s = tf.strings.regex_replace(s,
            f"[{re.escape(string.punctuation)}]", "")

# split into word level tokens
s = tf.strings.split(s)

encoding = [vocab_inv_d[word] if word in vocab_inv_d else 1 for word in s.numpy()]

encoding

[191, 78, 12, 1]

In [56]:
decoding = [vocab_d[i].decode() for i in encoding]

" ".join(decoding)

'the fox battles [UNK]'

# Tensorflow

Now that we understand what's going on, let's do all of this using TensorFlow.

In [62]:
# put text into list of lines
text_list = text.split("\n")
text_list = [sub for sub in text_list if len(sub) > 0]

text_vectorization = TextVectorization(
    output_mode = "int",
    standardize = "lower_and_strip_punctuation",
    split = "whitespace"
)

text_vectorization.adapt(text_list)
text_vectorization.get_vocabulary()

['',
 '[UNK]',
 'sir',
 'a',
 'and',
 'socks',
 'in',
 'knox',
 'fox',
 'on',
 'mr',
 'with',
 'tweetle',
 'battle',
 'to',
 'this',
 'puddle',
 'now',
 'sews',
 'i',
 'you',
 'new',
 'it',
 'do',
 'chicks',
 'beetles',
 'band',
 'the',
 'say',
 'is',
 'goo',
 'broom',
 'box',
 'beetle',
 'when',
 'well',
 'trick',
 'slow',
 'lakes',
 'come',
 'clocks',
 'bricks',
 'who',
 'what',
 'they',
 'these',
 'sue',
 'quick',
 'my',
 'make',
 'joe',
 'game',
 'crows',
 'comes',
 'clothes',
 'cant',
 'bottle',
 'blocks',
 'bims',
 'bim',
 'bens',
 'ben',
 'whose',
 'trees',
 'three',
 'thats',
 'sues',
 'stack',
 'poodle',
 'paddle',
 'of',
 'luke',
 'luck',
 'likes',
 'licks',
 'duck',
 'crow',
 'chew',
 'call',
 'bends',
 'tongue',
 'then',
 'rose',
 'paddles',
 'mouth',
 'made',
 'lets',
 'ill',
 'hose',
 'heres',
 'gooey',
 'freezy',
 'fleas',
 'find',
 'easy',
 'can',
 'blue',
 'bands',
 'wont',
 'way',
 'very',
 'try',
 'tricks',
 'too',
 'tocks',
 'tock',
 'ticks',
 'tick',
 'their',
 'th

In [63]:
sentence = text_vectorization("The Fox Battles Chelsea")
sentence

<tf.Tensor: shape=(4,), dtype=int64, numpy=array([ 27,   8, 227,   1])>

In [67]:
decode = dict(enumerate(text_vectorization.get_vocabulary()))
decode

{0: '',
 1: '[UNK]',
 2: 'sir',
 3: 'a',
 4: 'and',
 5: 'socks',
 6: 'in',
 7: 'knox',
 8: 'fox',
 9: 'on',
 10: 'mr',
 11: 'with',
 12: 'tweetle',
 13: 'battle',
 14: 'to',
 15: 'this',
 16: 'puddle',
 17: 'now',
 18: 'sews',
 19: 'i',
 20: 'you',
 21: 'new',
 22: 'it',
 23: 'do',
 24: 'chicks',
 25: 'beetles',
 26: 'band',
 27: 'the',
 28: 'say',
 29: 'is',
 30: 'goo',
 31: 'broom',
 32: 'box',
 33: 'beetle',
 34: 'when',
 35: 'well',
 36: 'trick',
 37: 'slow',
 38: 'lakes',
 39: 'come',
 40: 'clocks',
 41: 'bricks',
 42: 'who',
 43: 'what',
 44: 'they',
 45: 'these',
 46: 'sue',
 47: 'quick',
 48: 'my',
 49: 'make',
 50: 'joe',
 51: 'game',
 52: 'crows',
 53: 'comes',
 54: 'clothes',
 55: 'cant',
 56: 'bottle',
 57: 'blocks',
 58: 'bims',
 59: 'bim',
 60: 'bens',
 61: 'ben',
 62: 'whose',
 63: 'trees',
 64: 'three',
 65: 'thats',
 66: 'sues',
 67: 'stack',
 68: 'poodle',
 69: 'paddle',
 70: 'of',
 71: 'luke',
 72: 'luck',
 73: 'likes',
 74: 'licks',
 75: 'duck',
 76: 'crow',
 77: 'c

In [27]:
decoded_sentence = " ".join(decode[int(i)] for i in sentence)

decoded_sentence

'the fox battles [UNK]'

When we process text, we can either add this processing step to a tensorflow pipeline, or we can add text processing as part of our model (as a Layer).

However, the text tokenization process cannot take advantage of GPUs, and therefore is trained on the CPU. If you include it as a layer in your network, every iteration will need to wait for the tokenization layer to process on the CPU before passing off the results to the GPU to run the rest of the model, which can slow things down.

Let's process the IMDB reviews dataset this way. 

In [70]:
# load in data from URL
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
# unzip file
!tar -xf aclImdb_v1.tar.gz

# remove extra files that are used for unsupervised learning
!rm -r aclImdb/train/unsup

# show us one file
!cat aclImdb/train/pos/4077_10.txt


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  12.4M      0  0:00:06  0:00:06 --:--:-- 14.4M    0  0:00:06  0:00:04  0:00:02 11.6M
I first saw this back in the early 90s on UK TV, i did like it then but i missed the chance to tape it, many years passed but the film always stuck with me and i lost hope of seeing it TV again, the main thing that stuck with me was the end, the hole castle part really touched me, its easy to watch, has a great story, great music, the list goes on and on, its OK me saying how good it is but everyone will take there own best bits away with them once they have seen it, yes the animation is top notch and beautiful to watch, it does show its age in a very few parts but that has now become part of it beauty, i am so glad it has came out on DVD as it is one of my top 10 films of all time. Buy it or rent it just see it, be

In [72]:
# import os, pathlib, shutil, random

# base_dir = pathlib.Path("aclImdb")
# val_dir = base_dir / "val"
# train_dir = base_dir / "train"
# for category in ("neg", "pos"):
#     os.makedirs(val_dir / category)
#     files = os.listdir(train_dir / category)
#     random.Random(1337).shuffle(files)
#     num_val_samples = int(0.2 * len(files))
#     val_files = files[-num_val_samples:]
#     for fname in val_files:
#         shutil.move(train_dir / category / fname,
#                     val_dir / category / fname)

In [73]:
from tensorflow import keras
batch_size = 32

train_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/train", batch_size=batch_size
)
val_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/val", batch_size=batch_size
)
test_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/test", batch_size=batch_size
)

Found 25000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


In [74]:
for inputs, targets in train_ds:
    # print(targets)
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

inputs.shape: (32,)
inputs.dtype: <dtype: 'string'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor(b'I saw this on TV the other night\xc2\x85 or rather I flicked over to another channel every so often to watch infomercials when I couldn\'t stand watching it any longer. It was bad. Really, really bad. Not "so bad it\'s good" just flat out bad. How did it get funded? Who thought this was a good idea? An actor friend of mine auditioned and was told he wasn\'t good enough to play a bad guy but I think what they meant was "save yourself and runaway from this steaming pile of @#$%." I bet the rest of the cast had been given the option. To be fair the acting was hard to judge because of the appalling fake American ascents. The shooting was dullllllllllll. The action was awkward and stilted. The dialog was inane. By far the saddest thing was ship. In real life the Interislander ferry is a shabby boat and on film it doesn\'t scrub up well. Instead of trying very unsuc

## 1-grams

Let's fit our text classification model using 1-grams (single words as tokens). Let's train a model on this and see how it does.

In [75]:
# take in raw text, standardize and tokenize
text_vectorization = TextVectorization(
    max_tokens=20000,
    output_mode="multi_hot",
)

text_only_train_ds = train_ds.map(lambda x, y: x)
text_vectorization.adapt(text_only_train_ds)

# create binary 1-grams
binary_1gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_1gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_1gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089


In [76]:
from tensorflow import keras
from tensorflow.keras import layers

# define a model building function so that we can
# use the same architechture over and over with different
# forms of pre-processing

def get_model(max_tokens=20000, hidden_dim=16):
    inputs = keras.Input(shape=(max_tokens,))
    x = layers.Dense(hidden_dim, activation="relu")(inputs)
    x = layers.Dropout(0.5)(x)
    outputs = layers.Dense(1, activation="sigmoid")(x)
    model = keras.Model(inputs, outputs)
    model.compile(optimizer="rmsprop",
                  loss="binary_crossentropy",
                  metrics=["accuracy"])
    return model

In [77]:
model = get_model()
model.summary()
callbacks = [
    keras.callbacks.ModelCheckpoint("binary_1gram.keras",
                                    save_best_only=True)
]
model.fit(binary_1gram_train_ds.cache(),
          validation_data=binary_1gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)
model = keras.models.load_model("binary_1gram.keras")
print(f"Test acc: {model.evaluate(binary_1gram_test_ds)[1]:.3f}")

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense (Dense)               (None, 16)                320016    
                                                                 
 dropout (Dropout)           (None, 16)                0         
                                                                 
 dense_1 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.880


In [78]:
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="multi_hot",
)

text_vectorization.adapt(text_only_train_ds)
binary_2gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_2gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_2gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

model = get_model()
model.summary()
callbacks = [
    keras.callbacks.ModelCheckpoint("binary_2gram.keras",
                                    save_best_only=True)
]
model.fit(binary_2gram_train_ds.cache(),
          validation_data=binary_2gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)
model = keras.models.load_model("binary_2gram.keras")
print(f"Test acc: {model.evaluate(binary_2gram_test_ds)[1]:.3f}")

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense_2 (Dense)             (None, 16)                320016    
                                                                 
 dropout_1 (Dropout)         (None, 16)                0         
                                                                 
 dense_3 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.887


In [79]:
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="count"
)
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="tf_idf",
)

text_vectorization.adapt(text_only_train_ds)

tfidf_2gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
tfidf_2gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
tfidf_2gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

model = get_model()
model.summary()
callbacks = [
    keras.callbacks.ModelCheckpoint("tfidf_2gram.keras",
                                    save_best_only=True)
]
model.fit(tfidf_2gram_train_ds.cache(),
          validation_data=tfidf_2gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)
model = keras.models.load_model("tfidf_2gram.keras")
print(f"Test acc: {model.evaluate(tfidf_2gram_test_ds)[1]:.3f}")

Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_3 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense_4 (Dense)             (None, 16)                320016    
                                                                 
 dropout_2 (Dropout)         (None, 16)                0         
                                                                 
 dense_5 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.874


# Using TextVectorization Layer after Training

Once we're done training our model, we may want to create a NEW model where we have an added text vectorization layer so that we can input raw data into the model and still make predictions.

To do that, we use the functional API from keras. First, we create an input layer that is expecting text of variable dimensions.

Then we send that text through the `text_vectorization()` layer we created. This will take our text and process it.

Then we feed this processed output into our model `model()` to make an actual prediction. 

By shoving all of these things into a new model, `inference_model()` we've created a useful model object that does both our pre-processing and predictions for us, while not creating a bottleneck of pre-processing during each iteration of training.



In [80]:
inputs = keras.Input(shape=(1,), dtype="string")
processed_inputs = text_vectorization(inputs)
outputs = model(processed_inputs)
inference_model = keras.Model(inputs, outputs)

In [81]:
import tensorflow as tf
raw_text_data = tf.convert_to_tensor([
    ["That was an excellent movie, I loved it."],
])
predictions = inference_model(raw_text_data)
print(f"{float(predictions[0] * 100):.2f} percent positive")

97.86 percent positive


# On Your Own

Build a simple text classification model using this email data (on Canvas).

Upload the .zip file to your working directory and then run the following code:

In [None]:
!unzip emails.zip

In [None]:
import os, pathlib, shutil, random

base_dir = pathlib.Path("Data")
val_dir = base_dir / "val"
train_dir = base_dir / "train"
for category in ("Crime", "Entertainment"):
    os.makedirs(val_dir / category)
    os.makedirs(train_dir / category)
    files = os.listdir(base_dir / category)
    random.Random(1337).shuffle(files)
    num_val_samples = int(0.2 * len(files))
    val_files = files[-num_val_samples:]
    train_files = files[0:num_val_samples]
    for fname in val_files:
        shutil.move(base_dir / category / fname,
                    val_dir / category / fname)
    for fname in train_files:
        shutil.move(base_dir / category / fname,
                    train_dir / category / fname)

In [None]:
batch_size = 32
train_ds = keras.utils.text_dataset_from_directory(
    "Data/train", batch_size=batch_size
)
val_ds = keras.utils.text_dataset_from_directory(
    "Data/val", batch_size=batch_size
)


In [82]:
# take in raw text, standardize and tokenize
text_vectorization = TextVectorization(
    max_tokens=7500,
    output_mode="multi_hot",
)

text_only_train_ds = train_ds.map(lambda x, y: x)
text_vectorization.adapt(text_only_train_ds)

# create binary 1-grams
binary_1gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_1gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
for i in binary_1gram_train_ds.take(1):
  print(i)

# Embedding Layers
We also learned about word embeddings which are non-sparse, low dimensional vectors that represent the semantic meaning of words.

<img src="https://drive.google.com/uc?export=view&id=1Ef4ZAxOuifpeodt8AZbQISNbOWQ_Bvu9" alt="Q" width = "400"/>

<img src="https://drive.google.com/uc?export=view&id=1WRx3J3bc95lcJAObQ7FwVCdvbEX0jHn3" alt="Q" width = "400"/>

While we can use pre-trained word embeddings (such as GloVe and word2vec; see tutorial [here](https://keras.io/examples/nlp/pretrained_word_embeddings/)), we can also learn our own embeddings as a part of our models.

We can do so using Keras' [`Embedding()`](https://keras.io/api/layers/core_layers/embedding/) layer. 

## Feedback Phrases Sentiment

Let's look at a really simple example which is trying to classify different feedback phrases as positive or negative.

To process our text we:

- create one hot encodings of our words, and store the *indices* for each word

(For example if the word `"hello"` was represented as `[0,0,0,0,0,1,0]` we'd store the index `5` instead of the whole vector `[0,0,0,0,0,1,0]`)

- pad the sequences so that they're all the same length. If a sequence is too short, we add `0`'s at the end until it's the right length
- feed the sequences into an `Embedding()` layer.
    - `input_dim`: tells us the length of the embedding vector, ours will be `vocab_size` because we have `vocab_size`-dimensional one-hot vectors
    - `output_dim`: the dimension of our embedded vectors
    - `input_length`: the length of all our sequences (if it's the same for each sequence)

- Flatten our embeddings into a single vector that can be fed to `Dense()`


In [84]:
from numpy import array
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Embedding


# define documents
docs = ['Well done!',
 'Good work',
 'Great effort',
 'nice work',
 'Excellent!',
 'Weak',
 'Poor effort!',
 'not good',
 'poor work',
 'Could have done better.',
 'Gorgeous job',
 'Better luck next time']

# define class labels
labels = array([1,1,1,1,1,0,0,0,0,0,1,0])

# pre-process

# integer encode the documents
vocab_size = 60
encoded_docs = [one_hot(d, vocab_size) for d in docs]
print("\nOne Hot Indices------------------------------------")
print(encoded_docs)
print("---------------------------------------------------\n")

# pad documents to a max length of 4 words
max_length = 4
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
print("\nDocuments with padding to make them the same length")
print(padded_docs)
print("---------------------------------------------------\n")

# define the model
model = Sequential()
model.add(Embedding(vocab_size, 8, input_length=max_length))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# summarize the model
print("\nModel Summary------------------------------------")
print(model.summary())
print("---------------------------------------------------\n")
# fit the model
model.fit(padded_docs, labels, epochs=50, verbose=0)

# evaluate the model
loss, accuracy = model.evaluate(padded_docs, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))


One Hot Indices------------------------------------
[[32, 42], [39, 21], [23, 35], [29, 21], [8], [38], [20, 35], [26, 39], [20, 21], [5, 48, 42, 56], [12, 11], [56, 59, 47, 32]]
---------------------------------------------------


Documents with padding to make them the same length
[[32 42  0  0]
 [39 21  0  0]
 [23 35  0  0]
 [29 21  0  0]
 [ 8  0  0  0]
 [38  0  0  0]
 [20 35  0  0]
 [26 39  0  0]
 [20 21  0  0]
 [ 5 48 42 56]
 [12 11  0  0]
 [56 59 47 32]]
---------------------------------------------------


Model Summary------------------------------------
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 4, 8)              480       
                                                                 
 flatten (Flatten)           (None, 32)                0         
                                                                 
 dense_6 (Den

## Pride and Prejudice vs. Frankenstein

Now, let's use the same Pride and Prejudice text we used last week, as well as the text of Frankenstein (on canvas) to build a model that can classify whether a snippet of 100 words is from Pride and Prejudice or Frankenstein.

First, we need to load in our two text files, get rid of punctuation, and split them into word lists.

In [119]:
import re
# load ascii text and covert to lowercase
filename = "pandp.txt"
raw_text_p = open(filename, 'r', encoding='utf-8').read()
raw_text_p = re.sub(r'[^\w\s]', '', raw_text_p)
raw_text_p = raw_text_p.lower().split()


filename = "frankenstein.txt"
raw_text_f = open(filename, 'r', encoding='utf-8').read()
raw_text_f = re.sub(r'[^\w\s]', '', raw_text_f)
raw_text_f = raw_text_f.lower().split()


Then we'll create a dictionary that maps our word tokens into indices for one hot vectors.

In [120]:

# create mapping of unique chars to integers
chars = sorted(list(set(raw_text_p))) + sorted(list(set(raw_text_f)))
char_to_int = dict((c, i) for i, c in enumerate(chars))

char_to_int
# text info
n_chars = len(raw_text_p) 
n_vocab = len(chars)
print("Total Characters: ", n_chars)
print("Total Vocab: ", n_vocab)



Total Characters:  121529
Total Vocab:  14118


Lastly, we'll pull data from each file by creating a bunch of sequences of 100 words from each text. 

In [121]:

# prepare the dataset of input to output pairs encoded as integers
seq_length = 20 # 100 words as input
dataX_p = []
dataY_p = []
for i in range(0, n_chars - seq_length, 1):
 seq_in = raw_text_p[i:i + seq_length] # generate 100 character input
 seq_out = raw_text_p[i + seq_length] # grab next character
 
 dataX_p.append([char_to_int[char] for char in seq_in])
 dataY_p.append(0)
n_patterns = len(dataX_p)
print("Total Patterns PandP: ", n_patterns)


# text info
n_chars = len(raw_text_f) 
n_vocab = len(chars)
print("Total Characters: ", n_chars)
print("Total Vocab: ", n_vocab)

# prepare the dataset of input to output pairs encoded as integers
seq_length = 20 # 100 words as input
dataX_f = []
dataY_f = []
for i in range(0, n_chars - seq_length, 1):
 seq_in = raw_text_f[i:i + seq_length] # generate 100 character input
 seq_out = raw_text_f[i + seq_length] # grab next character
 
 dataX_f.append([char_to_int[char] for char in seq_in])
 dataY_f.append(1)
n_patterns = len(dataX_f)
print("Total Patterns Frank: ", n_patterns)

Total Patterns PandP:  121509
Total Characters:  74941
Total Vocab:  14118
Total Patterns Frank:  74921


In [122]:
# combine the pandp and frank training data
dataX = dataX_p + dataX_f
dataY = dataY_p + dataY_f

Now you can use this data to build a model that predicts whether or not a text is from pride and prejudice or frankenstein! Use `Embedding`, `Dense`, and `Flatten` layers to build a model similar to the one in the previous section, but using this data.

In [124]:
# define the model

# compile the model

# summarize the model

# fit the model
