<a href="https://colab.research.google.com/github/drewpager/deep-learning/blob/main/11_Text_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import string

class Vectorizer:
  def standardize(self, text):
    text = text.lower()
    return "".join(char for char in text if char not in string.punctuation)

  def tokenize(self, text):
    text = self.standardize(text)
    return text.split()
  
  def make_vocabulary(self, dataset):
    self.vocabulary = {"": 0, "[UNK]": 1}
    for text in dataset:
      text = self.standardize(text)
      tokens = self.tokenize(text)
      for token in tokens:
        if token not in self.vocabulary:
          self.vocabulary[token] = len(self.vocabulary)
          self.inverse_vocabulary = dict((v, k) for k, v in self.vocabulary.items())
  
  def encode(self, text):
    text = self.standardize(text)
    tokens = self.tokenize(text)
    return [self.vocabulary.get(token, 1) for token in tokens]

  def decode(self, int_sequence):
    return " ".join(self.inverse_vocabulary.get(i, "[UNK]") for i in int_sequence)

vectorizer = Vectorizer()
dataset = [
    "I write, erase, rewrite",
    "Erase again, and then",
    "A poppy blooms.",           
]

vectorizer.make_vocabulary(dataset)



In [2]:
test_sentence = "I write, rewrite, and still rewrite again."
encoded_sentence = vectorizer.encode(test_sentence)
print(encoded_sentence)
decoded_sentence = vectorizer.decode(encoded_sentence)
print(decoded_sentence)

[2, 3, 5, 7, 1, 5, 6]
i write rewrite and [UNK] rewrite again


In [3]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

text_vectorization = layers.TextVectorization(output_mode="int")
text_vectorization.adapt(dataset)
vocabulary = text_vectorization.get_vocabulary()
encode_sent = text_vectorization(test_sentence)
print(encode_sent)
inverse_sent = dict(enumerate(vocabulary))
decode_sent = " ".join(inverse_sent[int(i)] for i in encode_sent)
print(decode_sent)

tf.Tensor([ 7  3  5  9  1  5 10], shape=(7,), dtype=int64)
i write rewrite and [UNK] rewrite again


**IMBD Review Sentiment Analysis**

In [4]:
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  17.1M      0  0:00:04  0:00:04 --:--:-- 17.1M


In [5]:
!rm -r aclImdb/train/unsup

In [6]:
!cat /content/aclImdb/train/pos/10009_9.txt

When I first read Armistead Maupins story I was taken in by the human drama displayed by Gabriel No one and those he cares about and loves. That being said, we have now been given the film version of an excellent story and are expected to see past the gloss of Hollywood...<br /><br />Writer Armistead Maupin and director Patrick Stettner have truly succeeded! <br /><br />With just the right amount of restraint Robin Williams captures the fragile essence of Gabriel and lets us see his struggle with issues of trust both in his personnel life(Jess) and the world around him(Donna).<br /><br />As we are introduced to the players in this drama we are reminded that nothing is ever as it seems and that the smallest event can change our lives irrevocably. The request to review a book written by a young man turns into a life changing event that helps Gabriel find the strength within himself to carry on and move forward.<br /><br />It's to bad that most people will avoid this film. I only say that

In [7]:
import os, shutil, pathlib, random 

base_dir = pathlib.Path("aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"
for category in ("pos", "neg"):
  os.makedirs(val_dir / category)
  files = os.listdir(train_dir / category)
  random.Random(1337).shuffle(files)
  num_val_samples = int(0.2 * len(files))
  val_files = files[-num_val_samples:]
  for fname in val_files:
    shutil.move(train_dir / category / fname,
                val_dir / category / fname)


In [8]:
from tensorflow import keras
batch_size = 32

train_ds = keras.preprocessing.text_dataset_from_directory("/content/aclImdb/train", batch_size=batch_size)
val_ds = keras.preprocessing.text_dataset_from_directory("/content/aclImdb/val", batch_size=batch_size)
test_ds = keras.preprocessing.text_dataset_from_directory("/content/aclImdb/test",batch_size=batch_size)

Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


In [13]:
for inputs, targets in train_ds:
  print(inputs.shape)
  print(inputs.dtype)
  print(targets.shape)
  print(targets.dtype)
  print(inputs[0])
  print(targets[0])
  break

(32,)
<dtype: 'string'>
(32,)
<dtype: 'int32'>
tf.Tensor(1, shape=(), dtype=int32)


In [21]:
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import TextVectorization

text_vectorization = TextVectorization(
    max_tokens=20000,
    ngrams=2,
    output_mode="count"
)

text_only_train_ds = train_ds.map(lambda x, y: x)
text_vectorization.adapt(text_only_train_ds)

binary_1gram_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y))
binary_1gram_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y))
binary_1gram_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y))

In [19]:
from tensorflow import keras
from tensorflow.keras import layers

def get_model(max_tokens=20000, hidden_dims=16):
  inputs = keras.Input(shape=(max_tokens,))
  x = layers.Dense(hidden_dims, activation="relu")(inputs)
  x = layers.Dropout(0.5)(x)
  outputs = layers.Dense(1, activation="sigmoid")(x)
  model = keras.Model(inputs, outputs)
  model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])
  return model

In [20]:
model = get_model()
model.summary()
callbacks = [
             keras.callbacks.ModelCheckpoint("binary_1gram.keras", save_best_only=True)
]
model.fit(binary_1gram_train_ds.cache(), validation_data=binary_1gram_val_ds.cache(), epochs=10, callbacks=callbacks)
model = keras.models.load_model("binary_1gram.keras")
print(f"Test Accuracy: {model.evaluate(binary_1gram_test_ds)[1]:.3f}")

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense_2 (Dense)             (None, 16)                320016    
                                                                 
 dropout_1 (Dropout)         (None, 16)                0         
                                                                 
 dense_3 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test Accuracy: 0.895
