**Plan**

**1. Text preprocessing with Keras**

**2. Using Keras for sentiment analysis and text classification**
   
**3. Keras NLP library**
  * Classification with BERT
  * Generation with GPT








# **Text preprocessing with Keras**

Text preprocessing is a crucial step in natural language processing (NLP) and machine learning workflows. Keras provides several tools and utilities for text preprocessing, including tokenization, padding, and encoding. Here's a comprehensive guide on text preprocessing with Keras:

**1. Text Tokenization**

Tokenization involves converting text into sequences of tokens or integers. Keras provides the Tokenizer class to handle this.

In [None]:
! pip install --upgrade keras

In [69]:
import keras
keras.__version__

'3.4.1'

In [71]:
from tensorflow.keras.preprocessing.text import Tokenizer

# Sample text data
texts = [
    "I love machine learning.",
    "Keras makes text preprocessing easy.",
    "Let's preprocess some text!"
]

# Initialize and fit the tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)

# Convert text to sequences of integers
sequences = tokenizer.texts_to_sequences(texts)
print("Sequences:", sequences)

# Retrieve the word index
word_index = tokenizer.word_index
print("Word Index:", word_index)

Sequences: [[2, 3, 4, 5], [6, 7, 1, 8, 9], [10, 11, 12, 1]]
Word Index: {'text': 1, 'i': 2, 'love': 3, 'machine': 4, 'learning': 5, 'keras': 6, 'makes': 7, 'preprocessing': 8, 'easy': 9, "let's": 10, 'preprocess': 11, 'some': 12}


* fit_on_texts(texts): Updates the tokenizer's internal vocabulary based on the provided texts.
* texts_to_sequences(texts): Converts the texts into sequences of integers based on the word index.

**2. Padding Sequences**

Padding ensures that all sequences have the same length. This is necessary for feeding data into many machine learning models.

In [72]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Padding sequences
padded_sequences = pad_sequences(sequences, padding='post')
print("Padded Sequences:", padded_sequences)


Padded Sequences: [[ 2  3  4  5  0]
 [ 6  7  1  8  9]
 [10 11 12  1  0]]


* pad_sequences(sequences, padding='post'): Pads sequences to ensure they are of the same length. The padding argument specifies whether to pad at the beginning (pre) or end (post) of sequences.

**3. Text Vectorization**

In addition to tokenization and padding, Keras offers text vectorization through the TextVectorization layer. This is available in tf.keras.layers.TextVectorization, but we can manually handle similar operations without it.

In [73]:
import numpy as np

# Sample text data
texts = [
    "I love machine learning.",
    "Keras makes text preprocessing easy.",
    "Let's preprocess some text!"
]

# Tokenization with Keras
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
word_index = tokenizer.word_index

# Determine maximum sequence length
max_len = max(len(seq) for seq in sequences)

# Pad sequences to ensure uniform length
padded_sequences = pad_sequences(sequences, maxlen=max_len, padding='post')

# Convert to numpy array
padded_sequences_array = np.array(padded_sequences)
print("Padded Sequences Array:", padded_sequences_array)


Padded Sequences Array: [[ 2  3  4  5  0]
 [ 6  7  1  8  9]
 [10 11 12  1  0]]


* maxlen: Specifies the maximum length of sequences. Sequences longer than this are truncated, and shorter ones are padded.
* np.array: Converts the padded sequences to a NumPy array for further use.

**Summary**

* **Tokenization:** Use the Tokenizer class to convert text to sequences of integers.
* **Padding:** Use pad_sequences to ensure all sequences have the same length.
    Vectorization: Convert sequences to a uniform shape for model input.

# **Using Keras for sentiment analysis and text classification**

In [75]:
from keras.datasets import imdb
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

# Load the IMDB dataset
num_words = 10000  # Consider only the top 10,000 words
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=num_words)

# Pad sequences to ensure uniform input length
max_len = 500
x_train_padded = pad_sequences(x_train, maxlen=max_len)
x_test_padded = pad_sequences(x_test, maxlen=max_len)


Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
[1m17464789/17464789[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step


In [76]:
x_train_padded.shape

(25000, 500)

In [78]:
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense

model = Sequential()
model.add(Embedding(input_dim=10000, output_dim=128, input_length=max_len))
model.add(LSTM(64))
model.add(Dense(1, activation='sigmoid'))  # For binary classification

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])




In [None]:
# Train the model
history = model.fit(x_train_padded, y_train, epochs=5, batch_size=32, validation_split=0.2)

# Evaluate the model
test_loss, test_acc = model.evaluate(x_test_padded, y_test)
print(f"Test Accuracy: {test_acc}")

In [None]:
# Predicting on new data
predictions = model.predict(x_test_padded)

# Convert predictions to binary labels
predicted_labels = (predictions > 0.5).astype(int)

# **Keras NLP library [Link](https://keras.io/keras_nlp/)**

**<h2>Classification with BERT</h2>**

In [None]:
! pip install --upgrade keras-cv
! pip install --upgrade keras-nlp
# ! pip install --upgrade keras

In [82]:
import os
os.environ["KERAS_BACKEND"] = "tensorflow"
import keras
import keras_nlp
import tensorflow_datasets as tfds

In [83]:
imdb_train, imdb_test = tfds.load(
    "imdb_reviews",
    split=["train", "test"],
    as_supervised=True,
    batch_size=16,
)


Downloading and preparing dataset 80.23 MiB (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/incomplete.3OZY03_1.0.0/imdb_reviews-train.tfrecor…

Generating test examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/incomplete.3OZY03_1.0.0/imdb_reviews-test.tfrecord…

Generating unsupervised examples...:   0%|          | 0/50000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/incomplete.3OZY03_1.0.0/imdb_reviews-unsupervised.…

Dataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.


In [84]:
item = next(iter(imdb_train))
print(item)

(<tf.Tensor: shape=(16,), dtype=string, numpy=
array([b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.",
       b'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell

In [None]:
# Load a BERT model.
classifier = keras_nlp.models.BertClassifier.from_preset(
    "bert_base_en_uncased",
    num_classes=2,
)
# Fine-tune on IMDb movie reviews.
classifier.fit(imdb_train, validation_data=imdb_test)
# Predict two new examples.
classifier.predict(["What an amazing movie!", "A total waste of my time."])


**<h2>Generation with GPT [Link](https://keras.io/examples/generative/gpt2_text_generation_with_kerasnlp/)</h2>**

In [85]:
# To speed up training and generation, we use preprocessor of length 128
# instead of full length 1024.
preprocessor = keras_nlp.models.GPT2CausalLMPreprocessor.from_preset(
    "gpt2_base_en",
    sequence_length=128,
)
gpt2_lm = keras_nlp.models.GPT2CausalLM.from_preset(
    "gpt2_base_en", preprocessor=preprocessor
)

Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_base_en/2/download/metadata.json...


100%|██████████| 141/141 [00:00<00:00, 100kB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_base_en/2/download/tokenizer.json...


100%|██████████| 448/448 [00:00<00:00, 261kB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_base_en/2/download/assets/tokenizer/vocabulary.json...


100%|██████████| 0.99M/0.99M [00:00<00:00, 16.5MB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_base_en/2/download/assets/tokenizer/merges.txt...


100%|██████████| 446k/446k [00:00<00:00, 9.24MB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_base_en/2/download/config.json...


100%|██████████| 484/484 [00:00<00:00, 493kB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_base_en/2/download/model.weights.h5...


100%|██████████| 475M/475M [00:08<00:00, 60.4MB/s]


In [86]:
gpt2_lm.summary()

In [88]:
preprocessor.get_config()

{'name': 'gpt2_causal_lm_preprocessor',
 'trainable': True,
 'dtype': {'module': 'keras',
  'class_name': 'DTypePolicy',
  'config': {'name': 'float32'},
  'registered_name': None},
 'tokenizer': {'module': 'keras_nlp.src.models.gpt2.gpt2_tokenizer',
  'class_name': 'GPT2Tokenizer',
  'config': {'name': 'gpt2_tokenizer',
   'trainable': True,
   'dtype': {'module': 'keras',
    'class_name': 'DTypePolicy',
    'config': {'name': 'int32'},
    'registered_name': None},
   'sequence_length': None,
   'add_prefix_space': False},
  'registered_name': 'keras_nlp>GPT2Tokenizer'},
 'sequence_length': 128,
 'add_start_token': True,
 'add_end_token': True}

In [89]:
gpt2_lm.backbone.summary()

In [None]:
import time

start = time.time()

output = gpt2_lm.generate("My trip to Yosemite was", max_length=200)
print("\nGPT-2 output:")
print(output)

end = time.time()
print(f"TOTAL TIME ELAPSED: {end - start:.2f}s")


**<h2>Finetune on Reddit dataset</h2>**

In [None]:
import tensorflow_datasets as tfds
import tensorflow as tf

reddit_ds = tfds.load("reddit_tifu", split="train", as_supervised=True)

In [None]:
for document, title in reddit_ds:
    print(document.numpy())
    print(title.numpy())
    break

In [None]:
train_ds = (
    reddit_ds.map(lambda document, _: document)
    .batch(32)
    .cache()
    .prefetch(tf.data.AUTOTUNE)
)

In [None]:


train_ds = train_ds.take(500)
num_epochs = 1

# Linearly decaying learning rate.
learning_rate = keras.optimizers.schedules.PolynomialDecay(
    5e-5,
    decay_steps=train_ds.cardinality() * num_epochs,
    end_learning_rate=0.0,
)
loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
gpt2_lm.compile(
    optimizer=keras.optimizers.Adam(learning_rate),
    loss=loss,
    weighted_metrics=["accuracy"],
)

gpt2_lm.fit(train_ds, epochs=num_epochs)


In [None]:
start = time.time()

output = gpt2_lm.generate("I like basketball", max_length=200)
print("\nGPT-2 output:")
print(output)

end = time.time()
print(f"TOTAL TIME ELAPSED: {end - start:.2f}s")

In [None]:
# Use a string identifier.
gpt2_lm.compile(sampler="top_k")
output = gpt2_lm.generate("I like basketball", max_length=200)
print("\nGPT-2 output:")
print(output)

# Use a `Sampler` instance. `GreedySampler` tends to repeat itself,
greedy_sampler = keras_nlp.samplers.GreedySampler()
gpt2_lm.compile(sampler=greedy_sampler)

output = gpt2_lm.generate("I like basketball", max_length=200)
print("\nGPT-2 output:")
print(output)
