**Keras for Beginners: Implementing a Recurrent Neural Network**

https://victorzhou.com/blog/keras-rnn-tutorial/

**The Problem: Classifying Movie Reviews**

In [None]:
!wget https://victorzhou.com/movie-reviews-dataset.zip

--2024-12-24 09:45:37--  https://victorzhou.com/movie-reviews-dataset.zip
Resolving victorzhou.com (victorzhou.com)... 104.21.72.186, 172.67.153.220, 2606:4700:3035::6815:48ba, ...
Connecting to victorzhou.com (victorzhou.com)|104.21.72.186|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 62951389 (60M) [application/zip]
Saving to: ‘movie-reviews-dataset.zip.1’


2024-12-24 09:45:38 (243 MB/s) - ‘movie-reviews-dataset.zip.1’ saved [62951389/62951389]



In [None]:
!unzip movie-reviews-dataset

Archive:  movie-reviews-dataset.zip
replace movie-reviews-dataset/.DS_Store? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [None]:
import keras
from keras.models import *
from keras.layers import *
import cv2
import os
import numpy as np
import pandas as pd
#import keras as K
import tensorflow as tf
import matplotlib.pyplot as plt
from cv2 import resize
from tensorflow.keras.preprocessing import text_dataset_from_directory

## Exploratory Data Analysis (EDA)

Performing a detailed analysis of the dataset to understand its structure and key features.

In [None]:



DATASET_DIR='/content/movie-reviews-dataset/train'
os.listdir(DATASET_DIR)


['pos', 'neg', '.DS_Store']

In [None]:
# Assumes you're in the root level of the dataset directory.
# If you aren't, you'll need to change the relative paths here.
train_data = text_dataset_from_directory('/content/movie-reviews-dataset/train')
test_data = text_dataset_from_directory('/content/movie-reviews-dataset/test')

Found 25000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


In [None]:
from tensorflow.keras.preprocessing import text_dataset_from_directory
from tensorflow.strings import regex_replace

def prepareData(dir):
  data = text_dataset_from_directory(dir)
  return data.map(
    lambda text, label: (regex_replace(text, '<br />', ' '), label),
  )

train_data = prepareData('/content/movie-reviews-dataset/train')
test_data = prepareData('/content/movie-reviews-dataset/test')

Found 25000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


## Data Preprocessing

Implementing NLP-specific preprocessing steps to prepare the data for modeling.

In [None]:
for text_batch, label_batch in train_data.take(1):
  print(text_batch.numpy()[0])
  print(label_batch.numpy()[0]) # 0 = negative, 1 = positive

b"I thought the movie (especially the plot) needs a lot of work. The elements of the movie remains westernized and untrue to the attempt of trying to produce an eastern feel in the movie. I'll give three out of many of the flaws of the movie:  First, when Shen told Wendy that he would help her study the history of China, I was really happy that the audience would receive some information about Chinese history; but it turns out that the movie did not exactly show Wendy actually studying Chinese history; yet instead, the movie only shows Wendy practicing the method of remembering what she had studied, which frustrated and put me in dismay.  Second, which really bothered me, is how the characters kept mentioning about moon cakes -- moon cakes this and moon cakes that and how good it tastes. Yet they didn't really mention the real significance of it. The only they they talked about that had any relevance to the moon cake was the Autumn Festival, which they did not explain or go in depth. T

3. Building the Model


## Model Architecture

Detailed explanation of the LSTM architecture used and its configuration parameters.

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras import Input

model = Sequential()
model.add(Input(shape=(1,), dtype="string"))

In [None]:
from tensorflow.keras.layers import TextVectorization

max_tokens = 1000
max_len = 100
vectorize_layer = TextVectorization(
  # Max vocab size. Any words outside of the max_tokens most common ones
  # will be treated the same way: as "out of vocabulary" (OOV) tokens.
  max_tokens=max_tokens,
  # Output integer indices, one per string token
  output_mode="int",
  # Always pad or truncate to exactly this many tokens
  output_sequence_length=max_len,
)

In [None]:
# Call adapt(), which fits the TextVectorization layer to our text dataset.
# This is when the max_tokens most common words (i.e. the vocabulary) are selected.
train_texts = train_data.map(lambda text, label: text)
vectorize_layer.adapt(train_texts)

In [None]:
from tensorflow.keras.layers import Embedding

# Previous layer: TextVectorization
max_tokens = 1000
# ...
model.add(vectorize_layer)

# Note that we're using max_tokens + 1 here, since there's an
# out-of-vocabulary (OOV) token that gets added to the vocab.
model.add(Embedding(max_tokens + 1, 128))

 The Recurrent Laye

In [None]:
from tensorflow.keras.layers import LSTM

# 64 is the "units" parameter, which is the
# dimensionality of the output space.
model.add(LSTM(64))

In [None]:
from tensorflow.keras.layers import Dense

model.add(Dense(64, activation="relu"))
model.add(Dense(1, activation="sigmoid"))

Compiling and training the Model

## Training the Model

Training the LSTM model using a training dataset and monitoring performance metrics.

In [None]:
model.compile(
  optimizer='adam',
  loss='binary_crossentropy',
  metrics=['accuracy'],
)

In [None]:
model.fit(train_data, epochs=10)

Epoch 1/10
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m59s[0m 72ms/step - accuracy: 0.5768 - loss: 0.6717
Epoch 2/10
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m80s[0m 69ms/step - accuracy: 0.6848 - loss: 0.5875
Epoch 3/10
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m54s[0m 69ms/step - accuracy: 0.7927 - loss: 0.4408
Epoch 4/10
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m53s[0m 68ms/step - accuracy: 0.8102 - loss: 0.4140
Epoch 5/10
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m52s[0m 67ms/step - accuracy: 0.7903 - loss: 0.4403
Epoch 6/10
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m83s[0m 68ms/step - accuracy: 0.8124 - loss: 0.4119
Epoch 7/10
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m53s[0m 68ms/step - accuracy: 0.8278 - loss: 0.3811
Epoch 8/10
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m52s[0m 66ms/step - accuracy: 0.8346 - loss: 0.3650
Epoch 9/10
[1m782/782[

<keras.src.callbacks.history.History at 0x7d90943627a0>

## Evaluation and Insights

Evaluating the model's performance and discussing key findings and insights.

In [None]:
print(model.predict(tf.data.Dataset.from_tensor_slices([
  "i loved it! highly recommend it to anyone and everyone looking for a great movie to watch.",
]).batch(1))) # Batch the dataset
print(model.predict(tf.data.Dataset.from_tensor_slices([
  "this was awful! i hated it so much, nobody should watch this. the acting was terrible, the music was terrible, overall it was just bad.",
]).batch(1))) # Batch the dataset

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 518ms/step
[[0.98522633]]
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 23ms/step
[[0.00559015]]
