### **Purpose**

The code is to build a deep learning model using **LSTM**(Long Short-Term Memory) to perform **sentiment analysis** on the **IMDB movie reviews dataset**, classifying them as either **positive (1)** or **negative (0)**.

### **Importing Libraries**

The folloing code imports essential libraries for building and working with deep learning models in TensorFlow.

tensorflow and keras are used to define, train, and evaluate neural networks, while numpy is a library for numerical computations, often used to handle arrays and matrices, which are fundamental to machine learning tasks.

These libraries provide the tools necessary to build, optimize, and work with deep learning models efficiently.

In [None]:
import tensorflow as tf
from tensorflow import keras
import numpy as np

In [None]:
# Load the buily-in IMDB dataset
imdb=keras.datasets.imdb

The above code loads the built-in IMDB dataset from Keras, which is a collection of movie reviews labeled as either positive or negative.

The dataset is commonly used for sentiment analysis tasks. By assigning keras.datasets.imdb to the variable imdb, you can easily access the dataset's training and test sets, which are typically used for training machine learning models for binary classification

In [None]:
# Set the vocabulary size and maximum sequence length
vocab_size=10000
max_length=250

This code sets two important parameters for preprocessing the IMDB dataset.

vocab_size=10000 means that only the top 10,000 most frequent words in the dataset will be considered as features, effectively limiting the vocabulary to the most common terms.

max_length=250 specifies that the movie reviews will be padded or truncated to a fixed length of 250 words. This ensures that all input sequences have the same length, which is necessary for feeding data into a neural network.

### **Loading Dataset**

The folwwong code loads the IMDB dataset for training and testing, limiting the vocabulary to the top 10,000 most frequent words, as specified by num_words=vocab_size.

The imdb.load_data() function returns two tuples: (x_train, y_train) for the training set and (x_test, y_test) for the test set. x_train and x_test contain the tokenized movie reviews (represented as sequences of integers corresponding to word indices), while y_train and y_test contain the labels (0 for negative reviews, 1 for positive reviews).

In [None]:
# Load the dataset
(x_train,y_train),(x_test,y_test)=imdb.load_data(num_words=vocab_size)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
[1m17464789/17464789[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step


This prepares the data for training and evaluating a machine learning mode

In [None]:
# Pad the sequences to have the same length
x_train=keras.preprocessing.sequence.pad_sequences(x_train,maxlen=max_length)
x_test=keras.preprocessing.sequence.pad_sequences(x_test,maxlen=max_length)

This code ensures that all input sequences (movie reviews) have the same length by padding or truncating them to a fixed size of 250 words, as specified by maxlen=max_length.

The keras.preprocessing.sequence.pad_sequences() function is applied to both the training data (x_train) and the test data (x_test).

If a review is shorter than 250 words, it is padded with zeros at the beginning; if it's longer, it is truncated to the first 250 words. This step is crucial because neural networks require inputs of consistent size for efficient processing.

### **Building The LSTM Model**

In [None]:
# # Build the LSTM model
# model=keras.Sequential([
#     keras.layers.Embedding(vocab_size,32),
#     keras.layers.LSTM(32),
#     keras.layers.Dense(1,activation='sigmoid')
#   ])

In [None]:
# Build the LSTM model
model=keras.Sequential([
    keras.layers.Embedding(vocab_size,32),
    keras.layers.LSTM(32), # Remove input_shape argument
    keras.layers.Dense(1,activation='sigmoid')
  ])

This code builds a simple LSTM (Long Short-Term Memory) model using Keras for sentiment analysis. The model consists of three layers:

1. **Embedding Layer: **The Embedding(vocab_size, 32) layer transforms the integer-encoded words in the input sequences into dense vectors of size 32. It maps each of the top 10,000 words (defined by vocab_size) to a 32-dimensional vector, allowing the model to learn better representations for the words.

2. **LSTM Layer:** The LSTM(32) layer is a type of recurrent neural network (RNN) designed to capture long-term dependencies in sequences. It contains 32 units, which help the model learn patterns in the sequence of words in the reviews.

3. **Dense Layer:** The Dense(1, activation='sigmoid') layer is the output layer. It has a single neuron with a sigmoid activation function, which is used for binary classification (positive or negative review).

This model architecture is designed to process sequences of text and classify them into two categories: positive or negative.

### **Compiling The Model**

In [None]:
# Compile the model
model.compile(optimizer='adam',
               loss='binary_crossntrophy',
               metrics=['accuracy'])

In [None]:
# Compile the model
model.compile(optimizer='adam',
               loss='binary_crossentropy', # Changed 'binary_crossentrophy' to 'binary_crossentropy'
               metrics=['accuracy'])

### **Training The Model**

This code trains the LSTM model on the training data (x_train and y_train) for 10 epochs with a batch size of 32. The validation_split=0.2 parameter means that 20% of the training data will be used for validation during training. The model will be evaluated on this validation data after each epoch to monitor its performance and generalization ability.

The fit() method returns a history object, which stores the training and validation metrics (such as accuracy and loss) over the epochs. This allows you to track how the model's performance evolves over time.

In [None]:
# Compile the model
model.compile(optimizer='adam',
               loss='binary_crossentropy', # Corrected loss function name
               metrics=['accuracy'])

# Train the model
history = model.fit(x_train,y_train,
                  epochs=10,
                  batch_size=32,
                  validation_split=0.2)

Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m59s[0m 90ms/step - accuracy: 0.6770 - loss: 0.5678 - val_accuracy: 0.8506 - val_loss: 0.3609
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m83s[0m 91ms/step - accuracy: 0.8900 - loss: 0.2795 - val_accuracy: 0.8530 - val_loss: 0.3660
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m81s[0m 90ms/step - accuracy: 0.9277 - loss: 0.2003 - val_accuracy: 0.8602 - val_loss: 0.3286
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m83s[0m 91ms/step - accuracy: 0.9443 - loss: 0.1580 - val_accuracy: 0.8610 - val_loss: 0.3445
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m82s[0m 92ms/step - accuracy: 0.9598 - loss: 0.1195 - val_accuracy: 0.8584 - val_loss: 0.3573
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m81s[0m 90ms/step - accuracy: 0.9400 - loss: 0.1575 - val_accuracy: 0.8596 - val_loss: 0.4435
Epoch 7/10
[1m6

In [None]:
# # Train the model
# history = model.fit(x_train,y_train,
#                   epochs=10,
#                   batch_size=32,
#                   validation_split=0.2)

This output shows the training progress over 10 epochs for the LSTM model. Each line corresponds to an epoch and provides several metrics:

1. **accuracy:** The model's accuracy on the training data for that epoch.

2. **loss:** The training loss, which measures the difference between the model's predictions and the actual labels. Lower values are better.

3. **val_accuracy: **The model's accuracy on the validation set after that epoch. This helps monitor how well the model generalizes to unseen data.

4. **val_loss:** The validation loss, which measures how well the model's predictions match the actual labels in the validation set.

**Key points from the output:**

- Epoch 1 starts with a training accuracy of 70.34% and validation accuracy of 83.24%.

- Epoch 2 sees a significant improvement in training accuracy (89.93%) and a slight increase in validation accuracy (87.80%).

- Training accuracy continues to rise steadily, reaching 98.00% by Epoch 10, while validation accuracy fluctuates slightly, peaking at 87.12% in Epoch 6 and ending at 84.98% in Epoch 10.

- The validation loss starts at 0.3732 in Epoch 1, improves initially, but then increases in later epochs, suggesting some overfitting (the model fits the training data well but may struggle to generalize on validation data).

Overall, the model shows strong training performance, but the increasing validation loss and slight fluctuations in validation accuracy suggest some potential overfitting, especially after Epoch 6.

### **Evaluation**

This code evaluates the trained LSTM model on the test data (x_test and y_test) using the evaluate() function. The function returns two values: test_loss and test_acc.

- **test_loss** represents how well the model's predictions match the true labels on the test set, measured using the loss function (in this case, likely binary cross-entropy, given the binary classification task).

- **test_ac**c represents the model's accuracy on the test set, i.e., the proportion of correctly classified reviews.

Finally, it prints the test accuracy (test_acc), which provides an indication of how well the model performs on unseen data.

In [None]:
# evaluate the model
test_loss,test_acc=model.evaluate(x_test,y_test)
print('Test accuracy:',test_acc)

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 28ms/step - accuracy: 0.8464 - loss: 0.5608
Test accuracy: 0.8469200134277344


This output indicates that the model has been evaluated on the test set. The 782/782 shows that there are 782 batches in the test data, and the evaluation took 4 seconds with an average of 5 milliseconds per step.

- **The accuracy:**0.8447 indicates that the model achieved an accuracy of approximately 84.47% on the test data during the evaluation.

- **The loss:** 0.5740 represents the value of the loss function, which quantifies how far the model's predictions are from the actual labels.

Finally, the printed Test accuracy: 0.8463 confirms that the model's accuracy on the test set is approximately 84.63%. This suggests the model is performing well in classifying the sentiment of movie reviews, achieving a good balance between accuracy and loss.