# Sentiment Analysis on Movie Reviews (NLP Project)

In this project, I will train a model using IMDB reviews and predict whether the review is positive or negative.


## Dataset

In [26]:
import tensorflow as tf
import tensorflow_datasets as tfds # For IMDB dataset
import numpy as np

In [27]:
# IBDM Dataset

imdb, info = tfds.load("imdb_reviews", with_info=True, as_supervised=True)

# Split data into train and test

train_data = imdb['train']
test_data = imdb['test']

# as_supervised=True
# To get the label (0 = negative, 1 = positive) of each review

In [28]:
# For converting dataset to numpy arrays

train_sentences = []
train_labels = []

test_sentences = []
test_labels = []

In [29]:
# Read the training and test data and add it to the lists.

for s, l in train_data:
    train_sentences.append(str(s.numpy()))
    train_labels.append(l.numpy())

for s, l in test_data:
    test_sentences.append(str(s.numpy()))
    test_labels.append(l.numpy())

In [30]:
train_labels = np.array(train_labels)
test_labels = np.array(test_labels)

# Convert the labels (consisting of 0s and 1s) into a numpy array using np.array().
# This way, we can perform faster processing during model training.

In [31]:
# Print on the screen how many training and test examples we have.

print(f"Dataset Loaded: {len(train_sentences)} training samples, {len(test_sentences)} test samples.")

Dataset Loaded: 25000 training samples, 25000 test samples.


**Summary of the Stage**:

- We downloaded the IMDB dataset from TensorFlow.

- We converted the comments and labels to numpy arrays.

- We checked the training and test sets.


**Next Up: Tokenization**

## Tokenization

In [34]:
import tensorflow_text as tf_text #NLP models use this in the background

In [37]:
# TextVectorization layer

vectorizer = tf.keras.layers.TextVectorization(
    max_tokens=10000,  # Get the 10,000 most frequently used words
    output_sequence_length=200  # Fix sentence length to 200 words
)

In [38]:
# Adapt tokenizer to training data

vectorizer.adapt(train_sentences)

In [39]:
# Example

sample_text = ["The movie was fantastic! I loved it."]
sample_tokenized = vectorizer(sample_text)

print(sample_tokenized)

tf.Tensor(
[[  2  18  14 771  11 438   9   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0]], shape=(1, 200), dtype=int64)


In [40]:
# Tokenize all training and testing data

train_sequences = vectorizer(np.array(train_sentences))
test_sequences = vectorizer(np.array(test_sentences))

In [41]:
# Let's convert TensorFlow tensors to NumPy arrays

train_padded = np.array(train_sequences)
test_padded = np.array(test_sequences)

In [42]:
# Check shapes

print(f"Train Padded Shape: {train_padded.shape}")
print(f"Test Padded Shape: {test_padded.shape}")

Train Padded Shape: (25000, 200)
Test Padded Shape: (25000, 200)


**Summary of the Stage**:

- We converted words to numbers with TextVectorization.
- We fixed the length of sentences (Padding).
- The model will now see texts as numbers.


**Next Up: A LSTM (Long Short-Term Memory) based sentiment analysis model**

## A LSTM (Long Short-Term Memory) Based Sentiment Analysis Model

In [54]:
# Create the LSTM model

model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(200,)),
    tf.keras.layers.Embedding(input_dim=10000, output_dim=16, input_length=200),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True)),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    tf.keras.layers.Dense(16, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

- Embedding Layer → Converts words to vectors (Each word will be a 16-dimensional vector).
- Bidirectional LSTM → Bidirectional LSTM captures the meaning of sentences better.
- Dense Layer (ReLU) → Extra learning layer provides better generalization of the model.
- Dense Layer (Sigmoid) → Output layer returns the output as a sentiment score between 0 and 1.

In [55]:
# Compile model

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Summary

model.summary()

In [56]:
# Train the model

num_epochs = 5

history = model.fit(train_padded, train_labels, epochs=num_epochs, validation_data=(test_padded, test_labels))

Epoch 1/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m340s[0m 386ms/step - accuracy: 0.5534 - loss: 0.6746 - val_accuracy: 0.7211 - val_loss: 0.5907
Epoch 2/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m331s[0m 423ms/step - accuracy: 0.7931 - loss: 0.4583 - val_accuracy: 0.8197 - val_loss: 0.4104
Epoch 3/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m376s[0m 416ms/step - accuracy: 0.8671 - loss: 0.3323 - val_accuracy: 0.8244 - val_loss: 0.4138
Epoch 4/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m330s[0m 422ms/step - accuracy: 0.8958 - loss: 0.2694 - val_accuracy: 0.8360 - val_loss: 0.4117
Epoch 5/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m378s[0m 416ms/step - accuracy: 0.9186 - loss: 0.2216 - val_accuracy: 0.8340 - val_loss: 0.3956


In [57]:
# Evaluate the model (80% or more desired)

loss, acc = model.evaluate(test_padded, test_labels)
print(f"Test Accuracy: {acc:.4f}")

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m62s[0m 79ms/step - accuracy: 0.8359 - loss: 0.3943
Test Accuracy: 0.8340


**Test Accuracy: 0.8340**

The model achieved 83.4% accuracy, which is a very good starting level result for sentiment analysis.

However, the following can be implemented for improvement:


- Increase the Embedding Size:
```
tf.keras.layers.Embedding(input_dim=10000, output_dim=32, input_length=200)
```
- Make STM Layers Deeper
```
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32, return_sequences=True))
```
- Prevent Overfitting with Dropout
```
tf.keras.layers.Dropout(0.3)
```
**But for now I want to leave it like this and move on.**

**In this step:**

- We created a sentiment analysis model based on LSTM.

- We trained the model with train_padded data.

- We calculated the accuracy of the model on the test data.


**Next Up: Testing the Model with Real Comments**

## Testing the Model with Real Comments

In [58]:
# Testing with a sample user text

sample_text = ["The movie was absolutely amazing, I loved it!"]

# Vectorizing

sample_sequence = vectorizer(sample_text)
sample_padded = tf.cast(sample_sequence, tf.int32) # data type must be int32

In [59]:
# Get model prediction

prediction = model.predict(sample_padded)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 608ms/step


In [60]:
# Print the result

print(f"Sentiment Score: {prediction[0][0]:.4f}")

if prediction[0][0] > 0.5:
  print("Prediction: Positive Comment 🙂")
else:
  print("Prediction: Negative Comment 😔")

Sentiment Score: 0.9544
Prediction: Positive Comment 🙂


In [61]:
# More examples

test_sentences = [
    "I really enjoyed this movie, it was fantastic!",
    "The plot was very boring and the acting was terrible.",
    "One of the best movies I have ever seen.",
    "I would never watch this movie again, complete waste of time!",
    "Not bad, but could have been better.",
    "An absolute masterpiece! Highly recommend it.",
    "I fell asleep while watching, not entertaining at all."
]

# Tokenize

test_sequences = vectorizer(test_sentences)
test_padded = tf.cast(test_sequences, tf.int32)

# Prediction

predictions = model.predict(test_padded)

# Printing

for i, text in enumerate(test_sentences):
    score = predictions[i][0]
    sentiment = "Positive 🙂" if score > 0.5 else "Negative 😔"
    print(f"Comment: {text}\nPrediction: {sentiment} (Score: {score:.4f})\n")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 807ms/step
Comment: I really enjoyed this movie, it was fantastic!
Prediction: Positive 🙂 (Score: 0.9531)

Comment: The plot was very boring and the acting was terrible.
Prediction: Negative 😔 (Score: 0.1333)

Comment: One of the best movies I have ever seen.
Prediction: Positive 🙂 (Score: 0.9287)

Comment: I would never watch this movie again, complete waste of time!
Prediction: Negative 😔 (Score: 0.2672)

Comment: Not bad, but could have been better.
Prediction: Positive 🙂 (Score: 0.6611)

Comment: An absolute masterpiece! Highly recommend it.
Prediction: Positive 🙂 (Score: 0.9638)

Comment: I fell asleep while watching, not entertaining at all.
Prediction: Negative 😔 (Score: 0.4440)



**Summary of the Stage**:

- We tested the model with real user comments.
- We observed the model working on a single comment and multiple comments.
- We checked the accuracy of the predictions and evaluated the model.

Next Up: Deploying the Model (Publishing as API)

## Deploying the Model (Publishing as API)

In [63]:
# Save the model

from google.colab import drive
drive.mount('/content/drive')

# Modeli Drive'a kaydet
model.save('/content/drive/MyDrive/sentiment_model.keras')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [72]:
# Libraries

!pip install flask flask-ngrok pyngrok

Collecting pyngrok
  Downloading pyngrok-7.2.3-py3-none-any.whl.metadata (8.7 kB)
Downloading pyngrok-7.2.3-py3-none-any.whl (23 kB)
Installing collected packages: pyngrok
Successfully installed pyngrok-7.2.3


In [None]:
import os

# Ngrok token

os.environ["NGROK_AUTH_TOKEN"] = input("Enter your ngrok auth token: ")

In [93]:
!ngrok authtoken $NGROK_AUTH_TOKEN

Authtoken saved to configuration file: /root/.config/ngrok/ngrok.yml


In [94]:
# Load trained model

model = tf.keras.models.load_model("/content/drive/MyDrive/sentiment_model.keras", compile=False)

print("Model loaded successfully")

Model loaded successfully


In [None]:
# Flask API

from flask import Flask, request, jsonify
from flask_ngrok import run_with_ngrok
from pyngrok import ngrok

# Start Flask

app = Flask(__name__)

# ngrok connection

public_url = ngrok.connect(5000)
print(f"Public URL: {public_url}")

# API Endpoint: /predict

@app.route("/predict", methods=["POST"])
def predict():
    data = request.get_json()  # Retrieve the JSON data sent by the user
    text = data["text"]  # Extract the text from the JSON request

    # Tokenize the text

    sequence = vectorizer([text])
    padded_sequence = tf.cast(sequence, tf.int32)

    # Get the model's prediction

    prediction = model.predict(padded_sequence)[0][0]

    # Determine the sentiment

    sentiment = "Positive" if prediction > 0.5 else "Negative"

    # Return the response as JSON

    return jsonify({"text": text, "sentiment": sentiment, "score": float(prediction)})

app.run(port=5000)

In [None]:
import requests

# URL

url = "https://aaef-35-189-190-47.ngrok-free.app/predict"

# Test

data = {"text": "This movie was absolutely amazing! I loved it."}

# POST

response = requests.post(url, json=data)

# Response

print(response.json())

Summary of the Stage:

- We deployed the trained sentiment analysis model as a Flask API.

- We used ngrok to make the API accessible from external sources, allowing real-time sentiment predictions.

- The API successfully receives text input, processes it, and returns a classification result.

# Conclusion

In this project, we built a sentiment analysis model using a bi-directional LSTM network with TensorFlow. The model was trained on a dataset of text reviews, tokenized using the TextVectorization layer, and evaluated for accuracy. After achieving a satisfactory performance, we deployed the model as a Flask API, making it accessible for real-time predictions.

To integrate the model with an external interface, we used ngrok, allowing API access from anywhere. The API successfully receives text input, processes it, and returns a sentiment prediction.

This project demonstrates the complete pipeline of a machine learning model, from training and evaluation to deployment as a web service. Future improvements could include fine-tuning with Transformer models, deploying on a cloud platform, and integrating with a web-based user interface for a more interactive experience.