<a href="https://colab.research.google.com/github/Wezz-git/AI-samples/blob/main/(NLP)_Sentiment_Analysis_on_text_(Movie_reviews).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The Business Problem:

You're a data scientist at a movie studio. Your boss asks, "Our new movie just came out. I need to know if people love it or hate it. Can you build a model that reads thousands of online reviews and tells us if they are Positive or Negative?"

This is Natural Language Processing (NLP). We are teaching a computer to understand the meaning and sentiment of human language.

The Model:

We will build a simple Neural Network (a Sequential model) using tensorflow and keras.

In [None]:

# We're going to use a famous "toy" dataset that is built directly into the keras library: the IMDB Movie Reviews dataset.
# This dataset contains 50,000 movie reviews, pre-labeled as "Positive" (1) or "Negative" (0).

import tensorflow as tf
from tensorflow.keras.datasets import imdb
import numpy as np

# 1 - Load the dataset
# only use 10,000 most common words

num_words = 10000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=10000)

# 2. -Check what we got

print(f"-- Data Loaded --")
print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")

print("\n--- Our First Training Review (Raw) ---")
print(X_train[0])

print("\n--- Our First Training Label ---")
print(y_train[0])



Standardize the lengths of all reviews. This is called Padding.

In [2]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Define the length

max_length = 256

print(" Padding sequences..")

# This is the "standardizing" step

X_train_padded = pad_sequences(X_train, maxlen=max_length, padding='post')
X_test_padded = pad_sequences(X_test, maxlen=max_length, padding='post')

# Check working

print(f"\nOriginal length of first review: {len(X_train[0])}")
print(f"Padded length of first review: {len(X_train_padded[0])}")

print(f"\nOriginal length of second review: {len(X_train[1])}")
print(f"Padded length of second review: {len(X_train_padded[1])}")

 Padding sequences..

Original length of first review: 218
Padded length of first review: 256

Original length of second review: 189
Padded length of second review: 256


Build the "Brain" - hat will learn to read the reviews.

This is the "recipe" for a basic (but powerful) text-classification model.

In [3]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, GlobalAvgPool1D, Dense

# 1 - Define the model
vocab_size = 10000                    # 10,000-word vocabulary
embedding_dim = 16                    # the "thinking space" for each word
max_length = 256                      # 256-word review length

# 2 - Build the model

model = Sequential([
    # Layer 1 - "Embedding" Layer
    # Layer LEARNS the 'meaning' of words (eg, 'awfal' close to 'bad')
    Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_length),

    # Layer 2 - "Pooling" Layer
    # Layer READS all word meanings and averages them into one summary
    GlobalAvgPool1D(),

    # Layer 3 - "Decision" Layer
    # Standard Neural Networl layer that learns complex patterns
    Dense(16, activation='relu'),

    # Layer 4 - "Output" Layer
    # One neuron that outputs a single number (0 = Negative, 1 = Positive)
    Dense(1, activation='sigmoid')
])

# 3 - Compile the model

# Give the model its instructions
# Optimizer='adam' : How to learn
# loss='binary_cossentrophy' : How to measure error for a Yes/No problem
# metriccs-['accuracy'] : the "report card" we want to see

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

# 4 - print the model summary
# This shows the "blueprint" of the model just built
model.summary()



Train the "Brain" - need to fit (train) it on your padded data.

In [8]:
# 1 - Train the model

# Training the padded data for 20 "epochs" (cycles)
# set aside 2-% (0.2) of the 'training' data for 'validation'
# We can watch it learn in real-time.

print("Training the model..")
history = model.fit(
    X_train_padded, y_train,
    validation_split=0.2,
    epochs=20,
    batch_size=512,
    verbose=1
)
print("Training complete!")

# 2 - Evaluate the model
# Test the model on the 'X_test_padded' data it has'nt seen

loss, accuracy = model.evaluate(X_test_padded, y_test)
print("Evaluation complete!")

print(f"\n-- Final Model Performance --")
print(f"Test Accuracy: {accuracy * 100:.2f}%")

Training the model..
Epoch 1/20
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 0.9896 - loss: 0.0422 - val_accuracy: 0.8760 - val_loss: 0.4685
Epoch 2/20
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.9901 - loss: 0.0434 - val_accuracy: 0.8764 - val_loss: 0.4720
Epoch 3/20
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.9915 - loss: 0.0381 - val_accuracy: 0.8776 - val_loss: 0.4711
Epoch 4/20
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.9930 - loss: 0.0387 - val_accuracy: 0.8752 - val_loss: 0.4785
Epoch 5/20
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.9923 - loss: 0.0364 - val_accuracy: 0.8782 - val_loss: 0.4879
Epoch 6/20
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.9917 - loss: 0.0377 - val_accuracy: 0.8722 - val_loss: 0.5094
Epoch 7/20
[1m40/4