# CinemaScope: Automated Sentiment Analysis of IMDb Movie Reviews for Strategic Entertainment Marketing

## Business Problem Statement
### Background:
In the entertainment industry, especially in film production and distribution, understanding audience sentiment is crucial for both marketing strategies and content creation. Movie reviews, as expressed by viewers on platforms like IMDb, social media, and other review websites, offer a wealth of data that can provide insights into audience preferences and perceptions. However, manually analyzing these reviews is time-consuming, subjective, and inefficient, especially given the volume of data generated with each movie release.

### Problem Statement:
To enhance strategic decision-making and improve customer engagement, our company seeks to automate the process of sentiment analysis on movie reviews. The goal is to develop a machine learning model that can accurately classify the sentiment of movie reviews as positive or negative. This will enable us to quickly gauge public opinion of new releases, identify shifts in viewer preferences, and adjust marketing strategies accordingly.

In [1]:
import tensorflow as tf
import numpy as np
import tensorflow_datasets as tfds
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.optimizers import Adam

In [12]:
# Fuction for loading the data
def load_d():
    # Load the IMDb dataset
    train_data, test_data, info = tfds.load('imdb_reviews', split=['train', 'test'], as_supervised=True, with_info=True)
    return train_data, test_data, info

In [3]:
# Data Preprocessing
def preprocess_data(train_data, test_data, max_length=256):
    # Preprocessing the dataset: convert to padded sequences
    train_sentences = []
    train_labels = []
    
    test_sentences = []
    test_labels = []
    
    for s, l in train_data:
        train_sentences.append(str(s.numpy().decode('utf8')))
        train_labels.append(l.numpy())
    
    for s, l in test_data:
        test_sentences.append(str(s.numpy().decode('utf8')))
        test_labels.append(l.numpy())

    tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=10000, oov_token="<OOV>")
    tokenizer.fit_on_texts(train_sentences)

    # Convert sentences to sequences
    train_sequences = tokenizer.texts_to_sequences(train_sentences)
    test_sequences = tokenizer.texts_to_sequences(test_sentences)

    # Pad the sequences so they are all the same length
    train_padded = pad_sequences(train_sequences, maxlen=max_length, truncating='post')
    test_padded = pad_sequences(test_sequences, maxlen=max_length, truncating='post')

    # Convert labels to numpy arrays
    train_labels = np.array(train_labels)
    test_labels = np.array(test_labels)

    return train_padded, test_padded, train_labels, test_labels

In [4]:
# Model Architecture 
def build_model(vocab_size=10000, embedding_dim=128, rnn_units=64, batch_size=32):
    model = Sequential([
        Embedding(vocab_size, embedding_dim),
        LSTM(rnn_units, return_sequences=True),
        LSTM(rnn_units),
        Dense(1, activation='sigmoid')
    ])
    model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])
    return model

In [13]:
train_data, test_data, info = load_d()

[1mDownloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to C:\Users\USER\tensorflow_datasets\imdb_reviews\plain_text\1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

RecursionError: maximum recursion depth exceeded while calling a Python object

In [7]:
train_padded, test_padded, train_labels, test_labels = preprocess_data(train_data, test_data)

NameError: name 'train_data' is not defined

In [None]:
model = build_model()
model.summary()

# Train the model
history= model.fit(train_padded, train_labels, epochs=10, validation_data=(test_padded, test_labels), verbose=1)

In [None]:
import matplotlib.pyplot as plt

plt.plot(history.history['accuracy'], label='accuracy')
plt.plot(history.history['val_accuracy'], label='val_accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.ylim([0, 1])
plt.legend(loc='lower right')
plt.show()



In [None]:
# Evaluate the model
loss, accuracy = model.evaluate(test_padded, test_labels, verbose=1)
print(f"Loss: {loss}")
print(f"Accuracy: {accuracy}")
