# Spoiler Detection from movie reviews
This notebook implements a Multi layer perceptron model to be trained on movie reviews from the IMDB spoiler dataset (available [here](https://www.kaggle.com/datasets/rmisra/imdb-spoiler-dataset/data)).
[spaCy](https://spacy.io/) is used to perform NLP tasks like tokenization, feature extraction and entity recognition.
[Scikit-learn](https://scikit-learn.org/) is used to perform Term Frequence Inverse Document Frequency (TF-IDF) vectorization along with basic ML tasks like scaling, train test splitting, model creation and training.

## Data loading and preprocessing
### Imports

In [1]:
import pandas as pd
import spacy
import concurrent
import pickle
from time import time
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neural_network import MLPClassifier
from numpy import hstack

### Create a function to find valid release dates and filter dataframe based on it

In [2]:
def filter_valid_release_dates(dataframe):
    """
    Function that checks if the date value has 10 characters or not.
    It returns a dataframe with only those rows where above criteria is true.
    :param dataframe: dataframe that contains the movie reviews data
                      along with their release date
    :return:
    """
    mask = (dataframe['release_date'].str.len() == 10)
    return dataframe.loc[mask]

### Load data

In [3]:
df_reviews = pd.read_json('data/IMDB_reviews.json', lines=True)
df_movies = pd.read_json('data/IMDB_movie_details.json', lines=True)

# Filter the dataframe to extract rows with valid movie release dates
df_movies = filter_valid_release_dates(df_movies)

### Subset dataframe by taking only required columns

In [4]:
# Convert date columns to datetime format
df_reviews['review_date'] = pd.to_datetime(df_reviews['review_date'])
df_movies['release_date'] = pd.to_datetime(df_movies['release_date'])

# Merge the two dataframes on movie_id
merged_df = pd.merge(df_reviews, df_movies, on='movie_id')

# Calculate days since release for each review
merged_df['days_since_release'] = (merged_df['review_date'] - merged_df['release_date']).dt.days

merged_df['review_combined'] = merged_df['review_summary'].astype(str) + ' ' + merged_df['review_text'].astype(str)

# Extract the combined text column and the label column
selected_columns = ['days_since_release', 'review_combined', 'is_spoiler']
subset_df = merged_df[selected_columns]

print(subset_df.head())

# Convert the subsetted data to CSV and store it
subset_df.to_csv('data/review_with_labels_withDateDiff.csv', index=False)

   days_since_release                                    review_combined  \
0                4137  A classic piece of unforgettable film-making. ...   
1                2154  Simply amazing. The best film of the 90's. The...   
2                2485  The best story ever told on film I believe tha...   
3                2879  Busy dying or busy living? **Yes, there are SP...   
4                3506  Great story, wondrously told and acted At the ...   

   is_spoiler  
0        True  
1        True  
2        True  
3        True  
4        True  


### Preprocess data
We take only the first 2000 samples due to limitation of computational resources. Feel free to adjust the samples taken based on your available computational resource.

In [5]:
# Start the timer
start_time = time()
absolute_start_time = start_time

df = pd.read_csv('data/review_with_labels_withDateDiff.csv')

df = df[:2000]
print(len(df))

# Calculate the elapsed time
elapsed_time = time() - start_time
# Convert elapsed time to hours, minutes, and seconds
hours, rem = divmod(elapsed_time, 3600)
minutes, seconds = divmod(rem, 60)
# Display the elapsed time
print("\nStep 1: Reading the CSV file using pandas DONE!")
print(f"Elapsed time: {int(hours):02d}:{int(minutes):02d}:{int(seconds):02d}")


######################################################################

# Step 2: Extract the columns with labels and text

# Start the timer
start_time = time()

labels = df['is_spoiler'].tolist()
texts = df['review_combined'].tolist()
days_since_review = df['days_since_release'].tolist()

# Calculate the elapsed time
elapsed_time = time() - start_time
# Convert elapsed time to hours, minutes, and seconds
hours, rem = divmod(elapsed_time, 3600)
minutes, seconds = divmod(rem, 60)
# Display the elapsed time
print("\nStep 2: Extracting the columns with labels and text DONE!")
print(f"Elapsed time: {int(hours):02d}:{int(minutes):02d}:{int(seconds):02d}")


######################################################################

# Step 3: Text preprocessing with spaCy

# Start the timer
start_time = time()

nlp = spacy.load('en_core_web_sm')

# Calculate the elapsed time
elapsed_time = time() - start_time
# Convert elapsed time to hours, minutes, and seconds
hours, rem = divmod(elapsed_time, 3600)
minutes, seconds = divmod(rem, 60)
# Display the elapsed time
print("\nStep 3: Loading the spacy en_core_web_sm DONE!")
print(f"Elapsed time: {int(hours):02d}:{int(minutes):02d}:{int(seconds):02d}")


######################################################################

# Step 4: Apply text preprocessing to each text

# Function to preprocess text and include tokens, POS tags, and named entities
def preprocess(text, days):
    doc = nlp(text)
    pos_counts = doc.count_by(spacy.attrs.POS)
    entity_counts = doc.count_by(spacy.attrs.ENT_TYPE)

    processed_text = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]
    tokens = ' '.join(processed_text)

    # Extracting some common POS tags as features
    pos_ner_counts = {
        'NOUN_count': pos_counts.get(spacy.symbols.NOUN, 0),
        'VERB_count': pos_counts.get(spacy.symbols.VERB, 0),
        'ADJ_count': pos_counts.get(spacy.symbols.ADJ, 0),
        'PERSON_count': entity_counts.get(spacy.symbols.PERSON, 0),
        'ORG_count': entity_counts.get(spacy.symbols.ORG, 0),
        'GPE_count': entity_counts.get(spacy.symbols.GPE, 0),
        'DATE_count': entity_counts.get(spacy.symbols.DATE, 0)
    }

    features = [days]

    for i in pos_ner_counts.keys():
        features.append(pos_ner_counts[i])
    features.append(tokens)

    return features


# Start the timer
start_time = time()

# PARALLELIZATION CODE FOR SPACY

# Specify the maximum number of worker threads
max_workers = 2  # Set the number of desired worker threads
processed_texts = []
# Create a ThreadPoolExecutor with the specified number of worker threads
with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
    # Submit the text processing tasks to the executor
    future_results = [executor.submit(preprocess, text, days) for text, days in zip(texts, days_since_review)]

    # Retrieve the results as they become available

    for future in concurrent.futures.as_completed(future_results):
        processed_doc = future.result()
        processed_texts.append(processed_doc)

# Calculate the elapsed time
elapsed_time = time() - start_time
# Convert elapsed time to hours, minutes, and seconds
hours, rem = divmod(elapsed_time, 3600)
minutes, seconds = divmod(rem, 60)
# Display the elapsed time
print("\nStep 4: Apply text preprocessing to each text DONE!")
print(f"Elapsed time: {int(hours):02d}:{int(minutes):02d}:{int(seconds):02d}")
######################################################################


# Step 4.5: Dump the preprocessed data using Pickle

# Save the processed_texts list to a file using Pickle
with open('data/processed_texts_BEFORE_SCALING_full_with_pos_ner_counts_datediff.pkl', 'wb') as f:
    pickle.dump(processed_texts, f)

# Create a DataFrame from the list of extracted features
df = pd.DataFrame(processed_texts, columns=['Days', 'NOUN_count', 'VERB_count', 'ADJ_count', 'PERSON_count', 'ORG_count', 'GPE_count', 'DATE_count', 'tokens'])

# Normalize the features
scaler = MinMaxScaler()
normalized_counts = scaler.fit_transform(df[['Days','NOUN_count', 'VERB_count', 'ADJ_count', 'PERSON_count', 'ORG_count', 'GPE_count', 'DATE_count']])

# Replace the original counts with the normalized counts
df[['Days','NOUN_count', 'VERB_count', 'ADJ_count', 'PERSON_count', 'ORG_count', 'GPE_count', 'DATE_count']] = normalized_counts

# Now, df contains the original columns with the normalized counts

######################################################################
# Step 5: Dump the preprocessed data using Pickle
# Start the timer
start_time = time()

# Save the processed_texts list to a file using Pickle
with open('data/processed_texts_full_with_pos_ner_counts_datediff.pkl', 'wb') as f:
    pickle.dump(df, f)

# Save the processed_texts list to a file using Pickle
with open('data/labels_full_with_pos_ner_count_datediff.pkl', 'wb') as f:
    pickle.dump(labels, f)

# Calculate the elapsed time
elapsed_time = time() - start_time
# Convert elapsed time to hours, minutes, and seconds
hours, rem = divmod(elapsed_time, 3600)
minutes, seconds = divmod(rem, 60)
# Display the elapsed time
print("\nStep 5: Dump the preprocessed data using Pickle DONE!")
print(f"Elapsed time: {int(hours):02d}:{int(minutes):02d}:{int(seconds):02d}")

# Save the labels list to a file using Pickle
with open('data/labels_full_with_pos_ner_count_datediff.pkl', 'wb') as f:
    pickle.dump(labels, f)

2000

Step 1: Reading the CSV file using pandas DONE!
Elapsed time: 00:00:07

Step 2: Extracting the columns with labels and text DONE!
Elapsed time: 00:00:00

Step 3: Loading the spacy en_core_web_sm DONE!
Elapsed time: 00:00:00

Step 4: Apply text preprocessing to each text DONE!
Elapsed time: 00:00:32

Step 5: Dump the preprocessed data using Pickle DONE!
Elapsed time: 00:00:00


In [6]:
with open('data/processed_texts_full_with_pos_ner_counts_datediff.pkl', 'rb') as f:
    processed_texts = pickle.load(f)

with open('data/labels_full_with_pos_ner_count_datediff.pkl', 'rb') as f:
    labels = pickle.load(f)

print("\nRead texts and labels using Pickle DONE!")


Read texts and labels using Pickle DONE!


### Split the data into training and testing sets with 80:20 ratio

In [7]:
# Step 6: Splitting data into training and testing sets

# Start the timer
start_time = time()

X = processed_texts
y = labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify = y)

# Calculate the elapsed time
elapsed_time = time() - start_time
# Convert elapsed time to hours, minutes, and seconds
hours, rem = divmod(elapsed_time, 3600)
minutes, seconds = divmod(rem, 60)
# Display the elapsed time
print("\nStep 6: Splitting data into training and testing sets DONE!")
print(f"Elapsed time: {int(hours):02d}:{int(minutes):02d}:{int(seconds):02d}")


Step 6: Splitting data into training and testing sets DONE!
Elapsed time: 00:00:00


### Perform TF-IDF vectorization and save them into a pickle file

In [8]:
# Step 7: Vectorization with TfidfVectorizer

# Start the timer
start_time = time()

vectorizer = TfidfVectorizer(ngram_range=(1, 3), max_df=0.9, max_features=100000)
X_train_tokens_vectorized = vectorizer.fit_transform(X_train['tokens'])
X_test_tokens_vectorized = vectorizer.transform(X_test['tokens'])
# print("Shape before expanding:",X_train_tokens_vectorized.shape)

# Converting to numpy array
X_train_additional_features = X_train[['Days', 'NOUN_count', 'VERB_count', 'ADJ_count', 'PERSON_count', 'ORG_count', 'GPE_count', 'DATE_count']].to_numpy()
X_test_additional_features = X_test[['Days', 'NOUN_count', 'VERB_count', 'ADJ_count', 'PERSON_count', 'ORG_count', 'GPE_count', 'DATE_count']].to_numpy()

# print("Shape of vectorized tokens:", X_train_tokens_vectorized.shape)
# print("Shape of additional features:", X_train_additional_features.shape)

# Use hstack to combine the sparse TF-IDF matrices and the additional feature matrices
X_train_combined = hstack((X_train_tokens_vectorized.toarray(), X_train_additional_features))
X_test_combined = hstack((X_test_tokens_vectorized.toarray(), X_test_additional_features))

# Calculate the elapsed time
elapsed_time = time() - start_time
# Convert elapsed time to hours, minutes, and seconds
hours, rem = divmod(elapsed_time, 3600)
minutes, seconds = divmod(rem, 60)
# Display the elapsed time
print("\nStep 7: Vectorization with TfidfVectorizer DONE!")
print(f"Elapsed time: {int(hours):02d}:{int(minutes):02d}:{int(seconds):02d}")

# Step 7.1: Dump X_train_vectorized, y_train, X_test_vectorized, y_test using Pickle

# Start the timer
start_time = time()

with open('data/X_train_vectorized_full_ngram13_maxdf9_pos_ner_counts_strat_feat_100k.pkl', 'wb') as f:
    pickle.dump(X_train_combined, f)

with open('data/y_train_full_ngram13_maxdf9_pos_ner_counts_strat_feat_100k.pkl', 'wb') as f:
    pickle.dump(y_train, f)

with open('data/X_test_vectorized_full_ngram13_maxdf9_pos_ner_counts_strat_feat_100k.pkl', 'wb') as f:
    pickle.dump(X_test_combined, f)

with open('data/y_test_full_ngram13_maxdf9_pos_ner_counts_strat_feat_100k.pkl', 'wb') as f:
    pickle.dump(y_test, f)

# Calculate the elapsed time
elapsed_time = time() - start_time
# Convert elapsed time to hours, minutes, and seconds
hours, rem = divmod(elapsed_time, 3600)
minutes, seconds = divmod(rem, 60)
# Display the elapsed time
print("\nStep 7.1: Dump X_train_vectorized, y_train, X_test_vectorized, y_test using Pickle DONE!")
print(f"Elapsed time: {int(hours):02d}:{int(minutes):02d}:{int(seconds):02d}")


Step 7: Vectorization with TfidfVectorizer DONE!
Elapsed time: 00:00:01

Step 7.1: Dump X_train_vectorized, y_train, X_test_vectorized, y_test using Pickle DONE!
Elapsed time: 00:00:01


### Load the vectorized data and perform MLP classification.
A review can be classified into either 'spoiler' or 'not spoiler'

In [9]:
print("USING TOKENS+DATE+POS+NER, NGRAM: 1,3, MODEL: XGBOOST")

# Step 7.9: Read X_train_vectorized, y_train, X_test_vectorized, y_test using Pickle

# Start the timer AGAIN!
start_time = time()
absolute_start_time = start_time

with open('data/X_train_vectorized_full_ngram13_maxdf9_pos_ner_counts_strat_feat_100k.pkl', 'rb') as f:
    X_train_vectorized = pickle.load(f)

with open('data/y_train_full_ngram13_maxdf9_pos_ner_counts_strat_feat_100k.pkl', 'rb') as f:
    y_train = pickle.load(f)

with open('data/X_test_vectorized_full_ngram13_maxdf9_pos_ner_counts_strat_feat_100k.pkl', 'rb') as f:
    X_test_vectorized = pickle.load(f)

with open('data/y_test_full_ngram13_maxdf9_pos_ner_counts_strat_feat_100k.pkl', 'rb') as f:
    y_test = pickle.load(f)

print("\nStep 7.9: Read X_train_vectorized, y_train, X_test_vectorized, y_test using Pickle DONE!")


######################################################################

# Start the timer
start_time = time()

classifier = MLPClassifier(hidden_layer_sizes=(100,100), activation='relu', solver='adam', random_state=42, verbose=1)

classifier.fit(X_train_vectorized, y_train)
y_pred = classifier.predict(X_test_vectorized)

# Calculate the elapsed time
elapsed_time = time() - start_time
# Convert elapsed time to hours, minutes, and seconds
hours, rem = divmod(elapsed_time, 3600)
minutes, seconds = divmod(rem, 60)
# Display the elapsed time
print("\nStep 8: Classification with SVC DONE!")
print(f"Elapsed time: {int(hours):02d}:{int(minutes):02d}:{int(seconds):02d}")

######################################################################
# Step 9: Evaluation

# Start the timer
start_time = time()

accuracy = accuracy_score(y_test, y_pred)
confusion = confusion_matrix(y_test, y_pred)
classification = classification_report(y_test, y_pred)

# Calculate the elapsed time
elapsed_time = time() - start_time
# Convert elapsed time to hours, minutes, and seconds
hours, rem = divmod(elapsed_time, 3600)
minutes, seconds = divmod(rem, 60)
# Display the elapsed time
print("\nStep 9: Evaluation DONE!")
print(f"Elapsed time: {int(hours):02d}:{int(minutes):02d}:{int(seconds):02d}")

######################################################################

# Step 10: Display the results
# Start the timer
start_time = time()

print("Accuracy:", accuracy)
print("Confusion Matrix:")
print(confusion)
print("Classification Report:")
print(classification)

# Calculate the elapsed time
absolute_end_time = time()
elapsed_time = absolute_end_time - start_time
# Convert elapsed time to hours, minutes, and seconds
hours, rem = divmod(elapsed_time, 3600)
minutes, seconds = divmod(rem, 60)
# Display the elapsed time
print("\nStep 10: Display the results DONE!")
print(f"Elapsed time: {int(hours):02d}:{int(minutes):02d}:{int(seconds):02d}")


######################################################################

# Step 11: Dump y_pred, y_pred_proba using Pickle

# Start the timer
start_time = time()

with open('data/y_pred_xgboost_200est_LR02_ngram13_maxdf9_pos_ner_counts_strat_feat_100k.pkl', 'wb') as f:
    pickle.dump(y_pred, f)

print("\nStep 11: Dump y_pred using Pickle DONE!")
print("calculating y_pred proba")
# * Disable this step if needed
y_pred_proba = classifier.predict_proba(X_test_vectorized)

with open('data/y_pred_proba_xgboost_200est_LR02_ngram13_maxdf9_pos_ner_counts_strat_feat_100k.pkl', 'wb') as f:
    pickle.dump(y_pred_proba, f)

print("\nStep 11: Dump y_pred_proba using Pickle DONE!")


######################################################################


# TOTAL TIME CALCULATION:

# Calculate the elapsed time
elapsed_time = absolute_end_time - absolute_start_time
# Convert elapsed time to hours, minutes, and seconds
hours, rem = divmod(elapsed_time, 3600)
minutes, seconds = divmod(rem, 60)
# Display the elapsed time
print(f"\nTOTAL ELAPSED TIME: {int(hours):02d}:{int(minutes):02d}:{int(seconds):02d}")

USING TOKENS+DATE+POS+NER, NGRAM: 1,3, MODEL: XGBOOST

Step 7.9: Read X_train_vectorized, y_train, X_test_vectorized, y_test using Pickle DONE!
Iteration 1, loss = 0.69324372
Iteration 2, loss = 0.62046751
Iteration 3, loss = 0.49461294
Iteration 4, loss = 0.34296172
Iteration 5, loss = 0.20572467
Iteration 6, loss = 0.11481623
Iteration 7, loss = 0.06674227
Iteration 8, loss = 0.04659192
Iteration 9, loss = 0.03579284
Iteration 10, loss = 0.03274321
Iteration 11, loss = 0.03121158
Iteration 12, loss = 0.02828942
Iteration 13, loss = 0.02511771
Iteration 14, loss = 0.02248600
Iteration 15, loss = 0.02087546
Iteration 16, loss = 0.01888110
Iteration 17, loss = 0.01800053
Iteration 18, loss = 0.01690353
Iteration 19, loss = 0.01579055
Iteration 20, loss = 0.01477849
Iteration 21, loss = 0.01337572
Iteration 22, loss = 0.01458336
Iteration 23, loss = 0.01395171
Iteration 24, loss = 0.01285453
Iteration 25, loss = 0.01226380
Iteration 26, loss = 0.01170837
Iteration 27, loss = 0.00991508
I