
# Natural Langauge Processing CW - Task 2: Text Classification

We'll start by importing all the relevant libraries and loading the data that will be used across the notebook:

In [1]:
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
# from sklearn.preprocessing import normalize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.multioutput import MultiOutputClassifier
import nltk
import time
from nltk.stem import PorterStemmer
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense, Bidirectional
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

Load the training dataset:

In [7]:
# Load your training data
train_data = pd.read_csv('./data/Training-dataset.csv')

Load the validation dataset:

In [8]:
# Load your test data
val_data = pd.read_csv('./data/Task-2-validation-dataset.csv')

Load the test dataset:

In [9]:
test_data = pd.read_csv('./data/Task-2-test-dataset.csv')

# Naive Bayes

To classify multi labels we will be using naive bayes:

Define the preprocessing function similar to task 1:

In [10]:
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()

    # Tokenization
    tokens = word_tokenize(text)

    # Lemmatization and stemming
    lemmatizer = WordNetLemmatizer()
    stemmer = PorterStemmer()
    tokens = [lemmatizer.lemmatize(stemmer.stem(token)) for token in tokens]

    # Join tokens back into a string
    return ' '.join(tokens)

Next, we apply the preprocess_text function to the plot synopsis column in the training data. We also create a list of the labels. Then we create a vectorizer and fit it to the training data. We then train the multinomial naive bayes classifier as a multioutput classifier.

In [29]:
start = time.time()
# Preprocess the plot_synopsis column
text = train_data['plot_synopsis'].apply(preprocess_text)

# create a list of columns representing labels
labels = train_data[['comedy', 'cult', 'flashback', 'historical', 'murder', 'revenge', 'romantic', 'scifi', 'violence']]

In [30]:

# Create a TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000, ngram_range=(1, 3))

# Transform the data
train_tfidf = vectorizer.fit_transform(text)

# Train a Multinomial Naive Bayes classifier as a multi-output classifier
classifier = MultiOutputClassifier(MultinomialNB(), n_jobs=-1)
classifier.fit(train_tfidf, labels)

end = time.time()
time_el = end - start
print(f'time: {time_el}')

time: 204.9425265789032


To asses the performance of the model, we use the validation dataset we loaded before, apply the preprocessing, then predict the labels.

In [31]:
start = time.time()
# Preprocess the plot_synopsis column
val_text = val_data['plot_synopsis'].apply(preprocess_text)

# Transform the test data using the same TfidfVectorizer
val_tfidf = vectorizer.transform(val_text)

# Make predictions on the test set
val_predictions = classifier.predict(val_tfidf)
end = time.time()
time_el = end - start
print(f'time: {time_el}')

time: 25.55899167060852


Save the results in a csv file:

In [32]:
# Save the results to a CSV file
val_results_df = pd.DataFrame({
    'ID': val_data['ID'],
    'comedy': val_predictions[:, 0],
    'cult': val_predictions[:, 1],
    'flashback': val_predictions[:, 2],
    'historical': val_predictions[:, 3],
    'murder': val_predictions[:, 4],
    'revenge': val_predictions[:, 5],
    'romantic': val_predictions[:, 6],
    'scifi': val_predictions[:, 7],
    'violence': val_predictions[:, 8],
})

val_results_df.to_csv('10693727-Task2-method-a-validation.csv', index=False, header=False)

**Results:**



### Testing:

Now that we have our model trained using our training dataset, and tested using the validation dataset, we can test it on unseen data. We will load the test dataset and run the model just like we did for the validation dataset, and save the results in a csv file.

In [33]:
#start time
start = time.time()
# Preprocess the plot_synopsis column
test_text = test_data['plot_synopsis'].apply(preprocess_text)

# Transform the test data using the same TfidfVectorizer
test_tfidf = vectorizer.transform(test_text)

# Make predictions on the test set
test_predictions = classifier.predict(test_tfidf)
#end time
end = time.time()
#print the time elapsed
elapsed_time = end - start
print(f'Time taken to test the model: {elapsed_time} seconds')

Time taken to test the model: 25.674399375915527 seconds


Save the results:

In [34]:
# Save the results to a CSV file
test_results_df = pd.DataFrame({
    'ID': test_data['ID'],
    'comedy': test_predictions[:, 0],
    'cult': test_predictions[:, 1],
    'flashback': test_predictions[:, 2],
    'historical': test_predictions[:, 3],
    'murder': test_predictions[:, 4],
    'revenge': test_predictions[:, 5],
    'romantic': test_predictions[:, 6],
    'scifi': test_predictions[:, 7],
    'violence': test_predictions[:, 8],
})

test_results_df.to_csv('10693727-Task2-method-a.csv', index=False, header=False)

# Bi-LSTM

The first thing is to preprocess the text. We also created a list of labels:

In [15]:
start = time.time()
# Preprocess the plot_synopsis column
train_data['plot_synopsis'] = train_data['plot_synopsis'].apply(preprocess_text)

text = train_data['plot_synopsis']
labels = train_data[['comedy', 'cult', 'flashback', 'historical', 'murder', 'revenge', 'romantic', 'scifi', 'violence']]
# Convert labels to 0 or 1
labels_binary = labels.applymap(lambda x: 1 if x == 1 else 0)

We tokenize the words and add pad sequence:

In [16]:
# Tokenize and pad sequences
max_words = 10000
max_len = 200
tokenizer = Tokenizer(num_words=max_words, lower=True, oov_token="<OOV>")
tokenizer.fit_on_texts(text)
text_seq = tokenizer.texts_to_sequences(text)
text_padded = pad_sequences(text_seq, maxlen=max_len)

Build the model and compile it:

In [17]:
# Build the LSTM model
model = Sequential()
model.add(Embedding(input_dim=max_words, output_dim=100, input_length=max_len))
model.add(Bidirectional(LSTM(64)))
model.add(Dense(9, activation='sigmoid'))

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

Now, train the model:

In [18]:
model.fit(text_padded, labels_binary, epochs=7, batch_size=32)
end = time.time()
time_el = end - start
print(f'time: {time_el}')

Epoch 1/7
Epoch 2/7
Epoch 3/7
Epoch 4/7
Epoch 5/7
Epoch 6/7
Epoch 7/7
time: 261.75799441337585


Now that we built the model and trained it. We preprocess the training data we loaded and tokenize it:

In [23]:
start = time.time()
# Preprocess the plot_synopsis column
val_data['plot_synopsis'] = val_data['plot_synopsis'].apply(preprocess_text)

# Assuming 'plot_synopsis' is the column containing plot synopses
val = val_data['plot_synopsis']

# Tokenize and pad sequences for test data
val_seq = tokenizer.texts_to_sequences(val)
val_padded = pad_sequences(val_seq, maxlen=max_len)


Now we predict the labels on the validation set:

In [24]:
# Make predictions on the validation set
val_predictions = model.predict(val_padded)

# Convert predictions to 0 or 1
val_predictions_binary = (val_predictions > 0.5).astype(int)

end = time.time()

time_el = end - start
print(f'time: {time_el}')

time: 24.730493783950806


Now save the validation results to a csv file:

In [25]:
# Save the results to a CSV file
val_results_df = pd.DataFrame({
    'ID': val_data['ID'],  # Assuming 'ID' is the index of your test data
    'comedy': val_predictions_binary[:, 0],
    'cult': val_predictions_binary[:, 1],
    'flashback': val_predictions_binary[:, 2],
    'historical': val_predictions_binary[:, 3],
    'murder': val_predictions_binary[:, 4],
    'revenge': val_predictions_binary[:, 5],
    'romantic': val_predictions_binary[:, 6],
    'scifi': val_predictions_binary[:, 7],
    'violence': val_predictions_binary[:, 8],
})

val_results_df.to_csv('10693727-Task2-method-b-validation.csv', index=False, header=False)

In [26]:
#precision: 43
# recall: 42.4

### Testing:

Now that we have our model trained using our training dataset, and tested using the validation dataset, we can test it on unseen data. We will load the test dataset and run the model just like we did for the validation dataset, and save the results in a csv file.

In [27]:
#start time
start = time.time()
# Preprocess the plot_synopsis column
test_data['plot_synopsis'] = test_data['plot_synopsis'].apply(preprocess_text)

# Assuming 'plot_synopsis' is the column containing plot synopses
test = test_data['plot_synopsis']

# Tokenize and pad sequences for test data
test_seq = tokenizer.texts_to_sequences(test)
test_padded = pad_sequences(test_seq, maxlen=max_len)

# Make predictions on the test set
test_predictions = model.predict(test_padded)

# Convert predictions to 0 or 1
test_predictions_binary = (test_predictions > 0.5).astype(int)

#end time
end = time.time()
#print the time elapsed
elapsed_time = end - start
print(f'Time taken to test the model: {elapsed_time} seconds')

Time taken to test the model: 27.002696752548218 seconds


Save the results:

In [28]:
# Save the results to a CSV file
test_results_df = pd.DataFrame({
    'ID': test_data['ID'],
    'comedy': test_predictions_binary[:, 0],
    'cult': test_predictions_binary[:, 1],
    'flashback': test_predictions_binary[:, 2],
    'historical': test_predictions_binary[:, 3],
    'murder': test_predictions_binary[:, 4],
    'revenge': test_predictions_binary[:, 5],
    'romantic': test_predictions_binary[:, 6],
    'scifi': test_predictions_binary[:, 7],
    'violence': test_predictions_binary[:, 8],
})

test_results_df.to_csv('10693727-Task2-method-b.csv', index=False, header=False)