# Required assignment 11.1: Naïve Bayes in Python

The naive Bayes classifier can be effectively used to classify movie reviews in the NLTK Movie Reviews data set. This task involves identifying the sentiment (positive or negative) expressed in text. This data set is a standard for building and evaluating sentiment analysis models.

In [1]:
#Import necessary libraries
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)

import nltk
from nltk.corpus import movie_reviews
import random
import pandas as pd

In [2]:
#Download required NLTK data (run once)
nltk.download('movie_reviews',quiet = True)

True

In [3]:
#Load the movie reviews data
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

print(f"Number of documents: {len(documents)}")
print(f"Example document words: {documents[0][0][:20]}")
print(f"Example document category: {documents[0][1]}")

Number of documents: 2000
Example document words: ['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', ',', 'drink', 'and', 'then', 'drive', '.', 'they', 'get', 'into', 'an']
Example document category: neg


## Question 1

- Load the `documents` in a data frame `df`.

- Display the columns as [`review`,`sentiment`].

In [4]:
###GRADED
df = ...
# YOUR CODE HERE
#raise NotImplementedError()

# Build a DataFrame with the required columns: [review, sentiment]
df = pd.DataFrame(documents, columns=['review', 'sentiment'])

print(df.head())


                                              review sentiment
0  [plot, :, two, teen, couples, go, to, a, churc...       neg
1  [the, happy, bastard, ', s, quick, movie, revi...       neg
2  [it, is, movies, like, these, that, make, a, j...       neg
3  [", quest, for, camelot, ", is, warner, bros, ...       neg
4  [synopsis, :, a, mentally, unstable, man, unde...       neg



Randomly shuffle the documents by using `.random.shuffle()`.


In [5]:
#Shuffle the documents for randomness
random.shuffle(documents)


To build a machine learning model for text classification, such as identifying positive or negative movie reviews, relevant features must be extracted from raw text. 

Using NLTK, the data set is processed by computing a frequency distribution of all words and normalised to lowercase for consistency. The 2,000 most frequent words are then selected as the model’s feature set. This method is widely used in text classification tasks, including spam detection, where efficient filtering and feature selection are essential.

## Question 2

- Use `FreqDist` from `nltk` to compute word frequencies and normalise each word to lowercase with `.lower()`. 

- Iterate over `movie_reviews.words()` and assign the resulting frequency distribution to `all_words`.

- From the `all_words` results, list the top 2,000 words as features and assign it to `word_features`.

- Display the top 10 `most_common` words from the `all_words` list and assign it to `ans2`.

In [7]:
###GRADED
all_words = ...
word_features = ...
ans2 = ...
# YOUR CODE HERE
#raise NotImplementedError()

# 1 - Frequency distribution of lower-cased tokens
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())

# 2 - Top 2,000 words as features (just the tokens, not the counts)
word_features = [w for (w, _) in all_words.most_common(2000)]

# 3 - Top 10 most common words (token, count) pairs
ans2 = all_words.most_common(10)

print("The most common words are ", ans2)


The most common words are  [(',', 77717), ('the', 76529), ('.', 65876), ('a', 38106), ('and', 35576), ('of', 34123), ('to', 31937), ("'", 30585), ('is', 25195), ('in', 21822)]


The `document_features` function converts a text document into a structured feature vector by checking for the presence of a predefined list of important words (typically, the most frequent terms in the data set). 

It normalises the document’s words to lowercase, uses a set for efficient lookups and returns a dictionary in which each key corresponds to a word and the value is a Boolean that indicates whether that word appears in the document. 

This consistent representation allows models such as naive Bayes to classify text based on the presence or absence of key terms.

In [8]:
def document_features(document):
    document_words = set(w.lower() for w in document)
    features = {}
    for word in word_features:
        features[f'contains({word})'] = (word in document_words)
    return features


This step prepares the data set for training and evaluation by representing each document as a fixed set of features paired with its sentiment label. This format is compatible with machine learning models such as the naive Bayes classifier.

## Question 3

- Use a list comprehension to iterate over the documents, each paired with its corresponding class label, and apply the `document_features()` function to each document. 

- Store the resulting features-label pairs in a variable named `featuresets`.

In [9]:
###GRADED
featuresets = ...
# YOUR CODE HERE
#raise NotImplementedError()

#For each (review, sentiment) in documents, it creates a pair. Collects those into a list called featuresets.
featuresets = [(document_features(review), sentiment) for (review, sentiment) in documents]

print(featuresets[0])


({'contains(,)': True, 'contains(the)': True, 'contains(.)': True, 'contains(a)': True, 'contains(and)': True, 'contains(of)': True, 'contains(to)': True, "contains(')": True, 'contains(is)': True, 'contains(in)': True, 'contains(s)': True, 'contains(")': True, 'contains(it)': True, 'contains(that)': True, 'contains(-)': True, 'contains())': True, 'contains(()': True, 'contains(as)': True, 'contains(with)': True, 'contains(for)': True, 'contains(his)': True, 'contains(this)': True, 'contains(film)': True, 'contains(i)': True, 'contains(he)': True, 'contains(but)': True, 'contains(on)': True, 'contains(are)': True, 'contains(t)': True, 'contains(by)': True, 'contains(be)': True, 'contains(one)': False, 'contains(movie)': False, 'contains(an)': True, 'contains(who)': True, 'contains(not)': True, 'contains(you)': True, 'contains(from)': False, 'contains(at)': True, 'contains(was)': True, 'contains(have)': True, 'contains(they)': True, 'contains(has)': True, 'contains(her)': True, 'contain

 The data set of 2,000 documents is split into:

- 1, 800 documents for training (featuresets[:1800]), which is 90 per cent of the data

- 200 documents for testing (featuresets[1800:]), which is the remaining 10 per cent of the data

This is a common practice to train the model on most of the data and evaluate its performance on unseen data to estimate how well it generalises.

## Question 4

- Create the `train_set` and `test_set` by using 90 per cent of the data for training and 10 per cent for testing.

- There is a total of 2,000 words in the `featuresets`. Use 1,800 as `train_set` and the remaining 200 as `test_set`.

In [10]:
###GRADED
train_set = ...
test_set = ...

# YOUR CODE HERE
#raise NotImplementedError()


# 90/10 split
#Use the first 1,800 feature–label pairs for training and the remaining 200 for testing:
train_set = featuresets[:1800]
test_set  = featuresets[1800:]

print(len(train_set), len(test_set))


1800 200


Using a 90/10 split provides a good balance between having sufficient data for training and keeping a meaningful portion for testing, especially with small data sets. It’s a practical approach rather than a strict rule.

NLTK classifiers, such as `NaiveBayesClassifier`, require data in the form of (features, label) tuples. Unlike libraries such as scikit-learn, which separate features (X) and labels (y), NLTK’s format is already aligned with this structure. This makes splitting into `train_set` and `test_set` straightforward and directly compatible with NLTK’s training and evaluation methods.

To train the classifier, use `NaiveBayesClassifier` from NLTK: https://www.nltk.org/_modules/nltk/classify/naivebayes.html.

## Question 5

- Use `nltk.NaiveBayesClassifier.train` on the `train_set`.

- Assign the output to `classifier`.

In [11]:
###Graded
classifier = ...

# YOUR CODE HERE
#raise NotImplementedError()

from nltk import NaiveBayesClassifier

# Train the classifier on the training set
classifier = NaiveBayesClassifier.train(train_set)

# Show the 10 most informative features
print(classifier.show_most_informative_features(10))


Most Informative Features
   contains(outstanding) = True              pos : neg    =     12.6 : 1.0
   contains(wonderfully) = True              pos : neg    =      8.6 : 1.0
         contains(mulan) = True              pos : neg    =      8.2 : 1.0
        contains(seagal) = True              neg : pos    =      7.9 : 1.0
         contains(damon) = True              pos : neg    =      6.6 : 1.0
        contains(wasted) = True              neg : pos    =      6.5 : 1.0
        contains(poorly) = True              neg : pos    =      5.4 : 1.0
           contains(era) = True              pos : neg    =      5.1 : 1.0
         contains(flynt) = True              pos : neg    =      4.9 : 1.0
    contains(ridiculous) = True              neg : pos    =      4.9 : 1.0
None


## Question 6

- Perform the classifier accuracy by using the `nltk.classify.accuracy()` on the `test_set`.

- Assign it to `accuracy`.

In [12]:
###GRADED
accuracy = ...

# YOUR CODE HERE
#raise NotImplementedError()
accuracy = nltk.classify.accuracy(classifier, test_set)

print(f'Accuracy: {accuracy:.2f}')



Accuracy: 0.78


An accuracy of around 81 per cent using a basic naive Bayes classifier on the NLTK movie reviews data set is a solid starting point. However, there’s room for improvement.

To build a more robust sentiment classifier, the workflow involves importing essential libraries, preprocessing and splitting the data and vectorising the text. Several models, such as naive Bayes, logistic regression and support vector machine (SVMs), are trained and evaluated. You will learn more about the other models in later modules.

A `VotingClassifier` ensemble combines their predictions, improving both accuracy and stability. This approach highlights how ensemble methods can boost performance and make text classification models more reliable.


In [13]:
import nltk
from nltk.corpus import movie_reviews
import random
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score


#Separate the text and the labels
texts = [" ".join(doc) for doc, label in documents] # Join words into a string
labels = [label for doc, label in documents]

#Split the data into train and test sets (90% train, 10% test)
X_train, X_test, y_train, y_test = train_test_split(
    texts, labels, test_size=0.1, random_state=42)

#Vectorise the text data into word count features
vectorizer = CountVectorizer(stop_words='english', max_features=3000)
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

#Define the classifiers to try
classifiers = {
    "Multinomial Naive Bayes": MultinomialNB(),
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Linear SVM": LinearSVC()
}

#Train and evaluate each classifier
for name, clf in classifiers.items():
    clf.fit(X_train_vec, y_train)
    y_pred = clf.predict(X_test_vec)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"{name} Accuracy: {accuracy:.4f}")

Multinomial Naive Bayes Accuracy: 0.8000
Logistic Regression Accuracy: 0.8150
Linear SVM Accuracy: 0.8100


In [14]:
from sklearn.ensemble import VotingClassifier
#Create a VotingClassifier ensemble
ensemble = VotingClassifier(
    estimators=[
        ('nb', classifiers["Multinomial Naive Bayes"]),
        ('lr', classifiers["Logistic Regression"]),
        ('svm', classifiers["Linear SVM"])
        ],
    voting='hard'  # 'soft' for soft voting
)

#Train the ensemble
ensemble.fit(X_train_vec, y_train)

#Predict on the test set
y_pred = ensemble.predict(X_test_vec)

#Evaluate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Ensemble model accuracy: {accuracy:.4f}')

Ensemble model accuracy: 0.8000
