Week 10 Document Classification Assignment
By Evan McLaughlin and Vladimir Nimchenko

~~~
It can be useful to be able to classify new "test" documents using already classified "training" documents.  A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.  Here is one example of such data:  UCI Machine Learning Repository: Spambase Data Set

For this project, you can either use the above dataset to predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).

For more adventurous students, you are welcome (encouraged!) to come up a different set of documents (including scraped web pages!?) that have already been classified (e.g. tagged), then analyze these documents to predict how new documents should be classified.
~~~

For our assignment, we elected to build a classifier to bucket imdb reviews as either positive or negative. We located a good dataset for this exercise on kaggle, linked below. 


https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?resource=download

In [98]:
import nltk
import random
import pandas as pd
import io
import requests
import nltk

In [99]:
# first we set the seed and write our function to preprocess the segment data, and employ tokenization and normalization
# and pull out the document features
random.seed(269)
def preprocess_and_segment(text, segment_length):
    tokens = text.split()
    tokens = [token.lower().strip('.,?!') for token in tokens]
    segments = [tokens[i:i + segment_length] for i in range(0, len(tokens), segment_length)]
    return segments

In [100]:
def document_features(segment, word_features):
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in segment)
    return features

In [101]:

# next we write our function to train and test our classifier
def train_and_test_classifier(featuresets, train_size):
    random.shuffle(featuresets)
    train_set, test_set = featuresets[:train_size], featuresets[train_size:]
    classifier = nltk.NaiveBayesClassifier.train(train_set)
    accuracy = nltk.classify.accuracy(classifier, test_set)
    return classifier, accuracy


In the interest of efficiency, the segment_length and word_features_count variables below are used to control the length of text segments and the number of word features extracted for classification. We can set these values where we are comfortable, depending on the specific characteristics of the dataset and the requirements of your classification task. We can thus adjust these values to optimize performance of the model. We can adjust segment lengths or numbers of word features to find the settings that yield the best classification accuracy.


In [102]:
# now we can kick off our code. Let's set our segment length and word features counts at 2000 to begin and read in the data

segment_length = 2000
word_features_count = 2000
url = "https://raw.githubusercontent.com/evanmclaughlin/DATA-620/main/imdb_reviews.v3.csv"
response = requests.get(url)
data = io.StringIO(response.content.decode('utf-8'))
imdb_reviews = pd.read_csv(data)

In [103]:
# We next must preprocess and combine the words for feature extraction

all_segments = [(preprocess_and_segment(review, segment_length), label) for review, label in zip(imdb_reviews['review'], imdb_reviews['sentiment'])]
all_words = [word.lower() for review in imdb_reviews['review'] for word in review.split()]
word_frequencies = nltk.FreqDist(all_words)
word_features = list(word_frequencies.keys())[:word_features_count]

In [104]:
# Next up, we extract document features for all segments and train-test our classifier, selecting 30% for our training
featuresets = []
for segment, label in all_segments:
    features = document_features(segment, word_features)
    featuresets.append((features, label))
    
train_size = int(len(featuresets) * 0.30) 
classifier, accuracy = train_and_test_classifier(featuresets, train_size)

In [105]:
# Let's check out the accuracy
print("Accuracy:", accuracy)

Accuracy: 0.5087419232231091


This is basically a coin flip and a poor effort to start out with. In order to improve our classifier, we can consider several options, including TF-IDF vectorizer, which captures the significance of terms within individual documents and across the dataset. It weighs words based on their importance and helps prioritize terms that contribute to sentiment while mitigating the impact of common words. This makes TF-IDF a good choice to sentiment analysis where correctly identifying key terms is critical for prediction success.

In [106]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# after splitting the data into train-test, we generate a "pipeline" with the TF-IDF vectorizer and random forest classifier 
X_train, X_test, y_train, y_test = train_test_split(imdb_reviews['review'], imdb_reviews['sentiment'], test_size=0.3, random_state=269)
pipeline = make_pipeline(TfidfVectorizer(max_features=2000), RandomForestClassifier(n_estimators=100, random_state=269))

# Next, we train the model and predict on the test set
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

# accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.8279379157427937


We've really improved our accuracy using the TF-IDF vectorizer as we are now providing appropriate weight to important words within the document, improving our classifier's discriminitive power enormously. We limited our feature space to the top 2000 most frequent terms, the model focuses on the most informative features, enhancing its ability to generalize to unseen data. Combining TF-IDF with a Random Forest classifier further captures complex relationships within the data to enhance prediction power. Above, we also employed a pipeline which simplifies the workflow, integrating data preprocessing and model training into a single object for enhanced efficiency. 