# **Lab 05: Text Classification and Performance Analysis**

## Text Classification

### Importing required modules

*Regular expressions*

* re is a Python module for regular expressions.

* It allows pattern matching in strings. For example, remove punctuation, match words

* If the inout "This! is, some# text.", then the output will be "This  is  some  text ".

*from sklearn.datasets import load_files*

* This function loads a dataset from folders.
* It automatically assigns labels based on folder names.

*from nltk.corpus import stopwords*

* stopwords are common words like the, is, in, on, etc.
* These words don't carry meaningful sentiment, so we often remove them during text preprocessing.

*from nltk.stem import WordNetLemmatizer*

* Lemmatization = converting a word to its base form (dictionary form).
* Example: "running" → "run", "better" → "good"



In [2]:
import re # Regular expressions
from sklearn.datasets import load_files # For loading dataset folders
import nltk # Natural Language Toolkit

from nltk.corpus import stopwords # Stop words
from nltk.stem import WordNetLemmatizer # Lemmatization

# Download required NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

### Loading data

In [3]:
from google.colab import drive
drive.mount('/content/drive')


# Instantiate lemmatizer (needed for later)
lemmatizer = WordNetLemmatizer()

# movie_data = load_files(r"txt_sentoken")
movie_data = load_files('/content/drive/My Drive/CO544/movie_reviews/')
X, y = movie_data.data, movie_data.target

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
# Show basic summary information
print(f"Number of documents: {len(X)}")
print(f"Number of labels: {len(y)}")
print(f"Target names (classes): {movie_data.target_names}")

# Show a sample file (before decoding)
print("\nFirst document (raw bytes):")
print(X[0][:500]) # show first 500 bytes

# Decode and print a preview
print("\nFirst document (decoded):")
print(X[0].decode('utf-8')[:500]) # show first 500 characters

# Check label of first document
print(f"\nLabel of first document: {y[0]}")

Number of documents: 2017
Number of labels: 2017
Target names (classes): ['neg', 'pos']

First document (raw bytes):
b"my opinion on a film can be easily swayed by the presence of actors i love . \ni love ralph fiennes . \ni love uma thurman . \ni love sean connery . \nhell , i'm even a big fan of jim broadbent and fiona shaw . \ni saw the fantastic preview for the avengers nearly eight months ago , and i've been eagerly awaiting the film ever since . \na few months into the summer , however , i noticed that its release date had been changed a few times , and that it had ended up in the mid-august dumping ground . \nth"

First document (decoded):
my opinion on a film can be easily swayed by the presence of actors i love . 
i love ralph fiennes . 
i love uma thurman . 
i love sean connery . 
hell , i'm even a big fan of jim broadbent and fiona shaw . 
i saw the fantastic preview for the avengers nearly eight months ago , and i've been eagerly awaiting the film ever since . 
a few months

### Datapreprocessing

In [5]:
documents = []

for i in range(len(X)):
  # 1. Decode from bytes to string
  document = X[i].decode('utf-8')
  # # you can add a small check as follows.
  # print(X[0]) # Before decoding (bytes)
  # print(X[0].decode('utf-8')) # After decoding

  # 2. Apply your regex substitutions
  document = re.sub(r'\W', ' ', document) # remove special characters
  document = re.sub(r'\^[a-zA-Z]\s+', ' ', document) # single chars at beginning
  document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document) # single chars in middle
  document = re.sub(r'\d+', ' ', document) # remove numbers
  document = re.sub(r'\s+', ' ', document, flags=re.I) # multiple spaces to one

  # 3. Lowercase
  document = document.lower()

  # 4. Tokenize
  document = document.split()

  # 5. Lemmatize
  document = [lemmatizer.lemmatize(word) for word in document]

  # 6. Rejoin tokens if needed (optional)
  document = ' '.join(document)

  # 7. Append to new list
  documents.append(document)

### Convert text intonumbers

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords

vectorizer = CountVectorizer(
  max_features=1500,
  min_df=7,
  max_df=0.8,
  stop_words=stopwords.words('english')
)

X_vectors = vectorizer.fit_transform(documents).toarray()

# To check the shape and vocabulary:
print(X_vectors.shape) # (number_of_documents, number_of_features)
print(vectorizer.get_feature_names_out()) # List of feature words

(2017, 1500)
['ability' 'able' 'absolutely' ... 'york' 'young' 'younger']


### Text Classification

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_vectors, y, test_size=0.2,
random_state=0)

# Logistic Regression model
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()

# Train the model
log_reg.fit(X_train, y_train)

# Make the predictions on testing data
predictions = log_reg.predict(X_test)

### Evaluating Model Performance

In [9]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print("Confusion Matrix:")
print(confusion_matrix(y_test, predictions))
print("\nClassification Report:")
print(classification_report(y_test, predictions))
print("\nAccuracy:")
print(accuracy_score(y_test, predictions))

Confusion Matrix:
[[152  42]
 [ 41 169]]

Classification Report:
              precision    recall  f1-score   support

           0       0.79      0.78      0.79       194
           1       0.80      0.80      0.80       210

    accuracy                           0.79       404
   macro avg       0.79      0.79      0.79       404
weighted avg       0.79      0.79      0.79       404


Accuracy:
0.7945544554455446


## Train three additional classifiers

### Import Required Models

In [10]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

### Train All Models

In [11]:
# Logistic Regression (already trained before)
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
log_pred = log_reg.predict(X_test)

# Random Forest
rf = RandomForestClassifier(random_state=0)
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)

# SVM
svm = SVC(kernel='linear')  # Linear kernel for text data
svm.fit(X_train, y_train)
svm_pred = svm.predict(X_test)

# Naive Bayes
nb = MultinomialNB()
nb.fit(X_train, y_train)
nb_pred = nb.predict(X_test)


### Evaluate All Models

In [12]:
def evaluate_model(name, y_test, y_pred):
    print(f"--- {name} ---")
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
    print("Classification Report:\n", classification_report(y_test, y_pred))


evaluate_model("Logistic Regression", y_test, log_pred)
evaluate_model("Random Forest", y_test, rf_pred)
evaluate_model("Support Vector Machine", y_test, svm_pred)
evaluate_model("Naive Bayes", y_test, nb_pred)

--- Logistic Regression ---
Accuracy: 0.7945544554455446
Confusion Matrix:
 [[152  42]
 [ 41 169]]
Classification Report:
               precision    recall  f1-score   support

           0       0.79      0.78      0.79       194
           1       0.80      0.80      0.80       210

    accuracy                           0.79       404
   macro avg       0.79      0.79      0.79       404
weighted avg       0.79      0.79      0.79       404

--- Random Forest ---
Accuracy: 0.8193069306930693
Confusion Matrix:
 [[164  30]
 [ 43 167]]
Classification Report:
               precision    recall  f1-score   support

           0       0.79      0.85      0.82       194
           1       0.85      0.80      0.82       210

    accuracy                           0.82       404
   macro avg       0.82      0.82      0.82       404
weighted avg       0.82      0.82      0.82       404

--- Support Vector Machine ---
Accuracy: 0.754950495049505
Confusion Matrix:
 [[150  44]
 [ 55 155]]
Class