This notebook explores text classification, introducing a majority class baseline and analyzing the effect of hyperparameter choices on accuracy. The goal is to understand how different settings in feature extraction and model regularization impact performance.

In [None]:
# Import necessary libraries
import sys # Provides access to system-specific parameters and functions
from collections import Counter # A dictionary subclass for counting hashable objects
from sklearn.feature_extraction.text import CountVectorizer # Converts text to a matrix of token counts
from sklearn import preprocessing # Provides tools for data preprocessing, like label encoding
from sklearn import linear_model # Contains linear models like Logistic Regression
import pandas as pd # A library for data manipulation and analysis, used here for plotting
import numpy as np # A fundamental package for scientific computing with Python

### 1. Data Loading

First, we define a function to read our data. The data is expected to be in a Tab-Separated Values (`.tsv`) format, with each line containing a label and the corresponding text.

In [None]:
# Define a function to read data from a file
def read_data(filename):
    # Initialize empty lists to store the text data (features) and labels
    X=[]
    Y=[]
    # Open the specified file with UTF-8 encoding to handle various characters
    with open(filename, encoding="utf-8") as file:
        # Iterate over each line in the file
        for line in file:
            # Split the line by the tab character ("\t") and remove any trailing whitespace
            cols=line.rstrip().split("\t")
            # The first column is the label
            label=cols[0]
            # The second column is the text
            text=cols[1]
            # The sample text data is already tokenized (words are separated by spaces).
            # If your data is not tokenized, you would do it here.
            
            # Append the text to the feature list X
            X.append(text)
            # Append the label to the label list Y
            Y.append(label)
    # Return the lists of features (X) and labels (Y)
    return X, Y

Next, we specify the directory where the training, development, and test data files are located. You'll need to change this path to match the location on your own system.

In [None]:
# Change this to the directory with your data.
# The directory should contain train.tsv, dev.tsv, and test.tsv files.
directory="../data/text_classification_sample_data"

Using the function and path defined above, we now load the training and development (validation) datasets into memory.

In [None]:
# Read the training data from 'train.tsv'
trainX, trainY = read_data("%s/train.tsv" % directory)
# Read the development (validation) data from 'dev.tsv'
devX, devY = read_data("%s/dev.tsv" % directory)

### 2. Majority Class Baseline

Baselines are crucial for understanding how well a model is performing. A simple yet effective baseline is the **majority class baseline**. This method predicts the most frequent label from the **training data** for every single example in the development set. If our sophisticated model can't beat this simple baseline, it's not very useful.

In [None]:
# Define a function to calculate the majority class baseline accuracy
def majority_class(trainY, devY):
    # Use Counter to count the occurrences of each label in the training data
    label_counts = Counter(trainY)
    # Find the most common label and its count, then extract just the label
    majority_label = label_counts.most_common(1)[0][0]
    
    # Create a list of predictions by repeating the majority label for every item in the dev set
    predictions = [majority_label] * len(devY)
    
    # Calculate accuracy by comparing predictions to the true dev labels
    # This sums up all instances where the prediction matches the true label (evaluates to 1)
    correct = sum(p == y for p, y in zip(predictions, devY))
    # Accuracy is the number of correct predictions divided by the total number of predictions
    accuracy = correct / len(devY)
    
    # Return the list of predictions and the calculated accuracy
    return predictions, accuracy

Now, let's execute the `majority_class` function to see what our baseline accuracy is.

In [None]:
# Calculate the majority class baseline predictions (p) and accuracy (a)
p, a = majority_class(trainY,devY)

### 3. Hyperparameter Tuning (Grid Search)

While Scikit-learn's `GridSearchCV` is a powerful tool, building our own grid search gives us more control and helps in understanding the process. We'll explore how different parameter settings for `CountVectorizer` and `LogisticRegression` affect accuracy.

First, we'll test a single hyperparameter: `max_features` from `CountVectorizer`, which limits the vocabulary size to the most frequent words.

In [None]:
# Initialize empty lists to store the accuracy scores and parameter names for each trial
scores=[]
names=[]

# Define a list of values for the max_features hyperparameter to test
feat_vals=[50, 100, 500, 1000, 5000, 10000, 50000]

# Initialize LabelEncoder to convert string labels (e.g., 'positive', 'negative') into integers (e.g., 1, 0)
le = preprocessing.LabelEncoder()
# Fit the encoder on the training labels to learn the mapping
le.fit(trainY)
# Transform the training and dev labels into their integer representations
Y_train=le.transform(trainY)
Y_dev=le.transform(devY)

# Initialize a counter for tracking the number of trials
idx=0

# Loop through each value in our list of feature values
for feat_val in feat_vals:

    # Initialize CountVectorizer.
    # max_features: use the top 'feat_val' most frequent words.
    # analyzer=str.split: split text on whitespace (assumes pre-tokenized text).
    # lowercase=False: do not convert text to lowercase.
    # strip_accents=None: do not remove accents.
    # binary=True: use 1s and 0s for word presence, not word counts.
    vectorizer = CountVectorizer(max_features=feat_val, analyzer=str.split, lowercase=False, strip_accents=None, binary=True)

    # Fit the vectorizer on the training text and transform it into a feature matrix
    X_train = vectorizer.fit_transform(trainX)
    # Transform the dev text using the same fitted vectorizer
    X_dev = vectorizer.transform(devX)

    # Print the progress of the grid search
    print ("%s of %s trials" % (idx, len(feat_vals)))

    # Initialize a Logistic Regression model with default parameters
    # C=1.0: inverse of regularization strength.
    # solver='lbfgs': algorithm to use in the optimization problem.
    # penalty='l2': use L2 regularization.
    logreg = linear_model.LogisticRegression(C=1.0, solver='lbfgs', penalty='l2')
    # Train the model on the vectorized training data and encoded labels
    logreg.fit(X_train, Y_train)
    # Calculate the accuracy on the dev set and append it to our scores list
    scores.append(logreg.score(X_dev, Y_dev))
    # Create a descriptive name for this trial and append it to our names list
    names.append("feat_value:%s" % (feat_val))
    # Increment the trial counter
    idx+=1

Now we will use `pandas` to create a DataFrame from our results and plot them in a bar chart to easily compare the performance for different numbers of features.

In [None]:
# Create a pandas DataFrame to store and display the results neatly
pd_results=pd.DataFrame({"value":names, "accuracy":scores})
# Plot the results as a bar chart to visualize the relationship between the number of features and accuracy
pd_results.plot.bar(x='value', y='accuracy', figsize=(14,6))
# Display the DataFrame
pd_results

### 4. Tuning Multiple Hyperparameters

Some parameters interact with each other. For example, the optimal number of features might depend on the regularization strength of the model. Here, we'll perform a grid search on two interacting hyperparameters:
1.  **`max_features`** from `CountVectorizer`.
2.  **`C`** (regularization strength) from `LogisticRegression`. A smaller `C` value means stronger regularization.

In [None]:
# Re-initialize empty lists to store the new results
scores=[]
names=[]

# Define the list of values for max_features to test
feat_vals=[50, 100, 500, 1000, 5000, 10000, 50000]
# Define the list of values for the regularization parameter C to test
C_values=[0.001, 0.1, 1, 5, 10]

# The LabelEncoder is already fitted, so we can reuse the encoded labels
le = preprocessing.LabelEncoder()
le.fit(trainY)
Y_train=le.transform(trainY)
Y_dev=le.transform(devY)

# Initialize a trial counter
idx=0

# Outer loop: iterate through each max_features value
for feat_val in feat_vals:

    # Create the feature matrix for the current feat_val
    vectorizer = CountVectorizer(max_features=feat_val, analyzer=str.split, lowercase=False, strip_accents=None, binary=True)
    X_train = vectorizer.fit_transform(trainX)
    X_dev = vectorizer.transform(devX)

    # Inner loop: iterate through each C value for the current feature set
    for C_val in C_values:
        
        # Print the progress of the grid search
        print ("%s of %s trials" % (idx, len(feat_vals)*len(C_values)))

        # Initialize the Logistic Regression model with the current C_val
        logreg = linear_model.LogisticRegression(C=C_val, solver='lbfgs', penalty='l2')
        # Train the model
        logreg.fit(X_train, Y_train)
        # Score the model and append the accuracy to the scores list
        scores.append(logreg.score(X_dev, Y_dev))
        # Create a descriptive name including both hyperparameter values and append it
        names.append("feat_value:%s-C:%s" % (feat_val, C_val))
        # Increment the trial counter
        idx+=1

Finally, let's visualize the results from the two-parameter grid search to find the best combination of `max_features` and `C`.

In [None]:
# Create a pandas DataFrame from the results of the two-parameter grid search
pd_results=pd.DataFrame({"value":names, "accuracy":scores})
# Plot the results as a bar chart. This will help identify the best-performing combination.
pd_results.plot.bar(x='value', y='accuracy', figsize=(14,6))
# Display the results DataFrame
pd_results