# Class 2: Early text classification

**Topics**

* Naive Bayes
* Logistic Regression
* The bag-of-words vector representation

**Reading**
* [Jurafsky \& Martin Chapter 4: Naive Bayes, Text Classification, and Sentiment](https://web.stanford.edu/~jurafsky/slp3/4.pdf)
* [Jurafsky \& Martin Chapter 5: Logistic Regression](https://web.stanford.edu/~jurafsky/slp3/5.pdf)


## 1. Basic sentiment analysis of movie reviews using Naive Bayes

In the next few cells, we'll implement basic sentiment analysis of movie reviews using Naive Bayes and the TF-IDF-weighed bag-of-words feature representation. We'll then examine the top-ranked features by class learned by the classifier

In [None]:
import nltk
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report


nltk.download('movie_reviews')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')

from nltk.corpus import movie_reviews

# Load the movie reviews dataset
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

# Shuffle the documents
import random
random.shuffle(documents)

# Convert list of words back to strings for TF-IDF Vectorizer
document_texts = [" ".join(d) for (d, c) in documents]
document_sentiments = [c for (d, c) in documents]

" ".join(documents[0][0]), documents[0][1]

`MultinomialNB` implements the Naive Bayes algorithm for multinomial models.
It  learns the likelihood of seeing certain words in positive versus negative reviews and uses this information to predict the sentiment of new reviews.

1.  **Assumed Distribution:** It assumes that the features follow a multinomial distribution. In the context of text, this means it considers the counts of words in a document.
2.  **Naive Assumption:** It makes the "naive" assumption that the features are conditionally independent given the class label. This means that the presence or absence of one word does not affect the presence or absence of another word, given that we know the sentiment (e.g., positive or negative). While this assumption is often violated in real-world text, Naive Bayes still performs surprisingly well in many text classification tasks.
3.  **Probability Calculation:** For each class (e.g., positive or negative), the model learns the probability of each feature (word) appearing in documents belonging to that class. This is typically done by counting the occurrences of each word in the training data for each class and applying smoothing to handle words that might not appear in all classes.
4.  **Classification:** To classify a new document, the model calculates the probability of the document belonging to each class, based on the learned feature probabilities and the prior probability of each class. It then assigns the document to the class with the highest probability.



In [None]:
# Split data into training and testing sets
X_train_text, X_test_text, y_train, y_test = train_test_split(document_texts, document_sentiments, test_size=0.25, random_state=42)

# Use TfidfVectorizer for feature extraction
tfidf_vectorizer = TfidfVectorizer(max_features=3000) # You can adjust max_features
X_train = tfidf_vectorizer.fit_transform(X_train_text)
X_test = tfidf_vectorizer.transform(X_test_text)


# Train a Naive Bayes classifier
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

# Predict on the test set
y_pred = classifier.predict(X_test)

# Evaluate the classifier
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print("Classification Report:")
print(report)

## 1.1 Get most-informative features

Now let's take a look at the most informative features for the `pos` and `neg` classes. In Multinomial Naive Bayes, the log likelihood ratio of a feature for a given class (e.g., positive sentiment) compared to all other classes (e.g., negative sentiment) tells us how much more likely that feature is to appear in the positive class compared to the negative class.

A higher positive log likelihood ratio for a feature means that the feature is much more likely to be present in positive reviews than in negative reviews, making it a strong indicator of positive sentiment. Conversely, a large negative log likelihood ratio indicates a feature that is much more likely to be found in negative reviews, making it a strong indicator of negative sentiment.

By sorting features based on the absolute value of this log likelihood ratio, we can identify the features that have the strongest association with either the positive or negative class, thus indicating their importance in distinguishing between the sentiments.



In [None]:
# Get the log probabilities of features for each class
neg_prob_features = classifier.feature_log_prob_[0]
pos_prob_features = classifier.feature_log_prob_[1]

# Get the feature names from the vectorizer
feature_names = tfidf_vectorizer.get_feature_names_out()

# Create a DataFrame to compare probabilities
feature_importance = pd.DataFrame({
    'feature': feature_names,
    'neg_prob': neg_prob_features,
    'pos_prob': pos_prob_features
})

# Calculate the difference in probabilities (log likelihood ratio)
feature_importance['log_likelihood_ratio'] = pos_prob_features - neg_prob_features

# Sort by the absolute value of the log likelihood ratio to find the most informative features
feature_importance['abs_log_likelihood_ratio'] = abs(feature_importance['log_likelihood_ratio'])
most_informative_features = feature_importance.sort_values(by='abs_log_likelihood_ratio', ascending=False).head(20)

print("\nMost Informative Features:")
display(most_informative_features[['feature', 'log_likelihood_ratio']])

In [None]:
# Show the most informative features for the positive class
most_informative_positive_features = feature_importance.sort_values(by='log_likelihood_ratio', ascending=False).head(20)

print("\nMost Informative Features for Positive Class:")
display(most_informative_positive_features[['feature', 'log_likelihood_ratio']])

In [None]:
most_informative_negative_features = feature_importance.sort_values(by='log_likelihood_ratio', ascending=False).tail(20)

print("\nMost Informative Features for Negative Class:")
display(most_informative_negative_features[['feature', 'log_likelihood_ratio']])

## 2. Multiclass classification of articles using Logistic Regression

In the next few cells, we'll implement a multi-class text classification model using the 20 Newsgroups dataset -- a commonly used dataset for multiclass classification problems--and Logistic Regression.


In [None]:
from sklearn.datasets import fetch_20newsgroups

# Load the 20 Newsgroups dataset
# Select 4 categories for a 4-class classification setting
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories, remove=('headers', 'footers', 'quotes'))

# Inspect the dataset
print("Number of training documents:", len(newsgroups_train.data))
print("Number of testing documents:", len(newsgroups_test.data))
print("Number of categories:", len(newsgroups_train.target_names))
print("Categories:", newsgroups_train.target_names)
print("\nFirst document in training set:")
print(newsgroups_train.data[0])
print("\nTarget of the first document:", newsgroups_train.target[0])
print("Target name of the first document:", newsgroups_train.target_names[newsgroups_train.target[0]])

## 2.1 Per-topic distribution in dataset

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Get the category names
category_names_train = newsgroups_train.target_names
category_names_test = newsgroups_test.target_names

# Count the occurrences of each category in the training data
category_counts_train = np.bincount(newsgroups_train.target)

# Count the occurrences of each category in the testing data
category_counts_test = np.bincount(newsgroups_test.target)

# Create a bar plot of the category distribution for the training data
plt.figure(figsize=(8, 6))
sns.barplot(x=category_names_train, y=category_counts_train)
plt.title('Distribution of News Categories in Training Data')
plt.xlabel('Category')
plt.ylabel('Number of Documents')
plt.xticks(rotation=45, ha='right') # Rotate labels for better readability
plt.tight_layout() # Adjust layout to prevent labels overlapping
plt.show()

# Create a bar plot of the category distribution for the test data
plt.figure(figsize=(8, 6))
sns.barplot(x=category_names_test, y=category_counts_test)
plt.title('Distribution of News Categories in Testing Data')
plt.xlabel('Category')
plt.ylabel('Number of Documents')
plt.xticks(rotation=45, ha='right') # Rotate labels for better readability
plt.tight_layout() # Adjust layout to prevent labels overlapping
plt.show()

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TfidfVectorizer
# We can adjust parameters like max_features, stop_words, etc.
vectorizer = TfidfVectorizer(max_features=3000, stop_words='english')

# Fit the vectorizer on the training data and transform the training data
X_train = vectorizer.fit_transform(newsgroups_train.data)

# Transform the testing data
X_test = vectorizer.transform(newsgroups_test.data)

# Get the target variables
y_train = newsgroups_train.target
y_test = newsgroups_test.target

# Print the shape of the resulting matrices to verify
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

## 2.2. Sigmoid, Softmax, and cross-entropy loss: The core components of logistic regression


#### 2.2.1 The sigmoid Function

The sigmoid function, also known as the logistic function, is a crucial component in binary logistic regression. It's defined as:

$$ \sigma(z) = \frac{1}{1 + e^{-z}} $$

where $z$ is the input, typically a linear combination of features and weights: $z = w_0 + w_1x_1 + w_2x_2 + ... + w_nx_n$.

The sigmoid function takes any real-valued input and squashes it to a value between 0 and 1. In binary logistic regression, this output is interpreted as the probability of the input belonging to the positive class (e.g., probability of a review being positive).

- If $z$ is large positive, $\sigma(z)$ approaches 1.
- If $z$ is large negative, $\sigma(z)$ approaches 0.
- If $z$ is 0, $\sigma(z)$ is 0.5.

This property makes the sigmoid function ideal for modeling probabilities, as probabilities must be between 0 and 1.

#### 2.2.2 The softmax function

While the sigmoid function is used for binary classification, the **softmax function** is its generalization for multiclass classification. It's used to convert a vector of arbitrary real values into a probability distribution over multiple classes. The softmax function is defined as:

$$ \text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}} $$

where $z = (z_1, z_2, ..., z_K)$ is the input vector (the output of the linear model for each of the $K$ classes), and $z_i$ is the input for class $i$.

The softmax function:
- Takes a vector of $K$ real numbers as input.
- Exponentiates each element to make them non-negative.
- Divides each exponentiated element by the sum of all exponentiated elements, ensuring that the output values sum to 1.

The output of the softmax function is a vector of $K$ probabilities, where each element represents the probability of the input belonging to a specific class. This makes it suitable for multiclass classification problems where we want to predict the probability of an instance belonging to each of the available classes.

##### 2.2.2.1 Visualizing the Softmax Function

Let's visualize this with a simple example. Suppose we have an input vector representing the scores for three classes: [1.0, 2.0, 3.0]

In [None]:
import numpy as np
import matplotlib.pyplot as plt

def softmax(z):
    """Compute softmax scores for each class in z."""
    exp_z = np.exp(z - np.max(z)) # Subtract max for numerical stability
    return exp_z / np.sum(exp_z)

# Example input vector (e.g., scores for 3 classes)
input_scores = np.array([1.0, 2.0, 3.0])
softmax_output = softmax(input_scores)

print("Input Scores:", input_scores)
print("Softmax Output (Probabilities):", softmax_output)

# Visualize the input scores and softmax output
fig, axes = plt.subplots(1, 2, figsize=(10, 5))

# Plot Input Scores
axes[0].bar(range(len(input_scores)), input_scores, color='skyblue')
axes[0].set_title('Input Scores')
axes[0].set_xlabel('Class')
axes[0].set_ylabel('Score')
axes[0].set_xticks(range(len(input_scores)))
axes[0].set_xticklabels([f'Class {i+1}' for i in range(len(input_scores))])
axes[0].grid(axis='y', linestyle='--')

# Plot Softmax Output
axes[1].bar(range(len(softmax_output)), softmax_output, color='lightcoral')
axes[1].set_title('Softmax Output (Probabilities)')
axes[1].set_xlabel('Class')
axes[1].set_ylabel('Probability')
axes[1].set_xticks(range(len(softmax_output)))
axes[1].set_xticklabels([f'Class {i+1}' for i in range(len(softmax_output))])
axes[1].set_ylim(0, 1) # Probabilities are between 0 and 1
axes[1].grid(axis='y', linestyle='--')

plt.tight_layout()
plt.show()

### 2.2.3 The cross-Entropy Loss Function

The **cross-entropy loss function**, also known as logarithmic loss, is a standard loss function used in classification problems, particularly with models that output probability distributions like logistic regression and neural networks. It measures the difference between the predicted probability distribution and the true class distribution.

For a single instance with $K$ classes, where $y_i$ is a binary indicator (1 if the true class is $i$, 0 otherwise) and $p_i$ is the predicted probability of the instance belonging to class $i$, the cross-entropy loss is defined as:

$$ H(y, p) = -\sum_{i=1}^{K} y_i \log(p_i) $$

In the case of a true class $j$, $y_j=1$ and $y_i=0$ for $i \neq j$. The formula simplifies to:

$$ H(y, p) = -\log(p_j) $$

The goal during training is to minimize this loss. Minimizing $-\log(p_j)$ is equivalent to maximizing $\log(p_j)$, which in turn is equivalent to maximizing $p_j$, the predicted probability of the true class.

Cross-entropy loss penalizes confident wrong predictions heavily. If the model predicts a low probability for the true class, the $-\log(p_j)$ term will be large, indicating a high loss. Conversely, if the model predicts a high probability for the true class, the loss will be small. This makes it an effective loss function for training classification models to output well-calibrated probabilities.

### 2.2.4 Review of stochastic gradient descent with cross-entropy loss

See Jurafsky & Martin 5.6

Our goal is to find the set of weights which minimizes this loss function, averaged over all examples. Gradient descent is a method that finds a minimum of a function by figuring out in which direction (in the space of the parameters θ) the function’s slope is rising the most steeply, and moving in the opposite direction.

The intuition is that if you are hiking in a canyon and trying to descend most quickly down to the river at the bottom, you might look around yourself in all directions, find the direction where the ground is sloping the steepest, and walk downhill in that direction.

For logistic regression, this loss function is conveniently convex. A convex function has at most one minimum; there are no local minima to get stuck in, so gradient descent starting from any point is guaranteed to find the minimum. (By contrast, the loss for multi-layer neural networks is non-convex, and gradient descent may get stuck in local minima)

Below is a simplified example of Cross-Entropy Loss and SGD.vWe assume a simple binary classification scenario with one feature (x) and we are trying to learn a single weight (w) with no bias for simplicity. The prediction is sigmoid(w * x)



In [None]:
import numpy as np
import plotly.graph_objects as go

# --- Simplified Example with Cross-Entropy Loss and SGD ---

# Let's assume a simple binary classification scenario with one feature (x)
# and we are trying to learn a single weight (w) with no bias for simplicity.
# The prediction is sigmoid(w * x)

# Example Data (Feature x and True Label y)
# Data point 1: x=1, y=1 (Positive class)
# Data point 2: x=2, y=0 (Negative class)
X = np.array([1.0, 2.0])
y_true = np.array([1.0, 0.0]) # True labels

# Sigmoid function
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Cross-Entropy Loss function for a single data point
def cross_entropy_loss(y_true, y_pred):
    # Avoid log(0) by clipping predictions
    epsilon = 1e-15
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -y_true * np.log(y_pred) - (1 - y_true) * np.log(1 - y_pred)

# Total Cross-Entropy Loss over all data points
def total_cross_entropy_loss(w, X, y_true):
    total_loss = 0
    for i in range(len(X)):
        z = w * X[i]
        y_pred = sigmoid(z)
        total_loss += cross_entropy_loss(y_true[i], y_pred)
    return total_loss / len(X) # Average loss

# Gradient of the Cross-Entropy Loss with respect to the weight (w) for a single data point
# For sigmoid(wx), the derivative of loss with respect to w is (y_pred - y_true) * x
def gradient_cross_entropy(w, x, y_true):
    y_pred = sigmoid(w * x)
    return (y_pred - y_true) * x

# Simulate Stochastic Gradient Descent
def stochastic_gradient_descent_ce(initial_w, learning_rate, n_iterations, X, y_true):
    w_values = [initial_w]
    loss_values = [total_cross_entropy_loss(initial_w, X, y_true)]

    for i in range(n_iterations):
        # In SGD, we pick one data point (or a mini-batch) randomly
        # For this simple illustration, let's cycle through the data points
        data_index = i % len(X)
        x_i = X[data_index]
        y_true_i = y_true[data_index]

        # Calculate gradient using a single data point
        grad = gradient_cross_entropy(w_values[-1], x_i, y_true_i)

        # Update the weight
        new_w = w_values[-1] - learning_rate * grad
        w_values.append(new_w)
        loss_values.append(total_cross_entropy_loss(new_w, X, y_true)) # Calculate total loss for visualization

    return w_values, loss_values

# Set parameters for SGD
initial_w = -2.0 # Initial weight
learning_rate = 0.5 # Learning rate
n_iterations = 300 # Number of iterations

# Run SGD simulation
w_steps, loss_steps = stochastic_gradient_descent_ce(initial_w, learning_rate, n_iterations, X, y_true)

# Generate w values for plotting the total loss function
w_plot = np.linspace(-3, 4, 100)
loss_plot = [total_cross_entropy_loss(w, X, y_true) for w in w_plot]

# Create interactive plot using Plotly
fig = go.Figure()

# Add the loss function curve
fig.add_trace(go.Scatter(x=w_plot, y=loss_plot,
                         mode='lines',
                         name='Average Cross-Entropy Loss'))

# Add the SGD steps as points with hover information
fig.add_trace(go.Scatter(x=w_steps, y=loss_steps,
                         mode='markers+lines',
                         name='SGD Steps',
                         marker=dict(size=8),
                         hovertemplate='Weight: %{x:.4f}<br>Loss: %{y:.4f}<extra></extra>')) # Customize hover info

# Update layout
fig.update_layout(
    title='Illustration of Stochastic Gradient Descent with Cross-Entropy Loss',
    xaxis_title='Weight Value (w)',
    yaxis_title='Average Loss',
    hovermode='x unified' # Show hover info for all traces at a given x-value
)

fig.show()

In [None]:
from sklearn.linear_model import LogisticRegression

# Initialize the Multinomial Logistic Regression model
# We use the 'sag' solver which is suitable for large datasets and supports stochastic gradient descent.
# 'multi_class' is not needed with 'sag' as it defaults to 'auto' which handles multinomial classification.
model = LogisticRegression(solver='sag', random_state=42, n_jobs=-1)

# Train the model
model.fit(X_train, y_train)

print("Model training complete.")

### 2.4 Evaluating the model
 We will use common classification metrics like accuracy, precision, recall, and F1-score.


In [None]:
# Predict on the test set
y_pred = model.predict(X_test)

In [None]:
y_pred

In [None]:
from sklearn.metrics import classification_report, accuracy_score

# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=newsgroups_test.target_names)

print(f"Accuracy: {accuracy}")
print("Classification Report:")
print(report)

In [None]:
y_pred

In [None]:
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# Generate the confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Plot the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=newsgroups_test.target_names, yticklabels=newsgroups_test.target_names)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()

### 2.5 Examine the most informative per-class features

To determine the features that provided the best discriminative signal, we can simply look at the model coefficients.

In [None]:
# Get the model coefficients
# The coefficients are stored in the coef_ attribute of the LogisticRegression model
# coef_ is an array of shape (n_classes, n_features)
coefficients = model.coef_

# Get the feature names from the vectorizer
feature_names = vectorizer.get_feature_names_out()

# Get the class names
class_names = newsgroups_train.target_names

print("Most Informative Features for Each Class (Logistic Regression):")

# Iterate through each class to find the top features
for i, class_name in enumerate(class_names):
    # Get the coefficients for the current class
    class_coefficients = coefficients[i]

    # Create a DataFrame to associate features with their coefficients for this class
    feature_coefficient_df = pd.DataFrame({
        'feature': feature_names,
        'coefficient': class_coefficients
    })

    # Sort features by the absolute value of their coefficients to find the most important
    feature_coefficient_df['abs_coefficient'] = abs(feature_coefficient_df['coefficient'])
    most_informative_class_features = feature_coefficient_df.sort_values(by='abs_coefficient', ascending=False).head(20)

    print(f"\n--- Class: {class_name} ---")
    display(most_informative_class_features[['feature', 'coefficient']])