<img src="data/images/div/lecture-notebook-header.png" />

# Logistic Regression

Logistic Regression is a statistical method used for binary classification tasks, predicting the probability of a binary outcome based on input features. Despite its name, it's used for classification, not regression. Here's how Logistic Regression works and its application in text document classification:

* **Algorithm Overview:** Logistic Regression models the relationship between a set of independent variables (features) and a binary dependent variable (the target class). It uses the logistic function (also called the sigmoid function) to map input features to a probability score between 0 and 1, representing the likelihood of belonging to a particular class.

* **Text Document Classification:** In the context of text classification, each document is represented as a numerical feature vector. This vector can be constructed using various methods, such as:
    * *Bag-of-Words (BoW):* Counts the frequency of words in each document, creating a numerical representation.
    * *TF-IDF (Term Frequency-Inverse Document Frequency):* Weights words based on their frequency in a document relative to their frequency across all documents, capturing word importance.
    * *Word Embeddings:* Dense, lower-dimensional representations of words that capture semantic relationships and context.


* **Model Training:** Once text documents are transformed into feature vectors, Logistic Regression fits a linear decision boundary to distinguish between classes based on these features. It learns the weights (coefficients) for each feature and the bias term to make predictions.

* **Prediction:** During prediction, Logistic Regression uses these learned weights and the logistic (sigmoid) function to calculate the probability of a document belonging to a particular class. A threshold (usually 0.5) is applied to these probabilities to classify documents into the respective classes.

* **Regularization:** Logistic Regression models can be regularized to prevent overfitting by adding penalty terms to the objective function. Regularization techniques like L1 or L2 regularization help control model complexity and improve generalization to new data.

Logistic Regression is often considered a baseline model for text classification tasks due to its simplicity, interpretability, and efficiency. It's particularly useful when the relationship between features and classes is relatively linear or when there's a need for a straightforward probabilistic interpretation of the results.

Overall, Logistic Regression is a versatile and widely used method in text document classification, providing a probabilistic framework to predict class probabilities based on textual features, aiding in tasks like sentiment analysis, document categorization, or spam detection.

## Setting up the Notebook

### Import Required packages

In [None]:
import numpy as np
import pandas as pd
import time

import matplotlib.pyplot as plt

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn import metrics
from sklearn.pipeline import Pipeline

from tqdm import tqdm

---

## Preparing the Data

For this notebook, we use a simple dataset for sentiment classification. This dataset consists of 10,662 sentences, where 50% of the sentences are labeled 1 (positive), and 50% of the sentences are labeled -1 (negative).

### Loading Sentence/Label Pairs from File

In [None]:
sentences, labels = [], []

with open("data/datasets/sentence-polarities/sentence-polarities.csv") as file:
    for line in file:
        line = line.strip()
        sentence, label = line.split("\t")
        sentences.append(sentence)
        labels.append(int((int(label)+1)/2))
        
print("Total number of sentences: {}".format(len(sentences)))

### Create Training & Test Set

To evaluate any classifier, we need to split our dataset into a training and a test set. With the method `train_test_split()` this is very easy to do; this method also shuffles the dataset by default, which is important for this example, since the dataset file is ordered with all positive sentences coming first. In the example below, we set the size of the test set to 20%.


In [None]:
# Split sentences and labels into training and test set with a test set size of 20%
sentences_train, sentences_test, labels_train, labels_test = train_test_split(sentences, labels, test_size=0.2, random_state=42)

# We can directly convert the numerical class labels from lists to numpy arrays
y_train = np.asarray(labels_train)
y_test = np.asarray(labels_test)

print("Size of training set: {}".format(len(sentences_train)))
print("Size of test set: {}".format(len(sentences_test)))

---

## Training & Testing a Logistic Classifier

Let's first have a look at how to train a Naive Bayes classifier with the minimum number of steps. For this, we randomly pick some meaningful values for the vectorizer and use the the default values of the [`LogisticRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) classifier.

In [None]:
# Create Document-Term Matrix for differen n-gram sizes
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 1), max_features=10000)

X_train = tfidf_vectorizer.fit_transform(sentences_train)
X_test = tfidf_vectorizer.transform(sentences_test)

Using the training data, we can train a Naive Bayes classifier with a single line of code.

In [None]:
model = LogisticRegression().fit(X_train, y_train)

Once trained, we can predict the class labels for the document vectors in our test set.

In [None]:
y_pred = model.predict(X_test)

`y_pred` now contains the 2,133 predicted labels that we can compare with the ground truth labels from the test set. scikit-learn provides methods to easily calculate all the important metrics we covered in the lecture. Since we only have to class labels (i.e., binary classification), we do not have to set the `average` parameter to indicate micro or macro averaging.

In [None]:
precision = metrics.precision_score(y_test, y_pred)
recall = metrics.recall_score(y_test, y_pred)
f1 = metrics.f1_score(y_test, y_pred)

print("Precison: {:.3f}".format(precision))
print("Recall:   {:.3f}".format(recall))
print("F1 score: {:.3f}".format(f1))

scikit-learn also provides a method `classification_report()` for a more detailed description of the results, showing a breakdown of the precision, recall, and f1 scores broken down for each class.

In [None]:
print(metrics.classification_report(y_test, y_pred))

## Hyperparameter Tuning

Similarly to the Naive Bayes Classifier, Logistic Regression also does not offer hyperparameters that are very fundamental to the algorithm. If you check the documentation of [`LogisticRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html), already performs L2 regularization by default. An important parameter is the `solver`. Note that the class constructor does not provide any parameter for a learning rate. This is because Logistic Regression can be solved much faster than using basic Gradient Descent. This is mainly due to loss function being convex, so more optimized techniques can be applied. The reason we cover Logistic Regression with Gradient Descent in the lecture is because it is a more general approach that is directly applicable to neural networks.


### Selecting the Best Maximum N-Gram Size

Like in the notebook for the Naive Bayes Classifier, let's evaluate the effect of different maximum n-gram sizes in the result (i.e., the f1 scores).


In [None]:
min_ngram_size = 1
max_ngram_size = 5

num_runs = max_ngram_size - min_ngram_size

# numpy array to keep track of all results
results = []

with tqdm(total=num_runs) as pbar:
    for i, ngram in enumerate(range(min_ngram_size, max_ngram_size+1)):
        # Create Document-Term Matrix for different n-gram sizes
        tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, ngram), max_features=20000)
        X_train = tfidf_vectorizer.fit_transform(sentences_train)
        X_test = tfidf_vectorizer.transform(sentences_test)
        # Train & test model using cross validation
        model = LogisticRegression()
        scores = cross_val_score(model, X_train, y_train, cv=10, scoring="f1")
        mean_score = np.mean(scores)
        results.append((ngram, mean_score))
        pbar.update(1)

With the f1 scores for the different values for `max_ngram_size`, we can quickly plot those results.

In [None]:
plt.figure()
plt.plot([s[0] for s in results], [s[1] for s in results], lw=3)
font_axes = {'family':'serif','color':'black','size':16}
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlabel("Max N-Gram Size", fontdict=font_axes)
plt.ylabel("F1 Score", fontdict=font_axes)
plt.tight_layout()
plt.show()

While the best value for the maximum n-gram size is at 2, keep in mind that the f1 score actually doesn't change too much; see the scale of the y-axis. The main reason for this is that, for example, a maximum n-gram size of 3 still contains all unigrams and bigrams. This is the most common approach in practice. However, feel free to also set `min_ngram_size` to a larger value than 1 and see how it affects the results.

Of course, all these results and observations only hold true for this specific data set and might significantly differ for other ones.


## Pipelines & Grid Search

Hyperparameter tuning is a quite important step, but the previous example has shown that it can be quite tedious. However, note that we basically tried all possible combinations for certain sets of parameter values. And since we were tuning 2 parameters, we required 2 nested loops. Thus, if we would tune $N$ parameters at the same time, we would need to have $N$ nested loops. Luckily, scikit-learn makes this much easier using [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).

Since the parameters we would like to tune refer to 2 different components -- the vectorizer and the classifier -- we also need a [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) to combine both components into a single model. Let's do this first:


In [None]:
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('logreg', LogisticRegression()),
])

Now we can define the search space, by providing the set of values for the hyperparameters we want to consider. See how the identifier of the parameters are a combination of the name in the pipeline (here: `tfidf` and `logreg`) and the name of the parameter in the respective class. For example, `tfidf__max_df` refers to the `max_df` parameter of the `TfidfVectorizer`.

In [None]:
parameters = {
    'tfidf__max_features': (5000, 10000),
    'tfidf__ngram_range': ((1, 1), (1, 2), (1, 3)),
    'logreg__penalty': ("l2", "elaticnet"),
    'logreg__fit_intercept': (True, False),
    'logreg__C': (1.0, 10.0),
    'logreg__solver': ('sag',)
}


Now we can use `GridSearchCV` to check all possible combinations of parameter values. Of course, we kept the number of possible values rather small to avoid overly run times here. Note that for any parameter not listed above (e.g., `min_df` of the vectorizer) the default value is used.

In [None]:
%%time

grid_search = GridSearchCV(pipeline, parameters, n_jobs=1, verbose=1, cv=5)

grid_search = grid_search.fit(sentences_train, labels_train)

Once the `GridSearchCV` has checked all possible parameter combinations, we can read out the best combination as follows:

In [None]:
print(grid_search.best_params_)

With these best parameter values -- note that those might not really be the best values as we selected just some alternatives for this example -- we compute the final scores by vectorizing our data and training the Naive Bayes Classifiers using those parameters. Now we train the classifier using the complete training data, and evaluate the classifier over the test data. Appreciate that we used the test data only this one time for the final results.

In [None]:
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 2), max_features=10000, max_df=0.75)

X_train = tfidf_vectorizer.fit_transform(sentences_train)
X_test = tfidf_vectorizer.transform(sentences_test)

model = LogisticRegression(C=10.0, fit_intercept=False).fit(X_train, y_train)
y_pred = model.predict(X_test)

precision = metrics.precision_score(y_test, y_pred)
recall = metrics.recall_score(y_test, y_pred)
f1 = metrics.f1_score(y_test, y_pred)

print("Precison: {:.3f}".format(precision))
print("Recall:   {:.3f}".format(recall))
print("F1 score: {:.3f}".format(f1))

---

## Logistic Regression with PyTorch

We have seen in the lecture that Logistic Regression forms the building block of neural networks: the neuron. This means we can implement Logistic Regression as a single-neuron network using PyTorch. While this is not meaningful in practice -- since Logistic Regression can be solved much more efficiently than with Gradient Descent / Backpropagation -- we can use this idea for a basic introduction into PyTorch.


### Required Packages

In [None]:
import torch
import torch.nn as nn

from torch.utils.data import TensorDataset, DataLoader

### Checking/Setting the Device

PyTorch allows to train neural networks on supported GPU to significantly speed up the training process. If you have a support GPU, feel free to utilize it. However, for this notebook it's certainly not needed as our dataset is small and our network model is very simple. In fact, the training is fast on the CPU here since initializing memory on the GPU and moving the data to the GPU involves some overhead.


In [None]:
use_cuda = torch.cuda.is_available()
use_cuda = False
device = torch.device("cuda:0" if use_cuda else "cpu")

print("Available device: {}".format(device))

### Create Tensors

Both `X_train` and `X_test` are now our matrices containing the document vectors of our training and test set. However, right now, `X_train` and `X_test` are sparse matrices, i.e., representations that only store the non-zero values. Further use with PyTorch, we have to perform 2 additional steps:

* Convert the sparse representation to a dense (i.e., full/normal) representation using `.todense()`; the output will be numpy arrays

* Convert numpy arrays to tensors. `Tensor` is the data object used by PyTorch; they look, feel, and handle basically the same as numpy arrays.

In [None]:
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 2), max_features=20000)

X_train = tfidf_vectorizer.fit_transform(sentences_train)
X_test = tfidf_vectorizer.transform(sentences_test)

X_train = torch.Tensor(X_train.todense())
X_test = torch.Tensor(X_test.todense())

y_train = torch.Tensor(y_train).float()
y_test = torch.Tensor(y_test).float()

### Create PyTorch Datasets

Training a neural network is usually not done computing the gradient w.r.t. the whole dataset as most of the time the dataset is way too large to fit into memory. It might also slow down training since the gradients w.r.t. the whole dataset can be very small (although this could be addressed by increasing the learning rate). In practice, the training is basically always done using batches, i.e., much smaller subset of the data. While we can take `X_train` and `X_test` and implement our own loops to create batches, PyTorch comes with a series of convenient utility classes to simplify things.

The first utility class we use is the [`Dataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) class; more specifically, since `Dataset` is in abstract class, we use the [`TensorDataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.TensorDataset) class to wrap our tensors and make each sample retrievable by indexing the tensors along the first dimension. This will be used by the data loaders below.


In [None]:
dataset_train = TensorDataset(X_train, y_train)
dataset_test = TensorDataset(X_test, y_test)

### Create Data Loaders

The [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) class takes a `DataSet` object as input to handle to split the dataset into batches. As such, a data loader also has `batch_size` as an input parameter. In the following, we use a batch size of 64, although you can easily go higher since we are dealing with only sentences.


In [None]:
batch_size = 64

loader_train = DataLoader(dataset_train, batch_size=batch_size)
loader_test = DataLoader(dataset_test, batch_size=batch_size)

### Create Model

The code cell below implements Logistic Regression as a neural network in PyTorch. Recall that we first compute the linear signal, which is the weighted sum of the input values. This is accomplished by the [`nn.Linear`](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) layer. The main parameters of this class are the input size (i.e., the number of features / vocabulary size) and the output size which is just 1 for Logistic Regression. The internal parameters of `nn.Linear` are our weights $\theta$ which also includes the bias $\theta_0$. Since `nn.Linear` only gives us our linear signal, we need to push it through the Sigmoid function to get proper probabilities. PyTorch offers the `nn.Sigmoid` layer for this (PyTorch calls activations functions also layers).

So far, we only defined our layers. The `forward()` method is the default method that is executed when a class instance evaluates an input. In this method, simply speaking, the input is pushed through all the layers. The `squeeze()` method is needed here to reshape `out` from the shape `(num_inputs, 1)` to just `(num_inputs,)`. Ensuring the proper shapes of all the tensors to avoid errors is just something to get used to when working with PyTorch or other frameworks.


In [None]:
class MyLogisticRegression(nn.Module):
    
    def __init__(self, vocab_size):
        super().__init__()
        self.vocab_size = vocab_size
        # Define fully connected (i.e., linear) layer
        self.fc = nn.Linear(self.vocab_size, 1)
        # Define sigmoid layer
        self.sigmoid = nn.Sigmoid()
        
    def forward(self, X):
        out = self.fc(X)
        out = self.sigmoid(out)
        return out.squeeze()

To "visualize" the network, we can create and print an instance of the class `MyLogisticRegression`. Of course, due to its simplicity -- we only have 1 neuron after all -- the output does not look very exciting. The command `.to(device)` "moves" the instance to the selected instance (e.g., the CPU or GPU). In general, both the model and the data need to reside on the same instance. So if the model will be on the GPU, we also need to move the data later to the GPU. Using consistently `.to(device)` on the model and the data (see below) ensures that there will be no mismatch.

In [None]:
# Create the model
classifier = MyLogisticRegression(X_train.shape[1]).to(device)

print(classifier)

### Train & Evaluate Model

With the data prepared and the model architecture defined, we can now train and evaluate a model to build our sentiment classifier.


#### Evaluate

The code cell below implements the method `evaluate()` to, well, evaluate our model. Apart from the model itself, the method also receives the data loader as input parameter. This allows us later to use both `loader_train` and `loader_test` to evaluate the training and test loss using the same method.

The method is very generic and is not specific to the dataset. It simply loops over all batches of the data loader, computes the log probabilities, uses these log probabilities to derive the predicted class labels, and compares the predictions with the ground truth to return the f1 score. This means, this method could be used "as is" or easily be adopted for all kinds of classifications tasks (incl. task with more than 2 classes).


In [None]:
def evaluate(model, loader):

    # Set model to "eval" mode (not needed here, but a good practice)
    model.eval()

    # Collect predictions and ground truth for all samples across all batches
    y_pred, y_test = [], []

    with tqdm(total=len(loader)) as pbar:
        
        # Loop over each batch in the data loader
        for X_batch, y_batch in loader:

            # Move data to device
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)

            # Push batch through network to get the log probability for each sample in the batch
            probs = model(X_batch)                
            
            y_batch_pred = (probs>0.5).float()

            # Add predictions and ground truth for current batch
            y_test += list(y_batch.detach().cpu())
            y_pred += list(y_batch_pred.detach().cpu())

            pbar.update(1)

    # Set model to "train" mode (not needed here, but a good practice)
    model.train()            
            
    # Return the f1 score as the output result
    return metrics.f1_score(y_test, y_pred)

For a quick test, let's evaluate the newly created model. Of course, we didn't train our model, but it will still make predictions based on the initial weights.


In [None]:
print(evaluate(classifier, loader_test))

### Train Model (and evaluate after each epoch)

Similar to the method `evaluate()` we also implement a method `train()` to wrap all the required steps training. This has the advantage that we can simply call `train()` multiple times to proceed with the training. Apart from the model, this method has the following input parameters:

* `loader_train` and `loader_test`: this allows us to compute the f1 score over the training data an the test data after each epoch; we can later visualize the changes in the f1 scores

* `optimizer`: the optimizer specifier how the computed gradients are used to updates the weights; in the lecture, we only covered the basic Stochastic Gradient Descent, but there are much more efficient alternatives available

* `criterion`: this is the loss function; "criterion" is just very common terminology in the PyTorch documentation and tutorials

* `num_epochs`: the number of epochs -- i.e., the number of times we want train over all samples in our dataset

The heart of the method is the snippet described as PyTorch Magic. It consists of the following 3 lines of code

* `optimizer.zero_grad()`: After each training step for a batch if have to set the gradients back to zero for the next batch

* `loss.backward()`: Calculating all gradients using backpropagation

* `optimizer.step()`: Update all weights using the gradients and the method of the specific optimizer


In [None]:
def train(model, loader_train, loader_test, optimizer, criterion, num_epochs):

    losses, f1_train, f1_test = [], [], []
    
    # Set model to "train" mode (not needed here, but a good practice)
    model.train()

    # Run all epochs
    for epoch in range(1, num_epochs+1):

        # Initialize epoch loss (cummulative loss fo all batchs)
        epoch_loss = 0.0

        with tqdm(total=len(loader_train)) as pbar:

            for X_batch, y_batch in loader_train:

                X_batch, y_batch = X_batch.to(device), y_batch.to(device)

                probs = classifier(X_batch)                

                # Calculate loss
                loss = criterion(probs, y_batch)

                ### Pytorch magic! ###
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

                # Keep track of overall epoch loss
                epoch_loss += loss.item()

                pbar.update(1)
                
        losses.append(epoch_loss)
        
        f1_tr = evaluate(model, loader_train)
        f1_te = evaluate(model, loader_test)
        
        f1_train.append(f1_tr)
        f1_test.append(f1_te)

        print("Loss:\t{:.3f}, f1 train: {:.3f}, f1 test: {:.3f} (epoch {})".format(epoch_loss, f1_tr, f1_te, epoch))
        
    return losses, f1_train, f1_test        

Before we can actually train the model, we need to instantiate the `criterion` (i.e., the loss function) and the `optimizer`. Conveniently, PyTorch provides the Binary Cross Entropy as a layer [`nn.BCELoss`](https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html). For the optimizer, we pick the widely used [`Adam`](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html) optimizer. Adam optimizer is the extended version of stochastic gradient descent which could be implemented in various deep learning applications such as computer vision and natural language processing in the future years. Adam was first introduced in 2014. The name is derived from adaptive moment estimation. The optimizer is called Adam because it uses estimations of the first and second moments of the gradient to adapt the learning rate for each weight of the neural network.

In [None]:
# Define optimizer
optimizer = torch.optim.Adam(classifier.parameters(), lr=0.01)
# Define loss function
criterion = nn.BCELoss()

Now we finally have everything in place to train the model. For this, we now only need to call the `train()` in the code cell below. Note that you can run the code cell below multiple times to continue the training for further 10 epochs. Each epoch will print 3 progress bars:

* training over training set

* evaluating over training set

* evaluating over test set

After each epoch, a print statement will show the current loss as well as the latest f1 scores for the training and test set.


In [None]:
num_epochs = 10

losses, f1_train, f1_test = train(classifier, loader_train, loader_test, optimizer, criterion, num_epochs)

Since the method `train()` returns the losses and f1 scores for each epoch, we can use this data to visualize how the loss and the f1 scores change over time, i.e., after each epoch. The code cell below creates the corresponding plot.

In [None]:
x = list(range(1, len(losses)+1))

# Convert losses to numpy array
losses = np.asarray(losses)
# Normalize losses so they match the scale in the plot (we are only interested in the trend of the losses!)
losses = losses/np.max(losses)
plt.figure()

plt.plot(x, losses, lw=3)
plt.plot(x, f1_train, lw=3)
plt.plot(x, f1_test, lw=3)

font_axes = {'family':'serif','color':'black','size':16}

plt.gca().set_xticks(x)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)

plt.xlabel("Epoch", fontdict=font_axes)
plt.ylabel("F1 Score", fontdict=font_axes)
plt.legend(['Loss', 'F1 (train)', 'F1 (test)'], loc='lower left', fontsize=16)
plt.tight_layout()
plt.show()

From the plot, we can observe several things

* The loss goes down! This essentially means that our model is learning. This can be very useful as an incorrectly implemented model may not throw an error but not train properly (i.e., the loss not going down). So it's a good sanity check when implementing and using a new model.

* The f1 score over the training data is getting close to 1.0 (at least after more epochs). This means that the model learns to correctly predict the label for all training samples. Of course, this is not what we are interested in but (a) again tells us that the model is generally learning, and (b) this might give us insights if our model is overfitting.

* The f1 score over the test data quickly plateaus after very few epochs.

**Important:** This plot showing the trends for the loss and f1 scores (or other metrics) can look very different depending on the dataset, the network architecture, and the hyperparameters. While in this simple example all trends seem to smoothly converge, this is not the case in general. For example, the f1 score for the test data might see a much steeper drop after peaking, which would even more clearly indicate the model is starting to overfit.


---

### Summary

Logistic Regression is a foundational algorithm in machine learning, primarily used for binary classification tasks, including text document classification. In the context of text analysis, Logistic Regression operates by mapping text features—such as word frequencies, TF-IDF values, or word embeddings—into a probability score representing the likelihood of a document belonging to a specific class.

For text document classification, each document undergoes a transformation into a numerical feature representation. This transformation captures the essence of the text, enabling Logistic Regression to learn a linear decision boundary that separates different classes based on these features. During training, Logistic Regression learns the weights (coefficients) for each feature and a bias term. It fits a linear model to distinguish between classes, utilizing the sigmoid function to produce probability scores. These probabilities are then thresholded to assign documents to their respective classes.

Logistic Regression's strengths lie in its simplicity, interpretability, and computational efficiency. It serves as a baseline model for text classification tasks, especially when the relationships between features and classes are relatively linear or when there's a need for probabilistic interpretations of classification results. Overall, Logistic Regression provides a straightforward and effective means of predicting class probabilities for text documents based on their textual features, making it a valuable tool in tasks such as sentiment analysis, document categorization, and spam filtering in natural language processing applications.
