# Sentiment Analysis  of text documents using TbNB

This is an example showing how TbNB can be used to classify documents by sentiment using a Bag of Words approach. This demo uses a binary document-term sparse matrix to encode the features and demonstrates the correct procedure to correctly train and utilize a (iterative) Threshold-Based Naive Bayes model

#### Setup

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from TbNB import TbNB  
import numpy as np 
from datasets import load_dataset
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB, ComplementNB, BernoulliNB
import time

### Import data and perform train/test split

Here we'll employ a simple sentiment dataset containing various reviews. We split the dataset in training and test data (counting 25k samples each), and according to dependent and independent variables.

In [4]:
dataset = load_dataset("imdb")
train = dataset["train"].shuffle(seed=42)
test = dataset["test"]


In [5]:
X_train_text = train["text"]
y_train = np.array(train["label"])
X_test_text = test["text"]
y_test = np.array(test["label"])


#### Text Vectorization

We leverage scikit-learn's CountVectorizer in order to remove stopwords and to create a BoW matrix signifying word presence/absence within each document. The vectorizer is fitted on training and data and is used to transform both training and test data. As the output indicates, CountVectorizer's output type defaults to a scipy sparse matrix, a format especially fitting for BoW data, which allows for extremely fast computations. However, TbNB also accepts other formats for X, such as numpy.ndarray or pandas dataframe. These formats are converted internally into sparse matrices. 

In [6]:
vectorizer = CountVectorizer(binary=False, stop_words="english")
X_train = vectorizer.fit_transform(X_train_text)
X_test = vectorizer.transform(X_test_text)

In [7]:
type(X_train)

scipy.sparse._csr.csr_matrix

#### Initialize and train the TbNB Model

We instantiate the model with iterative=True, which means calling fit will automatically estimate class priors and employ the iterative optimization algorithm described in Romano, M., Zammarchi, G., & Conversano, C. (2024). The .fit() method returns the fitted model and can be used for predictions using dot notation.


In [8]:
model = TbNB(iterative=True)
y_pred = model.fit(X_train, y_train).predict(X_test)
print("Predicted labels:", y_pred)

Predicted labels: [0 0 0 ... 1 0 0]


#### Evaluate Performance and post-hoc analysis

We can evaluate the modelâ€™s accuracy and other metrics using standard scikit-learn functions, as well as inspect learned attributed using the TbNB class

In [9]:
print(classification_report(y_test, y_pred, target_names=["Negative", "Positive"]))
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))


              precision    recall  f1-score   support

    Negative       0.84      0.82      0.83     12500
    Positive       0.83      0.85      0.84     12500

    accuracy                           0.84     25000
   macro avg       0.84      0.84      0.84     25000
weighted avg       0.84      0.84      0.84     25000

Confusion matrix:
 [[10306  2194]
 [ 1901 10599]]


### Inspect
 Once the model is trained, one can simply access learned parameters by calling their name using dot notation 

In [10]:
print("Decision threshold (tau_):", model.threshold_)
print("Number of features (words):", model.n_features_in_)
print("Review scores (lambda_scores_):")
print(model.lambda_scores_[:10])


Decision threshold (tau_): -2.6851164191490824
Number of features (words): 74538
Review scores (lambda_scores_):
[-0.36425597 -0.34267913 -0.69306718 -1.09845229 -0.69306718  0.69306718
 -1.38605435  0.69306718  0.69306718  0.87490845]


#### Benchmark



In [11]:
results = []

def benchmark_model(name, clf, X_train, X_test, y_train, y_test, variant):
    """Esegue un benchmark e salva i risultati globali."""
    
    t0 = time.time()
    clf.fit(X_train, y_train)
    train_time = time.time() - t0

    t0 = time.time()
    preds = clf.predict(X_test)
    pred_time = time.time() - t0

    acc = accuracy_score(y_test, preds)
    f1 = f1_score(y_test, preds, average="binary")


    results.append({
        "Vectorizer": variant,
        "Model": name,
        "Accuracy": acc,
        "F1-score": f1,
        "Train Time (s)": train_time,
        "Predict Time (s)": pred_time
    })


vectorizers = {
    "Simple": CountVectorizer(binary=False, stop_words="english"),
    "N_grams": CountVectorizer(binary=False, stop_words="english", ngram_range=(1,2))
}

for variant, vectorizer in vectorizers.items():

    print(f"\n==============================")
    print(f"  Running Vectorizer: {variant}")
    print(f"==============================")

    X_train_vec = vectorizer.fit_transform(X_train_text)
    X_test_vec = vectorizer.transform(X_test_text)

    benchmark_model(
        f"TbNB",
        TbNB(iterative=False),
        X_train_vec,
        X_test_vec,
        y_train,
        y_test,
        variant
    )
    
    benchmark_model(
        f"iTbNB",
        TbNB(iterative=True),
        X_train_vec,
        X_test_vec,
        y_train,
        y_test,
        variant
    )

    benchmark_model(
        "BernoulliNB",
        BernoulliNB(),
        X_train_vec,
        X_test_vec,
        y_train,
        y_test,
        variant
    )

    benchmark_model(
        "MultinomialNB",
        MultinomialNB(),
        X_train_vec,
        X_test_vec,
        y_train,
        y_test,
        variant
    )

    benchmark_model(
        "ComplementNB",
        ComplementNB(),
        X_train_vec,
        X_test_vec,
        y_train,
        y_test,
        variant
    )



df = pd.DataFrame(results).sort_values(by=["Vectorizer", "Accuracy"], ascending=[True, False])
print("\n\n=== RISULTATI FINALI ===")
print(df.to_string(index=False))



  Running Vectorizer: Simple

  Running Vectorizer: N_grams


=== RISULTATI FINALI ===
Vectorizer         Model  Accuracy  F1-score  Train Time (s)  Predict Time (s)
   N_grams         iTbNB   0.85932  0.859853        0.759851          0.030522
   N_grams MultinomialNB   0.85300  0.846151        0.070803          0.040640
   N_grams  ComplementNB   0.85300  0.846151        0.081839          0.058662
   N_grams          TbNB   0.85224  0.860835        0.798481          0.030292
   N_grams   BernoulliNB   0.81460  0.788886        0.103747          0.144499
    Simple         iTbNB   0.83620  0.838097        0.234481          0.006187
    Simple          TbNB   0.83584  0.836767        0.119916          0.007392
    Simple MultinomialNB   0.83192  0.822895        0.011478          0.016750
    Simple  ComplementNB   0.83192  0.822895        0.017665          0.015688
    Simple   BernoulliNB   0.81540  0.800312        0.029250          0.028783


In [13]:
import time
from preprocessing.nltk_pipeline import TextPreprocessor
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB, ComplementNB, BernoulliNB

results = []

def benchmark_model(name, pipeline, X_train, X_test, y_train, y_test, variant):
    """Esegue un benchmark e salva i risultati globali."""

    # Train
    t0 = time.time()
    pipeline.fit(X_train, y_train)
    train_time = time.time() - t0

    # Predict
    t0 = time.time()
    preds = pipeline.predict(X_test)
    pred_time = time.time() - t0

    # Metrics
    acc = accuracy_score(y_test, preds)
    f1 = f1_score(y_test, preds, average="binary")

    results.append({
        "Vectorizer": variant,
        "Model": name,
        "Accuracy": acc,
        "F1-score": f1,
        "Train Time (s)": train_time,
        "Predict Time (s)": pred_time
    })

from sklearn.pipeline import Pipeline

preprocessor = TextPreprocessor(
    language="english",
    remove_html=False,
    remove_urls=False,
    lower=True,
    expand_contr=True,
    remove_punct=True,
    remove_sw=True,
    stem=True
)


vectorizers = {
    "Simple": CountVectorizer(binary=False),
    "N_grams": CountVectorizer(binary=False, ngram_range=(1,2), min_df=3),
}


preprocessor = TextPreprocessor(
    language="english",
    remove_html=False,
    remove_urls=False,
    lower=True,
    expand_contr=True,
    remove_punct=True,
    remove_sw=True,
    stem=False
)

X_train_clean = preprocessor.fit_transform(X_train_text)
X_test_clean  = preprocessor.transform(X_test_text)


for variant, vectorizer in vectorizers.items():

    print(f"\n==============================")
    print(f"  Running Vectorizer: {variant}")
    print(f"==============================")

    X_train_vec = vectorizer.fit_transform(X_train_clean)
    X_test_vec = vectorizer.transform(X_test_clean)

    benchmark_model("TbNB", TbNB(iterative=False),
                    X_train_vec, X_test_vec, y_train, y_test, variant)

    benchmark_model("iTbNB", TbNB(iterative=True),
                    X_train_vec, X_test_vec, y_train, y_test, variant)

    benchmark_model("BernoulliNB", BernoulliNB(),
                    X_train_vec, X_test_vec, y_train, y_test, variant)

    benchmark_model("MultinomialNB", MultinomialNB(),
                    X_train_vec, X_test_vec, y_train, y_test, variant)

    benchmark_model("ComplementNB", ComplementNB(),
                    X_train_vec, X_test_vec, y_train, y_test, variant)



df = pd.DataFrame(results).sort_values(by=["Vectorizer", "Accuracy"], ascending=[True, False])
print("\n\n=== RISULTATI FINALI ===")
print(df.to_string(index=False))



  Running Vectorizer: Simple

  Running Vectorizer: N_grams


=== RISULTATI FINALI ===
Vectorizer         Model  Accuracy  F1-score  Train Time (s)  Predict Time (s)
   N_grams MultinomialNB   0.86612  0.862597        0.023634          0.035228
   N_grams  ComplementNB   0.86612  0.862597        0.021204          0.026975
   N_grams   BernoulliNB   0.86248  0.857557        0.055611          0.036396
   N_grams          TbNB   0.86200  0.856894        0.200127          0.010080
   N_grams         iTbNB   0.85876  0.851832        0.244095          0.013560
    Simple          TbNB   0.84032  0.842064        0.132292          0.007102
    Simple         iTbNB   0.84008  0.842784        0.167935          0.008002
    Simple MultinomialNB   0.83520  0.826629        0.022400          0.077565
    Simple  ComplementNB   0.83520  0.826629        0.024256          0.031025
    Simple   BernoulliNB   0.81868  0.804452        0.102576          0.184468


25000