# Sentiment Analysis  of text documents using TbNB

This is an example showing how TbNB can be used to classify documents by sentiment using a Bag of Words approach. This demo uses a binary document-term sparse matrix to encode the features and demonstrates the correct procedure to correctly train and utilize a (iterative) Threshold-Based Naive Bayes model

#### Setup

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from TbNB import TbNB
import numpy as np 
from datasets import load_dataset
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB, ComplementNB, BernoulliNB
from sklearn.pipeline import Pipeline
import time

### Import data and perform train/test split

Here we'll employ a simple sentiment dataset containing various reviews. We split the dataset in training and test data (counting 25k samples each), and according to dependent and independent variables.

In [2]:
dataset = load_dataset("yelp_polarity")
train = dataset["train"].shuffle(seed=42).select(range(100_000))
test = dataset["test"]


In [3]:
X_train_text = train["text"]
y_train = np.array(train["label"])
X_test_text = test["text"]
y_test = np.array(test["label"])


#### Text Vectorization

We leverage scikit-learn's CountVectorizer in order to remove stopwords and to create a BoW matrix signifying word presence/absence within each document. The vectorizer is fitted on training and data and is used to transform both training and test data. As the output indicates, CountVectorizer's output type defaults to a scipy sparse matrix, a format especially fitting for BoW data, which allows for extremely fast computations. However, TbNB also accepts other formats for X, such as numpy.ndarray or pandas dataframe. These formats are converted internally into sparse matrices. 

In [4]:
vectorizer = CountVectorizer(binary=False, stop_words="english")
X_train = vectorizer.fit_transform(X_train_text)
X_test = vectorizer.transform(X_test_text)

In [5]:
type(X_train)

scipy.sparse._csr.csr_matrix

#### Initialize and train the TbNB Model

We instantiate the model with iterative=True, which means calling fit will automatically estimate class priors and employ the iterative optimization algorithm described in Romano, M., Zammarchi, G., & Conversano, C. (2024). The .fit() method returns the fitted model and can be used for predictions using dot notation.


In [6]:
model = TbNB(iterative=True)
y_pred = model.fit(X_train, y_train).predict(X_test)
print("Predicted labels:", y_pred)

Predicted labels: [0 0 1 ... 0 0 0]


#### Evaluate Performance and post-hoc analysis

We can evaluate the modelâ€™s accuracy and other metrics using standard scikit-learn functions, as well as inspect learned attributed using the TbNB class

In [7]:
print(classification_report(y_test, y_pred, target_names=["Negative", "Positive"]))
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))


              precision    recall  f1-score   support

    Negative       0.80      0.91      0.85     19000
    Positive       0.90      0.77      0.83     19000

    accuracy                           0.84     38000
   macro avg       0.85      0.84      0.84     38000
weighted avg       0.85      0.84      0.84     38000

Confusion matrix:
 [[17378  1622]
 [ 4407 14593]]


### Inspect
 Once the model is trained, one can simply access learned parameters by calling their name using dot notation 

In [8]:
print("Decision threshold (tau_):", model.threshold_)
print("Number of features (words):", model.n_features_in_)
print("Review scores (lambda_scores_):")
print(model.lambda_scores_[:10])


Decision threshold (tau_): 5.400460149305644
Number of features (words): 95365
Review scores (lambda_scores_):
[-0.85856938 -0.90751354 -0.68476772 -0.68476772  0.70148664 -0.68476772
  0.70148664 -0.68476772  0.70148664 -0.68476772]


#### Benchmark



In [9]:
results = []

def benchmark_model(name, clf, X_train, X_test, y_train, y_test, variant):
    """Esegue un benchmark e salva i risultati globali."""
    
    t0 = time.time()
    clf.fit(X_train, y_train)
    train_time = time.time() - t0

    t0 = time.time()
    preds = clf.predict(X_test)
    pred_time = time.time() - t0

    acc = accuracy_score(y_test, preds)
    f1 = f1_score(y_test, preds, average="binary")


    results.append({
        "Vectorizer": variant,
        "Model": name,
        "Accuracy": acc,
        "F1-score": f1,
        "Train Time (s)": train_time,
        "Predict Time (s)": pred_time
    })


vectorizers = {
    "Simple": CountVectorizer(binary=False, stop_words="english"),
    "N_grams": CountVectorizer(binary=False, stop_words="english", ngram_range=(1,2))
}

for variant, vectorizer in vectorizers.items():

    print(f"\n==============================")
    print(f"  Running Vectorizer: {variant}")
    print(f"==============================")

    X_train_vec = vectorizer.fit_transform(X_train_text)
    X_test_vec = vectorizer.transform(X_test_text)

    benchmark_model(
        f"TbNB",
        TbNB(iterative=False),
        X_train_vec,
        X_test_vec,
        y_train,
        y_test,
        variant
    )
    
    benchmark_model(
        f"iTbNB",
        TbNB(iterative=True),
        X_train_vec,
        X_test_vec,
        y_train,
        y_test,
        variant
    )

    benchmark_model(
        "BernoulliNB",
        BernoulliNB(),
        X_train_vec,
        X_test_vec,
        y_train,
        y_test,
        variant
    )

    benchmark_model(
        "MultinomialNB",
        MultinomialNB(),
        X_train_vec,
        X_test_vec,
        y_train,
        y_test,
        variant
    )

    benchmark_model(
        "ComplementNB",
        ComplementNB(),
        X_train_vec,
        X_test_vec,
        y_train,
        y_test,
        variant
    )



df = pd.DataFrame(results).sort_values(by=["Vectorizer", "Accuracy"], ascending=[True, False])
print("\n\n=== RISULTATI FINALI ===")
print(df.to_string(index=False))



  Running Vectorizer: Simple

  Running Vectorizer: N_grams


=== RISULTATI FINALI ===
Vectorizer         Model  Accuracy  F1-score  Train Time (s)  Predict Time (s)
   N_grams MultinomialNB  0.898211  0.896760        0.163508          0.050890
   N_grams  ComplementNB  0.898184  0.896758        0.232112          0.104144
   N_grams          TbNB  0.875237  0.872790        2.416957          0.061976
   N_grams         iTbNB  0.870000  0.857997        2.202677          0.101810
   N_grams   BernoulliNB  0.806816  0.830934        0.294372          0.152062
    Simple  ComplementNB  0.875079  0.875964        0.028610          0.015275
    Simple MultinomialNB  0.874947  0.875791        0.021799          0.018259
    Simple          TbNB  0.842789  0.834065        0.279139          0.005782
    Simple         iTbNB  0.841342  0.828795        0.630690          0.010306
    Simple   BernoulliNB  0.806789  0.822777        0.076596          0.027639


In [14]:
import time
from preprocessing.nltk_pipeline import TextPreprocessor
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB, ComplementNB, BernoulliNB

results = []

def benchmark_model(name, pipeline, X_train, X_test, y_train, y_test, variant):
    """Esegue un benchmark e salva i risultati globali."""

    # Train
    t0 = time.time()
    pipeline.fit(X_train, y_train)
    train_time = time.time() - t0

    # Predict
    t0 = time.time()
    preds = pipeline.predict(X_test)
    pred_time = time.time() - t0

    # Metrics
    acc = accuracy_score(y_test, preds)
    f1 = f1_score(y_test, preds, average="binary")

    results.append({
        "Vectorizer": variant,
        "Model": name,
        "Accuracy": acc,
        "F1-score": f1,
        "Train Time (s)": train_time,
        "Predict Time (s)": pred_time
    })

vectorizers = {
    "Simple": CountVectorizer(binary=False),
    "N_grams": CountVectorizer(binary=False, ngram_range=(1,2)),
}


preprocessor = TextPreprocessor(
    language="english",
    remove_html=False,
    remove_urls=False,
    lower=True,
    expand_contr=True,
    remove_punct=True,
    remove_sw=True,
    stem=True
)

X_train_clean = preprocessor.fit_transform(X_train_text)
X_test_clean  = preprocessor.transform(X_test_text)


for variant, vectorizer in vectorizers.items():

    print(f"\n==============================")
    print(f"  Running Vectorizer: {variant}")
    print(f"==============================")

    X_train_vec = vectorizer.fit_transform(X_train_clean)
    X_test_vec = vectorizer.transform(X_test_clean)

    benchmark_model("TbNB", TbNB(iterative=False),
                    X_train_vec, X_test_vec, y_train, y_test, variant)

    benchmark_model("iTbNB", TbNB(iterative=True),
                    X_train_vec, X_test_vec, y_train, y_test, variant)

    benchmark_model("BernoulliNB", BernoulliNB(),
                    X_train_vec, X_test_vec, y_train, y_test, variant)

    benchmark_model("MultinomialNB", MultinomialNB(),
                    X_train_vec, X_test_vec, y_train, y_test, variant)

    benchmark_model("ComplementNB", ComplementNB(),
                    X_train_vec, X_test_vec, y_train, y_test, variant)



df = pd.DataFrame(results).sort_values(by=["Vectorizer", "Accuracy"], ascending=[True, False])
print("\n\n=== RISULTATI FINALI ===")
print(df.to_string(index=False))



  Running Vectorizer: Simple

  Running Vectorizer: N_grams


=== RISULTATI FINALI ===
Vectorizer         Model  Accuracy  F1-score  Train Time (s)  Predict Time (s)
   N_grams MultinomialNB  0.897868  0.896393        0.257513          0.113158
   N_grams  ComplementNB  0.897842  0.896381        0.350720          0.098244
   N_grams          TbNB  0.863237  0.852236        2.072327          0.081712
   N_grams         iTbNB  0.854289  0.837066        2.542964          0.133765
   N_grams   BernoulliNB  0.805105  0.828271        0.347211          0.136305
    Simple MultinomialNB  0.871500  0.872042        0.025441          0.011376
    Simple  ComplementNB  0.871289  0.871880        0.041409          0.012785
    Simple          TbNB  0.821342  0.818277        0.367823          0.008380
    Simple         iTbNB  0.820816  0.819145        0.697839          0.005453
    Simple   BernoulliNB  0.791789  0.809165        0.063587          0.027049
