# Sentiment Analysis  of text documents using TbNB

This is an example showing how TbNB can be used to classify documents by sentiment using a Bag of Words approach. This demo uses a binary document-term sparse matrix to encode the features and demonstrates the correct procedure to correctly train and utilize a (iterative) Threshold-Based Naive Bayes model

#### Setup

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from TbNB import TbNB  
import numpy as np 
from datasets import load_dataset
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB, ComplementNB, BernoulliNB
import time

### Import data and perform train/test split

Here we'll employ a simple sentiment dataset containing various reviews. We split the dataset in training and test data (counting 25k samples each), and according to dependent and independent variables.

In [2]:
dataset = load_dataset("amazon_polarity")
train = dataset["train"].shuffle(seed=42).select(range(200000)) 
test = dataset["test"]

X_train_text = train["content"]
y_train = np.array(train["label"])
X_test_text = test["content"]
y_test = np.array(test["label"])


README.md: 0.00B [00:00, ?B/s]


#### Text Vectorization

We leverage scikit-learn's CountVectorizer in order to remove stopwords and to create a BoW matrix signifying word presence/absence within each document. The vectorizer is fitted on training and data and is used to transform both training and test data. As the output indicates, CountVectorizer's output type defaults to a scipy sparse matrix, a format especially fitting for BoW data, which allows for extremely fast computations. However, TbNB also accepts other formats for X, such as numpy.ndarray or pandas dataframe. These formats are converted internally into sparse matrices. 

In [3]:
vectorizer = CountVectorizer(binary=False, stop_words="english")
X_train = vectorizer.fit_transform(X_train_text)
X_test = vectorizer.transform(X_test_text)

In [4]:
type(X_train)

scipy.sparse._csr.csr_matrix

#### Initialize and train the TbNB Model

We instantiate the model with iterative=True, which means calling fit will automatically estimate class priors and employ the iterative optimization algorithm described in Romano, M., Zammarchi, G., & Conversano, C. (2024). The .fit() method returns the fitted model and can be used for predictions using dot notation.


In [5]:
model = TbNB(iterative=True)
y_pred = model.fit(X_train, y_train).predict(X_test)
print("Predicted labels:", y_pred)

Predicted labels: [1 1 0 ... 0 1 0]


#### Evaluate Performance and post-hoc analysis

We can evaluate the modelâ€™s accuracy and other metrics using standard scikit-learn functions, as well as inspect learned attributed using the TbNB class

In [6]:
print(classification_report(y_test, y_pred, target_names=["Negative", "Positive"]))
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))


              precision    recall  f1-score   support

    Negative       0.80      0.86      0.83    200000
    Positive       0.85      0.79      0.82    200000

    accuracy                           0.82    400000
   macro avg       0.83      0.82      0.82    400000
weighted avg       0.83      0.82      0.82    400000

Confusion matrix:
 [[172459  27541]
 [ 42649 157351]]


### Inspect
 Once the model is trained, one can simply access learned parameters by calling their name using dot notation 

In [7]:
print("Decision threshold (tau_):", model.threshold_)
print("Number of features (words):", model.n_features_in_)
print("Review scores (lambda_scores_):")
print(model.lambda_scores_[:10])


Decision threshold (tau_): 0.6318971818258774
Number of features (words): 167178
Review scores (lambda_scores_):
[-0.90055648 -0.00290313  0.6990669  -0.68719737 -0.68719737  0.69907699
 -0.68719737 -0.68719737 -0.68719737 -1.09265251]


#### Benchmark



In [8]:
results = []

def benchmark_model(name, clf, X_train, X_test, y_train, y_test, variant):
    """Esegue un benchmark e salva i risultati globali."""
    
    t0 = time.time()
    clf.fit(X_train, y_train)
    train_time = time.time() - t0

    t0 = time.time()
    preds = clf.predict(X_test)
    pred_time = time.time() - t0

    acc = accuracy_score(y_test, preds)
    f1 = f1_score(y_test, preds, average="binary")


    results.append({
        "Vectorizer": variant,
        "Model": name,
        "Accuracy": acc,
        "F1-score": f1,
        "Train Time (s)": train_time,
        "Predict Time (s)": pred_time
    })


vectorizers = {
    "Simple": CountVectorizer(binary=False, stop_words="english"),
    "N_grams": CountVectorizer(binary=False, stop_words="english", ngram_range=(1,2))
}

for variant, vectorizer in vectorizers.items():

    print(f"\n==============================")
    print(f"  Running Vectorizer: {variant}")
    print(f"==============================")

    X_train_vec = vectorizer.fit_transform(X_train_text)
    X_test_vec = vectorizer.transform(X_test_text)

    benchmark_model(
        f"TbNB",
        TbNB(iterative=False),
        X_train_vec,
        X_test_vec,
        y_train,
        y_test,
        variant
    )
    
    benchmark_model(
        f"iTbNB",
        TbNB(iterative=True),
        X_train_vec,
        X_test_vec,
        y_train,
        y_test,
        variant
    )

    benchmark_model(
        "BernoulliNB",
        BernoulliNB(),
        X_train_vec,
        X_test_vec,
        y_train,
        y_test,
        variant
    )

    benchmark_model(
        "MultinomialNB",
        MultinomialNB(),
        X_train_vec,
        X_test_vec,
        y_train,
        y_test,
        variant
    )

    benchmark_model(
        "ComplementNB",
        ComplementNB(),
        X_train_vec,
        X_test_vec,
        y_train,
        y_test,
        variant
    )



df = pd.DataFrame(results).sort_values(by=["Vectorizer", "Accuracy"], ascending=[True, False])
print("\n\n=== RISULTATI FINALI ===")
print(df.to_string(index=False))



  Running Vectorizer: Simple

  Running Vectorizer: N_grams


=== RISULTATI FINALI ===
Vectorizer         Model  Accuracy  F1-score  Train Time (s)  Predict Time (s)
   N_grams  ComplementNB  0.853507  0.850691        0.184513          0.221479
   N_grams MultinomialNB  0.853495  0.850651        0.177454          0.236389
   N_grams          TbNB  0.852685  0.850187        2.161524          0.184067
   N_grams         iTbNB  0.851615  0.846856        2.638373          0.227796
   N_grams   BernoulliNB  0.847565  0.853112        0.261557          0.345673
    Simple   BernoulliNB  0.825523  0.826738        0.079414          0.125324
    Simple  ComplementNB  0.825483  0.822505        0.026182          0.058849
    Simple MultinomialNB  0.825473  0.822435        0.026290          0.060626
    Simple          TbNB  0.825348  0.820287        0.435855          0.027804
    Simple         iTbNB  0.824525  0.817637        0.889264          0.057638


In [9]:
import time
from preprocessing.nltk_pipeline import TextPreprocessor
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB, ComplementNB, BernoulliNB

results = []

def benchmark_model(name, pipeline, X_train, X_test, y_train, y_test, variant):
    """Esegue un benchmark e salva i risultati globali."""

    # Train
    t0 = time.time()
    pipeline.fit(X_train, y_train)
    train_time = time.time() - t0

    # Predict
    t0 = time.time()
    preds = pipeline.predict(X_test)
    pred_time = time.time() - t0

    # Metrics
    acc = accuracy_score(y_test, preds)
    f1 = f1_score(y_test, preds, average="binary")

    results.append({
        "Vectorizer": variant,
        "Model": name,
        "Accuracy": acc,
        "F1-score": f1,
        "Train Time (s)": train_time,
        "Predict Time (s)": pred_time
    })

from sklearn.pipeline import Pipeline

preprocessor = TextPreprocessor(
    language="english",
    remove_html=False,
    remove_urls=False,
    lower=True,
    expand_contr=False,
    remove_punct=True,
    remove_sw=True,
    stem=False
)


vectorizers = {
    "Simple": CountVectorizer(binary=False),
    "N_grams": CountVectorizer(binary=False, ngram_range=(1,2), min_df=3),
}


preprocessor = TextPreprocessor(
    language="english",
    remove_html=False,
    remove_urls=False,
    lower=True,
    expand_contr=True,
    remove_punct=True,
    remove_sw=True,
    stem=False
)

X_train_clean = preprocessor.fit_transform(X_train_text)
X_test_clean  = preprocessor.transform(X_test_text)


for variant, vectorizer in vectorizers.items():

    print(f"\n==============================")
    print(f"  Running Vectorizer: {variant}")
    print(f"==============================")

    X_train_vec = vectorizer.fit_transform(X_train_clean)
    X_test_vec = vectorizer.transform(X_test_clean)

    benchmark_model("TbNB", TbNB(iterative=False),
                    X_train_vec, X_test_vec, y_train, y_test, variant)

    benchmark_model("iTbNB", TbNB(iterative=True),
                    X_train_vec, X_test_vec, y_train, y_test, variant)

    benchmark_model("BernoulliNB", BernoulliNB(),
                    X_train_vec, X_test_vec, y_train, y_test, variant)

    benchmark_model("MultinomialNB", MultinomialNB(),
                    X_train_vec, X_test_vec, y_train, y_test, variant)

    benchmark_model("ComplementNB", ComplementNB(),
                    X_train_vec, X_test_vec, y_train, y_test, variant)



df = pd.DataFrame(results).sort_values(by=["Vectorizer", "Accuracy"], ascending=[True, False])
print("\n\n=== RISULTATI FINALI ===")
print(df.to_string(index=False))



  Running Vectorizer: Simple

  Running Vectorizer: N_grams


=== RISULTATI FINALI ===
Vectorizer         Model  Accuracy  F1-score  Train Time (s)  Predict Time (s)
   N_grams  ComplementNB  0.860850  0.861387        0.049492          0.253213
   N_grams MultinomialNB  0.860842  0.861359        0.105410          0.144594
   N_grams          TbNB  0.858625  0.855103        0.869709          0.130540
   N_grams         iTbNB  0.858225  0.854132        1.413474          0.154806
   N_grams   BernoulliNB  0.854710  0.858948        0.116639          0.209586
    Simple  ComplementNB  0.828322  0.825390        0.028737          0.065646
    Simple MultinomialNB  0.828275  0.825272        0.031811          0.069199
    Simple          TbNB  0.828202  0.825308        0.440209          0.039290
    Simple         iTbNB  0.827965  0.823353        0.953476          0.074540
    Simple   BernoulliNB  0.827245  0.828834        0.093275          0.123117
