# # Classifiers comparison on texts with naive Bayes assumption

In this session of laboratory we compare two models for categorical data probabilistic modeling:
1. multivariate Bernoulli
2. multinomial on a dataset

We adopt a dataset on Twitter messages labelled with emotions (Joy vs Sadness).

The following program shows the loading of the data from a file.

Data are loaded into a matrix X adopting a sparse matrix representation, in order to save space and time.
Sparse matrix representation (in the csr format) represents in three "parallel" arrays the value of the matrix cells that are different from zero and the indices of those matrix cells.
The arrays are called:
- data
- row
- col

- data[i] stores the value of the matrix cell #i whose indexes are contained in row[i] and col[i]
- row[i] stores the index of the row in the matrix of the cell #i,
- col[i] stores the index of the column of the cell #i.


The data file is in csv format.
Any Twitter message has been preprocessed by a Natural Language pipeline which eliminated stop words and substituted the interesting document elements with an integer identifier.  
The interesting document elements might be words, emoji or emoticons. The elements could be repeated in the same document and are uniquely identified in the documents by the same integer number (named "element_id" in the program). This "element_id" number will be used as the index of the column of the data matrix, for the purposes of storage of data.

Each row of the CSV file reports the content of a document (a Twitter message).It is formed as a list of integer number pairs, followed by a string which is the label of the document ("Joy" or "Sadness").
The first number of the pair is the identifier of a document element (the "element_id");
the second number of the pair is the count (frequency) of that element in that document.

The dataset has:

tot_n_docs (or rows in the file) =n_rows=11981

n_features (total number of distinct words in the corpus)=11288



The following program reads the data file and loads in a sparse way the matrix using the scipy.sparse library

In [None]:
import numpy as np
from scipy.sparse import csr_matrix

class_labels = ["Joy", "Sadness"]
n_features = 11288  # Numero di colonne della matrice (feature uniche nel corpus)
n_rows = 11981  # Numero di righe della matrice (tweet nel dataset)
n_elements = 71474  # Numero di elementi diversi da zero da caricare (sparse)

# Percorso e nome file
path_training="./sample_data/"
file_name = "joy_sadness6000.txt"

# Dichiarazione degli array per la matrice sparsa
row = np.empty(n_elements, dtype=int)
col = np.empty(n_elements, dtype=int)
data = np.empty(n_elements, dtype=int)

row_n = 0  # Contatore della riga corrente
cur_el = 0  # Contatore della posizione negli array sparse
twitter_labels = []  # Etichette testuali dei tweet (Joy/Sadness)
twitter_target = []  # Etichette numeriche (0=Joy, 1=Sadness)

# Lettura del file
with open(path_training + file_name, "r") as fi:
    for line_number, line in enumerate(fi):
        el_list = line.strip().split(',')  # Lista di elementi nella riga
        l = len(el_list)

        if l < 2:  # Se la riga è vuota o ha un formato errato
            print(f"Errore: Riga {line_number} malformata -> {line}")
            continue

        # Estraggo l'ultima colonna (etichetta Joy/Sadness)
        class_name = el_list[-1].strip()
        twitter_labels.append(class_name)
        twitter_target.append(0 if class_name == class_labels[0] else 1)

        i = 0  # Indice per scorrere gli elementi del tweet
        while i < (l - 1):  # Ultimo elemento è la label, quindi escludo
            try:
                element_id = int(el_list[i]) - 1  # ID dell'elemento, reso 0-based
                i += 1
                value_cell = int(el_list[i])  # Numero di occorrenze dell'elemento
                i += 1

                # Controllo che gli indici siano nel range corretto
                if row_n >= n_rows:
                    print(f"Errore: row_n={row_n} supera n_rows={n_rows} (linea {line_number})")
                    break
                if cur_el >= n_elements:
                    print(f"Errore: cur_el={cur_el} supera n_elements={n_elements} (linea {line_number})")
                    break
                if element_id < 0 or element_id >= n_features:
                    print(f"Errore: element_id={element_id} fuori range (0-{n_features-1}) (linea {line_number})")
                    continue  # Ignoro il valore fuori range

                # Carico i dati negli array per la matrice sparsa
                row[cur_el] = row_n
                col[cur_el] = element_id
                data[cur_el] = value_cell
                cur_el += 1

            except ValueError as e:
                print(f"Errore di conversione nella riga {line_number}: {line}")
                print(f"Dettaglio errore: {e}")

        row_n += 1  # Passo alla riga successiva

# Creazione della matrice sparsa
twitter_data = csr_matrix((data[:cur_el], (row[:cur_el], col[:cur_el])), shape=(n_rows, n_features)).toarray()

# Output finale
print("Matrice risultante:")
print(twitter_data)
print("Etichette:", twitter_labels[:10])  # Stampa solo le prime 10
print("Target numerico:", twitter_target[:10])

Matrice risultante:
[[1 1 1 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 1]
 [0 0 0 ... 0 0 0]
 [0 1 0 ... 0 0 0]]
Etichette: ['Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy']
Target numerico: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


Write a program in the following cell that splits the data matrix in training and test set (by random selection) and predicts the class (Joy/Sadness) of the messages on the basis of the words.
Consider the two possible models:
multivariate Bernoulli and multinomial Bernoulli.
Find the accuracy of the models and test is the observed differences are significant.

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.metrics import accuracy_score
from scipy.stats import ttest_rel

# split del dataset in training e test set
X_train, X_test, y_train, y_test = train_test_split(twitter_data, twitter_target, test_size=0.2, random_state=42)

# creazione modelli
bernoulli_nb = BernoulliNB()
multinomial_nb = MultinomialNB()

# addestramento modelli
bernoulli_nb.fit(X_train, y_train)
multinomial_nb.fit(X_train, y_train)

# calcolo predizioni
y_pred_bernoulli = bernoulli_nb.predict(X_test)
y_pred_multinomial = multinomial_nb.predict(X_test)

# calcolo accuratezza
accuracy_bernoulli = accuracy_score(y_test, y_pred_bernoulli)
accuracy_multinomial = accuracy_score(y_test, y_pred_multinomial)

# calcolo cross-validation
bernulli_score = cross_val_score(bernoulli_nb, X_train, y_train, cv=10)
multinomial_score = cross_val_score(multinomial_nb, X_train, y_train, cv=10)

# t-test per il confronto dei modelli
stat, p_value = ttest_rel(bernulli_score, multinomial_score)

# output
print(f"BernoulliNB Accuracy: {accuracy_bernoulli:.4f}")
print(f"MultinomialNB Accuracy: {accuracy_multinomial:.4f}")
print(f"cross_validation on bernulli model: {sum(bernulli_score)/len(bernulli_score):.4f}")
print(f"cross_validation on multinomial model: {sum(multinomial_score)/len(multinomial_score):.4f}")
print(f"Paired t-test statistic: {stat:.4f}, p-value: {p_value:.4f}")

# interpretazione del p-value
if p_value < 0.05:
    print("The difference between the two models is statistically significant.")
else:
    print("No significant difference found between the models.")

BernoulliNB Accuracy: 0.9458
MultinomialNB Accuracy: 0.9433
cross_validation on bernulli model: 0.9536
cross_validation on multinomial model: 0.9513
Paired t-test statistic: 1.7921, p-value: 0.1067
No significant difference found between the models.
