# Naive Bayes para clasificación
- Fuente ejemplo: https://medium.com/datos-y-ciencia/algoritmos-naive-bayes-fudamentos-e-implementaci%C3%B3n-4bcb24b307f
- Base de datos: https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
- Documentación: https://scikit-learn.org/stable/modules/naive_bayes.html
    - MultinomialNB: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn-naive-bayes-multinomialnb
    - GaussianNB: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html
- Objetivo: Clasificar entre spam o ham

Vectorizar: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

## 1. Cargar librerías y datos
### Librerías

In [1]:
#Manejo de datos
import pandas as pd
import numpy as np

#Modelo
from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB

#Preprocesamiento
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

#Evaluación del modelo
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

### Datos
Colecciónde mensajes SMS etiquetados como spam o ham (legítimo)

In [2]:
#Datos en estsa misma carpeta
dataframe = pd.read_csv(r"spam.csv",sep=',',encoding='latin-1')
dataframe = dataframe.drop(columns=['Unnamed: 2','Unnamed: 3','Unnamed: 4'],axis=1)
dataframe = dataframe.set_axis(['label','sms_message'], axis=1)
dataframe.head(10)

Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


## 2. Comprensión inicial de los datos
### Datos generales

In [3]:
# Cantidad por categoría
print(dataframe.groupby('label').size())
# Proporción por categoría
print(dataframe.groupby('label').size()/len(dataframe))

label
ham     4825
spam     747
dtype: int64
label
ham     0.865937
spam    0.134063
dtype: float64


### Preprocesamiento de Datos
Convertiremos las etiquetas en variables binarias, 0 representará ‘ham’ y 1 representará ‘spam’.

In [4]:
#Duplicar columna para estar segura de cuál es cuál
dataframe['label1'] = dataframe['label']
# Conversion
dataframe['label'] = dataframe.label.map({'ham':0, 'spam':1})
dataframe.head()

Unnamed: 0,label,sms_message,label1
0,0,"Go until jurong point, crazy.. Available only ...",ham
1,0,Ok lar... Joking wif u oni...,ham
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,spam
3,0,U dun say so early hor... U c already then say...,ham
4,0,"Nah I don't think he goes to usf, he lives aro...",ham


## 3. Peparación de DataFrame

### Train-Test split

In [5]:
#División del DataSet en train y test
X_train, X_test, y_train, y_test = train_test_split(dataframe['sms_message'], dataframe['label'], random_state=1)
print('Number of rows in the total set: {}'.format(dataframe.shape[0]))
print('Number of rows in the training set: {}'.format(X_train.shape[0]))
print('Number of rows in the test set: {}'.format(X_test.shape[0]))

Number of rows in the total set: 5572
Number of rows in the training set: 4179
Number of rows in the test set: 1393


In [6]:
X_train.head()

710     Height of Confidence: All the Aeronautics prof...
3740                                        2/2 146tf150p
2711    Wen ur lovable bcums angry wid u, dnt take it ...
3155                    Long time. You remember me today.
3748    Dear Voucher Holder 2 claim your 1st class air...
Name: sms_message, dtype: object

In [7]:
y_train.head()

710     0
3740    1
2711    0
3155    0
3748    1
Name: label, dtype: int64

### Bag of Words (BoW)
- Vectorizar: La idea es tomar un fragmento de texto y contar la frecuencia de las palabras en el mismo.
- Matriz de frecuencia: Podemos convertir un conjunto de documentos en una matriz, siendo cada documento una fila y cada palabra (token) una columna, y los valores correspondientes (fila, columna) son la frecuencia de ocurrencia de cada palabra (token) en el documento.

In [8]:
# CountVectorizer method
count_vector = CountVectorizer()
# Ajuste como vector a los datos de entrenamiento y luego devuelva la matriz.
training_data = count_vector.fit_transform(X_train)
#palabras
print(count_vector.get_feature_names_out())
#matriz total
print(training_data.toarray())

['00' '000' '000pes' ... 'ûïharry' 'ûò' 'ûówell']
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [9]:
# Transformar los datos de prueba y devuelve la matriz.
testing_data = count_vector.transform(X_test)
#palabras
print(count_vector.get_feature_names_out())
#mariz total
print(testing_data.toarray())

['00' '000' '000pes' ... 'ûïharry' 'ûò' 'ûówell']
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


## 4. Modelo

### 4.1 Modelo MultinomialNB
#### Entrenamiento

In [10]:
#Entrenamiento
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data, y_train)

MultinomialNB()

In [11]:
#Score: accuracy promedio
naive_bayes.score(training_data, y_train)

0.9937784158889686

#### Predicción

In [12]:
# Aplicación del modelo en los datos de prueba
predicciones = naive_bayes.predict(testing_data)
predicciones

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [13]:
#Probabilidad de clasificación
print("clasificación: [ham,spam]")
naive_bayes.predict_proba(testing_data)

clasificación: [ham,spam]


array([[9.99867733e-01, 1.32267211e-04],
       [9.99804615e-01, 1.95384688e-04],
       [9.99997809e-01, 2.19058026e-06],
       ...,
       [9.99759900e-01, 2.40100285e-04],
       [9.99998225e-01, 1.77530175e-06],
       [9.99991340e-01, 8.66005064e-06]])

In [14]:
#Score: accuracy promedio
naive_bayes.score(testing_data, y_test)

0.9856424982053122

#### Tabla de resultados

In [15]:
tabla = pd.DataFrame(X_test)
#renombrar columnas
tabla = tabla.set_axis(['sms_message'], axis=1)
#Agregar y_verdaderas
tabla['spam'] = y_test
#Agregar y de predicción
tabla['spam prediction'] = predicciones
tabla['T/F'] = tabla['spam']==tabla['spam prediction']
tabla

Unnamed: 0,sms_message,spam,spam prediction,T/F
1078,Convey my regards to him,0,0,True
4028,"[Û_] anyway, many good evenings to u! s",0,0,True
958,My sort code is and acc no is . The bank is n...,0,0,True
4642,Sorry i din lock my keypad.,0,0,True
4674,"Hi babe its Chloe, how r u? I was smashed on s...",1,0,False
...,...,...,...,...
3207,Oops my phone died and I didn't even know. Yea...,0,0,True
4655,"K, I'll work something out",0,0,True
1140,Oh:)as usual vijay film or its different?,0,0,True
1793,You bad girl. I can still remember them,0,0,True


In [16]:
#Cantidad verdader/falso
print(tabla['T/F'].value_counts())

#Proporción verdader/falso
print(((tabla['T/F'].value_counts())/len(tabla))*100)

True     1373
False      20
Name: T/F, dtype: int64
True     98.56425
False     1.43575
Name: T/F, dtype: float64


### 4.2 Modelo GaussianNB
#### Entrenamiento

In [17]:
#Pasar a matriz los datos de entrenamiento
training_data = training_data.toarray()
#Aplicación de modelo con Gauss
clf = GaussianNB()
clf.fit(training_data, y_train)

GaussianNB()

In [18]:
#Score: accuracy promedio
clf.score(training_data, y_train)

0.9535774108638431

#### Predicción

In [19]:
#Pasar a matriz los datos de entrenamiento
testing_data = testing_data.toarray()
#Aplicación del modelo
predicciones = clf.predict(testing_data)
predicciones

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [20]:
#Probabilidad de clasificación
print("clasificación: [ham,spam]")
clf.predict_proba(testing_data)

clasificación: [ham,spam]


array([[1., 0.],
       [1., 0.],
       [1., 0.],
       ...,
       [1., 0.],
       [1., 0.],
       [1., 0.]])

In [21]:
#Score: accuracy promedio
clf.score(testing_data, y_test)

0.9095477386934674

## 5. Evaluación del modelo

#### Matriz de confusión

In [22]:
#Matriz de confusión
pred = naive_bayes.predict(testing_data)
print("confusion_matrix")
print(confusion_matrix(y_test, pred))

confusion_matrix
[[1205    8]
 [  12  168]]


#### Métricas

In [23]:
#Accuracy: Considera (TP+TN)/(TP+FN+TN+FN)
accuracy = accuracy_score(y_test, pred)
print('Accuracy score: ', format(accuracy))

Accuracy score:  0.9856424982053122


In [24]:
#Precision: Considera TP/(TP+FP)
precision = precision_score(y_test, pred)
print('Precision score: ', format(precision))

Precision score:  0.9545454545454546


In [25]:
#Recall: Considera TP/(TP+FN)
recall = recall_score(y_test, pred)
print('Recall score: ', format(recall))

Recall score:  0.9333333333333333


In [26]:
#F1: Considera 2x((precision*recall)/(precision+recall))
f1 = f1_score(y_test, pred)
print('F1 score: ', format(f1))

F1 score:  0.9438202247191012
