In [1]:
import pandas as pd
import numpy as np

## Ejercicio Naive Bayes: Spam or ham

Vamos a crear nuestro propio filtro de *spam* a partir de los mensajes SMS que contiene el zip *smsspamcollection*.

Para esta práctica vamos a usar el clasificador *Naive Bayes* como ya sabeis es un clasificador probabilístico. Antes de empezar debeis leer el fichero *readme* que se proporciona con la base de datos.

En base a la idea que sugirió un compañero, he calculado distintas características en base a la longitud de los mensajes o de las palabras que lo forman. No he conseguido buenos resultados pero puede ser interesante ver el proceso de cálculo.

In [2]:
# Dataset from - https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
df = pd.read_table('data/SMSSpamCollection',
                   sep='\t',
                   header=None,
                   names=['label', 'sms_message'])

print("Observaciones: " + str(df.shape[0]))
df.head()

Observaciones: 5572


Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Balanceamos el conjunto de datos

In [3]:
from sklearn.utils import resample
balanceo = True
if balanceo:
    # Separamos las dos clases
    df_majority = df[df.label=="ham"]
    df_minority = df[df.label=="spam"]

    df_majority_downsampled = resample(df_majority, 
                                     replace=False,    # sample without replacement
                                     n_samples=747,     # to match minority class
                                     random_state=33) 

    # Combinamos
    df_downsampled = pd.concat([df_majority_downsampled, df_minority])

    # Mostramos el nuevo dataset
    print(df_downsampled.label.value_counts())
    df = df_downsampled

ham     747
spam    747
Name: label, dtype: int64


Obtenemos los tokens

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer(stop_words="english",token_pattern=r'\b[^_\d\W]+\b', min_df=0.001, strip_accents='unicode') #set the variable


Calculamos 2 características diferentes: 
* La longitud de los mensajes una vez hemos *tokenizado*.
* La longitud media de las palabras

In [5]:
#Función para tokenizar
funct = count_vector.build_tokenizer()

# Función para el cálculo
def averageLen(lst):
    lengths = [len(i) for i in lst]
    return 0 if len(lengths) == 0 else (float(sum(lengths)) / len(lengths)) 

# Creamos nuevas columnas en el dataframe en basa a la aplicación de las funciones anteriores
df["sms_tokens"] = df.sms_message.apply(funct)
df["sms_length"] = df.sms_tokens.apply(len) # la función len es de la base de python.
df["mean_word"] = df["sms_tokens"].apply(averageLen)

In [6]:
df # observamos el dataframe

Unnamed: 0,label,sms_message,sms_tokens,sms_length,mean_word
2598,ham,"Got fujitsu, ibm, hp, toshiba... Got a lot of ...","[Got, fujitsu, ibm, hp, toshiba, Got, a, lot, ...",13,3.384615
3199,ham,7 lor... Change 2 suntec... Wat time u coming?,"[lor, Change, suntec, Wat, time, u, coming]",7,4.142857
4605,ham,THANX 4 PUTTIN DA FONE DOWN ON ME!!,"[THANX, PUTTIN, DA, FONE, DOWN, ON, ME]",7,3.571429
954,ham,Also remember to get dobby's bowl from your car,"[Also, remember, to, get, dobby, s, bowl, from...",10,3.800000
1402,ham,Kaiez... Enjoy ur tuition... Gee... Thk e seco...,"[Kaiez, Enjoy, ur, tuition, Gee, Thk, e, secon...",19,3.473684
...,...,...,...,...,...
5537,spam,Want explicit SEX in 30 secs? Ring 02073162414...,"[Want, explicit, SEX, in, secs, Ring, now, Cos...",11,4.090909
5540,spam,ASKED 3MOBILE IF 0870 CHATLINES INCLU IN FREE ...,"[ASKED, IF, CHATLINES, INCLU, IN, FREE, MINS, ...",26,3.692308
5547,spam,Had your contract mobile 11 Mnths? Latest Moto...,"[Had, your, contract, mobile, Mnths, Latest, M...",26,4.730769
5566,spam,REMINDER FROM O2: To get 2.50 pounds free call...,"[REMINDER, FROM, To, get, pounds, free, call, ...",25,4.440000


Si nos fijamos, ninguna de las características calculadas tiene valores muy diferentes

In [7]:
df.groupby("label").describe()

Unnamed: 0_level_0,sms_length,sms_length,sms_length,sms_length,sms_length,sms_length,sms_length,sms_length,mean_word,mean_word,mean_word,mean_word,mean_word,mean_word,mean_word,mean_word
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
label,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
ham,747.0,14.323963,10.72703,1.0,7.0,11.0,19.0,117.0,747.0,3.676963,0.690065,1.0,3.272727,3.625,4.0,7.333333
spam,747.0,21.176707,5.507815,0.0,18.0,22.0,25.0,33.0,747.0,4.253469,0.668874,0.0,3.809524,4.2,4.658333,12.0


Aquí creamos los conjuntos de entrenamiento y de test en base a una de las categorias

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['mean_word'],
                                                    df['label'],
                                                    random_state=1)

print("Our original set contains", df.shape[0], "observations")
print("Our training set contains", X_train.shape[0], "observations")
print("Our testing set contains", X_test.shape[0], "observations")


Our original set contains 1494 observations
Our training set contains 1120 observations
Our testing set contains 374 observations


Este gráfico nos permite ver como se distribuye la característica seleccionada

In [9]:
import matplotlib.pyplot as plt

spam = X_train[y_train == "spam"]
ham = X_train[y_train == "ham"]

plt.plot(spam, [0]*spam.shape[0], 'ro');
plt.plot(ham, [1]*ham.shape[0], 'bx');

In [10]:
# Preparamos los datos para el clasificador
X_train = np.asarray(X_train).reshape(-1, 1)
X_test = np.asarray(X_test).reshape(-1, 1)


Debido a la sencillez del problema lo clasificaremos usando una regresión logística.

In [20]:
from sklearn.linear_model import LogisticRegression
logistic = LogisticRegression(C=1.0) #call the method
logistic.fit(X_train, y_train) #train the classifier on the training set



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [21]:
predictions = logistic.predict(X_test) #predict using the model on the testing set

## Mostrar las metricas resumen de nuestro clasificador

Vamos a usar la función ```classification_report ```

[Documentación](https://scikit-learn.org/stable/modules/classes.html#classification-metrics)

In [22]:
from sklearn.metrics import classification_report

print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

         ham       0.67      0.71      0.69       187
        spam       0.69      0.65      0.67       187

    accuracy                           0.68       374
   macro avg       0.68      0.68      0.68       374
weighted avg       0.68      0.68      0.68       374



De los resultados obtenidos podemos observar como de la creatividad para analizar el conjunto de datos pueden salir ideas que nos lleven a clasificarlos de maneras muy sencillas. 

De todas maneras, si comparamos los resultados obtenidos numericamente, hay que destacar que la aproximación del problema de la forma clásica nos proporciona mejores valores.
