In [14]:
import pandas as pd

balanceo = True # Esta variable me permite balancear los datos

## Ejercicio Naive Bayes: Spam or ham

Vamos a crear nuestro propio filtro de *spam* a partir de los mensajes SMS que contiene el zip *smsspamcollection*.

Para esta práctica vamos a usar el clasificador *Naive Bayes* como ya sabeis es un clasificador probabilístico. Antes de empezar debeis leer el fichero *readme* que se proporciona con la base de datos.

In [2]:
# Dataset from - https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
df = pd.read_table('data/SMSSpamCollection',
                   sep='\t',
                   header=None,
                   names=['label', 'sms_message'])

print("Observaciones: " + str(df.shape[0]))
df.head()

Observaciones: 5572


Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


### Ejercicio 0
¿Cúantas observaciones hay en el conjunto de datos?

¿Cúantas son de spam? 

¿Cúantas son de mensajes normales?

In [3]:
df.groupby("label").describe()

Unnamed: 0_level_0,sms_message,sms_message,sms_message,sms_message
Unnamed: 0_level_1,count,unique,top,freq
label,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,653,Please call our customer service representativ...,4


**Balanceo del conjunto de datos**

In [4]:
from sklearn.utils import resample

if balanceo:
    # Separamos las dos clases
    df_majority = df[df.label=="ham"]
    df_minority = df[df.label=="spam"]

    df_majority_downsampled = resample(df_majority, 
                                     replace=False,    # sample without replacement
                                     n_samples=747,     # to match minority class
                                     random_state=33) 

    # Combinamos
    df_downsampled = pd.concat([df_majority_downsampled, df_minority])

    # Mostramos el nuevo dataset
    print(df_downsampled.label.value_counts())
    df = df_downsampled

ham     747
spam    747
Name: label, dtype: int64


### Ejercicio 1
En primer lugar debemos limpiar los datos, alguna de las acciones que se recomienda realizar es:
    
* [Poner el texto en minúscula](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.lower.html)
* [Eliminar signos de puntuación](https://www.geeksforgeeks.org/string-punctuation-in-python)
* [Separar el texto restante en palabras](https://docs.python.org/3.7/library/stdtypes.html#str.split)
* Eliminar palabras muy cortas / carentes de significado por ellas mismas y los números.


In [5]:
df.sms_message.str.lower().head(10)

2598    got fujitsu, ibm, hp, toshiba... got a lot of ...
3199       7 lor... change 2 suntec... wat time u coming?
4605                  thanx 4 puttin da fone down on me!!
954       also remember to get dobby's bowl from your car
1402    kaiez... enjoy ur tuition... gee... thk e seco...
1389       oh k.i think most of wi and nz players unsold.
1339    aight sorry i take ten years to shower. what's...
1238                    is ur paper in e morn or aft tmr?
1036    hello baby, did you get back to your mom's ? a...
3738    plz note: if anyone calling from a mobile co. ...
Name: sms_message, dtype: object

Finalmente deberíamos contar la frecuencia de aparición de cada palabra antes de poder aplicar el clasificador.

### Ejercicio 2

Realmente este proceso es muy tedioso, scikit-learn nos proporciona la clase [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). Usa esta clase para preparar tus datos.


**Comentario:** Si quieres eliminar los números usa el siguiente parámetro: ``` token_pattern=r'\b[^\d\W]+\b'```

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer(stop_words="english",token_pattern=r'\b[^_\d\W]+\b', min_df=0.001, strip_accents='unicode') #set the variable

count_vector.fit(df.sms_message) #fit the function
count_vector.get_feature_names()[:10] #get the outputs
print(len(count_vector.get_feature_names()))

1500


In [7]:
doc_array = count_vector.transform(df.sms_message)

## Mostramos la matriz que hemos creado


In [8]:
doc_array_transformed = doc_array.toarray()
frequency_matrix = pd.DataFrame(doc_array_transformed, columns = count_vector.get_feature_names())
frequency_matrix.head(10)

Unnamed: 0,abiola,able,abt,abta,ac,accept,access,accident,account,aco,...,year,years,yer,yes,yesterday,yo,yr,yrs,yup,zed
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['sms_message'],
                                                    df['label'],
                                                    random_state=1)

print("Our original set contains", df.shape[0], "observations")
print("Our training set contains", X_train.shape[0], "observations")
print("Our testing set contains", X_test.shape[0], "observations")

Our original set contains 1494 observations
Our training set contains 1120 observations
Our testing set contains 374 observations


In [10]:
train = count_vector.fit_transform(X_train)
test = count_vector.transform(X_test)

## Aplicar NaiveBayes

Como ya hemos visto, los métodos de clasificación se usan de la misma manera. 

[Documentación](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)

In [11]:
from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB() #call the method
naive_bayes.fit(train, y_train); #train the classifier on the training set

In [12]:
predictions = naive_bayes.predict(test) #predic using the model on the testing set

## Mostrar las metricas resumen de nuestro clasificador

Vamos a usar la función ```classification_report ```

[Documentación](https://scikit-learn.org/stable/modules/classes.html#classification-metrics)

In [13]:
from sklearn.metrics import classification_report

print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

         ham       0.95      0.98      0.97       187
        spam       0.98      0.95      0.96       187

    accuracy                           0.97       374
   macro avg       0.97      0.97      0.97       374
weighted avg       0.97      0.97      0.97       374

