<h1>Dataset</h1>

Abra con su editor de texto preferido el dataset smsspamcollection.

* Las columnas en el conjunto de datos actualmente no tienen nombre y, como puede ver, hay 2 columnas.

* La primera columna toma dos valores, 'ham', que significa que el mensaje no es spam, y 'spam', que significa que el mensaje es spam.

* La segunda columna es el contenido de texto del mensaje SMS que se está clasificando.

In [0]:
import pandas as pd
# Dataset from - https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
df = pd.read_table('smsspamcollection/SMSSpamCollection',
                   sep='\t', 
                   header=None, 
                   names=['label', 'sms_message'])

# imprima las primeras 5 filas
df.head()

#Cambiamos las etiquetas texutales por etiquetas numericas, esto es una buena practica cuando se construyen modelo supervisados
df['label'] = df.label.map({'ham':0, 'spam':1})
print(df.shape)
df.head() # returns (rows, columns)

(5572, 2)


Unnamed: 0,label,sms_message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


<h2>Construir la bolsa de palabras con SKLEARN</h2>

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer()

** Data preprocessing with CountVectorizer() SKLEARN ** 

SKLEARN CountVectorizer involved converting all of our data to lower case and removing all punctuation marks. CountVectorizer() has certain parameters which take care of these steps for us. They are:

* `lowercase = True`
    
    The `lowercase` parameter has a default value of `True` which converts all of our text to its lower case form.


* `token_pattern = (?u)\\b\\w\\w+\\b`
    
    The `token_pattern` parameter has a default regular expression value of `(?u)\\b\\w\\w+\\b` which ignores all punctuation marks and treats them as delimiters, while accepting alphanumeric strings of length greater than or equal to 2, as individual tokens or words.


* `stop_words`

    The `stop_words` parameter, if set to `english` will remove all words from our document set that match a list of English stop words which is defined in scikit-learn. Considering the size of our dataset and the fact that we are dealing with SMS messages and not larger text sources like e-mail, we will not be setting this parameter value.

You can take a look at all the parameter values of your `count_vector` object by simply printing out the object as follows:

In [0]:
print(count_vector)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)


<h3>Ejemplo de CountVectorizer</h3>

In [0]:
documents = ['Hello, how are you!',
                'Win money, win from home.',
                'Call me now.',
                'Hello, Call hello you tomorrow?']

count_vector.fit(documents)
count_vector.get_feature_names() #Retorna el vocabulario del corpus


['are',
 'call',
 'from',
 'hello',
 'home',
 'how',
 'me',
 'money',
 'now',
 'tomorrow',
 'win',
 'you']

In [0]:
#Matriz termino documento
doc_array = count_vector.transform(documents).toarray()
doc_array

array([[1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1],
       [0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 2, 0],
       [0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0],
       [0, 1, 0, 2, 0, 0, 0, 0, 0, 1, 0, 1]], dtype=int64)

In [0]:
#Mas bonito en un data frame de pandas
frequency_matrix = pd.DataFrame(doc_array, 
                                columns = count_vector.get_feature_names())
frequency_matrix

Unnamed: 0,are,call,from,hello,home,how,me,money,now,tomorrow,win,you
0,1,0,0,1,0,1,0,0,0,0,0,1
1,0,0,1,0,1,0,0,1,0,0,2,0
2,0,1,0,0,0,0,1,0,1,0,0,0
3,0,1,0,2,0,0,0,0,0,1,0,1


<h2>Paso 1: Dividir el dataset en Entrenamiento y Pruebas</h2>

In [0]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['sms_message'], 
                                                    df['label'], 
                                                    random_state=1)

print('Number of rows in the total set: {}'.format(df.shape[0]))
print('Number of rows in the training set: {}'.format(X_train.shape[0]))
print('Number of rows in the test set: {}'.format(X_test.shape[0]))


Number of rows in the total set: 5572
Number of rows in the training set: 4179
Number of rows in the test set: 1393


<h2>Paso 2: Construir el modelo de bolsa de palabras para nuestro dataset </h2>

In [0]:
# Instantiate the CountVectorizer method
count_vector = CountVectorizer()

# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(X_train)

# Transform testing data and return the matrix. Note we are not fitting the testing data into the CountVectorizer()
testing_data = count_vector.transform(X_test)

<h2>Paso 3: Entrenar el clasificador Naive Bayes </h2>

In [0]:
from sklearn.naive_bayes import MultinomialNB
# Ver documentacion https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

<h2>Paso 4: Evaluando el Modelo </h2>

In [0]:
predictions = naive_bayes.predict(testing_data) #Lo primero es utilizar nuestro modelo para hacer predicciones sobre el dataset de pruebas

In [0]:
#Ahora vamos a construir la matriz de confusion
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, predictions)

array([[1203,    5],
       [  11,  174]], dtype=int64)

In [0]:
#En terminos de TP, FP, TN, FN
tn, fp, fn, tp = confusion_matrix(y_test, predictions).ravel()
(tn,fp,fn,tp)

(1203, 5, 11, 174)

### Evaluation Metrics ###

** Accuracy ** measures how often the classifier makes the correct prediction. It’s the ratio of the number of correct predictions to the total number of predictions (the number of test data points).

** Precision ** tells us what proportion of messages we classified as spam, actually were spam.
It is a ratio of true positives(words classified as spam, and which are actually spam) to all positives(all words classified as spam, irrespective of whether that was the correct classification), in other words it is the ratio of

`[True Positives/(True Positives + False Positives)]`

** Recall(sensitivity)** tells us what proportion of messages that actually were spam were classified by us as spam.
It is a ratio of true positives(words classified as spam, and which are actually spam) to all the words that were actually spam, in other words it is the ratio of

`[True Positives/(True Positives + False Negatives)]`

In [0]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy score: ', format(accuracy_score(y_test, predictions)))
print('Precision score: ', format(precision_score(y_test, predictions)))
print('Recall score: ', format(recall_score(y_test, predictions)))
print('F1 score: ', format(f1_score(y_test, predictions)))

Accuracy score:  0.9885139985642498
Precision score:  0.9720670391061452
Recall score:  0.9405405405405406
F1 score:  0.9560439560439562


<h2>Implementar su propia version del Clasificador Naive Bayes</h2>

1. Pueden usar scikit-learn para construir la bolsa de palabras y el vocabulario. Implemente su propia versión del clasificador Naive Bayes, no use la implementación de scikit-learn.
2. Usen el mismo dataset usado en esta practica.
3. Comparen los resultados (P, R, A) obtenidos con y sin scikit-learn. ¿Son diferentes?¿A que cree que se debe? (Espero respuestas con contenido)