# Naive Bayes
El objetivo del notebook es experimentar con el clasificador Naive Bayes en sus diferentes implementaciones. Se utilizará el clasificador gausiano para predicción multiclase en el conjunto iris y el clasficiador multinomial para clasficiación de mensajes de texto.


## Iris Dataset

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import matplotlib.patches as mpatches
import seaborn as sns
 
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.naive_bayes import GaussianNB

%matplotlib inline
plt.rcParams['figure.figsize'] = (16, 9)
plt.style.use('ggplot')

from sklearn import datasets

Función de utilizad para visualizar matriz de confusión.

In [0]:
def print_confusion_matrix(confusion_matrix, class_names, figsize = (10,7), fontsize=14):
    """Prints a confusion matrix, as returned by sklearn.metrics.confusion_matrix, as a heatmap.
    
    Arguments
    ---------
    confusion_matrix: numpy.ndarray
        The numpy.ndarray object returned from a call to sklearn.metrics.confusion_matrix. 
        Similarly constructed ndarrays can also be used.
    class_names: list
        An ordered list of class names, in the order they index the given confusion matrix.
    figsize: tuple
        A 2-long tuple, the first value determining the horizontal size of the ouputted figure,
        the second determining the vertical size. Defaults to (10,7).
    fontsize: int
        Font size for axes labels. Defaults to 14.
        
    Returns
    -------
    matplotlib.figure.Figure
        The resulting confusion matrix figure
        
    Reference
    -------
    https://gist.github.com/shaypal5/94c53d765083101efc0240d776a23823
    
    """
    df_cm = pd.DataFrame(
        confusion_matrix, index=class_names, columns=class_names, 
    )
    fig = plt.figure(figsize=figsize)
    try:
        heatmap = sns.heatmap(df_cm, annot=True, fmt="d")
    except ValueError:
        raise ValueError("Confusion matrix values must be integers.")
    heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation=0, ha='right', fontsize=fontsize)
    heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation=45, ha='right', fontsize=fontsize)
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    print(fig)

### Carga del dataset y exploración

In [0]:
iris = datasets.load_iris()
iris_df = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= iris['feature_names'] + ['target'])
iris_df.head()

In [0]:
iris_df.describe()

In [0]:
iris_df.hist()
plt.show()

### Separación en cojuntos de entrenamiento y validación, normalización

In [0]:
X = iris_df[iris['feature_names']].values
y = iris_df['target'].values
 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

### Entrenamiento y evaluación de un clasificador gausiano

In [0]:
gnb = GaussianNB()
model = gnb.fit(X_train, y_train)
print('Accuracy of GaussianNB classifier on training set: {:.2f}'
     .format(gnb.score(X_train, y_train)))
print('Accuracy of GaussianNB classifier on test set: {:.2f}'
     .format(gnb.score(X_test, y_test)))

In [0]:
pred = gnb.predict(X_test)
print_confusion_matrix(confusion_matrix(y_test, pred),["Setosa","Versicolor","Virginica"])

In [0]:
pred_t = gnb.predict(X_train)
print_confusion_matrix(confusion_matrix(y_train, pred_t),["Setosa","Versicolor","Virginica"])

In [0]:
print(classification_report(y_test, pred))

In [0]:
print(classification_report(y_train, pred_t))

## Automated SMS spam filtering

Adaptado de https://radimrehurek.com/data_science_python/.

In [0]:
%matplotlib inline
import matplotlib.pyplot as plt
import csv
import pandas as pd
import sklearn
import numpy as np
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

### Step 1: Load data, look around

Skipping the *real* first step (fleshing out specs, finding out what is it we want to be doing -- often highly non-trivial in practice!), let's download the dataset we'll be using in this demo. Go to https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection and download the zip file. Unzip it under `data` subdirectory. You should see a file called `SMSSpamCollection`, about 0.5MB in size:

```bash
$ ls -l data
total 1352
-rw-r--r--@ 1 kofola  staff  477907 Mar 15  2011 SMSSpamCollection
-rw-r--r--@ 1 kofola  staff    5868 Apr 18  2011 readme
-rw-r-----@ 1 kofola  staff  203415 Dec  1 15:30 smsspamcollection.zip
```

In [0]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip

In [0]:
!unzip smsspamcollection.zip

This file contains **a collection of more than 5 thousand SMS phone messages** (see the `readme` file for more info):

In [0]:
messages = [line.rstrip() for line in open('SMSSpamCollection')]
len(messages)

A collection of texts is also sometimes called "corpus". Let's print the first ten messages in this SMS corpus:

In [0]:
for message_no, message in enumerate(messages[:10]):
    print (message_no, message)

We see that this is a [TSV](http://en.wikipedia.org/wiki/Tab-separated_values) ("tab separated values") file, where the first column is a label saying whether the given message is a normal message "ham" or "spam". The second column is the message itself.

This corpus will be our labeled training set. Using these ham/spam examples, we'll **train a machine learning model to learn to discriminate between ham/spam automatically**. Then, with a trained model, we'll be able to **classify arbitrary unlabeled messages** as ham or spam.

[![](http://radimrehurek.com/data_science_python/plot_ML_flow_chart_11.png)](http://www.astroml.org/sklearn_tutorial/general_concepts.html#supervised-learning-model-fit-x-y)

Instead of parsing TSV (or CSV, or Excel...) files by hand, we can use Python's `pandas` library to do the work for us:

In [0]:
messages = pd.read_csv('SMSSpamCollection', sep='\t', quoting=csv.QUOTE_NONE,
                           names=["label", "message"])
messages.head()

With `pandas`, we can also view aggregate statistics easily:

In [0]:
messages.groupby('label').describe().T

How long are the messages?

In [0]:
messages['length'] = messages['message'].map(lambda text: len(text))
messages.head()

In [0]:
messages.length.plot(bins=20, kind='hist');

In [0]:
messages.length.describe()

What is that super long message?

In [0]:
print (list(messages.message[messages.length > 900]))

In [0]:
print (list(messages[messages.length > 900].index))

Is there any difference in message length between spam and ham?

In [0]:
messages.hist(column='length', by='label', bins=50, figsize=(12,4));

Good fun, but how do we make computer understand the plain text messages themselves? Or can it under such malformed gibberish at all?

### Step 2: Data to vectors

Now we'll convert each message, represented as a list of tokens (lemmas) above, into a vector that machine learning models can understand.

Doing that requires essentially three steps, in the bag-of-words model:

1. counting how many times does a word occur in each message (term frequency)
2. weighting the counts, so that frequent tokens get lower weight (inverse document frequency)
3. normalizing the vectors to unit length, to abstract from the original text length (L2 norm)

Each vector has as many dimensions as there are unique words in the SMS corpus.

To transform the entire bag-of-words corpus into TF-IDF corpus at once:

In [0]:
vectorizer = TfidfVectorizer()
sms_tfidf = vectorizer.fit_transform(messages['message'].values)

print(sms_tfidf.shape)

### Step 3: Training a model, detecting spam

With messages represented as vectors, we can finally train our spam/ham classifier. This part is pretty straightforward, and there are many libraries that realize the training algorithms.
The library sklearn.naive_bayes includes implementations of:
- GaussianNB
- MultinomialNB 
- BernoulliNB

#### What classifier class should we use?

#### When are used the other two NB versions? 

In [0]:
classifier = MultinomialNB()
targets = messages['label'].values
clf = classifier.fit(sms_tfidf, targets)

Let's try classifying our single random message:

In [0]:
examples = ['Free entry in 3 a wkly comp', 'Hello my friend']
example_vector = vectorizer.transform(examples)
predictions = classifier.predict(example_vector)

print(predictions)

Hooray! You can try it with your own texts, too.

A natural question is to ask, how many messages do we classify correctly overall?

In [0]:
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

all_predictions = clf.predict(sms_tfidf)
accuracy = accuracy_score(messages['label'], all_predictions)
cm = confusion_matrix(messages['label'], all_predictions)
statistics = classification_report(messages['label'], all_predictions)

print('Accuracy: %.4f\n' % accuracy)
print(statistics)

In [0]:
print_confusion_matrix(cm,["ham", "spam"])

#### By default, MultinomialNB uses the Additive Laplace smoothing (alpha = 1). Change the classifier to work with Lidstone smoothing. Explain new results comparing with the default version.

In [0]:
classifier = MultinomialNB(alpha=0.01)
targets = messages['label'].values
clf = classifier.fit(sms_tfidf, targets)

all_predictions = clf.predict(sms_tfidf)
accuracy = accuracy_score(messages['label'], all_predictions)
cm = confusion_matrix(messages['label'], all_predictions)
statistics = classification_report(messages['label'], all_predictions)

print('Accuracy: %.4f\n' % accuracy)
print(statistics)

In [0]:
print_confusion_matrix(cm,["ham", "spam"])