<a href="https://colab.research.google.com/github/davrodrod/algoritmosIA/blob/master/NaiveBayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Clasificador Bayes

Fuente: https://scikit-learn.org/stable/modules/naive_bayes.html

Existen varios tipos, el primero es el Gausiano que asume que las variables X siguen distribución gausiana (su media y su desviación típica se calcula automáticamente.

In [0]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
gnb = GaussianNB()
y_pred = gnb.fit(X_train, y_train).predict(X_test)
print("Number of mislabeled points out of a total %d points : %d" % (X_test.shape[0], (y_test != y_pred).sum()))

Number of mislabeled points out of a total 75 points : 4


Otras variantes:

- Multinomial Naive Bayes (MultinomialNB). Modela cada entrada como un polinomio en el que las vbles son las features y = theta1*feature1 + theta2*feature2 + .. + thetan*featuren

- Complement Naive Bayes (ComplementNB). Adaptación el MNB

- Bernoulli Naive Bayes (BernoulliNB ). Los datos tienen distribución multivariable Bernuilli

- CategoricalNB Categorical distribution

Si existen demasiados datos tiene un partial_fit para entrenar poco a poco.


# Uso para procesado de textos

Ejemplo, fuente: https://community.alteryx.com/t5/Data-Science-Blog/Naive-Bayes-in-Python/ba-p/138424

For our example we're going to be attempting to classify whether a wikipedia page is referring to a dinosaur or a cryptid (an animal from cryptozoology. Think Lochness Monster or Bigfoot).

We'll be using the text from each wikipedia article as features. What we'd expect is that certain words like "sighting" or "hoax" would be more commonly found in articles about cryptozoology, while words like "fossil" would be more commonly found in articles about dinosaurs.

 

We'll do some basic word-tokenization to count the occurrences of each word and then calculate conditional probabilities for each word as it pertains to our 2 categories.


Tokenizing and counting
 

First things first. We need to turn our files full of text into something a little more mathy. The simplest way to do this is to take the bag of words approach. That just means we'll be counting how many times each word appears in each document. We'll also perform a little text normalization by removing punctuation and lowercasing the text (this means "Hello," and "hello" will now be considered the same word).

 

Once we've cleaned the text, we need a way to delineate words. A simple approach is to just use a good 'ole regex that splits on whitespace and punctuation: \W+.

In [0]:
import re
import string
from prettytable import PrettyTable

def remove_punctuation(s):
    "see http://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string-in-python"
    table = str.maketrans("","", string.punctuation)
    return s.translate(table)

def tokenize(text):
    text = remove_punctuation(text)
    text = text.lower()
    return re.split("\W+", text)

def count_words(words):
    wc = {}
    for word in words:
        wc[word] = wc.get(word, 0.0) + 1.0
    return wc

s = "Hello my name, is Greg. My favorite food is pizza."
count_words(tokenize(s))

{'favorite': 1.0,
 'food': 1.0,
 'greg': 1.0,
 'hello': 1.0,
 'is': 2.0,
 'my': 2.0,
 'name': 1.0,
 'pizza': 1.0}

Calculating our probabilities
 

So now that we can count words, let's get cooking. The code below is going to do the following:

 

open each document
label it as either "crypto" or "dino" and keep track of how many of each label there are (priors)
count the words for the document
add those counts to the vocab, or a corpus level word count
add those counts to the word_counts, for a category level word count

In [0]:
!pip install sh
from sh import find

# Esto es el entrenamiento

# setup some structures to store our data
vocab = {}
word_counts = {
    "crypto": {},
    "dino": {}
}
priors = {
    "crypto": 0.,
    "dino": 0.
}
docs = []
for f in find("sample_data"):
    f = f.strip()
    if f.endswith(".txt")==False:
        # skip non .txt files
        continue
    elif "cryptid" in f:
        category = "crypto"
    else:
        category = "dino"
    docs.append((category, f))
    # ok time to start counting stuff...
    priors[category] += 1
    text = open(f).read()
    words = tokenize(text)
    counts = count_words(words)
    for word, count in counts.items():
        # if we haven't seen a word yet, let's add it to our dictionaries with a count of 0
        if word not in vocab:
            vocab[word] = 0.0 # use 0.0 here so Python does "correct" math
        if word not in word_counts[category]:
            word_counts[category][word] = 0.0
        vocab[word] += count
        word_counts[category][word] += count



In [0]:
new_doc = open("sample_data/Yeti.txt").read()
words = tokenize(new_doc)
counts = count_words(words)
print(counts)

{'the': 12.0, 'yeti': 4.0, 'ˈjɛti3': 1.0, 'or': 1.0, 'abominable': 1.0, 'snowman': 1.0, 'nepali': 1.0, 'ह': 1.0, 'मम': 1.0, 'नव': 1.0, 'lit': 1.0, 'mountain': 1.0, 'man': 1.0, 'is': 2.0, 'an': 3.0, 'apelike': 1.0, 'cryptid': 1.0, 'taller': 1.0, 'than': 2.0, 'average': 1.0, 'human': 1.0, 'that': 2.0, 'said': 1.0, 'to': 5.0, 'inhabit': 1.0, 'himalayan': 1.0, 'region': 2.0, 'of': 8.0, 'nepal': 1.0, 'and': 4.0, 'tibet4': 1.0, 'names': 1.0, 'mehteh': 1.0, 'are': 2.0, 'commonly': 1.0, 'used': 1.0, 'by': 1.0, 'people': 1.0, 'indigenous': 1.0, 'part': 1.0, 'their': 1.0, 'history': 1.0, 'mythology': 1.0, 'stories': 1.0, 'first': 1.0, 'emerged': 1.0, 'as': 2.0, 'a': 4.0, 'facet': 1.0, 'western': 1.0, 'popular': 1.0, 'culture': 1.0, 'in': 2.0, '19th': 1.0, 'century': 1.0, 'scientific': 1.0, 'community': 1.0, 'generally': 1.0, 'regards': 1.0, 'legend': 1.0, 'given': 1.0, 'lack': 1.0, 'conclusive': 1.0, 'evidence5': 1.0, 'but': 1.0, 'it': 1.0, 'remains': 1.0, 'one': 1.0, 'most': 1.0, 'famous': 1.0,

Alright, we've got our counts. Now we'll calculate P(word|category) for each word and multiply each of these conditional probabilities together to calculate the P(category|set of words). To prevent computational errors, we're going to perform the operations in logspace. All this means is we're going to use the log(probability) so we require fewer decimal places. More on the mystical properties of logs here and here.

In [0]:
import math

prior_dino = (priors["dino"] / sum(priors.values()))
prior_crypto = (priors["crypto"] / sum(priors.values()))

log_prob_crypto = 0.0
log_prob_dino = 0.0
for w, cnt in counts.items():
    # skip words that we haven't seen before, or words less than 3 letters long
    if not w in vocab or len(w) <= 3:
        continue
    # calculate the probability that the word occurs at all
    p_word = vocab[w] / sum(vocab.values())
    # for both categories, calculate P(word|category), or the probability a 
    # word will appear, given that we know that the document is <category>
    p_w_given_dino = word_counts["dino"].get(w, 0.0) / sum(word_counts["dino"].values())
    p_w_given_crypto = word_counts["crypto"].get(w, 0.0) / sum(word_counts["crypto"].values())
    # add new probability to our running total: log_prob_<category>. if the probability 
    # is 0 (i.e. the word never appears for the category), then skip it
    if p_w_given_dino > 0:
        log_prob_dino += math.log(cnt * p_w_given_dino / p_word)
    if p_w_given_crypto > 0:
        log_prob_crypto += math.log(cnt * p_w_given_crypto / p_word)

# print out the reuslts; we need to go from logspace back to "regular" space,
# so we take the EXP of the log_prob (don't believe me? try this: math.exp(log(10) + log(3)))
print("Score(dino)  :", math.exp(log_prob_dino + math.log(prior_dino)))
print("Score(crypto):", math.exp(log_prob_crypto + math.log(prior_crypto)))

Score(dino)  : 5498324743.005088
Score(crypto): 5.457176039492645
