<a href="https://colab.research.google.com/github/dfalci/sandbox/blob/master/naivebayesdozero.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Teorema de Bayes:

###Probabilidade de ocorrer <b>a</b> dadas as evidências <b>b</b>

$$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$$

### Qual a probabilidade de ser um spam dada a palavra Grátis?

$$P(spam|gratis) = \frac{P(gratis|spam) P(spam)}{P(gratis)}$$ 

Anterior?
$$P(gratis) = \frac{count(gratis)}{n_{palavras}}$$

Likelihood?
$$P(spam) = \frac{count(spam)}{n_{exemplos}}$$

## com múltiplas palavras podemos entender como :

$$P(spam|você, acaba, de, ganhar) =\frac{P(você,acaba,de,ganhar|spam)P(spam)}{P(você,acaba,de,ganhar)} $$

$$P(Você)$$



In [0]:
!sudo pip install unidecode
import numpy as np
import re
from sklearn.datasets import fetch_20newsgroups
import unidecode
from functools import reduce

categories=['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med']
newsgroups_train=fetch_20newsgroups(subset='train',categories=categories)
newsgroups_test=fetch_20newsgroups(subset='test', categories=categories)

x_train=newsgroups_train.data
y_train=newsgroups_train.target

x_test=newsgroups_test.data
y_test=newsgroups_test.target

print ("Quantidade de registros no treinamento: ",len(x_train))
print ('Quantidade de registros no teste:', len(x_test))

def normalizar(str_arg):
    v=unidecode.unidecode(str_arg)# removendo acentos
    v=re.sub('[^a-z\s]+','',v,flags=re.IGNORECASE) #remove tudo que nao for caractere alfabetico
    v=re.sub('(\s+)',' ',v) #espacos multiplos sao substituidos por um espaco
    v=v.lower() #indo pra caixa baixa
    return v

Quantidade de registros no treinamento:  2257
Quantidade de registros no teste: 1502


In [0]:
class Estatistica(object):

    def __init__(self):
        self.w2qt = {}
        self.w2pb = {}
        self.pa = 0
        self.qt = 0
        self.qt_tokens = 0

    def calc_probabilidade_classe(self, qt_total):
        self.pa = np.log(self.qt/float(qt_total))
        for w in self.w2qt:
            self.w2pb[w] = self.w2qt[w]/float(self.qt_tokens)

    def add_words(self, words):
        self.qt += 1
        for w in words:
            self.__add_w(w)

    def __add_w(self, w):
        self.w2qt[w] = self.w2qt.get(w, 0) + 1
        self.qt_tokens += 1


    def get_px(self, words):
        probs = []
        for w in words:
            probs.append(np.log(self.w2pb.get(w, 1/float(self.qt_tokens))))
        return reduce(lambda x, y: x+y, probs)

class NB(object):

    def __init__(self, classes, preprocessamento):
        self.classes = classes
        self.quantidade_classes = len(classes)
        self.estatisticas = {}
        self.preprocessamento = preprocessamento
        for rotulo in self.classes:
            self.estatisticas[rotulo] = Estatistica()


    def treinar(self, sentencas, respostas):
        self.total_amostra = len(sentencas)
        self.geral = Estatistica()
        for i in range(0, len(sentencas)):
            wds = self.preprocessamento(sentencas[i]).split()
            l = respostas[i]
            self.estatisticas[l].add_words(wds)
            self.geral.add_words(wds)
        [self.estatisticas[l].calc_probabilidade_classe(self.total_amostra) for l in self.classes]
        self.geral.calc_probabilidade_classe(self.total_amostra)


    def prever(self, sentenca):
        wds = self.preprocessamento(sentenca).split()
        resultados = {}
        for classe in self.classes:
            #p(a|b) = p(b|a) * p(a) / p(b)
            obj = self.estatisticas[classe]
            resultados[classe] = (obj.get_px(wds) + obj.pa)
        return self.__get_melhor_resultado(resultados), resultados
    
    
    def __get_melhor_resultado(self, resultados):
        invertido = [(value, key) for key, value in resultados.items()]
        return max(invertido)[1]


# Vamos então exibir um pouco dos dados aqui:



In [0]:

for i in range(20):
    print('\n\n\n*** NOVA MENSAGEM ***\n')
    print("""CLASSE : {}\n\n{}""".format(y_test[i], x_test[i]))




*** NOVA MENSAGEM ***

CLASSE : 2

From: brian@ucsd.edu (Brian Kantor)
Subject: Re: HELP for Kidney Stones ..............
Organization: The Avant-Garde of the Now, Ltd.
Lines: 12
NNTP-Posting-Host: ucsd.edu

As I recall from my bout with kidney stones, there isn't any
medication that can do anything about them except relieve the pain.

Either they pass, or they have to be broken up with sound, or they have
to be extracted surgically.

When I was in, the X-ray tech happened to mention that she'd had kidney
stones and children, and the childbirth hurt less.

Demerol worked, although I nearly got arrested on my way home when I barfed
all over the police car parked just outside the ER.
	- Brian




*** NOVA MENSAGEM ***

CLASSE : 2

From: rind@enterprise.bih.harvard.edu (David Rind)
Subject: Re: Candida(yeast) Bloom, Fact or Fiction
Organization: Beth Israel Hospital, Harvard Medical School, Boston Mass., USA
Lines: 37
NNTP-Posting-Host: enterprise.bih.harvard.edu

In article <1993Apr26

In [0]:
rotulos = np.unique(y_train)
print(rotulos)
nb = NB(rotulos, normalizar)
nb.treinar(x_train, y_train)



print(nb.prever('I bought a new PC with an excellent gpu.'))
print(nb.prever('The route and extent of exposure are important in determining the potential for secondary contamination. Victims who were exposed only to gas or vapor and have no gross deposition of the material on their clothing or skin are not likely to carry significant amounts of chemical beyond the Hot Zone and are not likely to pose risks of secondary contamination to response personnel.'))
print(nb.prever('I need to take my medicine'))

[0 1 2 3]
(1, {0: -72.45472670782844, 1: -65.10543409051742, 2: -67.21223115860131, 3: -71.5856782286971})
(2, {0: -479.65686085399904, 1: -493.4378496652693, 2: -463.1366503716267, 3: -490.28001839677177})
(2, {0: -41.405703178849755, 1: -40.428789140843925, 2: -36.58814622652386, 3: -40.38887042161673})


In [0]:
acertos = 0
for i in range(len(x_test)):
    previsao = nb.prever(x_test[i])[0]
    a = (previsao, y_test[i])
    acertos += 1 if a[0] == a[1] else 0

print(acertos/float(len(y_test)))

0.918774966711052


In [0]:
x_train = [
    'fazer gol marcou trave gol gol gol trave tempo',
    'marcou passe passe assistencia trave gol hoje', 
    'passe assistencia artilheiro jogo partida momento', 
    'jogo partida grande grande meteu grande assistencia gol',
    'investimento partida jogo gol trave marcou'
    
    'investimento dinheiro grande investimento grande',
    'valor investimento dinheiro grande investimento'
    'fazer investimento economico valor momento',
    'tempo momento hoje dinheiro economia investimento'
]
y_train = [
    'esporte',
    'esporte',
    'esporte',
    'esporte',
    'esporte',
    'investimento',
    'investimento',
    'investimento',
    'investimento',
]

x_test = [
    'grande momento marcou gol',
    'grande momento investimento valor',
    'trave gol hoje'
]

outros_rotulos = np.unique(y_train)
print(outros_rotulos)
nb2 = NB(outros_rotulos, normalizar)

nb2.treinar(x_train, y_train)
print(nb2.prever(x_test[0]))
print(nb2.prever(x_test[1]))
print(nb2.prever(x_test[2]))

['esporte' 'investimento']
('esporte', {'esporte': -10.843494811027597, 'investimento': -11.391816592344263})
('investimento', {'esporte': -12.78940496008291, 'investimento': -9.600057123116208})
('esporte', {'esporte': -8.070906088787817, 'investimento': -9.376913571801998})


['esporte' 'investimento']
('esporte', {'esporte': -10.660944656678643, 'investimento': -11.171998877999634})
('investimento', {'esporte': -13.705467094402064, 'investimento': -8.869413785005587})
('esporte', {'esporte': -7.888355934438862, 'investimento': -9.968026073673698})
