# GAUSSIAN NAIVE BAYES (MULTINOMIAL) - Scikit-Learn

### Multinomial Naive Bayes 

O classificador Multinomial Naive Bayes é adequado para classificação com variáveis discretas (por exemplo, contagens de palavras ára a classificação de texto).  A distribuição multinomial requer contagens de entidades inteiras. No entanto, na prática, contagens fracionadas como tf-idf também podem funcionar

In [1]:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import GridSearchCV
from sklearn import metrics

In [5]:
# Categorias
categorias = ["alt.atheism", "soc.religion.christian", "comp.graphics", "sci.med"]

In [6]:
twenty_train = fetch_20newsgroups(subset="train", categories=categorias, shuffle=True, random_state=42)

In [9]:
# Classes
twenty_train.target_names

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

In [10]:
len(twenty_train)

5

In [12]:
# Visualizando alguns dados (atributos)
print("\n".join(twenty_train.data[0].split("\n")[:3]))

From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton


In [13]:
# Visualizando target
print(twenty_train.target_names[twenty_train.target[0]])

comp.graphics


In [14]:
# O Scikit Learn registra os labels como array de números, a fim de aumentar a velocidade
twenty_train.target[:10]

array([1, 1, 3, 3, 3, 3, 3, 2, 2, 2], dtype=int64)

In [16]:
# Visualizando as classes dos 10 primeiros registros 
for t in twenty_train.target[:10]:
 print(twenty_train.target_names[t])

comp.graphics
comp.graphics
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
sci.med
sci.med
sci.med


### BAG OF WORDS (saco de palavras)

In [25]:
# Tokenizing - Quebrar um paragrafo em frases - frase em palavras - reduzindo 
count_vect = CountVectorizer() # gera vetor dessas ocorrências reduzidas
X_train_counts = count_vect.fit_transform(twenty_train.data) # fit transform ajusta os dados no formato dos dados de entrada
count_vect.vocabulary_.get(u'algorithm')
X_train_counts.shape

# Shape revela o registre de OCORRÊNCIAS do conjunto de dados

(2257, 35788)

In [20]:
# De ocorrências a frequências - Term Frequency times Inverse Document Frequency (Tfidf)
tf_transformer = TfidfTransformer(use_idf = False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
X_train_counts

<2257x35788 sparse matrix of type '<class 'numpy.int64'>'
	with 365886 stored elements in Compressed Sparse Row format>

In [21]:
# Mesmo resultado da célula anterior, mas combinando as funções
tfdif_transformer = TfidfTransformer()
X_train_tfidf = tfdif_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(2257, 35788)

In [22]:
# Criando o modelo Multinomial
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

In [23]:
clf

In [26]:
# Realizando Previsões
docs_new = ["God is love", "OpenGL on the GPU is fast"]
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfdif_transformer.transform(X_new_counts)

previsao = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, previsao):
 print(f"{doc} => {twenty_train.target_names[category]}")

God is love => soc.religion.christian
OpenGL on the GPU is fast => comp.graphics


In [27]:
# Criando Pipeline - Classificador Composto 
# Vectorizer => transformer => classifier
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB())])

In [28]:
# Fit
text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

In [29]:
# Acurácia do modelo
twenty_test = fetch_20newsgroups(subset='test', categories = categorias, shuffle = True, random_state = 42)
docs_test = twenty_test.data
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)

0.8348868175765646

In [30]:
# Métricas 
print(metrics.classification_report(twenty_test.target, predicted, target_names = twenty_test.target_names))

                        precision    recall  f1-score   support

           alt.atheism       0.97      0.60      0.74       319
         comp.graphics       0.96      0.89      0.92       389
               sci.med       0.97      0.81      0.88       396
soc.religion.christian       0.65      0.99      0.78       398

              accuracy                           0.83      1502
             macro avg       0.89      0.82      0.83      1502
          weighted avg       0.88      0.83      0.84      1502



In [31]:
# Confusion Matrix
metrics.confusion_matrix(twenty_test.target, predicted)

array([[192,   2,   6, 119],
       [  2, 347,   4,  36],
       [  2,  11, 322,  61],
       [  2,   2,   1, 393]], dtype=int64)

#### Tentando melhorar a acurácia do modelo

In [42]:
# Parâmetros para o GridSearchCV
parameters = {'vect__ngram_range': [(1,1), (1,2)],
              'clf__alpha': (1e-2, 1e-3),}

In [43]:
# GridSearchCV
gs_clf = GridSearchCV(text_clf, parameters, n_jobs = -1)

In [44]:
# Fit
gs_clf = gs_clf.fit(twenty_train.data[:400], twenty_train.target[:400])

In [45]:
# Testes
twenty_train.target_names[gs_clf.predict(['God is love'])[0]]

'soc.religion.christian'

In [46]:
# Score
gs_clf.best_score_

0.9349999999999999

In [47]:
# Verificando os Parametros utilizados no Grid
for param_name in sorted(parameters.keys()):
 print(f"{param_name} => {gs_clf.best_params_[param_name]}")

clf__alpha => 0.01
vect__ngram_range => (1, 2)


Fim