<h1>Bayesian models with Naive Bayes</h1>

Usados principalmente para classificação. São chamados “naive”<br>
porque assumem independência entre os atributos — o que raramente é verdadeiro,<br>
mas funciona surpreendentemente bem na prática.<br><br>

Classificar instâncias em categorias com base nas probabilidades condicionais dos atributos.
O modelo escolhe a classe mais provável com base nas características da entrada.

<h3> Pré-requisitos</h3>
<ul>
    <li>Dados categóricos ou numéricos</li>
    <li>Existem variações do Naive Bayes para diferentes tipos de dados:</li>
    <li>
        <ul>
            <li>GaussianNB: para dados contínuos (supõe distribuição normal).</li>
            <li>MultinomialNB: para contagens (ex: textos).</li>
            <li>BernoulliNB: para variáveis booleanas (presença/ausência de uma feature).</li>
        </ul>
    </li>
    <li>Independência entre as features (hipótese do modelo):</li>
        <ul>
            <li>Embora irrealista, essa suposição simplifica o cálculo e ainda gera bons resultados.</li>
        </ul>   
    <li>Classificação supervisionada:
        <ul>
            <li>O modelo precisa de dados rotulados para ser treinado.</li>
        </ul>   
    </li>
</ul>

<h3>Vantagens</h3>
<ul>
    <li>Muito rápido para treinar e prever.</li>
    <li>Eficiente mesmo com grandes volumes de dados.</li>
    <li>Robusto com dados ruidosos.</li>
    <li>Poucos dados de treino necessários.</li>
</ul>                                                     
                                         
<h3>Limitações</h3>
<ul>
    <li>Suposição de independência pode limitar a performance.</li>
    <li>Pouca flexibilidade para capturar relações complexas.</li>
    <li>Menos preciso que modelos como Random Forest ou Gradient Boosting em muitos casos.</li>
</ul>

<h3>Aplicações mais comuns</h3>
<ul>
    <li>Processamento de linguagem natural (NLP):</li>
        <ul>
            <li>Processamento de linguagem natural (NLP):</li>
            <li>Classificação de e-mails: spam vs não spam.</li>
            <li>Análise de sentimentos: detectar polaridade de textos.</li>
            <li>Classificação de tópicos: agrupar textos por assunto. </li>
        </ul>
    <li>Marketing</li>
        <ul>
            <li>Segmentação de clientes: prever comportamento com base em interações.</li>
            <li> Recomendações simples.</li>
        </ul>
</li>Saúde</li>
        <ul>
            <li>Diagnóstico rápido com base em sintomas (com datasets tabulares).</li>
        </ul>
    <li>Segurança</li>
        <ul>
            <li>Detecção de fraudes simples.</li>
            <li>Filtragem de conteúdo automatizado.</li>
        </ul>
</ul>                                                



In [3]:

import numpy as np
import pandas as pd
import urllib
import sklearn

from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import accuracy_score

from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB


<h2>Carregando o dataset</h2>

In [5]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data"
names = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.names", comment='|', header=None)
df = pd.read_csv(url, names=names[0].tolist())
data = df.values
print(data[0])
df.head()

[  0.      0.64    0.64    0.      0.32    0.      0.      0.      0.
   0.      0.      0.64    0.      0.      0.      0.32    0.      1.29
   1.93    0.      0.96    0.      0.      0.      0.      0.      0.
   0.      0.      0.      0.      0.      0.      0.      0.      0.
   0.      0.      0.      0.      0.      0.      0.      0.      0.
   0.      0.      0.      0.      0.      0.      0.778   0.      0.
   3.756  61.    278.      1.   ]


Unnamed: 0,1,word_freq_make: continuous.,word_freq_address: continuous.,word_freq_all: continuous.,word_freq_3d: continuous.,word_freq_our: continuous.,word_freq_over: continuous.,word_freq_remove: continuous.,word_freq_internet: continuous.,word_freq_order: continuous.,...,word_freq_conference: continuous.,char_freq_;: continuous.,char_freq_(: continuous.,char_freq_[: continuous.,char_freq_!: continuous.,char_freq_$: continuous.,char_freq_#: continuous.,capital_run_length_average: continuous.,capital_run_length_longest: continuous.,capital_run_length_total: continuous.
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


In [6]:

X = data[:,:48]
y =data[:,-1]


<h2>Split dos dados em treinamento/teste</h2>

In [8]:
X_train, X_test , y_train, y_test = train_test_split(X,y,test_size=.2, random_state=17)


<h2>Modelo Bernoulli</h2>

In [10]:
bernNB = BernoulliNB(binarize=True)
bernNB.fit(X_train, y_train)

<H3>Avaliando o modelo</H3>

In [12]:
y_pred = bernNB.predict(X_test)
print('Acurácia: ',accuracy_score(y_test, y_pred))

Acurácia:  0.8577633007600435


<h2>Modelo Multinomia</h2>

In [None]:
muiltNB = MultinomialNB()
muiltNB.fit(X_train, y_train)

<h3>Avaliando o modelo </h3>

In [27]:
y_pred = muiltNB.predict(X_test)
print('Acurácia: ',accuracy_score(y_test, y_pred))

Acurácia:  0.8816503800217155


<h2>Modelo Gaussian</h2>

In [None]:
gaussNB = GaussianNB()
gaussNB.fit(X_train, y_train)

<h3>Avaliando o modelo </h3>

In [None]:
y_pred = gaussNB.predict(X_test)
print(accuracy_score(y_test, y_pred))

0.8197611292073833


<h2>Modelo Bernoulli otimizado</h2>

In [None]:
bernNB = BernoulliNB(binarize=.1)
bernNB.fit(X_train, y_train)

<h3>Avaliando o modelo </h3>

In [None]:
y_pred = bernNB.predict(X_test)
print(accuracy_score(y_test, y_pred))

0.9109663409337676
