# 單純貝氏分類器（Naive Bayes Classifier）

單純貝氏分類器是一種基於貝葉斯定理（Bayes' Theorem）的統計分類方法，用於機器學習中的分類任務。它被稱為「單純」，是因為它假設在每一類別中的特徵是條件獨立的，儘管這個假設在現實中可能並不成立，但它通常仍能提供不錯的結果。

## 貝葉斯定理

貝葉斯定理描述了在已知某些證據的情況下，更新我們對某一事件的信念。其數學表達式為：

$
P(C|X) = \frac{P(X|C) P(C)}{P(X)}
$

其中：
- $P(C|X)$：給定特徵 $X$ 的情況下，類別 $C$ 發生的條件概率，即後驗概率。
- $P(X|C)$：給定類別 $C$ 的情況下，觀察到特徵 $X$ 的概率，即似然。
- $P(C)$：類別 $C$ 的先驗概率。
- $P(X)$：觀察到特徵 $X$ 的總概率，通常作為歸一化常數。

  ### 舉例：
        有間餐廳，某天總共來了100位客人，其中女性60人，男性40人。其中女性點了48份甜點，男性點了20份甜點。
  
        請問有位客人點了一份甜點，請問這位客人是女性的機率 $P(女性|甜點)$ 是多少？
  
        $P(女性)$ = 0.6, $P(甜點)$ = 0.68, $P(甜點|女性)$=0.8
  
        $P(女性|甜點)$ = $P(女性) * P(甜點|女性) / P(甜點)$ = 0.6*0.8/0.68 = 70.59%
  
        $P(男性|甜點)$ = $P(男性) * P(甜點|男性) / P(甜點)$ = 0.4*0.5/0.68 = 29.41%

## 假設特徵獨立性

在 Naive Bayes 中，假設所有特徵在給定類別的條件下是相互獨立的。這個假設使得計算過程簡化，可以將條件概率分解為各特徵條件概率的乘積：

$
P(C|X_1, X_2, \dots, X_n) \propto P(C) \prod_{i=1}^{n} P(X_i|C)
$

## 優點

1. **簡單易懂**：數學模型簡單，容易實現和理解。
2. **高效**：即使在特徵空間較大的情況下，也能夠快速運行。
3. **適用於小數據集**：對小型數據集表現良好。
4. **容忍缺失數據**：即使部分特徵缺失，模型仍能運行。

## 缺點

1. **假設特徵獨立性**：在現實中，特徵之間往往有關聯，這會影響模型的準確性。
2. **對相關特徵敏感**：特徵之間的冗餘或相關性會影響性能。

## 應用領域

Naive Bayes 分類器廣泛應用於：
- 文本分類（如垃圾郵件過濾、情感分析等）
- 醫學診斷
- 推薦系統等

## 參考資料
* Python Data Science Handbook, Jake VanderPlas
* Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, Aurélien Géron


# 貝氏定理應用在分類器

In [2]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

data = pd.read_csv("../data/mushrooms.csv")
data.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


當有一筆資料的 $cap-shape = x$ 時，請問class為 'e' 的機率 $P(class_e|cap-shape_x)$？

$P(class_e|cap-shape_x) = P(class_e) * P(cap-shape_x|class_e) / P(shape_x)$

In [6]:
p1 = (data['class']=='e').mean()
print('P(class_e) =', p1)
p2 = ((data['class']=='e')&(data['cap-shape']=='x')).sum() / (data['class']=='e').sum()
print('P(cap-shape_x|class_e) =', p2)
p3 = (data['cap-shape']=='x').sum() / data.shape[0]
print('P(cap-shape_x) =', p3)
print('P(class_e|cap-shape_x) =', p1*p2/p3)

P(class_e) = 0.517971442639094
P(cap-shape_x|class_e) = 0.4629277566539924
P(cap-shape_x) = 0.4500246184145741
P(class_e|cap-shape_x) = 0.5328227571115973


當有一筆資料的 $population = s$ 時，請問class為 'e' 的機率 $P(class_p|population_s)$？

$P(class_e|population_s) = P(class_e) * P(population_s|class_e) / P(population_s)$

In [4]:
p1 = (data['class']=='e').sum() / data.shape[0]
print('P(class_e) =', p1)
p2 = ((data['class']=='e')&(data['population']=='s')).sum() / (data['class']=='e').sum()
print('P(population_s|class_e) =', p2)
p3 = (data['population']=='s').sum() / data.shape[0]
print('P(population_s) =', p3)
print('P(class_e|population_s) =', p1*p2/p3)

P(class_e) = 0.517971442639094
P(population_s|class_e) = 0.20912547528517111
P(population_s) = 0.1536189069423929
P(class_e|population_s) = 0.7051282051282052


當有一筆資料的`條件T = (cap-shape = x & population = s)` 時，請問class為 'e' 的機率 $P(class_e|條件T)？$

$P(class_e|條件T) = P(class_e) * P(條件T|class_e) / P(條件T)$

In [5]:
條件T = ((data['cap-shape']=='x')&(data['population']=='s'))
p1 = (data['class']=='e').sum() / data.shape[0]
print('P(class_e) =', p1)
p2 = ((data['class']=='e')&條件T).sum() / (data['class']=='e').sum()
print('P(條件T|class_e) =', p2)
p3 = 條件T.sum() / data.shape[0]
print('P(條件T) =', p3)
print('P(class_e|條件T) =', p1*p2/p3)

P(class_e) = 0.517971442639094
P(條件T|class_e) = 0.09885931558935361
P(條件T) = 0.07976366322008863
P(class_e|條件T) = 0.6419753086419753


# 單純貝氏分類器

在 Naive Bayes 中，假設所有特徵在給定類別的條件下是相互獨立的。這個假設使得計算過程簡化，可以將條件概率分解為各特徵條件概率的乘積：

$P(C|X_1, X_2, \dots, X_n) \propto P(C) \prod_{i=1}^{n} P(X_i|C)$


這邊我們先將方程式簡化成三個變數後，方程式如下：

$P(A|BC) = P(A) * P(BC|A) / P(BC)$

如果 $B$ $C$ 兩者互相獨立，相關性為0。則 $P(BC) = P(B) * P(C)$ 、 $P(BC|A) = P(B|A) * P(C|A)$

方程式可以改寫成

$P(A|BC) = P(A) * P(B|A) * P(C|A) / (P(B) * P(C))$

因此如果假設輸入欄位為 $X_1,\dots,X_n$，預測輸出為$Y$。可以從訓練資料中預先計算所有 $P(X_1|Y),\dots,P(X_n|Y) 跟 P(X_1),\dots,P(X_n)$

In [7]:
p1 = (data['class']=='e').sum() / data.shape[0]
print('P(class_e) =', p1)
p2 = ((data['class']=='e')&(data['cap-shape']=='x')).sum() / (data['class']=='e').sum()
print('P(cap-shape_x|class_e) =', p2)
p3 = (data['cap-shape']=='x').sum() / data.shape[0]
print('P(cap-shape_x) =', p3)
p4 = ((data['class']=='e')&(data['population']=='s')).sum() / (data['class']=='e').sum()
print('P(population_s|class_e) =', p4)
p5 = (data['population']=='s').sum() / data.shape[0]
print('P(population_s) =', p5)
print('P(class_e|cap-shape_x&population_s) =', p1*p2*p4/(p3*p5))

P(class_e) = 0.517971442639094
P(cap-shape_x|class_e) = 0.4629277566539924
P(cap-shape_x) = 0.4500246184145741
P(population_s|class_e) = 0.20912547528517111
P(population_s) = 0.1536189069423929
P(class_e|cap-shape_x&population_s) = 0.7253456917611263


### 使用SKLEARN

In [8]:
from sklearn.preprocessing import LabelEncoder
encoded_data = data.apply(LabelEncoder().fit_transform)
x = encoded_data.drop('class',axis=1)
y = encoded_data['class']
encoded_data.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,1,5,2,4,1,6,1,0,1,4,...,2,7,7,0,2,1,4,2,3,5
1,0,5,2,9,1,0,1,0,0,4,...,2,7,7,0,2,1,4,3,2,1
2,0,0,2,8,1,3,1,0,0,5,...,2,7,7,0,2,1,4,3,2,3
3,1,5,3,8,1,6,1,0,1,5,...,2,7,7,0,2,1,4,2,3,5
4,0,5,2,3,0,5,1,1,0,4,...,2,7,7,0,2,1,0,3,0,1


In [9]:
data = pd.read_csv("../data/mushrooms.csv")
encoded_data = pd.get_dummies(data)
encoded_data.head()

Unnamed: 0,class_e,class_p,cap-shape_b,cap-shape_c,cap-shape_f,cap-shape_k,cap-shape_s,cap-shape_x,cap-surface_f,cap-surface_g,...,population_s,population_v,population_y,habitat_d,habitat_g,habitat_l,habitat_m,habitat_p,habitat_u,habitat_w
0,False,True,False,False,False,False,False,True,False,False,...,True,False,False,False,False,False,False,False,True,False
1,True,False,False,False,False,False,False,True,False,False,...,False,False,False,False,True,False,False,False,False,False
2,True,False,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
3,False,True,False,False,False,False,False,True,False,False,...,True,False,False,False,False,False,False,False,True,False
4,True,False,False,False,False,False,False,True,False,False,...,False,False,False,False,True,False,False,False,False,False


In [10]:
encoded_data.columns

Index(['class_e', 'class_p', 'cap-shape_b', 'cap-shape_c', 'cap-shape_f',
       'cap-shape_k', 'cap-shape_s', 'cap-shape_x', 'cap-surface_f',
       'cap-surface_g',
       ...
       'population_s', 'population_v', 'population_y', 'habitat_d',
       'habitat_g', 'habitat_l', 'habitat_m', 'habitat_p', 'habitat_u',
       'habitat_w'],
      dtype='object', length=119)

In [11]:
x = encoded_data.drop(['class_e','class_p'],axis=1)
x.columns

Index(['cap-shape_b', 'cap-shape_c', 'cap-shape_f', 'cap-shape_k',
       'cap-shape_s', 'cap-shape_x', 'cap-surface_f', 'cap-surface_g',
       'cap-surface_s', 'cap-surface_y',
       ...
       'population_s', 'population_v', 'population_y', 'habitat_d',
       'habitat_g', 'habitat_l', 'habitat_m', 'habitat_p', 'habitat_u',
       'habitat_w'],
      dtype='object', length=117)

In [12]:
y = encoded_data['class_e']

### 各種Naive-Bayes可以參考
https://scikit-learn.org/stable/modules/naive_bayes.html

In [24]:
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import ComplementNB
from sklearn.naive_bayes import CategoricalNB
scores = cross_val_score(MultinomialNB(),x,y,cv=5,scoring='accuracy')
print('MultinomialNB',scores.mean())
scores = cross_val_score(GaussianNB(),x,y,cv=5,scoring='accuracy')
print('GaussianNB',scores.mean())
scores = cross_val_score(BernoulliNB(),x,y,cv=5,scoring='accuracy')
print('BernoulliNB',scores.mean())
scores = cross_val_score(ComplementNB(),x,y,cv=5,scoring='accuracy')
print('ComplementNB',scores.mean())
# scores = cross_val_score(CategoricalNB(),x,y,cv=5,scoring='accuracy')
# print('CategoricalNB',scores.mean())

MultinomialNB 0.8289848427434634
GaussianNB 0.8516481242895036
BernoulliNB 0.8164312239484653
ComplementNB 0.8296002273588481


In [21]:
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(x)
pca_x = pca.transform(x)
# scores = cross_val_score(MultinomialNB(),pca_x,y,cv=5,scoring='accuracy')
# print('MultinomialNB',scores.mean())
scores = cross_val_score(GaussianNB(),pca_x,y,cv=5,scoring='accuracy')
print('GaussianNB',scores.mean())
scores = cross_val_score(BernoulliNB(),pca_x,y,cv=5,scoring='accuracy')
print('BernoulliNB',scores.mean())
# scores = cross_val_score(ComplementNB(),pca_x,y,cv=5,scoring='accuracy')
# print('ComplementNB',scores.mean())
# scores = cross_val_score(CategoricalNB(),pca_x,y,cv=5,scoring='accuracy')
# print('CategoricalNB',scores.mean())

GaussianNB 0.6243314134141721
BernoulliNB 0.7856838196286472


In [19]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

lda = LinearDiscriminantAnalysis()
lda.fit(x,y)
lda_x = lda.transform(x)
scores = cross_val_score(GaussianNB(),lda_x,y,cv=5,scoring='accuracy')
print('GaussianNB',scores.mean())
scores = cross_val_score(BernoulliNB(),lda_x,y,cv=5,scoring='accuracy')
print('BernoulliNB',scores.mean())

GaussianNB 0.9996307692307692
BernoulliNB 0.9995076923076922


▲ 上面的效果太好，不可置信。應該是使用全部資料進行LDA的原因

▼ 改成只使用訓練組的資料來進行LDA

In [26]:
from sklearn.pipeline import make_pipeline

GaussianNB_model = make_pipeline(LinearDiscriminantAnalysis(),GaussianNB())
BernoulliNB_model = make_pipeline(LinearDiscriminantAnalysis(),BernoulliNB())


scores = cross_val_score(GaussianNB_model,x,y,cv=5,scoring='accuracy')
print('GaussianNB',scores.mean())
scores = cross_val_score(BernoulliNB_model,x,y,cv=5,scoring='accuracy')
print('BernoulliNB',scores.mean())


GaussianNB 0.9466754831375521
BernoulliNB 0.9576360742705571
