# 第4章 朴素贝叶斯

1．朴素贝叶斯法是典型的生成学习方法。生成方法由训练数据学习联合概率分布
$P(X,Y)$，然后求得后验概率分布$P(Y|X)$。具体来说，利用训练数据学习$P(X|Y)$和$P(Y)$的估计，得到联合概率分布：

$$P(X,Y)＝P(Y)P(X|Y)$$

概率估计方法可以是极大似然估计或贝叶斯估计。

2．朴素贝叶斯法的基本假设是条件独立性，

$$\begin{aligned} P(X&=x | Y=c_{k} )=P\left(X^{(1)}=x^{(1)}, \cdots, X^{(n)}=x^{(n)} | Y=c_{k}\right) \\ &=\prod_{j=1}^{n} P\left(X^{(j)}=x^{(j)} | Y=c_{k}\right) \end{aligned}$$


这是一个较强的假设。由于这一假设，模型包含的条件概率的数量大为减少，朴素贝叶斯法的学习与预测大为简化。因而朴素贝叶斯法高效，且易于实现。其缺点是分类的性能不一定很高。

3．朴素贝叶斯法利用贝叶斯定理与学到的联合概率模型进行分类预测。

$$P(Y | X)=\frac{P(X, Y)}{P(X)}=\frac{P(Y) P(X | Y)}{\sum_{Y} P(Y) P(X | Y)}$$
 
将输入$x$分到后验概率最大的类$y$。

$$y=\arg \max _{c_{k}} P\left(Y=c_{k}\right) \prod_{j=1}^{n} P\left(X_{j}=x^{(j)} | Y=c_{k}\right)$$

后验概率最大等价于0-1损失函数时的期望风险最小化。

4．朴素贝叶斯法的参数估计(parameter estimation): 估计类条件概率的一种常用策略就是先假定其具有某种确定的概率分布形式, 再基于训练样本对概率分布的参数进行估计. 
   - 极大似然法, 极大似然估计(Maximum Likehood Estimation, MLE):  
   先验概率$P(Y=c_k)$的极大似然估计是:
   $$P(Y=c_k) = \frac {\sum_{i=1}^N I(y_i=c_k) }{N}, \quad k=1, 2, \cdots, K$$
   设第j个特征$x^{(j)}$的可取值集合为$\{a_{j1}, a_{j1}, \cdots, a_{jS_j}\}$, 条件概率$P(X^{(j)}=a_{jl}|Y=c_k)$的极大似然估计是:
   $$P(X^{(j)}=a_{jl}|Y=c_k) = \frac {\sum_{i=1}^{N} I(x^{(j)}=a_{jl}, y_i=c_k)}{\sum_{i=1}^{N}I(y_i = c_k)}$$
   $$j = 1, 2, \cdots, n; \quad l=1, 2, \cdots, S_j; \quad k=1, 2, \cdots, K$$
   $𝑆𝑗$是第$j$个特征可取的值的数量.
   - 贝叶斯估计: 极大似然估计可能会出现所要的概率值为0的情况. 可以使用贝叶斯估计, 条件概率的贝叶斯估计:
   $$P_\lambda(X^{(j)}=a_{jl}|Y=c_k) = \frac {\sum_{i=1}^{N} I(x^{(j)}=a_{jl}, y_i=c_k) + \lambda}{\sum_{i=1}^{N}I(y_i = c_k) + S_j \lambda}$$
   式中$\lambda \geq 0$, 常取$\lambda = 1$, 这时称为拉普拉斯平滑(Laplacian smoothing).先验概率的贝叶斯估计:
   $$P(Y=c_k) = \frac {\sum_{i=1}^N I(y_i=c_k) + \lambda}{N + K\lambda}, \quad k=1, 2, \cdots, K$$
模型：

- 高斯模型
- 多项式模型
- 伯努利模型

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

from collections import Counter
import math
np.set_printoptions(precision=4, threshold=15,suppress=True)
pd.options.display.max_rows = 20

In [3]:
iris = load_iris()
# iris

In [4]:
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [5]:
y_train

array([2, 1, 1, ..., 1, 0, 2])

In [6]:
X_test[0], y_test[0]

(array([6.2, 2.8, 4.8, 1.8]), 2)

In [7]:
np.std(X_test, axis=0)

array([0.7993, 0.3899, 1.6904, 0.742 ])

In [8]:
np.std(X_test[:, 0])

0.7993392332899363

参考：https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/

## GaussianNB 高斯朴素贝叶斯

特征的可能性被假设为高斯

概率质量函数(Probability Mass Function, PMF)：
$$P(x_i | y_k)=\frac{1}{\sqrt{2\pi\sigma^2_{yk}}}exp(-\frac{(x_i-\mu_{yk})^2}{2\sigma^2_{yk}})$$
概率密度函数(Probability Density Function, PDF)
$$f(x;\mu,\sigma)=\frac{1}{\sigma\sqrt{2\pi}} \, \exp \left( -\frac{(x- \mu)^2}{2\sigma^2} \right)$$
数学期望(mean)：$\mu$

方差：$\sigma^2=\frac{\sum(X-\mu)^2}{N}$

In [9]:
# 假设4个特征都服从高斯分布且独立同分布, 使用训练集计算y=0,1和2时的(均值, 方差),条件概率相乘得到
# 测试数据带入概率分布函数,计算3种概率,取最大的
class NaiveBayes_1:
    def __init__(self):
        self.model = {}  # {类别0:(mean, std), ...}
    
    def fit(self, X, y):
        labels, indices = np.unique(y, return_inverse=True)
        for label in labels:
            mean = np.mean(X[indices==label], axis=0)
            var = np.var(X[indices==label], axis=0)
            self.model[label] = [mean, var]
            # {分类1:[均值1, 方差1]; 分类2:[]}
            
    def gaussian_probability(self, x, mean, var):
        # 计算高斯分布的概率
        # x: 2D array-like (n_samples, n_fatures)
        p = 1 / np.sqrt(2*var * np.pi) * np.exp(-np.square(x-mean) / (2*var))
        # return  np.sum(np.log(p), axis=1) # (n_samples, 1)  取对数 再相加
        return np.prod(p, axis=1)  # iid
    
    def predict(self, x):
        temp = []
        labels = np.array(list(self.model.keys()))
        probs = self.predict_proba(x)
        return labels[np.argmax(probs, axis=1)]
    
    def predict_proba(self, x):
        temp = []
        labels = np.array(list(self.model.keys()))
        for (mean, var) in self.model.values():
            p = self.gaussian_probability(x, mean, var)
            temp.append(p)
        probs = np.stack(temp, axis=1) # (n_samples, n_labels)
        return probs  
        
    def score(self, X_test, y_test):
        return np.sum(self.predict(X_test) == y_test) / len(X_test)
        

In [10]:
model = NaiveBayes_1()
model.fit(X_train, y_train)
model.model

{0: [array([5.0727, 3.5182, 1.4788, 0.2485]),
  array([0.1244, 0.1603, 0.0314, 0.0092])],
 1: [array([5.9844, 2.8   , 4.3094, 1.3469]),
  array([0.2819, 0.09  , 0.2677, 0.05  ])],
 2: [array([6.5825, 2.9475, 5.555 , 2.0025]),
  array([0.4134, 0.0885, 0.2815, 0.0722])]}

In [11]:
model.predict([[4.4,  3.2,  1.3,  0.2], [5.4,  2.7,  4.3,  1.2]])

array([0, 1])

In [12]:
model.score(X_test, y_test)

1.0

In [13]:
model.predict_proba(X_test)  # 未归一化

array([[0.    , 0.1036, 0.1882],
       [0.4151, 0.    , 0.    ],
       [0.    , 0.    , 0.0003],
       ...,
       [0.    , 0.3824, 0.    ],
       [3.1193, 0.    , 0.    ],
       [0.0097, 0.    , 0.    ]])

In [14]:
class NaiveBayes:
    def __init__(self):
        self.model = None

    # 数学期望
    @staticmethod
    def mean(X):
        return sum(X) / float(len(X))

    # 标准差（方差）
    def stdev(self, X):
        avg = self.mean(X)
        return math.sqrt(sum([pow(x - avg, 2) for x in X]) / float(len(X)))

    # 概率密度函数
    def gaussian_probability(self, x, mean, stdev):
        exponent = math.exp(-(math.pow(x - mean, 2) /
                              (2 * math.pow(stdev, 2))))
        return (1 / (math.sqrt(2 * math.pi) * stdev)) * exponent

    # 处理X_train
    def summarize(self, train_data):
        summaries = [(self.mean(i), self.stdev(i)) for i in zip(*train_data)]
        return summaries

    # 分类别求出数学期望和标准差
    def fit(self, X, y):
        labels = list(set(y))
        data = {label: [] for label in labels}
        for f, label in zip(X, y):
            data[label].append(f)
        self.model = {
            label: self.summarize(value)
            for label, value in data.items()
        }
        return 'gaussianNB train done!'

    # 计算概率
    def calculate_probabilities(self, input_data):
        # summaries:{0.0: [(5.0, 0.37),(3.42, 0.40)], 1.0: [(5.8, 0.449),(2.7, 0.27)]}
        # input_data:[1.1, 2.2]
        probabilities = {}
        for label, value in self.model.items():
            probabilities[label] = 1
            for i in range(len(value)):
                mean, stdev = value[i]
                probabilities[label] *= self.gaussian_probability(
                    input_data[i], mean, stdev)
        return probabilities

    # 类别
    def predict(self, X_test):
        # {0.0: 2.9680340789325763e-27, 1.0: 3.5749783019849535e-26}
        label = sorted(
            self.calculate_probabilities(X_test).items(),
            key=lambda x: x[-1])[-1][0]
        return label

    def score(self, X_test, y_test):
        right = 0
        for X, y in zip(X_test, y_test):
            label = self.predict(X)
            if label == y:
                right += 1

        return right / float(len(X_test))

In [15]:
model = NaiveBayes()

In [16]:
model.fit(X_train, y_train)

'gaussianNB train done!'

In [17]:
model.model

{0: [(5.072727272727273, 0.3527147764109446),
  (3.518181818181818, 0.4003442045211395),
  (1.478787878787879, 0.17711077813578732),
  (0.24848484848484853, 0.09573072120564434)],
 1: [(5.984375, 0.5309833889821789),
  (2.7999999999999994, 0.3),
  (4.309375, 0.5174211141565447),
  (1.346875, 0.22358496008229176)],
 2: [(6.5825000000000005, 0.6429959175609127),
  (2.9475, 0.2974789908548165),
  (5.554999999999999, 0.5305421755148219),
  (2.0025, 0.2687819748420642)]}

In [18]:
print(model.predict([4.4,  3.2,  1.3,  0.2]))

0


In [19]:
model.score(X_test, y_test)

1.0

### scikit-learn实例
**高斯朴素贝叶斯分类器**

In [25]:
GaussianNB?

In [20]:
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf

GaussianNB()

In [21]:
model = clf.fit(X_train, y_train)

In [22]:
model.predict_proba([[4.4,  3.2,  1.3,  0.2], [5.4,  2.7,  4.3,  1.2]])  # 原始预测概率未校准

array([[1.    , 0.    , 0.    ],
       [0.    , 0.9998, 0.0002]])

In [23]:
model.score(X_test, y_test)

1.0

In [24]:
model.predict_proba(X_test)

array([[0.    , 0.3057, 0.6943],
       [1.    , 0.    , 0.    ],
       [0.    , 0.    , 1.    ],
       ...,
       [0.    , 1.    , 0.    ],
       [1.    , 0.    , 0.    ],
       [1.    , 0.    , 0.    ]])

带有每个类别的先验概率

In [26]:
clf = GaussianNB(priors=[0.25, 0.25, 0.5])

In [27]:
model = clf.fit(X_train, y_train)

In [28]:
model.predict([[4.4,  3.2,  1.3,  0.2], [5.4,  2.7,  4.3,  1.2]])

array([0, 1])

In [29]:
model.score(X_test, y_test)

1.0

**校准预测概率**

注意：来自高斯朴素贝叶斯的原始预测概率（使用predict_proba输出）未校准。 也就是说，他们不应该是可信的。 如果我们想要创建有用的预测概率，我们将需要使用等渗回归(isotonic regression)或相关方法来校准它们。

类别概率是机器学习模型中常见且有用的部分。 在`scikit-learn`中，大多数学习算法允许我们使用`predict_proba`来查看成员的类别预测概率。 例如，如果我们想要仅预测某个类，如果模型预测它们是该类的概率超过 90%，则这非常有用。 然而，一些模型，包括朴素贝叶斯分类器输出的概率，不基于现实世界。 也就是说，`predict_proba`可能预测，观测有 0.70 的机会成为某一类，而实际情况是它是 0.10 或 0.99。 特别是在朴素贝叶斯中，虽然不同目标类别的预测概率的排名是有效的，但是原始预测概率倾向于接近 0 和 1 的极值。

为了获得有意义的预测概率，我们需要进行所谓的校准。 在 `scikit-learn` 中，我们可以使用`CalibratedClassifierCV`类，使用 `k-fold` 交叉验证创建校准良好的预测概率。 在`CalibratedClassifierCV`中，训练集用于训练模型，测试集用于校准预测概率。返回的预测概率是 k 折的平均值。

CalibratedClassifierCV(base_estimator=None, method='sigmoid', cv='warn')
```
Probability calibration with isotonic regression or sigmoid.
Parameters
----------
base_estimator : instance BaseEstimator
   需要校准的classifier 
method : 'sigmoid' or 'isotonic'
cv : integer, cross-validation generator, iterable or "prefit", optional
    Determines the cross-validation splitting strategy.
    Possible inputs for cv are:

    - None, to use the default 3-fold cross-validation,
    - integer, to specify the number of folds.
    - :term:`CV splitter`,
    - An iterable yielding (train, test) splits as arrays of indices.
```

In [32]:
from sklearn.calibration import CalibratedClassifierCV

clf = GaussianNB()
clf_sigmoid = CalibratedClassifierCV(clf, cv=3, method='sigmoid')
clf_sigmoid.fit(X_train, y_train)

CalibratedClassifierCV(base_estimator=GaussianNB(), cv=3)

In [33]:
clf_sigmoid.score(X_test, y_test)

1.0

In [34]:
new_observation = [[4.4,  3.2,  1.3,  0.2], [5.4,  2.7,  4.3,  1.2]]
# 查看校准概率
clf_sigmod.predict_proba(new_observation)

array([[0.9005, 0.0453, 0.0542],
       [0.0268, 0.9149, 0.0583]])

**伯努利朴素贝叶斯**  
伯努利朴素贝叶斯分类器假设我们的所有特征都是二元的，它们仅有两个值（例如，已经是独热编码的标称分类特征）。

In [37]:
from sklearn.naive_bayes import BernoulliNB

X = np.random.randint(2, size=(100, 3))
X

array([[0, 1, 1],
       [0, 1, 1],
       [1, 1, 0],
       ...,
       [0, 1, 0],
       [1, 1, 1],
       [1, 1, 0]])

In [38]:
# 创建二元目标向量
y = np.random.randint(2, size=(100, 1)).ravel()
y

array([1, 0, 0, ..., 1, 0, 1])

In [39]:
# 创建伯努利朴素贝叶斯对象，带有每个类别的先验概率
clf = BernoulliNB(class_prior=[0.5, 0.5])

In [40]:
model = clf.fit(X, y)

In [41]:
target = [[1, 1, 1], [1, 1, 0], [1, 0, 0], [0, 0, 0], [0, 1, 0], [1, 0, 1], [0, 0, 1], [0, 1, 1]]

In [42]:
model.predict(target)

array([1, 1, 0, 0, 1, 0, 0, 1])

In [43]:
clf = BernoulliNB(class_prior=[0.25, 0.5])
model = clf.fit(X, y)
model.predict(target)

array([1, 1, 1, 1, 1, 1, 1, 1])