# 朴素贝叶斯

朴素，体现在X特征之间相互独立

通过训练数据集学习联合概率分布  
$$P(X,Y)=P(Y)P(X|Y)$$

先验概率分布，($c_k$为不同类别)
$$P(Y=c_k),k=1,2,3,...,K$$  

条件概率分布（独立性）   
$$
\begin{align}
P(X=x|Y=c_k)&=P(X^{(1)}=x^{(n)},...,X^{(n)}=x^{(n)}|Y=c_k) \\
&=\prod_{j=1}^n P(X^{(j)}=x^{(j)}|Y=c_k)
\end{align}$$

贝叶斯分类时，对给定x，通过学习到的模型计算后验概率$P(Y=c_k|X=x)$    
(全概率公式)

$$\begin{align}
P(Y=c_k|X=x)&=\frac{P(X=x|Y=c_k)P(Y=c_k)}{\sum_k P(X=x|Y=c_k)P(Y=c_k)} \\
&=\frac{P(Y+c_k)\prod_{j=1}^n P(X^{(j)}=x^{(j)}|Y=c_k)}{\sum_k P(Y+c_k)\prod_{j=1}^n P(X^{(j)}=x^{(j)}|Y=c_k)}
\end{align}$$

由于不同类别的分母是相同的，可以求得后验概率最大的y类别

$$y=\underset {c_k} argmax P(Y=c_k) \prod_j P(X^{(j)}=x^{(j)}|Y=c_k)$$  



如果特征为离散变量，可以一个个统计求不同值的概率  
如果特征为连续变量，则假设特征符合高斯分布

$$P(x_i|y_k)=\frac{1}{\sqrt{2\pi\sigma_{yk}^2}}exp(-\frac{(x_i-\mu_{yk})^2}{2\sigma_{yk}^2})$$

In [1]:
import numpy as np
import pandas as pd
import math
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

In [2]:
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['label'] = iris.target
df.columns = ['sepal length', 'sepal width', 'petal length', 'petal width', 'label']
df.head()

Unnamed: 0,sepal length,sepal width,petal length,petal width,label
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [3]:
X = np.array(df.iloc[:, :-1])
y = np.array(df.iloc[:, -1])
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)
print(X_train.shape)
print(X_test.shape)

(120, 4)
(30, 4)




In [4]:
class NaiveBayes:
    def __init__(self):
        self.model = {}
        self.prior = {}
    # 期望，均值
    def mean(self, X):
        return sum(X) / float(len(X))
    
    # 标准差
    def stdev(self, X):
        avg = self.mean(X)
        return math.sqrt(sum([pow(x-avg,2) for x in X]) / float(len(X)))
    
    # 计算每个特征的属性
    def summarize(self, X_train):
        return [(self.mean(i), self.stdev(i)) for i in zip(*X_train)]
    
    def fit(self, X, y):
        data = {label:[] for label in list(set(y))}
        # 将类别相同的数据放在一起
        # 在之后计算中已经加入了y=ck的条件
        for f, label in zip(X, y):
            data[label].append(f)
        # 先验概率
        for label, f in data.items():
            self.prior[label] = len(f)/len(X)
            
        # 计算出每一类的各个特征的属性
        # 每一类不同特征的均值，方差
        self.model = {label: self.summarize(value) for label, value in data.items()}

    # 计算高斯分布概率
    def gaussian_probability(self, x, mean, stdev):
        a = 1/math.sqrt(2*math.pi)*stdev
        b = math.exp(-(math.pow(x-mean, 2) / (2*math.pow(stdev, 2))))
        return a*b
        
    # 计算测试集不同类别的概率
    def calcluate_probability(self, X_test):
        probabilities = {}
        for index, x in enumerate(X_test):
            prob = {}
            for label, value in self.model.items():
#                 prob[label] = 1
                prob[label] = self.prior[label]  # 加入先验概率
                for i in range(len(value)):
                    mean, stdev = value[i]
                    prob[label] *= self.gaussian_probability(x[i], mean, stdev)
            # 将可能的类别按照概率大小降序排序
            probabilities[index] = sorted(prob.items(), key=lambda x:x[1], reverse=True)
        return probabilities
                
    def predict(self, X_test):
        probabilities = self.calcluate_probability(X_test)
        res = []
        for i in range(len(X_test)):
            res.append(probabilities[i][0][0])
        return res
    
    # 计算正确率
    def score(self, y_pred, y_test):
        count = 0
        for i, j in zip(y_pred, y_test):
            if i==j:
                count+=1
        print('Accuracy: ',count/len(y_test))
        

In [5]:
model = NaiveBayes()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
model.score(y_pred, y_test)

Accuracy:  0.9666666666666667


### sklearn

In [6]:
from sklearn.naive_bayes import GaussianNB

In [8]:
clf = GaussianNB()
clf.fit(X_train, y_train)
y_pred = model.predict(X_test)
model.score(y_pred, y_test)

Accuracy:  0.9666666666666667
