### 1. 贝叶斯基础理论
条件概率：
$$P(A|B)= \frac{P(AB)}{P(B)}=\frac{P(A)P(B|A)}{P(B)}$$
全概率公式：
$$P(A)=\sum_{i=1}(P(A|B_i)P(B_i))$$
贝叶斯公式：
$$P(B|A) = \frac{P(A|B)P(B)}{P(A)}=\frac{P(A|B)P(B)}{\sum_{i=1}(P(A|B_i)P(B_i))}$$
最大似然估计：
$$max(P(A_i)|D)=max\frac{P(D|A_i)P(A_i)}{P(D)}=>max(P(D|A_i)P(A_i))=>max(P(D|A_i))$$
贝叶斯网络（全连接）,每一对节点之间都有边连接：
$$P(a,b,c)=P(c|a,b)P(a|b,c)P(b|a,c)$$

### 2.使用贝叶斯对文本进行分类：

In [1]:
import numpy as np
import pandas as pd
import re
import matplotlib.pyplot as plt
from collections import Counter
from sklearn import feature_extraction, model_selection, naive_bayes, metrics
from IPython.display import Image

import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
data=pd.read_csv('./data/Datasets5477/spam.csv',encoding='latin-1')
data.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [2]:
data=data.drop(['Unnamed: 2','Unnamed: 3','Unnamed: 4'],axis=1)
data.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
#查看非垃圾邮件和垃圾邮件的数量
pd.value_counts(data.v1)

ham     4825
spam     747
Name: v1, dtype: int64

In [4]:
#去除杂质单个字符、非字母等
data.v2=data.apply(lambda x:re.findall('[a-z]{2,100}',str(x.v2).lower()),axis=1)

In [5]:
ham_words=[]
spam_words=[]
i=0
for f in data.v1:
    if f=='ham':
        ham_words.extend(data.v2.loc[i])
    else:
        spam_words.extend(data.v2.loc[i])
    i+=1
#非垃圾邮寄处理
#统计前30词频数目
ham=Counter(ham_words).most_common(30)
ham_word_frequency=pd.DataFrame(ham)
ham_word_frequency=ham_word_frequency.rename(columns={0:"non_spam",1:'count'})
# ham=data.apply(lambda x:re.findall('[a-z]{2,100}',str(x.v2).lower()),axis=1)
#对垃圾邮件处理
spam=Counter(spam_words).most_common(30)
spam_word_frequency=pd.DataFrame(spam)
spam_word_frequency=spam_word_frequency.rename(columns={0:"spam",1:'count'})

In [6]:
ham_word_frequency

Unnamed: 0,non_spam,count
0,you,1944
1,to,1554
2,the,1126
3,and,857
4,in,820
5,me,772
6,my,750
7,is,732
8,it,713
9,that,558


In [7]:
spam_word_frequency

Unnamed: 0,spam,count
0,to,688
1,call,370
2,you,299
3,your,264
4,free,228
5,the,206
6,for,204
7,now,203
8,or,192
9,txt,170


In [8]:
# #不计入"english"
# #词袋模型处理
# f = feature_extraction.text.CountVectorizer(stop_words = 'english')
# X = f.fit_transform(data["v2"])
# np.shape(X)

In [15]:
#不计入"english"
#词袋模型处理
f = feature_extraction.text.CountVectorizer(stop_words = 'english')
X = f.fit_transform(ham_words+spam_words)
np.shape(X)

(78103, 7414)

In [12]:
Y = f.fit_transform(spam_words)
np.shape(Y)

(15741, 1903)

In [18]:
data.v1=data.v1.map({'spam':1,'ham':0})
X_train, X_test, y_train, y_test = model_selection.train_test_split(data.v2, data['v1'], test_size=0.30, random_state=42)
print([np.shape(X_train), np.shape(X_test)])

[(3900,), (1672,)]


In [10]:
#多项式模型的贝叶斯 调整平滑参数 将每次模型的预测结果评价放入数组中
list_alpha = np.arange(1/100000, 20, 0.11)
score_train = np.zeros(len(list_alpha))
score_test = np.zeros(len(list_alpha))
recall_test = np.zeros(len(list_alpha))
precision_test= np.zeros(len(list_alpha))
count = 0
for alpha in list_alpha:
    bayes = naive_bayes.MultinomialNB(alpha=alpha)
    bayes.fit(X_train, y_train)
    score_train[count] = bayes.score(X_train, y_train)
    score_test[count]= bayes.score(X_test, y_test)
    recall_test[count] = metrics.recall_score(y_test, bayes.predict(X_test))
    precision_test[count] = metrics.precision_score(y_test, bayes.predict(X_test))
    count = count + 1 

ValueError: setting an array element with a sequence.

In [None]:
matrix = np.matrix(np.c_[list_alpha, score_train, score_test, recall_test, precision_test])
models = pd.DataFrame(data = matrix, columns = 
             ['alpha', 'Train Accuracy', 'Test Accuracy', 'Test Recall', 'Test Precision'])
models.head(n=10)

In [None]:
#找到训练集预测准确度最高的模型
best_index = models['Test Precision'].idxmax()
models.iloc[best_index, :]

In [None]:
best_index = models[models['Test Precision']==1]['Test Accuracy'].idxmax()
bayes = naive_bayes.MultinomialNB(alpha=list_alpha[best_index])
bayes.fit(X_train, y_train)
models.iloc[best_index, :]

In [None]:
#混淆矩阵
m_confusion_test = metrics.confusion_matrix(y_test, bayes.predict(X_test))
pd.DataFrame(data = m_confusion_test, columns = ['Predicted 0', 'Predicted 1'],
            index = ['Actual 0', 'Actual 1'])

In [None]:
list_C = np.arange(500, 2000, 100) #100000
score_train = np.zeros(len(list_C))
score_test = np.zeros(len(list_C))
recall_test = np.zeros(len(list_C))
precision_test= np.zeros(len(list_C))
count = 0
for C in list_C:
    svc = svm.SVC(C=C)
    svc.fit(X_train, y_train)
    score_train[count] = svc.score(X_train, y_train)
    score_test[count]= svc.score(X_test, y_test)
    recall_test[count] = metrics.recall_score(y_test, svc.predict(X_test))
    precision_test[count] = metrics.precision_score(y_test, svc.predict(X_test))
    count = count + 1 

In [None]:
matrix = np.matrix(np.c_[list_C, score_train, score_test, recall_test, precision_test])
models = pd.DataFrame(data = matrix, columns = 
             ['C', 'Train Accuracy', 'Test Accuracy', 'Test Recall', 'Test Precision'])
models.head(n=10)

In [None]:
best_index = models['Test Precision'].idxmax()
models.iloc[best_index, :]

In [None]:
models[models['Test Precision']==1].head(n=5)