# 1607104130 鲍骞月 Naive-Bayes垃圾邮件分类

## 伯努利模型实现    
* 1. 准备数据   
    * 1.1 加载数据   
    * 1.2 对字符串进行划分（正则表达式），切分出单词    
    * 1.3 根据切分出的单词构成词表（无重复）
* 2. 特征选择      
    * 以词表中的每一个单词在一个样本中是否出现作为特征     
    * 词袋模型特征    
* 3. 训练模型    
    * 训练数据集划分（train_test_splite）    
    * 计算每个样本的词向量    
    * 根据词向量计算概率    
    * 根据概率结果得出分类结果    
* 4. 测试模型    
    * 使用测试数据集计算分类错误率     

## 1. 准备数据   

In [60]:
import numpy as np 
import re 
import math 

def textParse(res_string):
    # 使用正则表达式切分单词
    listOfTokens = re.split(r'\W*',res_string)
    # 过滤小于两个字符的单词，同时将单词全部小写   
    return [tok.lower() for tok in listOfTokens if len(tok) > 2]

In [61]:
# 加载数据 
docList = []
classList = []
fullText = []  

# 读取文件  
for i in list(range(1, 26)):
    # 加载垃圾邮件数据   
    wordList = textParse(open('./email/spam/%d.txt' % i, encoding="ISO-8859-1").read())
    docList.append(wordList)
    fullText.extend(wordList)
    classList.append(1)
    # 加载正常邮件数据
    wordList = textParse(open('./email/ham/%d.txt' % i, encoding="ISO-8859-1").read())
    docList.append(wordList)
    fullText.extend(wordList)
    classList.append(0)

In [63]:
# 构造词表   
def createVocabList(dataSet):
    return list(set([y for x in dataSet for y in x]))  

vocabList = createVocabList(docList)

## 2. 特征计算      

Bag of Word  

In [64]:
def bagOfWords2VecMN(vocabList,inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
        else:
            print("the word: %s is not in my Vocabulary!" % word)
    return returnVec

## 3. 训练模型       

### Train—Test—Split    
从50个邮件中，随机挑选出40个作为训练集,10个做测试集  

In [65]:
np.random.seed(777)  # 设定随机种子

trainingSet = list(range(len(wordList)))    
testSet = []
for i in list(range(10)):
    # 随机
    randIndex = int(np.random.uniform(0,len(trainingSet)))
    testSet.append(trainingSet[randIndex])
    del(trainingSet[randIndex])   
# 构建训练样本矩阵    
trainMat = []
trainClasses = []
for docIndex in trainingSet:
    trainMat.append(bagOfWords2VecMN(vocabList,docList[docIndex]))
    trainClasses.append(classList[docIndex])

### 通过词向量计算概率    

文档 d 属于类别 c 的概率计算如下（多项式模型）：
$$P(c|d)=\frac{P(d|c)P(c)}{P(d)} \propto P(d|c)P(c) = P(c)\prod_{1\leq k \leq n_{d}}P(t_{k} | c) $$
$n_{d}$ 是文档的长度（词条个数）     
$P(t_{k}|c)$  是词项  $t_{k}$ 出现在类别 c 中文档的频率      
$P(t_{k}|c)$  度量的是当 c 是正确类别时  $t_{k}$  的贡献     
p(c)是类别 c 的先验概率

In [66]:
def trainNB0(trainMatrix,trainCategory):
    numTrainDocs = len(trainMatrix) # 句子，样本数
    numWords = len(trainMatrix[0])  # 词向量长度
    pAbusive = sum(trainCategory)/float(numTrainDocs) # 类别为1的样本占比
    p0Num = np.ones(numWords)
    p1Num = np.ones(numWords)
    p0Denom = 2.0
    p1Denom = 2.0
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]        # 向量：样本汇总，所有样本加到一个词向量长度的向量
            p1Denom += sum(trainMatrix[i]) # 标量：所有单词计数
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    p1Vect = np.log(p1Num/p1Denom)   # 向量，训练数据中归为1类的样本统计出的，词向量中每个元素出现的概率
    p0Vect = np.log(p0Num/p0Denom)  
    return p0Vect, p1Vect, pAbusive

In [67]:
## 计算概率   
p0V,p1V,pSpam = trainNB0(np.array(trainMat),np.array(trainClasses))

In [68]:
# 分类函数    
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    # 使用log防止出现0
    p1 = sum(vec2Classify*p1Vec) + np.log(pClass1)
    p0 = sum(vec2Classify*p0Vec) + np.log(1-pClass1)
    return 1 if p1 > p0 else 0 

## 4. 测试模型  

In [70]:
errorCount = 0
for docIndex in testSet:
    wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
    if classifyNB(np.array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
        errorCount += 1
print("the error rate is : ", errorCount/len(testSet))

the error rate is :  0.6


  """


## Sklearn中伯努利贝叶斯分类实现   

In [71]:
import pandas as pd
from sklearn.feature_extraction import text
import glob

# ham
files = "./email/ham/*.txt"
ham_lst = glob.glob(files)
labels = [0]*len(ham_lst)
files = "./email/spam/*.txt"
spam_lst = glob.glob(files)
lst = ham_lst + spam_lst
labels.extend([1]*len(spam_lst))
cVct = text.CountVectorizer(input="filename",encoding="latin_1")
X = cVct.fit_transform(lst).todense().A
y = labels

In [72]:
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import train_test_split
bnb = BernoulliNB()

In [73]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=0,stratify=y)
bnb.fit(X_train,y_train)
y_pred = bnb.predict(X_test)
print("error rate is %1.2f%%" % (100*np.mean(y_pred != y_test)))

error rate is 0.00%
