## python 文本分类

数据实用复旦中文文本分类数据集,[下载地址](https://pan.baidu.com/s/1833mT2rhL6gBMlM0KnmyKg) 密码:zyxa

文本分类步骤:

1) 对每篇文档进行分词，生成空格分隔的文档

2) 计算每篇文档TF-IDF矩阵，fit_transform

3) 训练模型，以tf-idf矩阵为特征，文档分类为label，使用贝叶斯或其他分类算法，训练模型

4) 计算测试数据tf-idf值。transform（test.contents）

5) 使用测试数据tf-idf 值，进行预测mode.predict(test.tfidf)

### tfidf算法

TF-IDF（Term Frequency-Inverse Document Frequency）即词频-逆文档频率，一般用在文本描述中。

主要思想是通过统计文章的关键词频率，来衡量和某个主题的相近程度或者计算文章之间的相似性。




In [36]:
import jieba
title = "安徽落实落实发展城市建设"
words = jieba.cut(title)
words = list(words)
print(words)

['安徽', '落实', '落实', '发展', '城市', '建设']


In [41]:
from sklearn.feature_extraction.text import TfidfVectorizer
change_title = " ".join(words)

vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=1,min_df=0)
x = vectorizer.fit_transform([change_title])

for key,value in vectorizer.vocabulary_.items():
    print(key,x[0,value])



安徽 0.3816141458138271
落实 0.6461289150464732
发展 0.3816141458138271
城市 0.3816141458138271
建设 0.3816141458138271


计算步骤

1)计算TF(单词在单个文章或句子的出现评率)

TF=单词出现次数/总单词个数

2)计算IDF（计算单词的重要程度）

IDF = log(文章总数/(包含单词文章数+1))

这个公式体现了一个意思，那就是说如果某个词在所有的语句、文章中都经常出现

那么说明它就是个常用词，不重要，那么计算出的IDF就会很小（PS:+1是为了防止除以0，万一有的词从未出现过）

3)TF-IDF
TF-IDF = TF * IDF



sklearn中tf idf计算有些改变

如果参数sublinear_tf=True,则tf = 1+ln(tf)
tf = 单词在文档出现次数/1

idf = ln((文章总数+smooth)/(包含文章数+smooth))+1 ，smooth=1

默认tf-idf返回前会进行归一化，使用L2正则化
tf-idf = norm ( tf * idf )

上述例子中
"落实" tf=1+ln(2) =1+0.6931471806  =1.6931471806
IDF = log((1+1)/(1+1))+1 = 1 ，smooth=1
tf-idf = norm(1.6931471806)

归一化 1.6931471806/(1.6931471806^2 +1^2+1^2+1^2+1^2)^0.5 = 1.6931471806 / 2.62044793407038122 = 0.646128915055377


 vectorizer.fit_transform 返回一个矩阵,每一行表示一个句子或文档。每列表示一个单词的tf-idf值



### 分词

分词采用jieba分词

分词后，把分词结果保存到文件，避免重复分词

In [43]:
import sys  
import os  
import jieba
import re


datasetdir = "D:\my\ml\data\ch_document_category"
changedatadir = "D:\my\ml\data\changedatadir"
stopwordpath = "D:\my\ml\data\ch_stopword.txt"
stopwords = [line.strip() for line in open(stopwordpath, 'r', encoding='utf-8').readlines()]  
## 数据格式，每个目录表示一个分类
## 文本数据开头有一些作者，标题，sn号等信息
## 对每篇文章分词前进行预处理，删除空格，换行，sn号，出版日期等

def pre_handler_content(content):
    patten = "【 标  题 】(.*?)\\r?\\n【 正  文 】([\s\S]*)"
    match = re.search(patten,content, re.M|re.I)
    if match != None:
        g = match.groups()
        if len(g) == 2:
            content = "".join(g)
    content = re.sub("[\r\n\t\s＊]", "", content)
    return content
    
    
def split_world_and_tostr(content):
    words = jieba.cut(content)
    savewords = []
    for word in words:
        if word not in stopwords:
            savewords.append(word)
    if len(savewords)>0:
        return " ".join(savewords)
    return ''


## 获取子目录
catelist = os.listdir(datasetdir)


def makesplit():
    for dirname in catelist:
        fulldirname = os.path.join(datasetdir,dirname)
        newdirname = os.path.join(changedatadir,dirname)
        if not os.path.exists(newdirname):  # 是否存在分词目录，如果没有则创建该目录  
            os.makedirs(newdirname) 
        file_list = os.listdir(fulldirname)
        for filename in file_list:
            change_str = ""
            with open(os.path.join(fulldirname,filename), "rb") as f:
                content = f.read()
                if content!= None:
                    content = content.decode('gbk', 'ignore')
                    content = pre_handler_content(content)
                    change_str = split_world_and_tostr(content)
            if len(change_str)>0:
                with open(os.path.join(newdirname,filename),"w+") as f:
                    f.write(change_str)

    

makesplit()

### 结构化表示-构建词向量空间

In [44]:
from sklearn.datasets import base
import random

def load_data_to_learn():
    bunchtrain = base.Bunch(target_name=[], label=[], filenames=[], contents=[])
    bunchtest = base.Bunch(target_name=[], label=[], filenames=[], contents=[])
    catelist = os.listdir(changedatadir)
    for dirname in catelist:
        fulldirname = os.path.join(changedatadir,dirname)
        file_list = os.listdir(fulldirname)
        totalcount = len(file_list)
        trainnum = 0
        
        for filename in file_list:
            fullfilename = os.path.join(fulldirname,filename)
            with open(fullfilename, "rb") as f:
                content = f.read()
                if content!= None:
                    content = content.decode('gbk', 'ignore')
                    r = random.random()
                    if r<0.7:
                        bunchtrain.label.append(dirname)
                        bunchtrain.target_name.append(dirname)
                        bunchtrain.filenames.append(fullfilename)
                        bunchtrain.contents.append(content)
                    else:
                        bunchtest.label.append(dirname)
                        bunchtest.target_name.append(dirname)
                        bunchtest.filenames.append(fullfilename)
                        bunchtest.contents.append(content)

    return (bunchtrain,bunchtest)

    
    





In [46]:
import _pickle as cPickle

bunch,bunchtest = load_data_to_learn()

rumtimepath = "D:\\my\\ml\\data\\runtime"
with open(os.path.join(rumtimepath,"bunch.dat"), "wb+") as file_obj:  
    cPickle.dump(bunch, file_obj)
    print("构建训练集文本对象结束！！！")

with open(os.path.join(rumtimepath,"bunchtest.dat"), "wb+") as file_obj:  
    cPickle.dump(bunchtest, file_obj)
    print("构建训练集文本对象结束！！！")

构建训练集文本对象结束！！！
构建训练集文本对象结束！！！


### tf-idf

In [47]:

from sklearn.feature_extraction.text import TfidfVectorizer

tfidfspace = base.Bunch(target_name=bunch.target_name, label=bunch.label, filenames=bunch.filenames, tdm=[], vocabulary={})
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5)
tfidfspace.tdm = vectorizer.fit_transform(bunch.contents)
tfidfspace.vocabulary = vectorizer.vocabulary_



with open(os.path.join(rumtimepath,"tfidfspace.dat"), "wb+") as file_obj:  
    cPickle.dump(tfidfspace, file_obj)  
    print("tfidf 构建结束！！！")


tfidf 构建结束！！！


In [48]:
feature_path = os.path.join(rumtimepath,"feture.dat")
with open(feature_path, 'wb') as fw:
    cPickle.dump(vectorizer.vocabulary_, fw)


### 贝叶斯分类

In [50]:
from  sklearn.feature_extraction.text import CountVectorizer
loaded_vec = CountVectorizer(decode_error="replace", vocabulary=cPickle.load(open(feature_path, "rb")))


In [53]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score
import random

### 分割测试，训练
traindata_x = tfidfspace.tdm
traindata_y = bunch.label


testdata_x = vectorizer.transform(bunchtest.contents)
testdata_y = bunchtest.label
# countnum = len(tfidfspace.label)
# for i in range(0,countnum-1):
#     r = random.random()
#     if r<0.3:
#         testdata_x = tfidfspace.tdm
#         testdata_y = tfidfspace.label
#     else:
#         traindata_x = tfidfspace.tdm
#         traindata_y = tfidfspace.label





In [54]:
alpha_list = [0.00035,0.0006,0.001,0.01]
for al in alpha_list:
    clf = MultinomialNB(alpha=al)                                  
    accs=cross_val_score(clf,traindata_x ,traindata_y,cv=10,scoring='accuracy')
    print('交叉验证结果:',accs)

交叉验证结果: [0.73417722 0.79220779 0.73611111 0.81944444 0.81428571 0.82089552
 0.78125    0.7704918  0.7704918  0.78333333]
交叉验证结果: [0.75949367 0.79220779 0.75       0.84722222 0.85714286 0.82089552
 0.796875   0.78688525 0.73770492 0.76666667]
交叉验证结果: [0.75949367 0.79220779 0.76388889 0.84722222 0.85714286 0.80597015
 0.796875   0.78688525 0.75409836 0.75      ]
交叉验证结果: [0.75949367 0.80519481 0.77777778 0.80555556 0.77142857 0.80597015
 0.8125     0.78688525 0.7704918  0.76666667]


In [55]:
clf = MultinomialNB(alpha=0.0006)
clf.fit(traindata_x ,traindata_y)

MultinomialNB(alpha=0.0006, class_prior=None, fit_prior=True)

In [61]:
## 预测结果

predicted = clf.predict(testdata_x)

count = 0
for flabel,filename,expct_cate in zip(testdata_y,bunchtest.filenames,predicted):
    if flabel != expct_cate:
        count = count+1
        print("%s 实际分类：%s -- 预测结果:%s" % (filename,flabel,expct_cate))
print("预测结束 测试集总共%d ，分类错误 %d" % (len(testdata_y),count))
 
# 计算分类精度：
from sklearn import metrics
def metrics_result(actual, predict):
    print("精度:%3f" % metrics.precision_score(actual, predict,average='weighted'))
    print("召回:%3f" % metrics.recall_score(actual, predict,average='weighted'))
    print("f1-score:%3f" % metrics.f1_score(actual, predict,average='weighted'))

metrics_result(testdata_y, predicted)



D:\my\ml\data\changedatadir\C11-Space\C11-Space0086.txt 实际分类：C11-Space -- 预测结果:C39-Sports
D:\my\ml\data\changedatadir\C16-Electronics\C16-Electronics10.txt 实际分类：C16-Electronics -- 预测结果:C32-Agriculture
D:\my\ml\data\changedatadir\C16-Electronics\C16-Electronics20.txt 实际分类：C16-Electronics -- 预测结果:C34-Economy
D:\my\ml\data\changedatadir\C16-Electronics\C16-Electronics22.txt 实际分类：C16-Electronics -- 预测结果:C34-Economy
D:\my\ml\data\changedatadir\C16-Electronics\C16-Electronics30.txt 实际分类：C16-Electronics -- 预测结果:C34-Economy
D:\my\ml\data\changedatadir\C16-Electronics\C16-Electronics38.txt 实际分类：C16-Electronics -- 预测结果:C34-Economy
D:\my\ml\data\changedatadir\C16-Electronics\C16-Electronics47.txt 实际分类：C16-Electronics -- 预测结果:C34-Economy
D:\my\ml\data\changedatadir\C16-Electronics\C16-Electronics55.txt 实际分类：C16-Electronics -- 预测结果:C11-Space
D:\my\ml\data\changedatadir\C17-Communication\C17-Communication19.txt 实际分类：C17-Communication -- 预测结果:C34-Economy
D:\my\ml\data\changedatadir\C17-Communication\