# 朴素贝叶斯

一般在数据量足够，数据丰富度够的情况下，用朴素贝叶斯完成这个任务，准确度还是很不错的。

## 准备数据

In [16]:
import jieba
import pandas as pd

In [17]:
df_technology = pd.read_csv("./data/technology_news.csv", encoding="utf-8")
df_technology = df_technology.dropna()

In [18]:
df_car = pd.read_csv("./data/car_news.csv", encoding="utf-8")
df_car = df_car.dropna()

In [19]:
df_entertainment = pd.read_csv("./data/entertainment_news.csv", encoding="utf-8")
df_entertainment = df_entertainment.dropna()

In [20]:
df_military = pd.read_csv("./data/military_news.csv", encoding="utf-8")
df_military = df_military.dropna()

In [21]:
df_sports = pd.read_csv("./data/sports_news.csv", encoding="utf-8")
df_sports = df_sports.dropna()

In [22]:
technology = df_technology.content.values.tolist()[1000:210000]
car = df_car.content.values.tolist()[1000:21000]
entertainment = df_entertainment.content.values.tolist()[1000:20000]
military = df_military.content.values.tolist()[1000: 20000]
sports = df_sports.content.values.tolist()[:20000]

In [23]:
technology[12]

'\u3000\u3000现在家里都拉了网线，都能无线上网，一定要帮他们先登上WiFi，另外，老人不懂得流量是什么，也不知道如何开关，控制流量，所以设置好流量上限很重要，免得不小心点开了视频或者下载，电话费就大发了。'

In [24]:
car[100]

'\u3000\u3000截至发稿时，人人车给出的处理方案仍旧是检修车辆。王先生则认为，车辆在购买时就存在问题，但交易平台并未能检测出来。因此，王先生希望对方退款。王先生称，他将找专业机构对车辆进行鉴定，并通过法律途径维护自己的权益。J256'

In [25]:
entertainment[10]

'\u3000\u3000不想拼赢但不想输给自己'

In [26]:
military[10]

'\xa0\xa0\xa0\xa0上世纪70年代，陆军某部“大功三连”因“煤油灯下学毛著”而享誉军内外。40多年来，这个优良传统一直在延续，连队成为闻名全军的思想工作模范连、基层建设模范连、科学发展模范连。'

In [27]:
sports[10]

'\u3000\u3000据统计，2016年仅在中国田径协会注册的马拉松赛事便达到了328场，继续呈现出爆发式增长的态势，2015年，这个数字还仅仅停留在134场。如果算上未在中国田协注册的纯“民间”赛事，国内全年的路跑赛事还要更多。'

## 分词中文文本处理

### 停用词

In [28]:
stopwords = pd.read_csv("data/stopwords.txt", index_col=False, quoting=3, sep="\t", names=["stopword"], encoding="utf-8")
stopwords = stopwords.stopword.values

## 去停用词

In [29]:
def preprocess_text(content_lines, sentences, category):
    for line in content_lines:
        try:
            segs = jieba.lcut(line)
            segs = filter(lambda x: len(x) > 1, segs)
            segs = filter(lambda x: x not in stopwords, segs)
            sentences.append((" ".join(segs), category))
        except Exception as e:
            print(line)
            continue

In [None]:
sentences = []

preprocess_text(technology, sentences, "technology")
preprocess_text(car, sentences, "car")
preprocess_text(entertainment, sentences, "entertainment")
preprocess_text(military, sentences, "military")
preprocess_text(sports, sentences, "sports")

## 生成训练集

In [None]:
import random
random.shuffle(sentences)

In [None]:
for sentence in sentences[:5]:
    print(sentence[0], sentence[1])

## 生成测试集

In [None]:
from sklearn.model_selection import train_test_split
x, y = zip(*sentences)
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1111)

## 抽取词袋模型特征

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer(analyzer="word", max_features=4000)
vec.fit(x_train)

## 训练朴素贝叶斯分类器

In [None]:
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(vec.transform(x_train), y_train)

## 准确率

In [None]:
classifier.score(vec.transform(x_test), y_test)

## 抽取2-gram和3-gram的统计特征，再把词库的量放大一点

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer(analyzer="word", ngram_range=(1, 4), max_features=20000)
vec.fit(x_train)