# 情感分析（基于朴素贝叶斯）

构建分类器，可根据影评内容将影评预测"好评","中评","差评"情感

In [1]:
import pandas as pd

In [2]:
pos = pd.read_csv('comments_sets/pre_pos.txt',header = None)
mid = pd.read_csv('comments_sets/pre_mid.txt',header = None)
neg = pd.read_csv('comments_sets/pre_neg.txt',header = None)

In [3]:
pos['label'] = 1#好评标签
mid['label'] = 0#中评标签
neg['label'] = -1#差评标签

In [4]:
print('好评数:',len(pos))
print('中评数:',len(mid))
print('差评数:',len(neg))

好评数: 196761
中评数: 14836
差评数: 24729


In [5]:
pos.head()

Unnamed: 0,0,label
0,['好看'],1
1,"['终局', '开局', '宇宙', '第一季', '结尾', '第二季', '引子', '...",1
2,"['六个', '字']",1
3,"['钢铁', '侠凉']",1
4,"['还行', '还行']",1


In [6]:
mid.head()

Unnamed: 0,0,label
0,[],0
1,['服务'],0
2,"['情怀', '支撑']",0
3,['困'],0
4,"['剧情', '特效']",0


In [7]:
neg.head()

Unnamed: 0,0,label
0,"['老套', '剧情', '开头', '结尾']",-1
1,['垃圾'],-1
2,"['花', '钱', '娃去']",-1
3,"['对象', '黄了', '简单']",-1
4,"['感觉', '特技', '没什么', '可看']",-1


In [8]:
combi = pos.append(mid,ignore_index = True)
combi = combi.append(neg,ignore_index = True)
combi.columns = ['comments','label']

In [9]:
combi.head()

Unnamed: 0,comments,label
0,['好看'],1
1,"['终局', '开局', '宇宙', '第一季', '结尾', '第二季', '引子', '...",1
2,"['六个', '字']",1
3,"['钢铁', '侠凉']",1
4,"['还行', '还行']",1


In [10]:
print('所有评论数量:',len(combi))

所有评论数量: 236326


In [11]:
x = combi['comments']
y = combi['label']

In [12]:
#提取评论的词袋(bow)特征
from sklearn.feature_extraction.text import CountVectorizer
bow_vectorizer = CountVectorizer()
# bag-of-words feature matrix
bow = bow_vectorizer.fit_transform(x)

In [13]:
#查看提取特征数
bow.shape

(236326, 50566)

提取50566个特征，即每一条评论转换为50566维向量

In [14]:
#划分训练集和测试集
from sklearn.model_selection import train_test_split
xtrain_bow, xtest_bow, ytrain, ytest = train_test_split(bow,combi['label'], random_state=42, test_size=0.3)

# 使用贝叶斯预测分类
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

#根据测试集的bow特征，训练贝叶斯分类器
clf = nb.fit(xtrain_bow,ytrain)

#基于测试集的bow特征进行预测
y_pred = clf.predict(xtest_bow) 

In [15]:
print('训练集大小：',len(ytrain))
print('测试集大小：',len(ytest))

训练集大小： 165428
测试集大小： 70898


In [16]:
#python测量工具集
from sklearn import metrics 
#准确率
accuracy=metrics.accuracy_score(ytest,y_pred)
print('贝叶斯预测影评情感在测试集上的预测准确率为：'+str(accuracy))

贝叶斯预测影评情感在测试集上的预测准确率为：0.870222009083472


In [17]:
#混淆矩阵
print('混淆矩阵：\n'+str(metrics.confusion_matrix(ytest,y_pred)))

混淆矩阵：
[[ 4604   410  2388]
 [ 1443   519  2623]
 [ 1560   777 56574]]


基于朴素贝叶斯的情感预测，好评预测准确率非常高，差评预测准确率比较高，中评预测准确率偏低

end