通过snownlp库方法和朴素贝叶斯方法分别来对中文文本情感分析

In [1]:
#导入库
import numpy as np
import pandas as pd

#读取数据
data = pd.read_csv('data1.csv')
data.head()

Unnamed: 0,comment,star
0,口味：不知道是我口高了，还是这家真不怎么样。??我感觉口味确实很一般很一般。上菜相当快，我敢...,2
1,菜品丰富质量好，服务也不错！很喜欢！,4
2,说真的，不晓得有人排队的理由，香精香精香精香精，拜拜！,2
3,菜量实惠，上菜还算比较快，疙瘩汤喝出了秋日的暖意，烧茄子吃出了大阪烧的味道，想吃土豆片也是口...,5
4,先说我算是娜娜家风荷园开业就一直在这里吃??每次出去回来总想吃一回??有时觉得外面的西式简餐...,4


In [2]:
#评分的数值
data['star'].unique()

array([2, 4, 5, 1], dtype=int64)

In [3]:
#将评分转化为二值变量
def label(star):
    if star > 3:
        return 1
    else:
        return 0
    
data['star_label'] = data.star.apply(label)

我们用一个第三库（snownlp），这个库可以直接对文本进行情感分析，返回的是积极性的概率。

In [4]:
#举例说明snownlp库的作用
from snownlp import SnowNLP

text1 = '这个东西不错'
text2 = '这个东西很垃圾'

s1 = SnowNLP(text1)
s2 = SnowNLP(text2)

print(s1.sentiments,s2.sentiments)

0.8623218777387431 0.21406279508712744


In [5]:
#通过snownlp库来确定情感得分
def snownlp_result(comment_words):
    s = SnowNLP(comment_words)
    if s.sentiments >= 0.6:
        return 1
    else:
        return 0
    
data['sentiment'] = data.comment.apply(snownlp_result)

In [6]:
#对比star_label和sentiment两列的结果
print((data[data['star_label']==data['sentiment']].shape[0])/len(data))

0.763


朴素贝叶斯方法用于情感得分预测

In [8]:
#jieba分词
import jieba

def chinese_jieba_cut(mytext):
    return " ".join(jieba.cut(mytext))

data['cut_comment'] = data.comment.apply(chinese_jieba_cut)

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\pr\AppData\Local\Temp\jieba.cache
Loading model cost 0.777 seconds.
Prefix dict has been built successfully.


In [9]:
#数据分为训练样本和测试样本
from sklearn.model_selection import train_test_split

X = data['cut_comment']
y = data.star_label
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=22)

In [10]:
#文本向量化
from sklearn.feature_extraction.text import CountVectorizer

def get_custom_stopwords(stopwords_file):
    with open(stopwords_file) as f:
        stopwords = f.read()
    stopwords_list = stopwords.split('\n')
    custom_stopwords_list = [i for i in stopwords_list]
    return custom_stopwords_list

stopwords_file = '哈工大停用词表.txt'
stopwords = get_custom_stopwords(stopwords_file)

vect = CountVectorizer(max_df = 0.8, 
                       min_df = 3, 
                       token_pattern=u'(?u)\\b[^\\d\\W]\\w+\\b', 
                       stop_words=frozenset(stopwords))

vect_pd = pd.DataFrame(vect.fit_transform(X_train).toarray(), columns=vect.get_feature_names())
vect_pd.head()

  'stop_words.' % sorted(inconsistent))


Unnamed: 0,ipad,ok,ps,wifi,一下,一个个,一个半,一个多,一人,一份,...,麻烦,麻辣,麻酱,麻麻,黄瓜,黄盖,黄色,黏糊糊,黑椒,默默
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,1,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [11]:
#朴素贝叶斯方法模型
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

X_train_vect = vect.fit_transform(X_train)
nb.fit(X_train_vect, y_train)
train_score = nb.score(X_train_vect, y_train)
print(train_score)

X_test_vect = vect.transform(X_test)
print(nb.score(X_test_vect, y_test))

0.899375
0.8275


In [12]:
X_vec = vect.transform(X)
nb_result = nb.predict(X_vec)
data['nb_result'] = nb_result

In [13]:
#对比star_label和nb_result两列的结果
print((data[data['star_label']==data['nb_result']].shape[0])/len(data))

0.885
