# SnowNLP快速进行评论数据情感分析

SnowNLP 主要可以进行中文分词、词性标注、情感分析、文本分类、转换拼音、繁体转简体、提取文本关键词、提取摘要、分割句子、文本相似等。

In [1]:
from snownlp import SnowNLP

## 测试一条京东的好评数据

In [2]:
SnowNLP("本本已收到，体验还是很好，功能方面我不了解，只看外观还是很不错很薄，很轻，也有质感。").sentiments

0.999950702449061

## 中评数据

In [3]:
SnowNLP("屏幕分辨率一般，送了个极丑的鼠标。").sentiments

0.03251402883400323

## 差评数据

In [4]:
SnowNLP("很差的一次购物体验，细节做得极差了，还有发热有点严重啊，散热不行，用起来就是烫得厉害，很垃圾！！！").sentiments

0.0036849517156107847

# 自定义模型训练和保存

In [7]:
from snownlp import sentiment

sentiment.train("data/neg.txt", "data/pos.txt")
sentiment.save("sentiment.marshal")

## 测试

### 好评

In [8]:
sentiment.classify("")

0.6089407099697889

### 差评

In [9]:
sentiment.classify("标准间太差房间还不如3星的而且设施非常陈旧.建议酒店把老的标准间从新改善.")

0.271552418168417

# 基于标注好的情感词典来计算情感值

In [17]:
import pandas as pd
import jieba

## 加载玻森情感词典

In [12]:
df = pd.read_table("data/BosonNLP_sentiment_score.txt", sep=" ", names=["key", "score"])
df[:5]

Unnamed: 0,key,score
0,最尼玛,-6.704
1,扰民,-6.497564
2,fuck...,-6.329634
3,RNM,-6.218613
4,wcnmlgb,-5.9671


In [13]:
key = df["key"].values.tolist()
score = df["score"].values.tolist()

## 结巴分词

In [18]:
def getscore(line):
    segs = jieba.lcut(line)  #分词
    score_list  = [score[key.index(x)] for x in segs if(x in key)]
    return  sum(score_list)  #计算得分

## 获得句子得分

In [20]:
line = "今天天气很好，我很开心"
round(getscore(line),2)

5.26

In [21]:
line = "今天下雨，心情也受到影响。"
round(getscore(line),2)

-0.96

# 绘制情感树

# 股吧数据情感分类

In [10]:
import pandas as pd
import numpy as np
import jieba
import random
import keras
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import Embedding
from keras.layers import Conv1D, GlobalMaxPooling1D
from keras.datasets import imdb
from keras.models import model_from_json
from keras.utils import np_utils
import matplotlib.pyplot as plt

Using TensorFlow backend.


## 中文语料

In [14]:
stopwords = pd.read_csv("data/stopwords.txt",index_col=False, quoting=3, sep="\t", names=['stopword'], encoding='utf-8')
stopwords = stopwords['stopword'].values

In [15]:
df_data1 = pd.read_csv("data/data1.csv", encoding='utf-8')
df_data1.head()

Unnamed: 0,Id,title,time,content,replay,all_replay,time.1,label
0,334,蝴蝶效应啊，国家队就不能伸出援手吗,2018/6/22 9:34,蝴蝶效应啊，国家队就不能伸出援手吗,（2）,\r\n 救不起呀\r\n ...,2018-06-22 09:45:00,0
1,341,根据港股跌幅计算，中兴通讯今天明天必开板。,2018/6/22 9:42,根据港股跌幅计算，中兴通讯今天明天必开板。,（3）,\r\n 看看还有多少封单，别忽悠了，...,2018-06-22 09:47:31,0
2,344,博傻开始,2018/6/22 9:35,今天半仓，明天低开全仓,（25）,\r\n 今天打不开吧？这么大的压单\...,2018-06-22 09:40:28,0
3,345,窒息,2018/6/22 11:58,110万资金惨遭七连跌的杀戮，只剩下51万，赤裸裸的屠杀,（2）,\r\n 能剩1万算你牛！\r\n ...,2018-06-22 15:38:35,0
4,346,亏大了一一一被平仓一一一也平不掉,2018/6/22 12:46,000063：5万自己的加5万荣资的，现在都是卷商的还卖不掉，还在吹我加钱吖,（3）,\r\n 偷鸡不成蚀把米啊，你想着赚更...,2018-06-22 12:59:25,0


In [20]:
#把内容有缺失值的删除
df_data1.dropna(inplace=True)

#抽取文本数据和标签
data_1 = df_data1.loc[:,['content', 'label']]

#把消极  中性  积极分别为0、1、2的预料分别拿出来
data_label_0 = data_1.loc[data_1['label'] ==0, :]
data_label_1 = data_1.loc[data_1['label'] ==1, :]
data_label_2 = data_1.loc[data_1['label'] ==2, :]

In [21]:
data_label_0[:5]

Unnamed: 0,content,label
0,蝴蝶效应啊，国家队就不能伸出援手吗,0
1,根据港股跌幅计算，中兴通讯今天明天必开板。,0
2,今天半仓，明天低开全仓,0
3,110万资金惨遭七连跌的杀戮，只剩下51万，赤裸裸的屠杀,0
4,000063：5万自己的加5万荣资的，现在都是卷商的还卖不掉，还在吹我加钱吖,0


## 分词

In [24]:
#定义分词函数
def preprocess_text(content_lines, sentences, category):
    for line in content_lines:
        try:
            segs=jieba.lcut(line)
            segs = filter(lambda x:len(x)>1, segs)
            segs = [v for v in segs if not str(v).isdigit()]#去数字
            segs = list(filter(lambda x:x.strip(), segs)) #去左右空格
            segs = filter(lambda x:x not in stopwords, segs)
            temp = " ".join(segs)
            if(len(temp)>1):
                sentences.append((temp, category))
        except Exception:
            print(line)
            continue 

## 复杂规则

In [25]:
#获取数据
data_label_0_content = data_label_0['content'].values.tolist()
data_label_1_content = data_label_1['content'].values.tolist()
data_label_2_content = data_label_2['content'].values.tolist()

#生成训练数据
sentences = []
preprocess_text(data_label_0_content, sentences, 0)
preprocess_text(data_label_1_content, sentences, 1)
preprocess_text(data_label_2_content, sentences, 2)

#我们打乱一下顺序，生成更可靠的训练集
random.shuffle(sentences)

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\ADMINI~1\AppData\Local\Temp\jieba.cache
Loading model cost 0.882 seconds.
Prefix dict has been built succesfully.


In [26]:
#所以把原数据集分成训练集的测试集，咱们用sklearn自带的分割函数。
from sklearn.model_selection import train_test_split

x, y = zip(*sentences)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3,random_state=1234)

## 特征向量

In [27]:
#抽取特征，我们对文本抽取词袋模型特征
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer(
    analyzer='word', #tokenise by character ngrams
    max_features=4000,  #keep the most common 1000 ngrams
)
vec.fit(x_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=4000, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

## 算法建模

### 定义模型参数

In [28]:
# 设置参数
max_features = 5001
maxlen = 100
batch_size = 32
embedding_dims = 50
filters = 250
kernel_size = 3
hidden_dims = 250
epochs = 10
nclasses = 3

### 转成数组和标签处理

In [29]:
x_train = vec.transform(x_train)
x_test = vec.transform(x_test)
x_train = x_train.toarray()
x_test = x_test.toarray()
y_train = np_utils.to_categorical(y_train, nclasses)
y_test = np_utils.to_categorical(y_test, nclasses)
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

x_train shape: (1912, 100)
x_test shape: (820, 100)


### 定义一个绘制 Loss 曲线的类：

In [30]:
import matplotlib.pyplot as plt

%matplotlib inline

In [31]:
class LossHistory(keras.callbacks.Callback):
    def on_train_begin(self, logs={}):
        self.losses = {'batch':[], 'epoch':[]}
        self.accuracy = {'batch':[], 'epoch':[]}
        self.val_loss = {'batch':[], 'epoch':[]}
        self.val_acc = {'batch':[], 'epoch':[]}

    def on_batch_end(self, batch, logs={}):
        self.losses['batch'].append(logs.get('loss'))
        self.accuracy['batch'].append(logs.get('acc'))
        self.val_loss['batch'].append(logs.get('val_loss'))
        self.val_acc['batch'].append(logs.get('val_acc'))

    def on_epoch_end(self, batch, logs={}):
        self.losses['epoch'].append(logs.get('loss'))
        self.accuracy['epoch'].append(logs.get('acc'))
        self.val_loss['epoch'].append(logs.get('val_loss'))
        self.val_acc['epoch'].append(logs.get('val_acc'))

    def loss_plot(self, loss_type):
        iters = range(len(self.losses[loss_type]))
        plt.figure()
        # acc
        plt.plot(iters, self.accuracy[loss_type], 'r', label='train acc')
        # loss
        plt.plot(iters, self.losses[loss_type], 'g', label='train loss')
        if loss_type == 'epoch':
            # val_acc
            plt.plot(iters, self.val_acc[loss_type], 'b', label='val acc')
            # val_loss
            plt.plot(iters, self.val_loss[loss_type], 'k', label='val loss')
        plt.grid(True)
        plt.xlabel(loss_type)
        plt.ylabel('acc-loss')
        plt.legend(loc="upper right")
        plt.show()

### 训练模型

In [32]:
history = LossHistory()
print('Build model...')
model = Sequential()

model.add(Embedding(max_features,
                        embedding_dims,
                        input_length=maxlen))
model.add(Dropout(0.5))
model.add(Conv1D(filters,
                     kernel_size,
                     padding='valid',
                     activation='relu',
                     strides=1))
model.add(GlobalMaxPooling1D())
model.add(Dense(hidden_dims))
model.add(Dropout(0.5))
model.add(Activation('relu'))
model.add(Dense(nclasses))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])
model.fit(x_train, y_train,
              batch_size=batch_size,
              epochs=epochs,
              validation_data=(x_test, y_test), callbacks=[history])

Build model...
Train on 1912 samples, validate on 820 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x17fd0b00>

## 情感分析