## 简介
- 来自互联网的50000多条严重两极分化的电影评论，该数据被分为用于训练的44000条评论和用于测试的10000条评论
- 网址： `http://jingsai.julyedu.com/v/25820185621452395/dataset.jhtml`

`训练集`:            
共两列，第一列为分类的标签，0表示差评，1表示好评，第二列为电影评论。

| label | comment |
| --- | --- |



`测试集`:            
一列为电影评论

| comment |
| --- |


`目标集格式`：
一列，对应测试集的分类标签

| label |
| --- |

## 数据集

In [105]:
import pandas as pd
import numpy as np

import jieba                                                      # 中文分词
               
from gensim.models import Word2Vec                                # 词向量    

from sklearn.model_selection import train_test_split              # 分割数据集

from keras.preprocessing import sequence                          # 序列填充

from keras import Sequential                                      # RNN序列
from keras.layers import Embedding, LSTM, Dense, Dropout          # 嵌入、LSTM、Dense net、 Dropout反向属性

from keras.optimizers import SGD                                  # 优化器SGD操作
from keras.optimizers import RMSprop, Adam                        # 优化算法RMSprop，Adam

### 预览

In [106]:
data_train = pd.read_csv('../Data/train_data.csv')

data_train.head()

Unnamed: 0,label,comment
0,0,国王的工作就是读几句稿子啊
1,0,小朋友看嫌复杂大朋友看想快进的尴尬人物似曾相识唯一的泪点也是复制黏贴
2,0,一个非常丰富传奇的故事拍得这么浅薄大家是怎么给出五星的好奇
3,0,渣男阿飞浪费时间不推荐
4,0,一个原力青年一心跟着大师学习想成为结果半道突然发现导师其实是个打鼓像在撸管的丑比


In [107]:
data_test = pd.read_csv('../Data/test_data.csv')

data_test.head()

Unnamed: 0,comment
0,画面很美但台词实在是太太太矫情不是十几岁少年的正常青涩哪怕故作的沧桑而像没文化的油腻中年对键...
1,这片子好看好看好看好看重要的问题问四遍竟然是前百失望致幻的嗨药是弱者逃避现实的选择你觉得现实...
2,老电影有老电影说不出来的美感即使是近三个小时的电影细节做的仍非精致七武士的故事人类百科
3,拯救世人的疾病和灾难对人世间的所有痛苦感同身受最后却遭受孤独误解蒙冤走上电椅十字架这其实是在...
4,非常非常反感人与机器之间所谓的感情很傻


### 分词

In [108]:
data_train['comment'][4], data_train['label'][4]

('一个原力青年一心跟着大师学习想成为结果半道突然发现导师其实是个打鼓像在撸管的丑比', 0)

In [109]:
# 调整分词结果
jieba.suggest_freq("渣男",tune=True)

2

In [110]:
# 训练集分词
lst_commentWords = []
for sen in data_train['comment']:
    lst_commentWords.append(list(jieba.cut(sen)))
    
df_trainWords = pd.DataFrame(columns=['label', 'com_words'])
df_trainWords['label'] = data_train['label']
df_trainWords['com_words'] = lst_commentWords

df_trainWords.head()

Unnamed: 0,label,com_words
0,0,"[国王, 的, 工作, 就是, 读, 几句, 稿子, 啊]"
1,0,"[小朋友, 看, 嫌, 复杂, 大, 朋友, 看想, 快进, 的, 尴尬, 人物, 似曾相识..."
2,0,"[一个, 非常, 丰富, 传奇, 的, 故事, 拍, 得, 这么, 浅薄, 大家, 是, 怎..."
3,0,"[渣男, 阿飞, 浪费时间, 不, 推荐]"
4,0,"[一个, 原力, 青年, 一心, 跟着, 大师, 学习, 想, 成为, 结果, 半道, 突然..."


In [111]:
# 测试集分词
lst_commentWords = []
for sen in data_test['comment']:
    lst_commentWords.append(list(jieba.cut(sen)))
    
df_testWords = pd.DataFrame(columns=['com_words'])
df_testWords['com_words'] = lst_commentWords

df_testWords.head()

Unnamed: 0,com_words
0,"[画面, 很美, 但, 台词, 实在, 是, 太太, 太, 矫情, 不是, 十几岁, 少年,..."
1,"[这, 片子, 好看, 好看, 好看, 好看, 重要, 的, 问题, 问, 四遍, 竟然, ..."
2,"[老电影, 有, 老电影, 说不出来, 的, 美感, 即使, 是, 近, 三个, 小时, 的..."
3,"[拯救, 世人, 的, 疾病, 和, 灾难, 对, 人世间, 的, 所有, 痛苦, 感同身受..."
4,"[非常, 非常, 反感, 人, 与, 机器, 之间, 所谓, 的, 感情, 很傻]"


### 词向量

In [112]:
%%time
# 训练集单词转为50维词向量
model_train = Word2Vec(df_trainWords['com_words'], size=50)

# 词向量列表vocab_train
vocab_train = model_train.wv.vocab

print(type(model_train), type(vocab_train))

<class 'gensim.models.word2vec.Word2Vec'> <class 'dict'>
Wall time: 6.11 s


In [113]:
%%time
# 测试集单词转为50维词向量
model_test = Word2Vec(df_testWords['com_words'], size=50)

# 词向量列表
vocab_test = model_test.wv.vocab

print(type(model_test), type(vocab_test))

<class 'gensim.models.word2vec.Word2Vec'> <class 'dict'>
Wall time: 1.36 s


In [114]:
print(str(vocab_train['国王']))
print(str(vocab_test['国王']))

Vocab(count:50, index:2601, sample_int:4294967296)
Vocab(count:10, index:2873, sample_int:4294967296)


In [115]:
# from datetime import datetime
# t = datetime.now()
# s = datetime.strftime(t, '%m%d')
# print(s)

# # 导出词向量模型
# model_train.save('../Model/model_%s_50train.bin'%s)
# model_test.save('../Model/model_%s_50test.bin'%s)

### 分割数据集

In [116]:
X, y = df_trainWords['com_words'], df_trainWords['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

In [117]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((30847,), (13221,), (30847,), (13221,))

## RNN-LSTM（56.27%）

### id分析

In [14]:
%%time
import re
# 自定义方法，用id进行训练
def getWordIndexs(l_tW, voc_t):
    t_l = []
    for sen in l_tW:
        l = []
        for w in sen:
            if w in voc_t.keys():
                r = re.compile('\d*')
                s_l = [n for n in r.findall(str(voc_t[w])) if len(n)]
#             print(s_l)
    
            l.append(int(s_l[1]))
        t_l.append(l)
    return t_l

# test
l = getWordIndexs(list(X_train)[:100], vocab_train)
print(len(list(X_train)[0]), len(list(X_train)[1]), len(l[0]), len(l[1]))

26 78 26 78
Wall time: 65 ms


In [15]:
%%time
# 训练集转为id list
X_train_id = getWordIndexs(list(X_train), vocab_train)

Wall time: 17.9 s


In [16]:
%%time
# 测试集转为id list
X_test_id = getWordIndexs(list(X_test), vocab_train)

Wall time: 7.62 s


In [17]:
%%time
# 目标集转为id list
test_id = getWordIndexs(list(df_testWords['com_words']), vocab_train)

Wall time: 5.7 s


In [18]:
X_train.shape[0], X_test.shape[0], df_testWords.shape[0]

(30847, 13221, 9999)

In [19]:
print(len(X_train_id), len(X_train_id[0]))
print(len(X_test_id), len(X_test_id[0]))
print(len(test_id), len(test_id[0]))

30847 26
13221 19
9999 72


### 序列填充

In [20]:
len(X_train_id), len(X_test_id), len(test_id)

(30847, 13221, 9999)

In [21]:
# 填充序列（Pad sequences）为长度500
max_words = 500

X_train_pad = sequence.pad_sequences(X_train_id, maxlen=max_words)
X_test_pad = sequence.pad_sequences(X_test_id, maxlen=max_words)
test_pad = sequence.pad_sequences(test_id, maxlen=max_words)

X_train_pad.shape, X_test_pad.shape, test_pad.shape

((30847, 500), (13221, 500), (9999, 500))

In [27]:
X_train_pad_p = X_train_pad[:30800]
X_test_pad_p = X_test_pad[:13200]


y_train_p = y_train[:30800]
y_test_p = y_test[:13200]

X_train_pad.shape, X_test_pad.shape, y_train.shape, y_test.shape

((30800, 500), (13200, 500), (30800,), (13200,))

### RNN网络

In [28]:
# 设计情感分析的RNN模型
## 输入是一个最大长度为 max_words的单词序列(技术上说，序列中的整数为单词id)，我们的输出是一个二进制情感标签(0或1)。


embedding_size = 200
input_size = len(X_train_pad)
print(input_size)

model=Sequential()

model.add(Embedding(input_size, embedding_size, input_length=max_words))
model.add(LSTM(200))

model.add(Dense(50, activation='relu'))
# model.add(Dropout(0.05))


# model.add(TimeDistributed(Dense(8)))
# model.add(TimeDistributed(Dropout(0.2)))

model.add(Dense(1, activation='softmax'))

model.summary()

30800
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 500, 200)          6160000   
_________________________________________________________________
lstm_2 (LSTM)                (None, 200)               320800    
_________________________________________________________________
dense_3 (Dense)              (None, 50)                10050     
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 51        
Total params: 6,490,901
Trainable params: 6,490,901
Non-trainable params: 0
_________________________________________________________________


In [29]:
# 通过指定在训练时使用的损失函数和优化器以及我们想要测量的任何评估指标来编译我们的模型
# categorical_crossentropy， binary_crossentropy

model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

# rmsprop = RMSprop(lr=0.01)
# model.compile(loss='cross_entropy', optimizer=rmsprop, metrics=['accuracy'])

In [31]:
# 训练
## 必须指定两个重要的训练参数——批处理大小（batch size）和训练周期的数量（number of training epochs），
## 它们与我们的模型体系结构一起决定了总的训练时间
model.fit(X_train_pad, y_train, validation_data=(X_test_pad, y_test), batch_size=200, epochs=2)

Instructions for updating:
Use tf.cast instead.
Train on 30800 samples, validate on 13200 samples
Epoch 1/2

KeyboardInterrupt: 

In [109]:
%%time
result = model.predict_classes(test_pad)
result[:10]

array([[1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1]])

In [112]:
pd.DataFrame(result,columns=['label'], index=None).to_csv('../Result/res_0624_5627.csv', sep=',')

## tokenize-naive Bayes

In [34]:
from sklearn.feature_extraction.text import CountVectorizer           # BoW

from sklearn.naive_bayes import MultinomialNB                         # naive Bayes

In [33]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((30847,), (13221,), (30800,), (13200,))

In [39]:
type(list(X_train)), list(X_train)[0]

(list,
 ['我们',
  '再也',
  '回不去',
  '了',
  '再',
  '多',
  '的',
  '深情',
  '也',
  '会',
  '被',
  '这',
  '年',
  '的',
  '时间',
  '打败',
  '的',
  '最后',
  '的',
  '妥协',
  '是',
  '累',
  '了',
  '想',
  '停下',
  '了'])

### BoW

In [76]:
def getSingleLst(lst_cW):
    l = []
    for lst in lst_cW:
        for s in lst:
            l.append(s)
    return l

'''Test Method'''
doc = ['国王的工作就是读几句稿子啊', '一个非常丰富传奇的故事拍得这么浅薄大家是怎么给出五星的好奇', '渣男阿飞浪费时间不推荐']
cw = []
for sen in doc:
    cw.append(list(jieba.cut(sen)))
print(list(cw))

l_s = getSingleLst(cw)
print(l_s, len(l_s))

[['国王', '的', '工作', '就是', '读', '几句', '稿子', '啊'], ['一个', '非常', '丰富', '传奇', '的', '故事', '拍', '得', '这么', '浅薄', '大家', '是', '怎么', '给出', '五星', '的', '好奇'], ['渣男', '阿飞', '浪费时间', '不', '推荐']]
['国王', '的', '工作', '就是', '读', '几句', '稿子', '啊', '一个', '非常', '丰富', '传奇', '的', '故事', '拍', '得', '这么', '浅薄', '大家', '是', '怎么', '给出', '五星', '的', '好奇', '渣男', '阿飞', '浪费时间', '不', '推荐'] 30


In [122]:
X_train.shape, X_test.shape, df_testWords['com_words'].shape

((30847,), (13221,), (9999,))

In [124]:
count_vector = CountVectorizer()

# 训练数据集
lst_training = list(df_trainWords['com_words'])
count_vector.fit(getSingleLst(lst_training))

# 转换训练集获得矩阵
training_data = count_vector.transform([' '.join(l) for l in X_train])

# 转换测试集获得矩阵
testing_data = count_vector.transform([' '.join(l) for l in X_test])

# 转换目标集获得矩阵
target_date = count_vector.transform([' '.join(l) for l in list(df_testWords['com_words'])])

In [125]:
training_data.shape, testing_data.shape, target_date.shape

((30847, 59425), (13221, 59425), (9999, 59425))

### naive Bayes

In [131]:
X_train.shape, X_test.shape

((30847,), (13221,))

In [132]:
y_train.shape, y_test.shape

((30847,), (13221,))

In [133]:
naive_bayes = MultinomialNB()

# 训练朴素贝叶斯
naive_bayes.fit(training_data, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [134]:
naive_bayes.score(testing_data, y_test)

0.8085621359957643

In [138]:
result = naive_bayes.predict(target_date)

result

array([1, 1, 1, ..., 0, 0, 0], dtype=int64)

In [139]:
df_result = pd.DataFrame(result, columns=['label'])

df_result.to_csv('../Result/nB_0624_8086.csv', sep=',', index=False)

Unnamed: 0,label
0,1
1,1
2,1
3,1
4,0


### 准确度