### Step1:读取评论文件

加载评论信息文件, 并获取所有评论内容及情感

In [142]:
import csv
import numpy as np

def loadCommentFile(file_name):
    all_sentences = []
    
    with open(file_name, 'r', encoding='utf-8') as fp:                                    #读取文件
        reader = csv.reader(fp)
        
        #读取迭代器内的所有评论信息
        all_sentences = np.array([[comment[0],comment[1]] for comment in reader])         #保存格式[['评论1','评论情感标签']]         
    
    print('Step1:read {} comments in file...'.format(len(all_sentences)))
    return all_sentences                                                                  #返回numpy数组形式的所有评论内容及对应标签

### Step2: 去除重复评论信息

去除评论数据集中的重复评论信息

In [152]:
import pandas as pd

def removeSameComment(all_sentences):
    
    data = pd.DataFrame(all_sentences)
    same_sentence_num = data.duplicated().sum()                                           #统计重复的评论内容个数
    
    if same_sentence_num > 0:
        print('Step2:remove {} of same comments...'.format(same_sentence_num))
        data = data.drop_duplicates()                                                     #删除重复的评论内容  
    
    return data.values                                                                   #返回numpy数组形式的评论内容信息


### Step3: 使用jieba库进行分词操作

使用jieba库对中文评论语句及逆行切分操作

In [153]:
import jieba

def getAllWords(all_sentences):
    all_words = []
    
    for sentence in all_sentences:
        words = jieba.lcut(sentence[0])                                                  #将评论切词，并存放所有切分后的评论语句
        all_words.append(words)
    
    print('Step3:jieba cut successfully...')
    return np.array(all_words)

### Step4: 去除停用词

去除评论词语中已经停用的词语

In [154]:
def removeStopWords(file_name, all_words):
    stop_words = []
    with open(file_name, 'r', encoding='utf-8') as fp:                      #读取所有停用词
        stop_words = fp.read().split('\n')                                   #存到stop_words列表中(以换行符切分)
    
    for sentence in all_words:                                              #双重循环去除评论中的停用词
        for word in sentence:
            if word in stop_words:
                sentence.remove(word)
    
    print('Step4:remove stop-words successfully...')
    return np.array(all_words)                                              #以numpy数组返回


### Step5: 生成词典

将所有评论信息中存在的词语生成一个词典

In [155]:
def getDictionary(all_words):
    dictionary = []
    
    for sentence in all_words:
        for word in sentence:
            
            if word not in dictionary:
                dictionary.append(word)                                     #将所有评论中出现的词语存入词典
    
    print('Step5:{} words in total...'.format(len(dictionary)))
    return dictionary

### Step6: 生成one-hot编码(暂未用到)

把所有评论中的词语信息都进行编码，编码形式为one-hot

In [156]:
def getOneHot(dictionary):
    one_hots = []
    
    for index,word in enumerate(dictionary):              #使用one-hot编码把出现的词语转化为向量
        one_hot = np.zeros(len(dictionary))
        one_hot[index] = 1
        
        one_hots.append(one_hot)
    
    print('Step6:one-hot encoding successfully...')
    return np.array(one_hots)
    

### Step7: Word2vec编码

将评论的词语信息以word2vec的方式进行编码，并存储(这里直接使用gensim库)

In [168]:
from gensim.models import Word2Vec

def getWord2Vec(all_words):
    
    #调用Word2Vec模型，将所有词语信息转化为向量
    model = Word2Vec(all_words, sg=0, vector_size=300, window=5, min_count=1, epochs=7, negative=10)
    model.save('word2vec_model')
    
    print('word2vec encoding successfully...')
    return model

### Step8:封装数据预处理过程

封装所有的操作过程，最后结果返回所有**预处理过后的评论数组，word2vec模型，词语词典**

In [169]:
def getData():
    all_sentences = loadCommentFile('../datasets/comments.csv')
    all_sentences = removeSameComment(all_sentences)
    target = all_sentences[:,1]
    all_words = getAllWords(all_sentences)
    all_words = removeStopWords('../file/cn-stopwords.txt', all_words)
    dictionary = getDictionary(all_words)
    one_hots = getOneHot(dictionary)
    
    print('get all data successfully...')
    
    return all_words, target, dictionary

### Step9:构建评论语句向量

求和评论中每个词语的word_vector，然后取平均值，即为评论语句的向量

In [170]:
def getSentenceVec(all_words, word2vec_model):
    sentences_vector = []
    
    for sentence in all_words:
        
        sentence_vector = np.zeros(word2vec_model.wv.vector_size)
        
        #取出评论中每个单词的向量累加
        for word in sentence:
            sentence_vector += word2vec_model.wv.get_vector(word)

        #取最终结果的平均值，作为评论语句的向量，并添加到评论向量列表中
        sentences_vector.append(sentence_vector/len(sentence))
    
    #返回numpy类型的评论列表
    return np.array(sentences_vector)

### Step10: 拆分数据集为训练集与测试集

调用数据预处理的封装函数进行数据预处理，将每个词语使用word2vec模型转化为向量，并将所有评论转化为向量

然后对数据集进行切分为数据集与测试集

In [171]:
from sklearn.model_selection import train_test_split         #引入拆分训练集与测试集的方法

all_words, target, dictionary = getData()                     #获取评论的分词形式列表、对应标签、词典

word2vec_model = getWord2Vec(all_words)                       #训练Word2Vec模型
word2vec_model.save('word2vec_model')                         #保存文件

#将每一句评论信息转化为对应的评论向量
sentences_vector = getSentenceVec(all_words, word2vec_model)

#拆分数据集为训练集与测试集
X_train, X_test, y_train, y_test = train_test_split(sentences_vector, target)
print('train_test_split successfully!')

Step1:read 13499 comments in file...
Step2:remove 1016 of same comments...
Step3:jieba cut successfully...


  # This is added back by InteractiveShellApp.init_path()


Step4:remove stop-words successfully...
Step5:15834 words in total...
Step6:one-hot encoding successfully...
get all data successfully...
word2vec encoding successfully...
train_test_split successfully!


### Step11:训练多种监督模型，准备评论情感预测

#### KNN算法模型的训练与评估

In [221]:
from sklearn.neighbors import KNeighborsClassifier               #引入KNN算法模型
import time

start = time.time()

knn = KNeighborsClassifier(n_neighbors=5)                         #K近邻模型训练中，n邻居个数越多越欠拟合，越少越过拟合
knn.fit(X_train, y_train)                                         #训练KNN模型

end = time.time()
print('It is take {} seconds'.format(end-start))                  #输出训练所需的时间

It is take 0.014011383056640625 seconds


In [222]:
start = time.time()

train_score = knn.score(X_train, y_train)
test_score = knn.score(X_test, y_test)

end = time.time()

print('Train score:{}'.format(train_score))
print('Test score:{}'.format(test_score))
print('It is take {} seconds'.format(end-start))                  #输出训练所需的时间

Train score:0.9629352702414015
Test score:0.9487343800064082
It is take 4.137312889099121 seconds


#### 逻辑回归模型的训练与评估

In [240]:
from sklearn.linear_model import LogisticRegression              #引入LogisticRegression逻辑回归模型

import time

start = time.time()

#弱正则化对应过拟合，强正则化对应欠拟合
logistic_regression = LogisticRegression(C=50)                    #在逻辑回归中，参数C控制正则化强弱，C越大正则化越弱，C越小正则化越强
logistic_regression.fit(X_train, y_train)                         #训练逻辑回归模型

end = time.time()
print('It is take {} seconds'.format(end-start))                  #输出训练所需的时间


It is take 1.5031111240386963 seconds


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [241]:
start = time.time()

train_score = logistic_regression.score(X_train, y_train)
test_score = logistic_regression.score(X_test, y_test)

end = time.time()

print('Train score:{}'.format(train_score))
print('Test score:{}'.format(test_score))
print('It is take {} seconds'.format(end-start))                  #输出训练所需的时间

Train score:0.967955565050203
Test score:0.962191605254726
It is take 0.032012939453125 seconds


#### 支持向量机分类器的训练与评估

In [259]:
from sklearn.svm import SVC                                      #引入支持向量机分类器


start = time.time()

svc = SVC(gamma=0.1, C=100)                                       #gamma控制类别的相似度程度，越小越好，C控制正则化程度，适中即可
svc.fit(X_train, y_train)                                         #训练支持向量机分类器

end = time.time()

print('It is take {} seconds'.format(end-start))                  #输出训练所需的时间

It is take 3.914299249649048 seconds


In [258]:
start = time.time()

train_score = svc.score(X_train, y_train)
test_score = svc.score(X_test, y_test)

end = time.time()

print('Train score:{}'.format(train_score))
print('Test score:{}'.format(test_score))
print('It is take {} seconds'.format(end-start))                  #输出训练所需的时间

Train score:0.9948728904080325
Test score:0.9724447292534444
It is take 3.5872702598571777 seconds


#### 伯努利贝叶斯模型的训练与评估

In [182]:
from sklearn.naive_bayes import BernoulliNB                      #引入伯努利贝叶斯模型
start = time.time()
bernoulli_bayes = BernoulliNB()
bernoulli_bayes.fit(X_train, y_train)                             #训练伯努利贝叶斯模型
end = time.time()

print('It is take {} seconds'.format(end-start))                  #输出训练所需的时间

It is take 0.09700345993041992 seconds


In [183]:
start = time.time()

train_score = bernoulli_bayes.score(X_train, y_train)
test_score = bernoulli_bayes.score(X_test, y_test)

end = time.time()

print('Train score:{}'.format(train_score))
print('Test score:{}'.format(test_score))
print('It is take {} seconds'.format(end-start))                  #输出训练所需的时间

Train score:0.913052766502884
Test score:0.9160525472604935
It is take 0.11000585556030273 seconds


#### 决策树模型的训练与评估

In [270]:
from sklearn.tree import DecisionTreeClassifier                  #引入决策树模型
start = time.time()   
decision_tree = DecisionTreeClassifier(max_depth=5)               #设置决策树的最大深度为5，避免出现过拟合现象           
decision_tree.fit(X_train, y_train)                               #训练多项式贝叶斯模型
end = time.time()

print('It is take {} seconds'.format(end-start))                  #输出训练所需的时间

It is take 1.640122652053833 seconds


In [271]:
start = time.time()

train_score = decision_tree.score(X_train, y_train)
test_score = decision_tree.score(X_test, y_test)

end = time.time()

print('Train score:{}'.format(train_score))
print('Test score:{}'.format(test_score))
print('It is take {} seconds'.format(end-start))                  #输出训练所需的时间

Train score:0.9573809015167699
Test score:0.93784043575777
It is take 0.031012535095214844 seconds


#### 随机森林模型的训练与评估

In [277]:
from sklearn.ensemble import RandomForestClassifier              #引入随机森林模型
start = time.time()
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, y_train)                               #训练随机森林模型
end = time.time()

print('It is take {} seconds'.format(end-start))                  #输出训练所需的时间

It is take 19.40948247909546 seconds


In [278]:
start = time.time()

train_score = random_forest.score(X_train, y_train)
test_score = random_forest.score(X_test, y_test)

end = time.time()

print('Train score:{}'.format(train_score))
print('Test score:{}'.format(test_score))
print('It is take {} seconds'.format(end-start))                  #输出训练所需的时间

Train score:1.0
Test score:0.9615507850048062
It is take 0.35802292823791504 seconds


#### 梯度提升分类器模型的训练与评估

In [279]:
from sklearn.ensemble import GradientBoostingClassifier          #引入梯度提升分类树模型
start = time.time()
gradient_boosting_tree = GradientBoostingClassifier()
gradient_boosting_tree.fit(X_train, y_train)                      #训练梯度提升分类树模型
end = time.time()

print('It is take {} seconds'.format(end-start))                  #输出训练所需的时间

It is take 307.464230298996 seconds


In [280]:
start = time.time()

train_score = gradient_boosting_tree.score(X_train, y_train)
test_score = gradient_boosting_tree.score(X_test, y_test)

end = time.time()

print('Train score:{}'.format(train_score))
print('Test score:{}'.format(test_score))
print('It is take {} seconds'.format(end-start))                  #输出训练所需的时间

Train score:0.9818414868617816
Test score:0.9602691445049664
It is take 0.31026268005371094 seconds


#### MLP神经网络多层感知机模型的训练与评估

In [193]:
from sklearn.neural_network import MLPClassifier                 #引入神经网络多层感知机模型
start = time.time()
mlp = MLPClassifier()
mlp.fit(X_train, y_train)                                        #训练神经网络MLP多层感知机模型
end = time.time()


print('It is take {} seconds'.format(end-start))                  #输出训练所需的时间

It is take 23.577603578567505 seconds




In [195]:
start = time.time()

train_score = mlp.score(X_train, y_train)
test_score = mlp.score(X_test, y_test)

end = time.time()

print('Train score:{}'.format(train_score))
print('Test score:{}'.format(test_score))
print('It is take {} seconds'.format(end-start))                  #输出训练所需的时间

Train score:0.9900662251655629
Test score:0.975008010253124
It is take 0.07800674438476562 seconds


### Step12: 使用模型随机预测评论情感

取出测试集中的100条测试数据，使用**支持向量机分类器(SVC)**进行评论预测，并输出结果

In [301]:
import random

dic_len = len(dictionary)

start = random.randint(100,2400)

#从测试集中随机抽取50条数据，准备测试
X_data = X_test[start:start+50]
y_data = y_test[start:start+50]

success_test = 0

#对 50 条评论信息进行预测
for sequence_index in range(len(X_data)):
    
    #找到该评论在数组中的位置
    loc = np.where((sentences_vector == X_data[sequence_index]).all(axis=1))
    
    #输出该评论语句
    print('/'.join(all_words[loc[0][0]]))
    
    
    res = svc.predict([X_data[sequence_index]])                #使用支持向量机进行预测
    

    #0 代表好评， 1代表差评
    if res == '0':             
        print('Predict result : 好评', end='\t')              #输出好评
    else:              
        print('Predict result : 差评', end='\t')             #否则输出差评
    
    #实际该评论的结果
    if y_data[sequence_index] == '0':
        print('Actual results: 好评', end='\t')
    else:   
        print('Actual results: 差评', end='\t')
        
    #判断是否预测正确
    if res == y_data[sequence_index]:
        print('Predict success!', end='\t')
        success_test += 1
    else:
        print('Predict fail!', end='\t')


    print('\n\n')

print('本次测试预测准确度为: {}'.format(success_test/50))

学大/数据/大学/狗/听/有人/说/MacBook/适合/工科/买来/一段时间/感觉/还好/性能/很强/不愧/M1/ /Pro/没买/Max/版本/感觉/不到/32/核/GPU/16G/内存/我/够用/了/。
Predict result : 好评	Actual results: 好评	Predict success!	


开机/划痕/心情/郁闷/提醒/这种/退换货/还是/官网/买
Predict result : 差评	Actual results: 差评	Predict success!	


外形/外观/好看/ /选/颜色/对/屏幕/音效/屏幕/120hz/刷新/太爽了/ /用回/xr/真/卡/拍照/效果/拍照/清晰/运行/速度/速度/快/待机时间/玩/一把/游戏/掉电/ /续航/不错/，/其他/特色/充电/速度/快/ /23/分钟/37/ /真的/很快/了
Predict result : 好评	Actual results: 好评	Predict success!	


完美/机/听筒/缝隙/屏幕显示/正常/拍照/重点/京东/发货/真的/太牛/超级/21/号/下单/已经/错过/前两天/抢购/情况/25/号/下午/收到/货/?/?/以为/好久/我/手机/壳/没/买/快递/员/打电话/知道/又/火速/京/东京/造买/了/手机/壳/没想到/壳/不错/。/我/仔细/看过/镜头/没有/灰尘/边框/没有/划痕/绝大多数/机子/没有/，/让/知道/都/有/发出/来显/的/问题/，/外形/外观/完美/，/拍照/效果/无敌/，/运行/速度/超级
Predict result : 好评	Actual results: 好评	Predict success!	


买来/疫情/在家/办公/外观/时尚/大气/颜色/深/我心/自重/轻/出差/携带/发货/的/久/一点/但/值得/另外/，/小妹/导购/15/态度/很/好/，/赞/！
Predict result : 好评	Actual results: 好评	Predict success!	


苹果/生产/技术/工艺/真的/太赞/熟悉/开机/声音/干脆/铝合金/打造/金属/机身/科技/感/十足/16/寸/大家伙/有点/重/但是/接受/毕竟/屏蔽/啊/运行/流畅/很棒/感觉/京东/物流/真的/，/快递/小哥/服务态度/很/不错/，

可见，支持向量机模型对结果的预测准确度达到98%，也较为不错

### 表格对比各机器学习模型在情感识别方面的性能

前提，评论文本信息已经完成一系列的预处理操作，如

- 评论去重
- 评论分词
- 去除停用词
- one-hot词向量(效率低，未用到)
- word2vec词向量

现在列举各机器学习模型对于评论情感分析的性能

|  模型名称   | 训练耗时  | 测试耗时  | 训练集预测精度  | 测试集预测精度  |
|  :----  | :----:  | :----:  | :----:  | :----:  |
| KNN模型  | 0.01401s | 4.13731s | 96.3% | 94.9% |
| 逻辑回归模型  | 1.50311s | 0.03201s | 96.8% | 96.2% |
| 支持向量机模型  | 3.91430s | 3.58727s | 99.5% | 97.2% |
| 伯努利贝叶斯模型  | 0.09700s | 0.11001s | 91.3% | 91.6% |
| 决策树模型  | 1.64012s | 0.03101s | 95.7% | 93.8% |
| 随机森林模型  | 19.40948s | 0.35802s | 100% | 96.2% |
| 梯度提升分类树模型  | 307.46423s | 0.31026s | 98.2% | 96.0% |
| 神经网络MLP多层感知器模型  | 23.57760s | 0.07800 | 99.0% | 97.5% |

### 小结

本次统计测试结果不一定完全准确，因为没有涉及到细节的调参问题或者数据缩放问题

尤其是随机森林模型，过拟合较为严重，后续若有新的学习进展，会陆续更新改进，若有小伙伴们有疑问或指导，欢迎下方留言