## 预处理传入W2V的语料
- jieba分词
- #去除停用词
- 拿到了分词后的文件，在一般的NLP处理中，会需要去停用词。由于word2vec的算法依赖于上下文，而上下文有可能就是停词。因此对于word2vec，我们可以不用去停词。

In [13]:
import gc
gc.collect()

5884

In [3]:
import jieba

#分词
def separate(sentence):   
    sep_text = jieba.cut(sentence.strip())  
    sep_text_str = ' '.join(sep_text)
    return sep_text_str

In [4]:
with open(r'C:\Users\Zedd\Desktop\yf_amazon_20w.txt', 'r', encoding='utf-8') as f:
    data = f.read()
    # data = data.replace('\n','')  #去转行
    
separated_data = separate(data)

with open(r'C:\Users\Zedd\Desktop\yf_amazon_processed.txt', 'w', encoding='utf-8') as f:
    f.write(separated_data)

Building prefix dict from the default dictionary ...
Dumping model to file cache C:\Users\Zedd\AppData\Local\Temp\jieba.cache
Loading model cost 0.685 seconds.
Prefix dict has been built successfully.


---

In [None]:
with open(r'C:\Users\Zedd\Desktop\yf_amazon_20w.txt', 'r', encoding='utf-8') as f:
    line = f.readline()
    lss = []
    while line:
        lss.append(line.strip())

import pandas as pd
import swifter
dff = pd.DataFrame({'content': lss})
dff['sep'] = dff.content.swifter.apply(separate)

content_sep = '\n'.join(i for i in dff['sep']) #合并一列中的文本

with open(r'C:\Users\Zedd\Desktop\yf_amazon_processed.txt', 'w', encoding='utf-8') as f:
    f.write(content_sep)

****

## 用word2vector深度学习模型训练获得词向量

In [5]:
from gensim.models import word2vec
import logging

def train_w2v():   #训练word2vector
    logging.basicConfig(format = '%(asctime)s : %(levelname)s : %(message)s', level = logging.INFO)   #配置日志文件
    sentence = word2vec.Text8Corpus(r'C:\Users\Zedd\Desktop\yf_amazon_processed.txt')   #加载语料库
    model = word2vec.Word2Vec(sentence, sg=1, size=100, window=5, min_count=1, negative=3, sample=0.001, hs=1, workers=40)   #训练skip-gram(sg=1)模型
    model.save(r'./w2v_model')   #保存模型
    # model = word2vec.Word2Vec.load(r'')   # 对应的加载方式
    
def get_w2v(wordlist):   #获取词向量
    model = word2vec.Word2Vec.load(r'./w2v_model') #加载模型
    vecs =[]
    for word in wordlist:
        word = word.replace('\n','')
        vecs.append(model[word])
    return vecs

word2vec.Word2Vec()  paramters
- sg=1是skip—gram算法，对低频词敏感，默认sg=0为CBOW算法
- size是神经网络层数（词向量维度），值太大则会耗内存并使算法计算变慢，一般值取为100到200之间。
- window是句子中当前词与目标词之间的最大距离，3表示在目标词前看3-b个词，后面看b个词（b在0-3之间随机）
- min_count是对词进行过滤，频率小于min-count的单词则会被忽视，默认值为5。
- negative和sample可根据训练结果进行微调，sample表示更高频率的词被随机下采样到所设置的阈值，默认值为1e-3,
- negative: 如果>0,则会采用negativesamping，用于设置多少个noise words
- hs=1表示层级softmax将会被使用，默认hs=0且negative不为0，则负采样将会被选择使用。即我们的word2vec两个解法的选择了，如果是0， 则是Negative Sampling，是1的话并且负采样个数negative大于0， 则是Hierarchical Softmax。默认是0即Negative Sampling。
- workers是线程数，此参数只有在安装了Cpython后才有效，否则只能使用单核

In [None]:
train_w2v() #内存不够，别运行

In [None]:
get_w2v(wordlist)

## Word2vec模型使用
- 找出某一个词向量最相近的词集合
- 查看两个词向量的相近程度
- 找出不同类的词

In [None]:
#找出某一个词向量最相近的词集合
req_count = 5
for key in model.wv.similar_by_word('沙瑞金'.decode('utf-8'), topn =100):
    if len(key[0])==3:
        req_count -= 1
        print key[0], key[1]
        if req_count == 0:
            break;

In [None]:
#查看两个词向量的相近程度
print model.wv.similarity('沙瑞金'.decode('utf-8'), '高育良'.decode('utf-8'))
print model.wv.similarity('李达康'.decode('utf-8'), '王大路'.decode('utf-8'))

In [None]:
#找出不同类的词
print model.wv.doesnt_match(u"沙瑞金 高育良 李达康 刘庆祝".split())

---

**<font size=4 color=blue>XGBoost多分类</font>**
- objective='multi:softmax'
- numclass=?

In [8]:
import pymysql
import pandas as pd

db = pymysql.connect(host = 'localhost',user = 'root',password = '123456',db = 'zeddhzm',charset='utf8')   
cur = db.cursor()   
sql = "SELECT * FROM jdcomment_320_cleansed_3_get_dummies"
cur.execute(sql)
result = cur.fetchall()
df = pd.DataFrame(list(result),columns=['ID','CONTENT','CREATIONTIME','USEFULVOTECOUNT','CONTENT_LENGTH','REPLYCOUNT','IMAGECOUNT','SCORE','plus','not_plus','time_delta','COMPLETENESS','subjectivity','whether_useful'])
cur.close()
db.close()

dff = df.copy()

In [None]:
import numpy as np
import pandas as pd
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from xgboost import plot_importance
from matplotlib import pyplot

In [None]:
all_x = dff[['CONTENT_LENGTH','REPLYCOUNT','IMAGECOUNT','SCORE','plus','not_plus','time_delta','COMPLETENESS','subjectivity']]
all_y = dff['whether_useful']
x_train,x_test,y_train,y_test = train_test_split(all_x, all_y, train_size=0.7, random_state=30)

In [None]:
xgb_0 = XGBClassifier(random_state=30)
xgb_0.fit(x_train, y_train, eval_metric='auc')
y_pred = xgb_0.predict(x_test)

In [None]:
from sklearn.metrics import accuracy_score
print ('Accuracy: %.5f' % accuracy_score(y_test, y_pred))

from sklearn.metrics import precision_score
print('Precision: ', precision_score(y_test, y_pred, labels=None, pos_label=0, average='binary'))

from sklearn.metrics import recall_score
print('Recall: ', recall_score(y_test, y_pred, labels=None, pos_label=0, average='binary', sample_weight=None))

from sklearn.metrics import f1_score 
print('F1 score: ', f1_score(y_test, y_pred, labels=None, pos_label=0, average='binary', sample_weight=None))

from sklearn.metrics import roc_auc_score        
print('roc_auc_score: ', roc_auc_score(y_test, y_pred))