# Word2Vec

生成词向量(word embedding)的另一种方法是Word2Vec

Word2Vec包含了两种词训练模型：CBOW模型和Skip-gram模型

## CBOW MODEL

但在实际使用时输入不一定是one-hot编码，也可以是随机生成的N维向量

<table>
<td> 
<img src="oneHot.jpg"> <br>
</td> 
<td> 
<img src="avgVec.jpg"> <br>
</td> 
</table>


<table>
<td> 
<img src="fc.jpg"> <br>
</td> 
<td> 
<img src="softMax.jpg"> <br>
</td> 
</table>
<caption><center> **Figure 1**: Feature vector of a review</center></caption>

# 提取数据
第一列是评分1～5\
第二列是评论标题\
第三列是评论正文

In [141]:
import pandas as pd
from collections import Counter
from tqdm import tqdm
test = pd.read_csv('test.csv',header=None)
test.head()

Unnamed: 0,0,1,2
0,1,mens ultrasheer,"This model may be ok for sedentary types, but ..."
1,4,Surprisingly delightful,This is a fast read filled with unexpected hum...
2,2,"Works, but not as advertised",I bought one of these chargers..the instructio...
3,2,Oh dear,I was excited to find a book ostensibly about ...
4,2,Incorrect disc!,"I am a big JVC fan, but I do not like this mod..."


In [142]:
train = pd.read_csv('train.csv',header=None)
train.head()

Unnamed: 0,0,1,2
0,3,more like funchuck,Gave this to my dad for a gag gift after direc...
1,5,Inspiring,I hope a lot of people hear this cd. We need m...
2,5,The best soundtrack ever to anything.,I'm reading a lot of reviews saying that this ...
3,4,Chrono Cross OST,The music of Yasunori Misuda is without questi...
4,5,Too good to be true,Probably the greatest soundtrack in history! U...


## 查看数据

In [143]:
import numpy as np
test_data=test[2]
test_label=test[0]
print('size of test data is ',len(test_data))
print('the score of: "',test_data[1],'"is ',test_label[1])

size of test data is  650000
the score of: " This is a fast read filled with unexpected humour and profound insights into the art of politics and policy. In brief, it is sly, wry, and wise. "is  4


In [144]:
train_data=train[2]
train_label=train[0]
print('size of train data is ',len(train_data))
print('the score of: "',train_data.iloc[0],'"is ',train_label.iloc[0])

size of train data is  3000000
the score of: " Gave this to my dad for a gag gift after directing "Nunsense," he got a reall kick out of it! "is  3


## 清洗数据
只保留单词；将字母处理为小写；删掉停用词；

In [145]:
import re
import nltk
from nltk.corpus import stopwords
# nltk.download('stopwords')
def cleanText(raw_text,remove_stopwords=True,sentences=False):
    letters_only = re.sub("[^a-zA-Z]", " ", raw_text)
    text=letters_only.lower().split()
    if remove_stopwords:
        stops = set(stopwords.words("english"))
        text = [w for w in text if not w in stops]
    if sentences:
        return " ".join(text)
    else:
        return text
text=cleanText(train_data.iloc[0],sentences=True)
print("Before processing: ",train_data.iloc[0])
print("After processing: ",text)

Before processing:  Gave this to my dad for a gag gift after directing "Nunsense," he got a reall kick out of it!
After processing:  gave dad gag gift directing nunsense got reall kick


## 重采样取部分数据实验

In [146]:
import numpy as np
t_data,t_label=[],[]
for i in range(30000):
    idx=np.random.randint(len(train_data))
    t_data.append(train_data[idx])
    t_label.append(train_label[idx])

test,label=[],[]
for i in range(1000):
    idx=np.random.randint(len(test_data))
    test.append(test_data[idx])
    label.append(test_label[idx])

len(t_label)
# t_data,t_label=train_data[:],train_label[:]
# test,label=test_data[:],test_label[:]

30000

In [147]:
train_clean,test_clean=[],[]
for i in tqdm(range(len(t_data))):
    train_clean.append(cleanText(t_data[i],remove_stopwords=True,sentences=True))
for i in tqdm(range(len(test))):
    test_clean.append(cleanText(test[i],remove_stopwords=True,sentences=True))

100%|██████████| 30000/30000 [00:05<00:00, 5408.64it/s]
100%|██████████| 1000/1000 [00:00<00:00, 5819.43it/s]


In [148]:
import sys
sys.path.append('/Users/user/miniconda3/lib/python3.8/site-packages')

## 使用gensim库构建Word2Vec模型

In [149]:
from gensim.models import Word2Vec
from gensim.models.keyedvectors import KeyedVectors

## 将训练集分词后整合

In [150]:
def parseSent(review,sentences,remove_stopwords=True):
    temp=cleanText(review, remove_stopwords=True,sentences=False)
    sentences.append(temp)
    return sentences


# Parse each review in the training set into sentences
sentences = []
for review in train_clean:
    parseSent(review, sentences)


print('%d parsed sentence in the training set\n'  %len(sentences))

print('Original review : \n',t_data[19])
print()
print('Cleaned review : \n', train_clean[19])
print()
print('Show a parsed sentence in the training set : \n',  sentences[19])


30000 parsed sentence in the training set

Original review : 
 This comes in two peices and you connect them. It is supper cheap and plastic. But what can you expect for the price. On the other side it works and for the price I would by more and recommend them. However, i would not recommend putting belts on it, Just ties. Also you have to make sure you keep it balanced or it will hang very crooked. So when you hang ties on it make sure you spread them out to balance it out. Other then that i recommend it. 3.5 / 5

Cleaned review : 
 comes two peices connect supper cheap plastic expect price side works price would recommend however would recommend putting belts ties also make sure keep balanced hang crooked hang ties make sure spread balance recommend

Show a parsed sentence in the training set : 
 ['comes', 'two', 'peices', 'connect', 'supper', 'cheap', 'plastic', 'expect', 'price', 'side', 'works', 'price', 'would', 'recommend', 'however', 'would', 'recommend', 'putting', 'belts', 't

## 训练Word2Vec模型

### 参数调节
sg： 用于设置训练算法，默认为0，对应CBOW算法；sg=1则采用skip-gram算法，使用skip-gram，训练速度慢，但对罕见字有效

size：是指特征向量的维度，默认为100

window：窗口大小，表示当前词与预测词在一个句子中的最大距离是多少，一般CBOW选择5，SG选择10

min_count: 可以对字典做截断. 词频少于min_count次数的单词会被丢弃掉, 默认值为5

sample: 高频词汇的随机降采样的配置阈值，默认为1e-3，范围是(0,1e-5)

workers：用于控制训练的并行数

In [151]:
# Fit parsed sentences to Word2Vec model 
# logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',level=logging.INFO)

num_features = 300  #embedding dimension  

print("Training Word2Vec model ...\n")
w2v = Word2Vec(sentences, workers=10, size=num_features, min_count = 5,\
                 window = 5, sample = 1e-3 , hs=1,sg=1)
w2v.init_sims(replace=True)
# w2v.save("w2v_300features_10minwordcounts_10context") #save trained word2vec model

Training Word2Vec model ...



In [152]:
print("Number of words in the vocabulary list : %d \n" %len(w2v.wv.index2word)) 
print("Show first 10 words in the vocalbulary list  vocabulary list: \n", w2v.wv.index2word[0:10])
print()
print('Part of vector of the word "book": \n',w2v['book'][:5])
print()
print('The size of words'' vector : \n',w2v['book'].shape)

Number of words in the vocabulary list : 15527 

Show first 10 words in the vocalbulary list  vocabulary list: 
 ['book', 'one', 'like', 'good', 'would', 'great', 'get', 'read', 'time', 'really']

Part of vector of the word "book": 
 [-0.04047463  0.08282275 -0.04423171  0.0444067   0.03222555]

The size of words vector : 
 (300,)


  print('Part of vector of the word "book": \n',w2v['book'][:5])
  print('The size of words'' vector : \n',w2v['book'].shape)


## 求出该review每个可用词的向量求平均值作为该review的向量

In [153]:
# Transfrom the training data into feature vectors
def makeFeatureVec(review, model, num_features):
    featureVec = np.zeros(num_features)
    nwords = 0
    index2word_set = set(model.wv.index2word) 
    isZeroVec = True
    for word in review:
        if word in index2word_set: 
            nwords += 1
            # 每个word更新该review的向量
            featureVec += model[word]
            isZeroVec = False
    if isZeroVec == False:
        featureVec /= nwords
    return featureVec

## 整合数据集所有review向量

In [154]:
def getAvgFeatureVecs(reviews, model, num_features):
    '''
    Transform all reviews to feature vectors using makeFeatureVec()
    '''
    reviewFeatureVecs = []
    for review in reviews:
        reviewFeatureVecs.append(makeFeatureVec(review, model,num_features))
    return reviewFeatureVecs

## 将训练集测试集review转成向量

In [155]:
# Get feature vectors for training set
X_train_cleaned = sentences
trainVector = getAvgFeatureVecs(X_train_cleaned, w2v, num_features)
print('Training set :',len(trainVector),' feature vectors with',len(trainVector[0]) ,'dimensions')


# Get feature vectors for validation set
X_test_cleaned = []
for review in tqdm(test_clean):
    X_test_cleaned.append(cleanText(review, remove_stopwords=True,sentences=False))
testVector = getAvgFeatureVecs(X_test_cleaned, w2v, num_features)
print('Testing set :',len(testVector),' feature vectors with',len(testVector[0]) ,'dimensions')

  featureVec += model[word]
100%|██████████| 1000/1000 [00:00<00:00, 6272.54it/s]

Training set : 30000  feature vectors with 300 dimensions





Testing set : 1000  feature vectors with 300 dimensions


In [156]:
print('Original review : \n',t_data[19])
print()
print('Part of vector : \n',trainVector[19][:10])
print()
print('Size of vector : \n',trainVector[19].shape)

Original review : 
 This comes in two peices and you connect them. It is supper cheap and plastic. But what can you expect for the price. On the other side it works and for the price I would by more and recommend them. However, i would not recommend putting belts on it, Just ties. Also you have to make sure you keep it balanced or it will hang very crooked. So when you hang ties on it make sure you spread them out to balance it out. Other then that i recommend it. 3.5 / 5

Part of vector : 
 [-0.02030254  0.02650279  0.01736031  0.03576423  0.02310178  0.01488558
 -0.02948713  0.02083334  0.01179845 -0.03702418]

Size of vector : 
 (300,)


## 得到词向量后用随机森林训练

In [115]:
from sklearn.ensemble import RandomForestClassifier
# Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100,max_depth=17)
rf.fit(trainVector, t_label)
predictions = rf.predict(testVector)
predictions[:3]

array([4, 5, 3])

In [170]:
def W2Vaccuracy(pred,label):
    right=0
    for p,a in zip(pred,label):
        if p>=4 and a>=4:
            right+=1
        elif p<4 and a<4:
            right+=1
    return 100*right/len(label)

In [117]:
accuracy=W2Vaccuracy(predictions,label)
print('accuracy= ',accuracy,'%')

accuracy=  74.6 %


In [124]:
from sklearn.ensemble import GradientBoostingClassifier
print('Begin training....')
gb=GradientBoostingClassifier(max_depth=5)
gb.fit(trainVector[:5000], t_label[:5000])
print('Finish training....')
predictions_gb=gb.predict(testVector)
predictions_gb[:3]

Begin training....
Finish training....


array([4, 5, 2])

In [127]:
accuracy_gb=W2Vaccuracy(predictions_gb,label)
print('accuracy= ',accuracy_gb,'%')

accuracy=  72.2 %


In [178]:
from sklearn.model_selection import GridSearchCV  # Perforing grid search
import lightgbm as lgb

In [209]:
parameters = {
              'max_depth': [15, 20, 25, 30, 35],
              'learning_rate': [0.01, 0.02, 0.05, 0.1, 0.15],
              'cat_smooth': [1, 10, 15, 20, 35]
}
gbm = lgb.LGBMClassifier(boosting_type='gbdt',
                         verbose = -1,
                         learning_rate = 0.01,
                         num_leaves = 35)

In [None]:
# #创建lightgbm分类器实例
# clf = lgb.LGBMClassifier(boosting_type='gbdt',
#                          verbose = -1,
#                          learning_rate = 0.01,
#                          num_leaves = 35)
# #拟合数据来训练
# clf = clf.fit(trainVector, t_label)
# #预测
# y_pred = clf.predict(testVector)


gsearch = GridSearchCV(gbm, param_grid=parameters, scoring='accuracy', cv=3)
gsearch.fit(trainVector, t_label)

print("Best score: %0.3f" % gsearch.best_score_)
print("Best parameters set:")
best_parameters = gsearch.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

In [200]:
y_pred[:3]

array([2, 1, 5])

In [201]:
accuracy=W2Vaccuracy(y_pred,label)
print('accuracy= ',accuracy,'%')

accuracy=  73.9 %
