- 如何将训练好的word2vec加载到程序中，方便使用
    - 输入：存储值是每个词的向量
    - 输出：word2vec, word2index,index2word
- 简单的分类训练：avg 词向量进行判断
    - 预处理：全部转换为小写
    - 构建简单的模型进行训练
- 基于LSTM进行训练，预测结果

In [1]:
import numpy as np
import emoji

# 1 加载 word matrix
- 读取文件的方法很有用，需要掌握

In [2]:
%%bash
# pip install emoji
head -n 2 data/glove.6B.50d.txt

the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862 -0.00066023 -0.6566 0.27843 -0.14767 -0.55677 0.14658 -0.0095095 0.011658 0.10204 -0.12792 -0.8443 -0.12181 -0.016801 -0.33279 -0.1552 -0.23131 -0.19181 -1.8823 -0.76746 0.099051 -0.42125 -0.19526 4.0071 -0.18594 -0.52287 -0.31681 0.00059213 0.0074449 0.17778 -0.15897 0.012041 -0.054223 -0.29871 -0.15749 -0.34758 -0.045637 -0.44251 0.18785 0.0027849 -0.18411 -0.11514 -0.78581
, 0.013441 0.23682 -0.16899 0.40951 0.63812 0.47709 -0.42852 -0.55641 -0.364 -0.23938 0.13001 -0.063734 -0.39575 -0.48162 0.23291 0.090201 -0.13324 0.078639 -0.41634 -0.15428 0.10068 0.48891 0.31226 -0.1252 -0.037512 -1.5179 0.12612 -0.02442 -0.042961 -0.28351 3.5416 -0.11956 -0.014533 -0.1499 0.21864 -0.33412 -0.13872 0.31806 0.70358 0.44858 -0.080262 0.63003 0.32111 -0.46765 0.22786 0.36034 -0.37818 -0.56657 0.044691 0.30392


In [3]:

def read_glove_vecs(glove_file):
    # 生成 word2vec, word2index,index2word
    word_dict={}
    words=set()
    with open(glove_file,'r') as f:
        for l in f:
            line=l.strip().split()
            cur_word=line[0]
            words.add(cur_word)
            word_dict[cur_word]=np.array(line[1:],dtype=np.float32)
        index2word={k:v for k,v in enumerate(words)}
        word2index={ v:k for k,v in enumerate(words)}
    return word2index,index2word,word_dict


In [4]:
word_to_index,index_to_word,word_to_vec_map=read_glove_vecs("./data/glove.6B.50d.txt")

In [5]:
word = "cucumber"
index = 79898
print("the index of", word, "in the vocabulary is", word_to_index[word])
print("the", str(index) + "th word in the vocabulary is", index_to_word[index])

the index of cucumber in the vocabulary is 120586
the 79898th word in the vocabulary is shurugwi


## 2 生成句子的词向量
- lower  split后进行转换即可

In [6]:
def sentence_to_avg(sentence,word_to_vec_map):
    s=sentence.lower().split()
    avg_list=[]
    for w in s:
        if w in s:
            avg_list.append(word_to_vec_map[w])
    avg_vec=np.vstack(avg_list).mean(axis=0)  
    return avg_vec 

In [7]:
s="Morrocan couscous is my favorite dish"

print(sentence_to_avg(s,word_to_vec_map))

[-0.00800501  0.56370836 -0.50427336  0.258865    0.5513111   0.03104983
 -0.21013719  0.16893934 -0.09590267  0.141784   -0.15708967  0.18525869
  0.6495786   0.3837112   0.21102166  0.11301666  0.02613968  0.26037768
  0.05820667 -0.01578168 -0.12078834 -0.02471267  0.41284552  0.5152061
  0.38756165 -0.89866096 -0.535145    0.33501163  0.68806934 -0.2156265
  1.797155    0.10476933 -0.36775336  0.75078493  0.10282584  0.348925
 -0.27262834  0.66767997 -0.10706166 -0.283635    0.5958012   0.28747332
 -0.33666348  0.23393816  0.34349182  0.178405    0.1166155  -0.076433
  0.14454171  0.09808666]


##  3 生成简单的分类模型
- 只要告诉我 输入是什么，输出是什么 
    - 如果有多个结果，选择哪个呢？
- 词向量转换为
- 分类转换为one-hot
- 构建单层softmax模型

In [8]:
%%bash
head  ./data/emojify_data.csv

French macaroon is so tasty,4,,
work is horrible,3,,
I am upset,3,, [3]
throw the ball,1,, [2]
Good joke,2,,
what is your favorite baseball game,1,,
I cooked meat,4,,
stop messing around,3,,
I want chinese food,4,,
Let us go play baseball,1,,


In [9]:
def read_csv(path):    
    X=[]
    Y=[]
    with open(path) as f:
        for l in f:
            l=l.strip('"').split(',')
            X.append(l[0])
            Y.append(l[1])        
    X=np.array(X)
    Y=np.array(Y,dtype=int)
    return X,Y
X_train,Y_train=read_csv("./data/emojify_data.csv")
X_test,Y_test=read_csv("./data/tesss.csv")


In [10]:
maxLen=len(max(X_train,key=len).split())
print(maxLen)

10


In [11]:
emoji_dictionary = {"0": "\u2764\uFE0F",    # :heart: prints a black instead of red heart depending on the font
                    "1": ":baseball:",
                    "2": ":smile:",
                    "3": ":disappointed:",
                    "4": ":fork_and_knife:"}

def label_to_emoji(label):
    return emoji.emojize(emoji_dictionary[str(label)],use_aliases=True)
    


In [12]:
index=2
print(X_train[index],label_to_emoji(Y_train[index]))

I am upset 😞


In [13]:
for k,v in emoji_dictionary.items():
    print(emoji.emojize(v,use_aliases=True))

❤️
⚾
😄
😞
🍴


In [14]:
def convert_to_one_hot(y,C):
    return np.eye(C)[y]


In [27]:
Y_oh_train=convert_to_one_hot(Y_train,C=5).transpose()
Y_oh_test=convert_to_one_hot(Y_test,C=5).transpose()
print(Y_oh_train.shape)
print(Y_oh_test.shape)

(5, 183)
(5, 56)


In [32]:
index = 50
print(Y_train[index], "is converted into one hot", Y_oh_train[:,index])

2 is converted into one hot [0. 0. 1. 0. 0.]


### 搭建model 
- 使用tensorflow 和 keras 分别写一遍

- X的shape是 (n_x,None)


In [50]:
import tensorflow as tf

num_iterations=2


X_train_vector=np.transpose(np.vstack([sentence_to_avg(i,word_to_vec_map)  for i in X_train]))
X_test_vector=np.transpose(np.vstack([sentence_to_avg(i,word_to_vec_map)  for i in X_test]))

n_x=X_train_vector.shape[0]
n_y=5

# 画图
tf.reset_default_graph()
X=tf.placeholder(shape=(n_x,None),name='X',dtype=tf.float32)
Y=tf.placeholder(shape=(n_y,None),name='Y',dtype=tf.float32)
W=tf.get_variable(shape=(n_y,n_x),name='W',dtype=tf.float32,initializer=tf.contrib.layers.xavier_initializer())
Z=tf.matmul(W,X)

loss=tf.nn.softmax_cross_entropy_with_logits(labels=Y,logits=Z)
opt=tf.train.AdamOptimizer(learning_rate=0.001).minimize(loss)

# 开始运行
init=tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)
    
    for i_iter in range(num_iterations):
        _,l=sess.run([opt,loss],feed_dict={X:X_train_vector,Y:Y_oh_train})
        
        
        
        
    
    
    
        
        
    




In [34]:
X_train_vector.shape

(50, 183)

In [35]:
Y_oh_train.shape

(5, 183)

In [20]:
Z

<tf.Tensor 'MatMul:0' shape=(5, ?) dtype=float32>

In [21]:
tf.transpose(X)

<tf.Tensor 'transpose_1:0' shape=(50, ?) dtype=float32>

In [22]:
W

<tf.Variable 'W:0' shape=(5, 50) dtype=float32_ref>