# Embedding Layer

***Author: Z.Wang***

***Date: 2021-11-03***

# Table of Contents
* [Why should we have to do Embedding?](#Why-should-we-have-to-do-mbedding?)
* [Embedding history](#Embedding-History)
* [How shold we do Embedding?](#How-shold-we-do-Embedding?)
    * [Category Features](#Category-Features)
    * [Numerical Features](#Numerical-Features)

# Content


神经网络 由神经元构成，而神经元是一个线性或非线性函数

* 线性：

  $w_1*x_1 + w_2*x_2 + w_3*x_3 +b$
  
  
* 非线性：
  
  $sigmoid(w_1*x_1 + w_2*x_2 + w_3*x_3 +b) = \frac{1}{1+e^{-(w_1*x_1 + w_2*x_2 + w_3*x_3 +b)}}$
  
  
Encode / Transform


## Why should we have to do Embedding?



## Embedding History

## How should we do Embedding?

准备数据

In [51]:
import numpy as np
import pandas as pd

np.random.seed(1024)

batch_size = 10
company = ['A', 'B', 'C']
gender = ['M', 'F']
color = ['yellow', 'white', 'black']
habit = ['traveling', 'swimming', 'running', 'cycling']

data = pd.DataFrame({
    'company': [company[i] for i in np.random.randint(0, 3, batch_size)],
    'gender': [gender[i] for i in np.random.randint(0, 2, batch_size)],
    'color': [color[i] for i in np.random.randint(0, 3, batch_size)],
    'habit': [','.join(habit[np.random.randint(0, i):i]) for i in np.random.randint(1, 5, batch_size)],
    'age': np.random.randint(18, 60, batch_size)
})

data

Unnamed: 0,company,gender,color,habit,age
0,B,M,white,"swimming,running,cycling",48
1,B,F,yellow,traveling,30
2,A,M,white,"traveling,swimming",34
3,B,F,white,"running,cycling",35
4,B,F,yellow,running,27
5,B,F,white,traveling,45
6,A,F,black,traveling,44
7,B,F,black,swimming,21
8,A,M,white,traveling,39
9,C,F,black,cycling,31


### Category Features

* ### 单值离散特征 & one_hot

In [2]:
company_onehot = pd.get_dummies(data['company'])
company_onehot

Unnamed: 0,A,B,C
0,0,1,0
1,0,1,0
2,1,0,0
3,0,1,0
4,0,1,0
5,0,1,0
6,1,0,0
7,0,1,0
8,1,0,0
9,0,0,1


生成 embedding table：

In [3]:
import tensorflow.compat.v1 as tf

tf.disable_v2_behavior()
tf.disable_eager_execution()
tf.set_random_seed(1024)

company_vocab_size = company_onehot.shape[1]
embed_dim = 4
company_embedding_table = tf.get_variable(name='company_embedding_table',
                                  shape=(company_vocab_size, embed_dim),
                                  initializer=tf.truncated_normal_initializer(0, 0.01))

with tf.Session() as sess:
    sess.run(tf.initialize_all_variables())
    company_embed_table = sess.run(company_embedding_table)
    
company_embed_table

Instructions for updating:
non-resource variables are not supported in the long term
Instructions for updating:
Use `tf.global_variables_initializer` instead.


array([[ 0.00990909, -0.0041791 , -0.00641351, -0.00594623],
       [ 0.00055747, -0.00159252, -0.00749974, -0.00555932],
       [-0.00067203,  0.00156488, -0.01019037,  0.01614569]],
      dtype=float32)

company字段的one hot编码与embedding table矩阵相乘，生成dense向量：

In [4]:
company_embed = tf.matmul(tf.cast(company_onehot.values, tf.float32), company_embed_table)

with tf.Session() as sess:
    sess.run(tf.initialize_all_variables())
    print(sess.run(tf.cast(company_onehot.values, tf.float32)))
    print(sess.run(company_embedding_table))
    company_embeded = sess.run(company_embed)
    
company_embeded

[[0. 1. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]]
[[ 0.00990909 -0.0041791  -0.00641351 -0.00594623]
 [ 0.00055747 -0.00159252 -0.00749974 -0.00555932]
 [-0.00067203  0.00156488 -0.01019037  0.01614569]]


array([[ 0.00055747, -0.00159252, -0.00749974, -0.00555932],
       [ 0.00055747, -0.00159252, -0.00749974, -0.00555932],
       [ 0.00990909, -0.0041791 , -0.00641351, -0.00594623],
       [ 0.00055747, -0.00159252, -0.00749974, -0.00555932],
       [ 0.00055747, -0.00159252, -0.00749974, -0.00555932],
       [ 0.00055747, -0.00159252, -0.00749974, -0.00555932],
       [ 0.00990909, -0.0041791 , -0.00641351, -0.00594623],
       [ 0.00055747, -0.00159252, -0.00749974, -0.00555932],
       [ 0.00990909, -0.0041791 , -0.00641351, -0.00594623],
       [-0.00067203,  0.00156488, -0.01019037,  0.01614569]],
      dtype=float32)

In [5]:
company_embeded.shape

(10, 4)

**Best Practice** 

将company字段的one hot编码id化

In [6]:
company_index = [item[1] for item in np.argwhere(company_onehot.values==1)]
company_index = np.reshape(company_index, (batch_size, 1))
company_index

array([[1],
       [1],
       [0],
       [1],
       [1],
       [1],
       [0],
       [1],
       [0],
       [2]])

从embedding table中抽取id行的dense向量：

In [7]:
company_embeded_v2 = tf.gather(company_embed_table, company_index)
with tf.Session() as sess:
    company_embeded_v2 = sess.run(company_embeded_v2)
    
company_embeded_v2

array([[[ 0.00055747, -0.00159252, -0.00749974, -0.00555932]],

       [[ 0.00055747, -0.00159252, -0.00749974, -0.00555932]],

       [[ 0.00990909, -0.0041791 , -0.00641351, -0.00594623]],

       [[ 0.00055747, -0.00159252, -0.00749974, -0.00555932]],

       [[ 0.00055747, -0.00159252, -0.00749974, -0.00555932]],

       [[ 0.00055747, -0.00159252, -0.00749974, -0.00555932]],

       [[ 0.00990909, -0.0041791 , -0.00641351, -0.00594623]],

       [[ 0.00055747, -0.00159252, -0.00749974, -0.00555932]],

       [[ 0.00990909, -0.0041791 , -0.00641351, -0.00594623]],

       [[-0.00067203,  0.00156488, -0.01019037,  0.01614569]]],
      dtype=float32)

In [8]:
company_embeded_v2.shape

(10, 1, 4)

In [9]:
np.squeeze(company_embeded_v2)

array([[ 0.00055747, -0.00159252, -0.00749974, -0.00555932],
       [ 0.00055747, -0.00159252, -0.00749974, -0.00555932],
       [ 0.00990909, -0.0041791 , -0.00641351, -0.00594623],
       [ 0.00055747, -0.00159252, -0.00749974, -0.00555932],
       [ 0.00055747, -0.00159252, -0.00749974, -0.00555932],
       [ 0.00055747, -0.00159252, -0.00749974, -0.00555932],
       [ 0.00990909, -0.0041791 , -0.00641351, -0.00594623],
       [ 0.00055747, -0.00159252, -0.00749974, -0.00555932],
       [ 0.00990909, -0.0041791 , -0.00641351, -0.00594623],
       [-0.00067203,  0.00156488, -0.01019037,  0.01614569]],
      dtype=float32)

In [10]:
np.squeeze(company_embeded_v2).shape

(10, 4)

* ### 多值离散特征 & multi_hot

In [11]:
data['habit']

0    swimming,running,cycling
1                   traveling
2          traveling,swimming
3             running,cycling
4                     running
5                   traveling
6                   traveling
7                    swimming
8                   traveling
9                     cycling
Name: habit, dtype: object

In [12]:
habit_onehot = pd.get_dummies(data['habit'])
habit_onehot

Unnamed: 0,cycling,running,"running,cycling",swimming,"swimming,running,cycling",traveling,"traveling,swimming"
0,0,0,0,0,1,0,0
1,0,0,0,0,0,1,0
2,0,0,0,0,0,0,1
3,0,0,1,0,0,0,0
4,0,1,0,0,0,0,0
5,0,0,0,0,0,1,0
6,0,0,0,0,0,1,0
7,0,0,0,1,0,0,0
8,0,0,0,0,0,1,0
9,1,0,0,0,0,0,0


In [13]:
print(data['habit'])

habit_multi_hot = data['habit'].str.split(',').str.join('|').str.get_dummies()
habit_multi_hot

0    swimming,running,cycling
1                   traveling
2          traveling,swimming
3             running,cycling
4                     running
5                   traveling
6                   traveling
7                    swimming
8                   traveling
9                     cycling
Name: habit, dtype: object


Unnamed: 0,cycling,running,swimming,traveling
0,1,1,1,0
1,0,0,0,1
2,0,0,1,1
3,1,1,0,0
4,0,1,0,0
5,0,0,0,1
6,0,0,0,1
7,0,0,1,0
8,0,0,0,1
9,1,0,0,0


生成habit字段的embedding table:

In [14]:
habit_vocab_size = habit_multi_hot.shape[1]
embed_dim = 4
habit_embedding_table = tf.get_variable(name='habit_embedding_table',
                                  shape=(habit_vocab_size, embed_dim),
                                  initializer=tf.truncated_normal_initializer(0, 0.01))

with tf.Session() as sess:
    sess.run(tf.initialize_all_variables())
    habit_embed_table = sess.run(habit_embedding_table)
    
habit_embed_table

array([[ 0.01577152, -0.01522578, -0.00711412, -0.00292747],
       [-0.00081242,  0.00565728, -0.00288759, -0.01013346],
       [ 0.00439587, -0.00260259,  0.01153456, -0.00773898],
       [ 0.00736157, -0.00729969,  0.01349804, -0.00022794]],
      dtype=float32)

habit字段的mutil hot编码与embedding table矩阵相乘，生成habit字段的dense向量：

In [15]:
habit_embed = tf.matmul(tf.cast(habit_multi_hot.values, tf.float32), habit_embed_table)

with tf.Session() as sess:
    sess.run(tf.initialize_all_variables())
    print(sess.run(tf.cast(habit_multi_hot.values, tf.float32)))
    print(sess.run(habit_embedding_table))
    habit_embeded = sess.run(habit_embed)
    
habit_embeded

[[1. 1. 1. 0.]
 [0. 0. 0. 1.]
 [0. 0. 1. 1.]
 [1. 1. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]
 [1. 0. 0. 0.]]
[[ 0.01577152 -0.01522578 -0.00711412 -0.00292747]
 [-0.00081242  0.00565728 -0.00288759 -0.01013346]
 [ 0.00439587 -0.00260259  0.01153456 -0.00773898]
 [ 0.00736157 -0.00729969  0.01349804 -0.00022794]]


array([[ 0.01935498, -0.01217109,  0.00153286, -0.02079991],
       [ 0.00736157, -0.00729969,  0.01349804, -0.00022794],
       [ 0.01175744, -0.00990228,  0.0250326 , -0.00796691],
       [ 0.0149591 , -0.0095685 , -0.01000171, -0.01306093],
       [-0.00081242,  0.00565728, -0.00288759, -0.01013346],
       [ 0.00736157, -0.00729969,  0.01349804, -0.00022794],
       [ 0.00736157, -0.00729969,  0.01349804, -0.00022794],
       [ 0.00439587, -0.00260259,  0.01153456, -0.00773898],
       [ 0.00736157, -0.00729969,  0.01349804, -0.00022794],
       [ 0.01577152, -0.01522578, -0.00711412, -0.00292747]],
      dtype=float32)

**Best Practice**

将habit字段的multi hot编码id化：

In [16]:
print(habit_multi_hot.values)
max_length = np.max([item[1] for item in np.argwhere(habit_multi_hot.values==1)])
print(f"max_length: {max_length}")

habit_index = []
for i in range(batch_size):
    temp = [item[1]+1 for item in np.argwhere(habit_multi_hot.values==1) if item[0]==i]
    while len(temp) < max_length:
        temp.append(0)
    habit_index.append(temp)
habit_index = np.array(habit_index)
habit_index

[[1 1 1 0]
 [0 0 0 1]
 [0 0 1 1]
 [1 1 0 0]
 [0 1 0 0]
 [0 0 0 1]
 [0 0 0 1]
 [0 0 1 0]
 [0 0 0 1]
 [1 0 0 0]]
max_length: 3


array([[1, 2, 3],
       [4, 0, 0],
       [3, 4, 0],
       [1, 2, 0],
       [2, 0, 0],
       [4, 0, 0],
       [4, 0, 0],
       [3, 0, 0],
       [4, 0, 0],
       [1, 0, 0]])

用habit字段的id，抽取embedding table中对应id的行：

In [17]:
with tf.variable_scope('embedding', reuse=tf.AUTO_REUSE):
    habit_embedding_table_v2 = tf.get_variable(name='habit_embedding_table_2',
                                      shape=(habit_vocab_size+1, embed_dim),
                                      initializer=tf.truncated_normal_initializer(0, 0.01))
habit_embed_v2 = tf.gather(habit_embedding_table_v2, tf.cast(habit_index, tf.int64))
with tf.Session() as sess:
    sess.run(tf.initialize_all_variables())
    print(sess.run(habit_embedding_table_v2))
    habit_embeded_v2 = sess.run(habit_embed_v2)
    
habit_embeded_v2

[[ 0.01020543  0.01566322 -0.00493186 -0.01034193]
 [-0.01021489 -0.00272903 -0.012645    0.00677401]
 [ 0.00953833  0.00702333  0.00557917  0.01392039]
 [ 0.01476383 -0.01419464  0.00723014  0.01381919]
 [ 0.01259444 -0.01933192  0.00539015  0.00478455]]


array([[[-0.01021489, -0.00272903, -0.012645  ,  0.00677401],
        [ 0.00953833,  0.00702333,  0.00557917,  0.01392039],
        [ 0.01476383, -0.01419464,  0.00723014,  0.01381919]],

       [[ 0.01259444, -0.01933192,  0.00539015,  0.00478455],
        [ 0.01020543,  0.01566322, -0.00493186, -0.01034193],
        [ 0.01020543,  0.01566322, -0.00493186, -0.01034193]],

       [[ 0.01476383, -0.01419464,  0.00723014,  0.01381919],
        [ 0.01259444, -0.01933192,  0.00539015,  0.00478455],
        [ 0.01020543,  0.01566322, -0.00493186, -0.01034193]],

       [[-0.01021489, -0.00272903, -0.012645  ,  0.00677401],
        [ 0.00953833,  0.00702333,  0.00557917,  0.01392039],
        [ 0.01020543,  0.01566322, -0.00493186, -0.01034193]],

       [[ 0.00953833,  0.00702333,  0.00557917,  0.01392039],
        [ 0.01020543,  0.01566322, -0.00493186, -0.01034193],
        [ 0.01020543,  0.01566322, -0.00493186, -0.01034193]],

       [[ 0.01259444, -0.01933192,  0.00539015,  0.00478455]

In [18]:
habit_embeded_v2.shape

(10, 3, 4)

生成真实特征和补0特征的mask：

In [19]:
print(habit_index)
print(habit_index!=0)
mask = (habit_index!=0).astype(int)
mask

[[1 2 3]
 [4 0 0]
 [3 4 0]
 [1 2 0]
 [2 0 0]
 [4 0 0]
 [4 0 0]
 [3 0 0]
 [4 0 0]
 [1 0 0]]
[[ True  True  True]
 [ True False False]
 [ True  True False]
 [ True  True False]
 [ True False False]
 [ True False False]
 [ True False False]
 [ True False False]
 [ True False False]
 [ True False False]]


array([[1, 1, 1],
       [1, 0, 0],
       [1, 1, 0],
       [1, 1, 0],
       [1, 0, 0],
       [1, 0, 0],
       [1, 0, 0],
       [1, 0, 0],
       [1, 0, 0],
       [1, 0, 0]])

用生成的habit字段的dense向量乘以mask，去掉补0值的dense向量：

In [20]:
habit_embeded_masked = habit_embeded_v2 * np.expand_dims(mask,2)
habit_embeded_masked

array([[[-0.01021489, -0.00272903, -0.012645  ,  0.00677401],
        [ 0.00953833,  0.00702333,  0.00557917,  0.01392039],
        [ 0.01476383, -0.01419464,  0.00723014,  0.01381919]],

       [[ 0.01259444, -0.01933192,  0.00539015,  0.00478455],
        [ 0.        ,  0.        , -0.        , -0.        ],
        [ 0.        ,  0.        , -0.        , -0.        ]],

       [[ 0.01476383, -0.01419464,  0.00723014,  0.01381919],
        [ 0.01259444, -0.01933192,  0.00539015,  0.00478455],
        [ 0.        ,  0.        , -0.        , -0.        ]],

       [[-0.01021489, -0.00272903, -0.012645  ,  0.00677401],
        [ 0.00953833,  0.00702333,  0.00557917,  0.01392039],
        [ 0.        ,  0.        , -0.        , -0.        ]],

       [[ 0.00953833,  0.00702333,  0.00557917,  0.01392039],
        [ 0.        ,  0.        , -0.        , -0.        ],
        [ 0.        ,  0.        , -0.        , -0.        ]],

       [[ 0.01259444, -0.01933192,  0.00539015,  0.00478455]

In [21]:
habit_embeded_masked.shape

(10, 3, 4)

求取多值dense向量的均值：

In [22]:
habit_embed_final = tf.reduce_mean(habit_embeded_masked, axis=1)
with tf.Session() as sess:
    habit_embeded_final = sess.run(habit_embed_final)
habit_embeded_final

array([[ 4.69575357e-03, -3.30011438e-03,  5.47704597e-05,
         1.15045267e-02],
       [ 4.19814823e-03, -6.44397363e-03,  1.79671745e-03,
         1.59485079e-03],
       [ 9.11942342e-03, -1.11755213e-02,  4.20676482e-03,
         6.20124582e-03],
       [-2.25521624e-04,  1.43143333e-03, -2.35527692e-03,
         6.89813169e-03],
       [ 3.17944276e-03,  2.34110948e-03,  1.85972266e-03,
         4.64012877e-03],
       [ 4.19814823e-03, -6.44397363e-03,  1.79671745e-03,
         1.59485079e-03],
       [ 4.19814823e-03, -6.44397363e-03,  1.79671745e-03,
         1.59485079e-03],
       [ 4.92127519e-03, -4.73154771e-03,  2.41004738e-03,
         4.60639503e-03],
       [ 4.19814823e-03, -6.44397363e-03,  1.79671745e-03,
         1.59485079e-03],
       [-3.40496438e-03, -9.09676154e-04, -4.21499958e-03,
         2.25800291e-03]])

In [23]:
habit_embeded_final.shape

(10, 4)

求取均值的另一种方法：多值特征的dense向量的和／mask的和

In [48]:
print(mask)
with tf.Session() as sess:
    print(sess.run(tf.cast(tf.reshape(tf.reduce_sum(mask, axis=1), (batch_size, 1)), tf.float64)))
    print(sess.run(tf.divide(tf.reduce_sum(habit_embeded_masked, axis=1), tf.cast(tf.reshape(tf.reduce_sum(mask, axis=1), (batch_size, 1)), tf.float64))))


[[1 1 1]
 [1 0 0]
 [1 1 0]
 [1 1 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]]
[[3.]
 [1.]
 [2.]
 [2.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]]
[[ 4.69575357e-03 -3.30011438e-03  5.47704597e-05  1.15045267e-02]
 [ 1.25944447e-02 -1.93319209e-02  5.39015234e-03  4.78455238e-03]
 [ 1.36791351e-02 -1.67632820e-02  6.31014723e-03  9.30186873e-03]
 [-3.38282436e-04  2.14714999e-03 -3.53291538e-03  1.03471975e-02]
 [ 9.53832828e-03  7.02332845e-03  5.57916798e-03  1.39203863e-02]
 [ 1.25944447e-02 -1.93319209e-02  5.39015234e-03  4.78455238e-03]
 [ 1.25944447e-02 -1.93319209e-02  5.39015234e-03  4.78455238e-03]
 [ 1.47638256e-02 -1.41946431e-02  7.23014213e-03  1.38191851e-02]
 [ 1.25944447e-02 -1.93319209e-02  5.39015234e-03  4.78455238e-03]
 [-1.02148931e-02 -2.72902846e-03 -1.26449987e-02  6.77400874e-03]]


### Numerical Features

连续值特征：

In [52]:
data['age']

0    48
1    30
2    34
3    35
4    27
5    45
6    44
7    21
8    39
9    31
Name: age, dtype: int64

处理连续值特征的三种方式：

* 不做embedding处理:
  
  直接使用原始值或者原始值的映射，案例：Google Play的Wide&Deep, JD的DMT, YoutubeDNN
  
  
* 直接做embeding处理：
  
  比如将age字段看作是年龄分组粒度为1的特征，即看做了分类特征，可直接从embedding table中索引dense向量。
  
  
* 硬离散化处理：

  比如将age字段先分桶，等距分桶或者等频分桶或者log化，比如将age字段等距分桶，11-20岁～1；21-30岁～2；31-40岁～3；41-50岁～4；51-60岁～5；即转化为离散特征，再按离散特征的方式做onehot编码，与embeding矩阵相乘。
  
  硬离散的缺点：信息损失；相似值可能分在不同的桶里；相同的桶里存在差异很大的特征值；
  
  
* 软离散化处理：Autodis

  相比硬离散把一个特征值分到一个特定的桶里，软离散给予每个样本的特征值一系列分桶权重，Weighted-Average, 与哪个桶的关系大，weight就大，与哪个桶的关系小，weight就小。比如上例中，对于age=30特征而言，它的权重可能是[0.2, 0.7, 0.3, 0.2, 0.1], 最终它的分桶就是0.2*1+2*0.7+0.3*3+0.2*4+0.1*5=3.7, 最终每个样本的age特征值都有一个属于自己独特的分桶值，然后再与embedding table矩阵相乘得到dense向量。
  

In [57]:
with tf.variable_scope('autodis_embedding', reuse=tf.AUTO_REUSE):
    weights = tf.get_variable(name='autodis_weights',
                              shape=(1, 5),
                              initializer=tf.random_uniform_initializer(0, 1))
    
with tf.Session() as sess:
    sess.run(tf.initialize_all_variables())
    wt = sess.run(weights)
wt

array([[0.6441853 , 0.54293144, 0.41212475, 0.09288633, 0.8783301 ]],
      dtype=float32)

In [58]:
[30] * wt

array([[19.32555914, 16.28794312, 12.36374259,  2.78658986, 26.34990335]])