# Chapter 3 word2vec

## 3.1 Estimation based method and Nerural Network

### 3.1.1 Problems of count based method

The count based method expresses the word by the word that appears around it.  
To create huge size of corpus, it will include over 100 Million terms.  
It will take so much time to compute.  



In [1]:
# 3.1.3 Word Processing of Neural network
import numpy as np

c = np.array([[1, 0, 0, 0, 0, 0, 0]])
W = np.random.randn(7, 3)
h = np.dot(c, W)

print(h)

[[ 0.20241309 -1.31116949  0.68082365]]


In [2]:
class MatMul:
    def __init__(self, W):
        self.params = [W]
        self.grads = [np.zeros_like(W)]
        self.x = None

    def forward(self, x):
        W = self.params
        out = np.dot(x, W)
        self.x = x
        return out

    def backward(self, dout):
        W = self.params
        dx = np.dot(dout, W.T)
        dW = np.dot(self.x.T, dout)
        self.grads[0][...] = dW
        return dx


In [3]:
import numpy as np

c = np.array([[1, 0, 0, 0, 0, 0, 0]])
W = np.random.randn(7, 3)
layer = MatMul(W)
h = layer.forward(c)

print(h)

[[[0.50664811 0.23032133 1.52703031]]]


## 3.2 Implemention of simple word2vec

In this Chapter, continuous bag-of-words: CBOW model is used.
CBOW is a Neural Network designed to infer a target from context.

CBOW model can be shown as:
![img](./fig/3_2_1.drawio.svg)

In [4]:
# 3.2.1 Impremention of simple word2vec
import numpy as np

# sample context data
c0 = np.array([[1, 0, 0, 0, 0, 0, 0]])
c1 = np.array([[0, 0, 1, 0, 0, 0, 0]])

W_in = np.random.randn(7, 3)
W_out = np.random.randn(3, 7)

in_layer0 = MatMul(W_in)
in_layer1 = MatMul(W_in)
out_layer = MatMul(W_out)

h0 = in_layer0.forward(c0)
h1 = in_layer1.forward(c1)
h = 0.5 * (h0 + h1)
s = out_layer.forward(h)

# show result
print(s)

[[[[-1.0399944   0.98493697  2.30230196  0.8781082   1.48025635
     2.80441429 -2.68816612]]]]


### 3.2.2 Learning of CBOW model

Learning model can be shown as:

![CBOW](./fig/3_2_2.drawio.svg)

From the figure shows in ### 3.2.1, softmax function and cross entropy function are added.
Softmax function is used to get probability.
Cross entropy function is used to get loss to calculate gradient.

### 3.2.3 Weights and distributions expression



## 3.3 Preparation of Learning data

### 3.3.1 context data and target data

Input of word2vec is context data and target data.
Context means the data that appears around the target data.
If window size is 1, context data contains two words around the target data one before and one after.

In following corpus, contexts and targets are shown as follows.

Corpus = "you say goodbye and I say hello."

| Contexts | Targets |
|---------|---------|
| you, goodbye| say|
| say, and | goodbye |
| goodbye, i | and |
| and, say | i |
| I, hello | say |
| say, . | hello |

In [5]:
def preprocess(text):
    text = text.lower()
    text = text.replace('.', ' .')
    words = text.split(' ')

    word_to_id = {}
    id_to_word = {}
    for word in words:
        if word not in word_to_id:
            new_id = len(word_to_id)
            word_to_id[word] = new_id
            id_to_word[new_id] = word

    corpus = np.array([word_to_id[w] for w in words])

    return corpus, word_to_id, id_to_word

In [6]:
# 3.3.1 context data

import sys

text = 'You say goodbye and I say hello.'
corpus, word_to_id, id_to_word = preprocess(text)
print(corpus)

print(id_to_word)



[0 1 2 3 4 1 5 6]
{0: 'you', 1: 'say', 2: 'goodbye', 3: 'and', 4: 'i', 5: 'hello', 6: '.'}


In [7]:
def create_contexts_target(corpus, window_size=1):
    target = corpus[window_size:-window_size]
    contexts = []

    for idx in range(window_size, len(corpus) - window_size):
        cs = []
        for t in range(-window_size, window_size + 1):
            if t == 0:
                continue
            cs.append(corpus[idx + t])
        contexts.append(cs)

    return np.array(contexts), np.array(target)


In [8]:
contexts, target = create_contexts_target(corpus, window_size=1)

print(contexts)
print(target)

[[0 2]
 [1 3]
 [2 4]
 [3 1]
 [4 5]
 [1 6]]
[1 2 3 4 1 5]


In [11]:
target.shape

(6,)

### 3.3.2 Transform to one-hot vector representation

```python
def convert_one_hot(corpus, vocab_size):
    N = corpus.shape[0]

```

In [12]:
def convert_one_hot(corpus, vocab_size):
    '''one-hot表現への変換

    :param corpus: 単語IDのリスト(1次元もしくは2次元のNumPy配列)
    :param vocab_size: 語彙数
    :return: one-hot表現(2次元のNumPy配列)
    '''
    N = corpus.shape[0]

    if corpus.ndim == 1:
        one_hot = np.zeros((N, vocab_size), dtype=np.int32)
        for idx, word_id in enumerate(corpus):
            one_hot[idx, word_id] = 1

    elif corpus.ndim == 2:
        C = corpus.shape[1]
        one_hot = np.zeros((N, C, vocab_size), dtype=np.int32)
        for idx_0, word_ids in enumerate(corpus):
            for idx_1, word_id in enumerate(word_ids):
                one_hot[idx_0, idx_1, word_id] = 1

    return one_hot


In [13]:
text = 'You say goodbye and I say hello.'

corpus, word_to_id, id_to_word = preprocess(text)

contexts, target = create_contexts_target(corpus, window_size=1)

vocab_size = len(word_to_id)
target = convert_one_hot(target, vocab_size)
contexts = convert_one_hot(contexts, vocab_size)

print(contexts)
print(target)

[[[1 0 0 0 0 0 0]
  [0 0 1 0 0 0 0]]

 [[0 1 0 0 0 0 0]
  [0 0 0 1 0 0 0]]

 [[0 0 1 0 0 0 0]
  [0 0 0 0 1 0 0]]

 [[0 0 0 1 0 0 0]
  [0 1 0 0 0 0 0]]

 [[0 0 0 0 1 0 0]
  [0 0 0 0 0 1 0]]

 [[0 1 0 0 0 0 0]
  [0 0 0 0 0 0 1]]]
[[0 1 0 0 0 0 0]
 [0 0 1 0 0 0 0]
 [0 0 0 1 0 0 0]
 [0 0 0 0 1 0 0]
 [0 1 0 0 0 0 0]
 [0 0 0 0 0 1 0]]
7
