# Natural Language Processing
## 2️⃣ Word Embedding

### word2vec

**word2vec** is one of various methods of **word embedding**.   
**word2vec** learns word embedding vectors by using the problem of predicting what words appear in a given context. 

In [1]:
from gensim.models import Word2Vec

doc = [["서울에", "살고", "있는", "나는", "강아지와", "고양이를", "좋아한다"]]

w2v_model = Word2Vec(min_count=1, window=2, vector_size=300)
w2v_model.build_vocab(doc)
w2v_model.train(doc, total_examples=w2v_model.corpus_count, epochs=20)

(10, 140)

In [8]:
similar_word = w2v_model.wv.most_similar("강아지와")
print(similar_word)

score = w2v_model.wv.similarity("강아지와", "고양이를")
print(score)

[('서울에', 0.13418062031269073), ('나는', 0.050058916211128235), ('좋아한다', 0.033167969435453415), ('고양이를', 0.025743037462234497), ('살고', 0.01304111909121275), ('있는', -0.03428904712200165)]
0.02574304


#### Parameters of Word2Vec()
- min_count: The model does not train words that appear less than min_count.
- window: The model trains each words based on "window" words at the front and back of the word.
- vector_size: defines the size of word embedding vector.

In [9]:
import pandas as pd
from gensim.models import Word2Vec

def load_data(filepath):
    data = pd.read_csv(filepath, delimiter=';', header=None, names=['sentence','emotion'])
    data = data['sentence']

    gensim_input = []
    for text in data:
        gensim_input.append(text.rstrip().split())
    return gensim_input

input_data = load_data("emotions_train.txt")

# Train word2vec model

w2v_model = Word2Vec(window = 2, vector_size = 300)
w2v_model.build_vocab(input_data)
w2v_model.train(input_data, total_examples=w2v_model.corpus_count, epochs=10)

# Check which word is similar with "happy".
similar_happy = w2v_model.wv.most_similar("happy")

print(similar_happy)

# Check which word is similar with "sad".
similar_sad = w2v_model.wv.most_similar("sad")

print(similar_sad)

# Check similarity between "good" and "bad".
similar_good_bad = w2v_model.wv.similarity("good", "bad")

print(similar_good_bad)

# Check similarity between "sad" and "lonely".
similar_sad_lonely = w2v_model.wv.similarity("sad", "lonely")

print(similar_sad_lonely)

# Check embedding vector of "happy".
wv_happy = w2v_model.wv["happy"]

print(wv_happy)


[('excited', 0.9061278700828552), ('thrilled', 0.9044565558433533), ('determined', 0.8859072327613831), ('stubborn', 0.8718532919883728), ('truthful', 0.8653934001922607), ('blessed', 0.8630205988883972), ('thankful', 0.8628314137458801), ('eager', 0.854218065738678), ('pleased', 0.8402495384216309), ('delighted', 0.8355127573013306)]
[('scared', 0.9373880624771118), ('unhappy', 0.9362308382987976), ('hopeless', 0.93414705991745), ('lonely', 0.9294252395629883), ('angry', 0.9264203906059265), ('depressed', 0.9154171347618103), ('paranoid', 0.9112583994865417), ('worthless', 0.9094374179840088), ('nervous', 0.9087405204772949), ('bitter', 0.9044232964515686)]
0.7675484
0.92942524
[-0.14070337  0.10991926 -0.082832    0.05044246 -0.16845034 -0.17876713
  0.06207728  0.2871339  -0.0346562  -0.03207461  0.08975355 -0.14058274
 -0.0584864  -0.28872845 -0.06976887 -0.1127381   0.4116943  -0.07580606
 -0.08733585 -0.07632934 -0.00583419  0.04894549  0.09784348 -0.1811314
  0.12294561 -0.03315

#### out-of-vocabulary problem

word2vec has a problem in that it is not possible to generate an embedding vector for a word that is not in the learning data. -> **fastText** can solve this problem!

### fastText

fastText splits each word into **letters** and trains the model similarly to word2vec so that they can create embedding vector for the word that is not in the trian data.

In [None]:
from gensim.models import FastText

doc = [["서울에", "살고", "있는", "나는", "강아지와", "고양이를", "좋아한다"]]

ft_model = FastText(min_count=1, window=2, vector_size=300)
ft_model.build_vocab(doc)
ft_model.train(doc, total_examples=ft_model.corpus_count, epochs=10)

In [None]:
similar_word = ft_model.wv.most_similar("엘리스는")
print(similar_word)
# [('좋아한다', 0.03110547922551632),
# ('살고', 0.015657681971788406),
# ('강아지를', -0.09297232329845428),
# ('서울에', -0.10255782306194305),
# ('있는', -0.10588616132736206b)]

new_vector = ft_model.wv["좋아한다고"]
print(new_vector)
# array([-5.8544584e-04, -1.5485507e-03, -1.3994898e-03, -9.1309723e-04, ...