# One-hotエンコーディング
One-hotエンコーディングを行うことでテキストをベクトル化することができる. One-hotエンコーディングによるテキスト表現は直観的で実装が簡単であるが, ベクトルの大きさが語彙数に比例することによってスパースな表現になる, テキストを固定長で表現できない, 単語間の類似性という概念を持たない, 未知の単語に適応できないという問題点がある.

In [1]:
documents = ["Dog bites man.", "Man bites dog.", "Dog eats meat.", "Man eats food."]
# 小文字化してピリオドを取り除く
processed_docs = [doc.lower().replace(".","") for doc in documents]

# vocabの構築
vocab={}
count = 0
for doc in processed_docs:
    for word in doc.split():
        if word not in vocab:
            count = count+1
            vocab[word] = count
print(vocab)

def get_onehot_vector(somestring):
    onehot_encoded = []
    for word in somestring.split():
        temp = [0]*len(vocab)
        if word in vocab:
            temp[vocab[word]-1] = 1
        onehot_encoded.append(temp)
    return onehot_encoded

print(processed_docs[1])
print(get_onehot_vector(processed_docs[1]))

{'dog': 1, 'bites': 2, 'man': 3, 'eats': 4, 'meat': 5, 'food': 6}
man bites dog
[[0, 0, 1, 0, 0, 0], [0, 1, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0]]


# Bag of Words
Bag of Wordsはテキストを単語の集合として表現することでテキスト表現を行う手法である. BoWは同じ単語を含むテキストのベクトル表現が近くなるため文書の類似性を捉えているといえる. また任意の長さの文を固定長の符号で表すことができる. 一方でスパース性の問題や, 同じ意味をもつ異なる単語への類似性がないこと, OOVが処理できない, 語順が失われるという問題がある.

In [12]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
bow_rep = count_vect.fit_transform(processed_docs)

# 語彙を表示
print(count_vect.vocabulary_)

# Bowを表示
for i in range(len(processed_docs)):
    print(processed_docs[i])
    print(bow_rep[i].toarray())
    
# 新しいテキストに対するBoW
temp = count_vect.transform(["dog and dog are friends"])
print(temp.toarray())

{'dog': 1, 'bites': 0, 'man': 4, 'eats': 2, 'meat': 5, 'food': 3}
dog bites man
[[1 1 0 0 1 0]]
man bites dog
[[1 1 0 0 1 0]]
dog eats meat
[[0 1 1 0 0 1]]
man eats food
[[0 0 1 1 1 0]]
[[0 2 0 0 0 0]]


# Bag of N-grams
Bag of N-grams(BoN)はテキストを連続するn個の単語に分割することでフレーズや語順を考慮したテキスト表現を作成する方法である. BoNは同じnグラムを含む文書に対する類似性を捉えることができる一方で, nが増加するとスパース性が急速に増加する, OOVが処理できないという問題がある.

In [13]:
# n=1,2,3のときのBoN
count_vect = CountVectorizer(ngram_range=(1,3))

bow_rep = count_vect.fit_transform(processed_docs)
# 語彙を表示
print(count_vect.vocabulary_)


{'dog': 3, 'bites': 0, 'man': 12, 'dog bites': 4, 'bites man': 2, 'dog bites man': 5, 'man bites': 13, 'bites dog': 1, 'man bites dog': 14, 'eats': 8, 'meat': 17, 'dog eats': 6, 'eats meat': 10, 'dog eats meat': 7, 'food': 11, 'man eats': 15, 'eats food': 9, 'man eats food': 16}


In [15]:
# 新しいテキストに対するBoW

# BoNを表示
for i in range(len(processed_docs)):
    print(processed_docs[i])
    print(bow_rep[i].toarray())

temp = count_vect.transform(["dog and dog are friends"])
print(temp.toarray())

dog bites man
[[1 0 1 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0]]
man bites dog
[[1 1 0 1 0 0 0 0 0 0 0 0 1 1 1 0 0 0]]
dog eats meat
[[0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 0 1]]
man eats food
[[0 0 0 0 0 0 0 0 1 1 0 1 1 0 0 1 1 0]]
[[0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]


# TF-IDF
TF-IDFはある文書中に登場する単語の頻度TFと, 文書間である単語の頻度を比較するIDFの積で表されるテキスト表現である. 

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
bow_rep_tfidf = tfidf.fit_transform(processed_docs)

# 語彙を表示
print(tfidf.vocabulary_)

# TF-IDFを表示
for i in range(len(processed_docs)):
    print(processed_docs[i])
    print(bow_rep_tfidf[i].toarray())

print(tfidf.get_feature_names) # 全単語
print(tfidf.idf_) # IDF

temp = tfidf.transform(["dog and dog are friends"])
print(temp.toarray())

{'dog': 1, 'bites': 0, 'man': 4, 'eats': 2, 'meat': 5, 'food': 3}
dog bites man
[[0.65782931 0.53256952 0.         0.         0.53256952 0.        ]]
man bites dog
[[0.65782931 0.53256952 0.         0.         0.53256952 0.        ]]
dog eats meat
[[0.         0.44809973 0.55349232 0.         0.         0.70203482]]
man eats food
[[0.         0.         0.55349232 0.70203482 0.44809973 0.        ]]
<bound method CountVectorizer.get_feature_names of TfidfVectorizer()>
[1.51082562 1.22314355 1.51082562 1.91629073 1.22314355 1.91629073]
[[0. 1. 0. 0. 0. 0.]]


# Word2Vec

## 事前学習済み単語埋め込み

In [10]:
# 事前学習済み単語埋め込みのダウンロード
!wget -P /tmp/input/ -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"

--2022-03-04 10:07:26--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.37.86
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.37.86|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1647046227 (1.5G) [application/x-gzip]
Saving to: ‘/tmp/input/GoogleNews-vectors-negative300.bin.gz’


2022-03-04 10:10:09 (9.71 MB/s) - ‘/tmp/input/GoogleNews-vectors-negative300.bin.gz’ saved [1647046227/1647046227]



In [1]:
from gensim.models import Word2Vec,KeyedVectors

path = "/tmp/input/GoogleNews-vectors-negative300.bin.gz"
vectors = KeyedVectors.load_word2vec_format(path,binary=True)

In [2]:
# beautifulに類似する単語を取得
print(vectors.most_similar("beautiful"))

[('gorgeous', 0.8353005051612854), ('lovely', 0.8106936812400818), ('stunningly_beautiful', 0.7329413294792175), ('breathtakingly_beautiful', 0.7231340408325195), ('wonderful', 0.6854086518287659), ('fabulous', 0.6700063943862915), ('loveliest', 0.6612576246261597), ('prettiest', 0.6595001816749573), ('beatiful', 0.6593326330184937), ('magnificent', 0.6591402888298035)]


In [3]:
# ベクトルを表示
vectors["beautiful"]

array([-0.01831055,  0.05566406, -0.01153564,  0.07275391,  0.15136719,
       -0.06176758,  0.20605469, -0.15332031, -0.05908203,  0.22851562,
       -0.06445312, -0.22851562, -0.09472656, -0.03344727,  0.24707031,
        0.05541992, -0.00921631,  0.1328125 , -0.15429688,  0.08105469,
       -0.07373047,  0.24316406,  0.12353516, -0.09277344,  0.08203125,
        0.06494141,  0.15722656,  0.11279297, -0.0612793 , -0.296875  ,
       -0.13378906,  0.234375  ,  0.09765625,  0.17773438,  0.06689453,
       -0.27539062,  0.06445312, -0.13867188, -0.08886719,  0.171875  ,
        0.07861328, -0.10058594,  0.23925781,  0.03808594,  0.18652344,
       -0.11279297,  0.22558594,  0.10986328, -0.11865234,  0.02026367,
        0.11376953,  0.09570312,  0.29492188,  0.08251953, -0.05444336,
       -0.0090332 , -0.0625    , -0.17578125, -0.08154297,  0.01062012,
       -0.04736328, -0.08544922, -0.19042969, -0.30273438,  0.07617188,
        0.125     , -0.05932617,  0.03833008, -0.03564453,  0.24

In [4]:
# 存在しない単語を検索したとき
print(vectors.most_similar("practicnlp"))

KeyError: "Key 'practicnlp' not present"

## Word2Vecの学習

In [1]:
from gensim.models import Word2Vec
from gensim.test.utils import common_texts

model = Word2Vec(common_texts,vector_size=10,window=5,min_count=1,workers=4)
model.save("tempmodel.w2v")

# 類似度の高い単語を表示
print(model.wv.most_similar("computer",topn=5))
# ベクトルを表示
print(model.wv["computer"])

[('eps', 0.2914133667945862), ('trees', 0.05541810393333435), ('minors', 0.042647670954465866), ('survey', -0.02176341600716114), ('interface', -0.15233567357063293)]
[ 0.0163195   0.00189972  0.03474648  0.00217841  0.09621626  0.05062076
 -0.08919986 -0.0704361   0.00901718  0.06394394]


## 埋め込みの組み合わせ

In [5]:
!python3 -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.9/13.9 MB[0m [31m22.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.2.0
[0m[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [6]:
import spacy

nlp=spacy.load("en_core_web_sm")

doc = nlp("Canada is a large country")
# Canadaのベクトルを表示
print(doc[0].vector)
# 文全体の平均ベクトル
print(doc.vector)

[-0.07389477 -1.1225122   0.3269888  -1.1655114  -0.04418483  1.5891542
 -0.28719527  0.09429426  0.06673294 -1.4717522  -0.0818031   0.07689728
  0.20577267  0.1534166   0.00336191 -0.43733618  0.07018674  2.1657472
  1.0276892   0.25029135 -0.80173635 -0.56803215  0.19891274  0.5664381
 -1.0816476  -1.4319063   0.17249304 -0.9627772   0.1964746  -0.06256803
  0.62617874 -1.106359    0.08719929  0.69526446  0.95803195 -1.0843782
 -0.4794482   0.6937295  -1.1727467   0.88092196 -0.8631144   0.206231
  0.20171316 -1.2485261  -0.87571794 -1.1993029  -0.48330194 -0.2675097
  1.3254739  -0.8328309   1.7992892   1.3626868  -0.48119223 -0.11044502
  0.84547126  1.543742   -0.69748783 -0.5775061  -1.0095484  -0.8765683
 -0.4263393   0.55180967  1.4581158  -0.1674192   0.14557643 -0.07270151
 -0.16963056 -0.3089923   1.7251426  -0.8527357  -0.11102927  1.0936136
  1.0242088   0.4009051  -0.9171642  -1.3943074  -1.4925234   0.20589393
 -0.01312634 -0.5379971   0.88315284 -0.871836    0.13177091