# Word2Vec實作
- 字詞所代表的意義非常多元，在不同狀況下，會代表不同意思。要把多元意思用單一向量表示，則必須要進行word embedding的動作，也就是把高維向量降為低維向量的過程
- 之前介紹過，利用分散式表示法來表達字詞向量，例如PMI、SVD..統計法..等
- 2013年神經網路盛行後，Tomas Mikolov利用神經網路訓練方式，來獲得字詞的表達向量，獲得很棒的成果。一般認為是利用神經網路模擬人類的理解能力，獲得不錯的分布空間所得到的成果。
- 本範例以維基百科wiki部分資料作範例
- 資料來源：https://dumps.wikimedia.org/zhwiki/20240501/zhwiki-20240501-pages-articles-multistream1.xml-p1p187712.bz2
- 利用結巴分詞(jieba)進行斷詞，gensim套件進行word2vec計算
- 本範例約需1小時長時間執行


In [1]:
!wget https://dumps.wikimedia.org/zhwiki/20240501/zhwiki-20240501-pages-articles-multistream1.xml-p1p187712.bz2

--2024-05-17 06:00:19--  https://dumps.wikimedia.org/zhwiki/20240501/zhwiki-20240501-pages-articles-multistream1.xml-p1p187712.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.71, 2620:0:861:3:208:80:154:71
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.71|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 233419414 (223M) [application/octet-stream]
Saving to: ‘zhwiki-20240501-pages-articles-multistream1.xml-p1p187712.bz2’


2024-05-17 06:01:03 (5.05 MB/s) - ‘zhwiki-20240501-pages-articles-multistream1.xml-p1p187712.bz2’ saved [233419414/233419414]



### opencc是繁簡轉換工具

In [2]:
!pip install opencc-python-reimplemented

Collecting opencc-python-reimplemented
  Downloading opencc_python_reimplemented-0.1.7-py2.py3-none-any.whl (481 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m481.8/481.8 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: opencc-python-reimplemented
Successfully installed opencc-python-reimplemented-0.1.7


### gensim是訓練word2vec的函式庫

In [3]:
from gensim.corpora import WikiCorpus

wiki_corpus = WikiCorpus('zhwiki-20240501-pages-articles-multistream1.xml-p1p187712.bz2', dictionary={})

In [4]:
wiki_corpus

<gensim.corpora.wikicorpus.WikiCorpus at 0x7c88a02801f0>

In [6]:
next(iter(wiki_corpus.get_texts()))[:20]

['歐幾里得',
 '西元前三世紀的古希臘數學家',
 '而現在被認為是幾何之父',
 '此畫為拉斐爾的作品',
 '雅典學院',
 '数学',
 '是研究數量',
 '屬於形式科學的一種',
 '數學利用抽象化和邏輯推理',
 '從計數',
 '計算',
 '量度',
 '對物體形狀及運動的觀察發展而成',
 '數學家們拓展這些概念',
 '以公式化新的猜想',
 '以及從選定的公理及定義出發',
 '嚴謹地推導出一些定理',
 '對數學基本概念的完善',
 '早在古埃及',
 '而在古希臘那裡有更為嚴謹的處理']

## 把wiki的資料檔案，轉換成連續文字的txt檔案

In [7]:
text_num = 0

with open('wiki_text.txt', 'w', encoding='utf-8') as f:
    for text in wiki_corpus.get_texts():
        f.write(' '.join(text)+'\n')
        text_num += 1
        if text_num % 10000 == 0:
            print('{} articles processed.'.format(text_num))

    print('{} articles processed.'.format(text_num))

10000 articles processed.
20000 articles processed.
30000 articles processed.
32786 articles processed.


In [8]:
import jieba
from opencc import OpenCC


# Initial
cc = OpenCC('s2t')
train_data = open('wiki_text.txt', 'r', encoding='utf-8').read()
train_data = cc.convert(train_data)
train_data = jieba.lcut(train_data)
train_data = [word for word in train_data if word != '']
train_data = ' '.join(train_data)
open('seg.txt', 'w', encoding='utf-8').write(train_data)

Building prefix dict from the default dictionary ...
DEBUG:jieba:Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
DEBUG:jieba:Dumping model to file cache /tmp/jieba.cache
Loading model cost 0.714 seconds.
DEBUG:jieba:Loading model cost 0.714 seconds.
Prefix dict has been built successfully.
DEBUG:jieba:Prefix dict has been built successfully.


136623985

In [9]:
from gensim.models import word2vec


# Settings
seed = 666
sg = 0
window_size = 10
#vector_size = 100
min_count = 1
workers = 8
#epochs = 5
batch_words = 10000

train_data = word2vec.LineSentence('seg.txt')
model = word2vec.Word2Vec(
    train_data,
    min_count=min_count,
    #size=vector_size,
    workers=workers,
    #iter=epochs,
    window=window_size,
    sg=sg,
    seed=seed,
    batch_words=batch_words
)

model.save('word2vec.model')

In [10]:
from gensim.models import word2vec

string = '門'
model = word2vec.Word2Vec.load('word2vec.model')
print(string)

# 查找關係
for item in model.wv.most_similar(string):
    print(item)

門
('門的', 0.6725940108299255)
('大門', 0.6616427898406982)
('天安', 0.6539272665977478)
('門地區', 0.6512478590011597)
('門內', 0.6413958072662354)
('牆', 0.6378226280212402)
('門前', 0.6370177268981934)
('鐘鼓樓', 0.635375440120697)
('中門', 0.6262894868850708)
('門城樓', 0.6217420697212219)
