# Word2Vec训练中文模型

## 1.准备数据与预处理

首先需要一份比较大的中文语料数据，可以考虑中文的维基百科（也可以试试搜狗的新闻语料库）。中文维基百科的打包文件地址为 https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2

中文维基百科的数据不是太大，xml的压缩文件大约1G左右。

首先用 process_wiki_data.py处理这个XML压缩文件，执行：
> python process_wiki_data.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text

用jieba完成分词，生成分词文件wiki.zh.text.seg
> python -m jieba /Users/zoe/Documents/GitHub/July-NLP/Lec\ 09\ Word2Vec/files/wiki.zh.text > /Users/zoe/Documents/GitHub/July-NLP/Lec\ 09\ Word2Vec/files/wiki.zh.text.seg

接着用word2vec工具训练：    
> python train_word2vec_model.py wiki.zh.text.seg wiki.zh.text.model wiki.zh.text.vector

## 2.测试模型效果：

In [1]:
import gensim

In [2]:
model = gensim.models.Word2Vec.load('./files/wiki.zh.text.model')

In [3]:
model.wv.most_similar('足球')

[('足球运动', 0.5982134342193604),
 ('冰球', 0.5636489987373352),
 ('排球', 0.5608409643173218),
 ('板球', 0.5432296991348267),
 ('手球', 0.5414480566978455),
 ('英超球', 0.5134667158126831),
 ('足球联赛', 0.5115091800689697),
 ('籃球', 0.5071338415145874),
 ('美式足球', 0.5016480088233948),
 ('德甲球', 0.5015369653701782)]

In [4]:
model.wv.most_similar('男人')

[('女人', 0.7860022187232971),
 ('傻瓜', 0.5513915419578552),
 ('家伙', 0.5501405000686646),
 ('女孩', 0.5029377341270447),
 ('女孩子', 0.49964481592178345),
 ('爸爸', 0.48556721210479736),
 ('老公', 0.478860080242157),
 ('小孩', 0.47839421033859253),
 ('小妹妹', 0.4783257842063904),
 ('心眼', 0.47733354568481445)]

In [5]:
model.wv.most_similar('女人')

[('男人', 0.7860022783279419),
 ('女孩', 0.5070950984954834),
 ('家伙', 0.5007292032241821),
 ('陌生人', 0.49460887908935547),
 ('傻瓜', 0.48818522691726685),
 ('基佬', 0.4780755639076233),
 ('撒嬌', 0.4745919406414032),
 ('老公', 0.46539241075515747),
 ('女明星', 0.4597305357456207),
 ('老婆', 0.458354651927948)]

In [6]:
model.wv.most_similar('青蛙')

[('烏龜', 0.6122806668281555),
 ('猴子', 0.6067733764648438),
 ('老鼠', 0.6040736436843872),
 ('螃蟹', 0.5925276875495911),
 ('章魚', 0.591739296913147),
 ('巫婆', 0.5903116464614868),
 ('狐狸', 0.5880568027496338),
 ('小狗', 0.5866398811340332),
 ('蟑螂', 0.5739650726318359),
 ('蚱蜢', 0.5694110989570618)]

In [7]:
model.wv.most_similar('姨夫')

[('侄媳妇', 0.6650698184967041),
 ('儿媳', 0.6523791551589966),
 ('外孙女', 0.6358482241630554),
 ('师兄', 0.6333339214324951),
 ('伯婆', 0.6228511333465576),
 ('姑夫', 0.6187611818313599),
 ('二女儿', 0.6163100600242615),
 ('郭伊助', 0.6098355054855347),
 ('嫫', 0.6084852814674377),
 ('孙女', 0.6060229539871216)]

In [9]:
model.wv.most_similar('衣服')

[('鞋子', 0.7649213671684265),
 ('衣物', 0.7573763728141785),
 ('裙子', 0.7024389505386353),
 ('大衣', 0.6892093420028687),
 ('外套', 0.6809908151626587),
 ('外衣', 0.6697770357131958),
 ('上衣', 0.6558798551559448),
 ('內褲', 0.6488358378410339),
 ('褲子', 0.6458038687705994),
 ('穿着', 0.6456842422485352)]

In [10]:
model.wv.most_similar('公安局')

[('检察院', 0.7569843530654907),
 ('纪委', 0.7540599703788757),
 ('北京市公安局', 0.7502676248550415),
 ('公安机关', 0.7441784143447876),
 ('公安分局', 0.7345539331436157),
 ('财政局', 0.7250583171844482),
 ('工商局', 0.7201695442199707),
 ('公安厅', 0.719853937625885),
 ('司法局', 0.7191680669784546),
 ('县公安局', 0.7026969790458679)]

In [11]:
model.wv.most_similar('铁道部')

[('中国铁道部', 0.779641330242157),
 ('国家计委', 0.7638068199157715),
 ('北京市政府', 0.7401143312454224),
 ('北京市人民政府', 0.6928790807723999),
 ('柳州铁路局', 0.6867716312408447),
 ('国家经委', 0.6844226121902466),
 ('国家教委', 0.6718689203262329),
 ('批复', 0.6703631281852722),
 ('广深铁路', 0.6619844436645508),
 ('国家文物局', 0.6578491926193237)]

In [12]:
model.wv.most_similar('清华大学')

[('北京大学', 0.8454896211624146),
 ('复旦大学', 0.7984063625335693),
 ('中国人民大学', 0.7835922241210938),
 ('武汉大学', 0.7804229259490967),
 ('同济大学', 0.7738265991210938),
 ('北京师范大学', 0.7696689367294312),
 ('南开大学', 0.7650592923164368),
 ('南京大学', 0.7648913264274597),
 ('天津大学', 0.7628638744354248),
 ('浙江大学', 0.7626370191574097)]

In [13]:
model.wv.most_similar('卫视')

[('衛視', 0.6956472992897034),
 ('湖南卫视', 0.6864842176437378),
 ('经视', 0.6304373741149902),
 ('影视频道', 0.6157732605934143),
 ('上檔', 0.6006176471710205),
 ('爱奇艺', 0.5971043705940247),
 ('金鹰', 0.5911879539489746),
 ('江蘇衛視', 0.5899255275726318),
 ('黄金档', 0.583237886428833),
 ('电影频道', 0.5815383791923523)]

In [14]:
model.wv.most_similar('习近平')

[('江泽民', 0.8337889313697815),
 ('胡锦涛', 0.8161196708679199),
 ('邓小平', 0.7476147413253784),
 ('温家宝', 0.7406492233276367),
 ('赵紫阳', 0.7238784432411194),
 ('朱镕基', 0.7178268432617188),
 ('胡耀邦', 0.7143329381942749),
 ('李克强', 0.7042862772941589),
 ('华国锋', 0.703812301158905),
 ('王岐山', 0.7010064125061035)]

In [15]:
model.wv.most_similar('林丹')

[('谌龙', 0.8853179216384888),
 ('傅海峰', 0.875369131565094),
 ('李宗伟', 0.8690844178199768),
 ('鲍春来', 0.8685038089752197),
 ('陈金', 0.8660693168640137),
 ('谢杏芳', 0.8643622994422913),
 ('李雪芮', 0.8565813302993774),
 ('张楠', 0.8555790185928345),
 ('徐晨', 0.851974368095398),
 ('蔡赟', 0.8502441048622131)]

In [16]:
model.wv.most_similar('语言学')

[('文字学', 0.7880354523658752),
 ('语音学', 0.7806053161621094),
 ('逻辑学', 0.7591598629951477),
 ('语义学', 0.7557202577590942),
 ('历史学', 0.7536841630935669),
 ('修辞学', 0.7435493469238281),
 ('音韵学', 0.7357654571533203),
 ('人类学', 0.7328830361366272),
 ('社会学', 0.7232593894004822),
 ('方法论', 0.7215195894241333)]

In [17]:
model.wv.most_similar('计算机')

[('电脑', 0.7502233386039734),
 ('电子计算机', 0.7116537094116211),
 ('图像处理', 0.6950002312660217),
 ('计算机网络', 0.68513023853302),
 ('图形学', 0.683332622051239),
 ('计算器', 0.6828181743621826),
 ('信号处理', 0.6819606423377991),
 ('集成电路', 0.677963376045227),
 ('超级计算机', 0.659066379070282),
 ('计算机技术', 0.6519860625267029)]

In [18]:
model.wv.similarity('计算机','自动化')

0.6177578890463322

In [19]:
model.wv.similarity('女人','男人')

0.7860021825853618

In [21]:
model.wv.doesnt_match('早餐 晚餐 午餐 中心'.split())

'中心'

---

## 3.总结

word2vec模型的三种常见应用：
- model.wv.most_similar('w1')  查看最相近的词汇
- model.wv.similarity('w1','w2')   查看两个词之间的相似程度
- model.wv.doesnt_match('w1 w2 w3'.split())  查看一组词中不符合一组的词

#### 读取xml文件，写成txt文件脚本的实际步骤

注意，不要执行，仅供查看。

In [23]:

from gensim.corpora import WikiCorpus

wiki = WikiCorpus('zhwiki-latest-pages-articles.xml.bz2', lemmatize=False, dictionary={})

output = WikiCorpus('wiki.zh.text', 'w')
for text in wiki.get_texts():
    output.write(' '.join(text) + '\n')

output.close()

#### jieba命令行分词

> python -m jieba /Users/zoe/Documents/GitHub/July-NLP/Lec\ 09\ Word2Vec/files/wiki.zh.text > /Users/zoe/Documents/GitHub/July-NLP/Lec\ 09\ Word2Vec/files/wiki.zh.text.seg

#### word2vec模型训练，读取wiki.zh.text.seg文件

In [24]:
from gensim.corpora import WikiCorpus
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence

In [None]:
model = Word2Vec(LineSentence('wiki.zh.text.seg'), size=400, window=5, min_count=5, 
                 workers=multiprocessing.cpu_count())

model.save('wiki.zh.text.model')
model.wv.save_word2vec_format('wiki.zh.text.model.vector')
