# Gensim

本文介绍Gensim 的基本使用

## 1. 使用pre-training 的word vectors

#### 1. Load pre-trained word2vec model

这些pre-trained model 都很占用内存，因此如果内存有限，可以使用`limit=200000` 来限制lexicon 中单词的个数。这样做的风险是，可能遇到生僻词，没有包含在有限的lexicon 中，影响NLP 性能。所以通常的做法是，在development phase limit lexicon，在production phase，使用全量数据。

In [1]:
from gensim.models.keyedvectors import KeyedVectors

In [2]:
word2vec_bin_file = '/Users/chenwang/Workspace/datasets/GoogleNews-vectors-negative300.bin.gz'

word_vectors = KeyedVectors.load_word2vec_format(word2vec_bin_file, binary=True, limit=200000)



#### 2. Most similar

`most_similar` 方法：高效的方法来寻找指定word vector 的nearest neighbors.

In [6]:
word_vectors.most_similar(positive=['exception'], topn=5)

[('exceptions', 0.7204183340072632),
 ('notable_exception', 0.6068810820579529),
 ('notable_exceptions', 0.5186542272567749),
 ('caveat', 0.4860664904117584),
 ('Exceptions', 0.4639701545238495)]

In [11]:
word_vectors.most_similar(positive=['Chicago', 'Bulls'], topn=5)

[('Chicago_Bulls', 0.675159215927124),
 ('Windy_City', 0.6629956960678101),
 ('Blackhawks', 0.5667264461517334),
 ('White_Sox', 0.5656265616416931),
 ('Bears', 0.5585381984710693)]

#### 3. vector operation

Analogy: king - man = ? - woman

? = king + woman - man

我们可以用以下方法求解


In [15]:
word_vectors.most_similar(positive=['king', 'woman'], negative=['man'], topn=2)

[('queen', 0.7118192911148071), ('monarch', 0.6189674139022827)]

#### 4. 找出最不相似的token

`doesnt_match` 方法：类似于anomaly detection，寻找到其他tokens 距离最大的token

In [13]:
lst_of_tokens = "potatoes milk cake computer".split()

word_vectors.doesnt_match(lst_of_tokens)

'computer'

#### 5. 计算相似度



In [16]:
word_vectors.similarity('princess', 'queen')

0.7070532

In [18]:
word_vectors.similarity('king', 'queen')

0.6510957

#### 6. Get word vectors

使用`[]` or `.get()` 方法

In [22]:
word_vector_king = word_vectors['king']
word_vector_king


array([ 1.25976562e-01,  2.97851562e-02,  8.60595703e-03,  1.39648438e-01,
       -2.56347656e-02, -3.61328125e-02,  1.11816406e-01, -1.98242188e-01,
        5.12695312e-02,  3.63281250e-01, -2.42187500e-01, -3.02734375e-01,
       -1.77734375e-01, -2.49023438e-02, -1.67968750e-01, -1.69921875e-01,
        3.46679688e-02,  5.21850586e-03,  4.63867188e-02,  1.28906250e-01,
        1.36718750e-01,  1.12792969e-01,  5.95703125e-02,  1.36718750e-01,
        1.01074219e-01, -1.76757812e-01, -2.51953125e-01,  5.98144531e-02,
        3.41796875e-01, -3.11279297e-02,  1.04492188e-01,  6.17675781e-02,
        1.24511719e-01,  4.00390625e-01, -3.22265625e-01,  8.39843750e-02,
        3.90625000e-02,  5.85937500e-03,  7.03125000e-02,  1.72851562e-01,
        1.38671875e-01, -2.31445312e-01,  2.83203125e-01,  1.42578125e-01,
        3.41796875e-01, -2.39257812e-02, -1.09863281e-01,  3.32031250e-02,
       -5.46875000e-02,  1.53198242e-02, -1.62109375e-01,  1.58203125e-01,
       -2.59765625e-01,  

In [23]:
len(word_vector_king)  # 300d


300

## 2. 训练自己的Word Vectors

有时候domain-specific 的word vectors 可以改进你的NLP 应用的性能（准确度），下面我们介绍如何训练自己的word vector.