Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python gensim 不能加载词向量文件 #8

Closed
sudazzk opened this issue May 15, 2018 · 5 comments
Closed

python gensim 不能加载词向量文件 #8

sudazzk opened this issue May 15, 2018 · 5 comments

Comments

@sudazzk
Copy link

sudazzk commented May 15, 2018

D:\Program\Anaconda3\lib\site-packages\gensim\utils.py:860: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
Traceback (most recent call last):
File ".\zzk_word2vec.py", line 101, in
test_word_embedding('D:\data\pretrain_word2vec\Chinese-Word-Vectors\sgns.zhihu.char\sgns.zhihu.char')
File ".\zzk_word2vec.py", line 76, in test_word_embedding
model = gensim.models.KeyedVectors.load_word2vec_format(vector_file, binary=False, encoding='utf8')
File "D:\Program\Anaconda3\lib\site-packages\gensim\models\keyedvectors.py", line 250, in load_word2vec_format
parts = utils.to_unicode(line.rstrip(), encoding=encoding, errors=unicode_errors).split(" ")
File "D:\Program\Anaconda3\lib\site-packages\gensim\utils.py", line 242, in any2unicode
return unicode(text, encoding, errors=errors)
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 96-97: invalid continuation byte

@sudazzk
Copy link
Author

sudazzk commented May 15, 2018

用这个函数加载 知乎问答数据 sgns.zhihu.char
model = gensim.models.KeyedVectors.load_word2vec_format(vector_file, binary=False, encoding='utf8')
会报上面的错误

@pyanh
Copy link

pyanh commented May 16, 2018

@sudazzk 我也是用这个函数加载,可以正常运行。

from gensim.models.keyedvectors import KeyedVectors
w2v_model = KeyedVectors.load_word2vec_format("Chinese-Word-Vectors/sgns.zhihu.char",binary=False,unicode_errors='ignore')
你把最后一个参数设置成unicode_errors='ignore'

@shenshen-hungry
Copy link
Collaborator

我更新了词向量,应该不会有unicode编码问题了。
如果还有问题,可以在打开的时候加参数:

open(filename, errors='ignore')

@jly8866
Copy link

jly8866 commented Nov 8, 2018

还不错哦

@jly8866
Copy link

jly8866 commented Nov 9, 2018

from gensim.models.keyedvectors import KeyedVectors
w2v_model = KeyedVectors.load_word2vec_format("Chinese-Word-Vectors/sgns.zhihu.char",binary=False,unicode_errors='ignore')
你把最后一个参数设置成unicode_errors='ignore'

这一句靠谱

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants