Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

not chinese #6

Closed
niutyut opened this issue Sep 9, 2021 · 4 comments
Closed

not chinese #6

niutyut opened this issue Sep 9, 2021 · 4 comments

Comments

@niutyut
Copy link

niutyut commented Sep 9, 2021

image
file_in is this picture content
model <- word2vec(x = file_in, type = "cbow", dim = 15, iter = 20)
lookslike <- predict(model, c( "鹰"), type = "nearest", top_n = 5)
lookslike

Error in w2v_nearest(object$model, x = x, top_n = top_n, ...) :
Could not find the word in the dictionary: 鹰

but 鹰 is in this picture content.Can you provide an example in Chinese?

@jwijffels
Copy link
Contributor

I did this on a Linux box on this file. Not sure if this makes sense. I don't speak Chinese.
example.txt

> x <- readLines("example.txt", encoding = "UTF-8")
> cat(x)
形态 开花的苹果树 落叶乔木,树高可达15米,栽培条件下一般高3~5米。树干灰褐色,老皮有不规则的纵裂或片状剥落,小枝光滑。叶序为单叶互生,椭圆至卵圆形,叶缘有锯齿。伞房花序,花瓣白色,含苞时带粉红色,雄蕊20,花柱5,大多数品种自花不育,需种植授粉树。果实为仁果,颜色及大小因品种而异。蘋果膳食纖維含量很豐富﹐也含有大量的果膠﹐對於整腸及調整腸道菌叢生態大有幫助。是一種綠色水果 [编辑] 习性 喜光,喜微酸性到中性土壤。最适于土层深厚,富含有机质,心土通气排水良好的沙质土壤。 [编辑] 品种 世界苹果产量 蘋果有超过7,500个已知品种。良种有红星系列、紅富士、乔纳森等。美國的名種有Red Delicious(香港稱地利蛇果,簡稱蛇果;台灣稱五爪蘋果)、Gold Delicious等[1]。英國北威爾斯巴德西島(Bardsey Island)則在近年發現新品種,比普通的果樹更健康,除了蟲害以外,並不會患病,被媒體稱為「世界上最罕有的蘋果」。除鮮食的品種外,尚有烹調用的蘋果。由於蘋果的果酸有保持水份的作用,適宜烤焗。> w2v <- word2vec::word2vec(x, min_count = 0)
> predict(w2v, newdata = "形态", type = "nearest")
$形态
  term1                                                                term2 similarity rank
1  形态                                                                    1  0.5622963    1
2  形态                                      。英國北威爾斯巴德西島(Bardsey  0.5039170    2
3  形态                                                                 习性  0.4285294    3
4  形态                                                         世界苹果产量  0.4036402    4
5  形态 落叶乔木,树高可达15米,栽培条件下一般高3~5米。树干灰褐色,老皮有不  0.3262694    5
6  形态                                                          蘋果有超过7  0.2895889    6
7  形态                                                         开花的苹果树  0.2073026    7
8  形态                                                               、Gold  0.1607383    8

> summary(w2v)
 [1] "500个已知品种。良种有红星系列、紅富士、乔纳森等。美國的名種有Red"        
 [2] "世界苹果产量"                                                            
 [3] "喜光,喜微酸性到中性土壤。最适于土层深厚,富含有机质,心土通气排水\xe8"  
 [4] "</s>"                                                                    
 [5] "。英國北威爾斯巴德西島(Bardsey"                                         
 [6] "1"                                                                       
 [7] "Delicious等"                                                             
 [8] "开花的苹果树"                                                            
 [9] "蘋果有超过7"                                                             
[10] "、Gold"                                                                  
[11] "落叶乔木,树高可达15米,栽培条件下一般高3~5米。树干灰褐色,老皮有不"    
[12] "编辑"                                                                    
[13] "Island)則在近年發現新品種,比普通的果樹更健康,除了蟲害以外,並不會\xe6"
[14] "习性"                                                                    
[15] "形态"                                                                    
[16] "品种"                                                                    
[17] "Delicious(香港稱地利蛇果,簡稱蛇果;台灣稱五爪蘋果"      

@niutyut
Copy link
Author

niutyut commented Sep 9, 2021

library(word2vec)
x <- readLines("example.txt", encoding = "UTF-8")
cat(x)
w2v <- word2vec::word2vec(x, min_count = 0)
predict(w2v, newdata = "形态", type = "nearest")

result is below:
predict(w2v, newdata = "形态", type = "nearest")
Error in w2v_nearest(object$model, x = x, top_n = top_n, ...) :
Could not find the word in the dictionary: 形态

How to solve this problem? My system is window.thank you!

@jwijffels
Copy link
Contributor

Write your cleaned text to a file and run word2vec from the file (e.g. below test.txt) instead of passing a character vector

library(readr)
library(word2vec)
x <- txt_clean_word2vec(x, ascii = FALSE, alpha = FALSE, tolower = TRUE, trim = TRUE)
write_lines(x, file = "test.txt")
model <- word2vec(x = "test.txt", min_count = 0) ## you need to change hyperparameters to your own 
terminology <- summary(model)
example <- sample(terminology, size = 2)
example
predict(model, newdata = example, type = "nearest")

@jwijffels
Copy link
Contributor

Maybe R package version 0.4.0 solves this issue. It allows to build a word2vec model from a list of tokenised sentences.
Writing text data to files before training for the file-based approach (word2vec.character) now uses useBytes = TRUE.
So it seems to me you can choose either one of the 2 options.

Closing, feel free to re-open if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants