not chinese #6

niutyut · 2021-09-09T13:11:51Z

file_in is this picture content
model <- word2vec(x = file_in, type = "cbow", dim = 15, iter = 20)
lookslike <- predict(model, c( "鹰"), type = "nearest", top_n = 5)
lookslike

Error in w2v_nearest(object$model, x = x, top_n = top_n, ...) :
Could not find the word in the dictionary: 鹰

but 鹰 is in this picture content.Can you provide an example in Chinese?

jwijffels · 2021-09-09T15:22:24Z

I did this on a Linux box on this file. Not sure if this makes sense. I don't speak Chinese.
example.txt

> x <- readLines("example.txt", encoding = "UTF-8")
> cat(x)
形态 开花的苹果树 落叶乔木，树高可达15米，栽培条件下一般高3～5米。树干灰褐色，老皮有不规则的纵裂或片状剥落，小枝光滑。叶序为单叶互生，椭圆至卵圆形，叶缘有锯齿。伞房花序，花瓣白色，含苞时带粉红色，雄蕊20，花柱5，大多数品种自花不育，需种植授粉树。果实为仁果，颜色及大小因品种而异。蘋果膳食纖維含量很豐富﹐也含有大量的果膠﹐對於整腸及調整腸道菌叢生態大有幫助。是一種綠色水果 [编辑] 习性 喜光，喜微酸性到中性土壤。最适于土层深厚，富含有机质，心土通气排水良好的沙质土壤。 [编辑] 品种 世界苹果产量 蘋果有超过7,500个已知品种。良种有红星系列、紅富士、乔纳森等。美國的名種有Red Delicious（香港稱地利蛇果，簡稱蛇果；台灣稱五爪蘋果)、Gold Delicious等[1]。英國北威爾斯巴德西島（Bardsey Island）則在近年發現新品種，比普通的果樹更健康，除了蟲害以外，並不會患病，被媒體稱為「世界上最罕有的蘋果」。除鮮食的品種外，尚有烹調用的蘋果。由於蘋果的果酸有保持水份的作用，適宜烤焗。> w2v <- word2vec::word2vec(x, min_count = 0)
> predict(w2v, newdata = "形态", type = "nearest")
$形态
  term1                                                                term2 similarity rank
1  形态                                                                    1  0.5622963    1
2  形态                                      。英國北威爾斯巴德西島（Bardsey  0.5039170    2
3  形态                                                                 习性  0.4285294    3
4  形态                                                         世界苹果产量  0.4036402    4
5  形态 落叶乔木，树高可达15米，栽培条件下一般高3～5米。树干灰褐色，老皮有不  0.3262694    5
6  形态                                                          蘋果有超过7  0.2895889    6
7  形态                                                         开花的苹果树  0.2073026    7
8  形态                                                               、Gold  0.1607383    8

> summary(w2v)
 [1] "500个已知品种。良种有红星系列、紅富士、乔纳森等。美國的名種有Red"        
 [2] "世界苹果产量"                                                            
 [3] "喜光，喜微酸性到中性土壤。最适于土层深厚，富含有机质，心土通气排水\xe8"  
 [4] "</s>"                                                                    
 [5] "。英國北威爾斯巴德西島（Bardsey"                                         
 [6] "1"                                                                       
 [7] "Delicious等"                                                             
 [8] "开花的苹果树"                                                            
 [9] "蘋果有超过7"                                                             
[10] "、Gold"                                                                  
[11] "落叶乔木，树高可达15米，栽培条件下一般高3～5米。树干灰褐色，老皮有不"    
[12] "编辑"                                                                    
[13] "Island）則在近年發現新品種，比普通的果樹更健康，除了蟲害以外，並不會\xe6"
[14] "习性"                                                                    
[15] "形态"                                                                    
[16] "品种"                                                                    
[17] "Delicious（香港稱地利蛇果，簡稱蛇果；台灣稱五爪蘋果"

niutyut · 2021-09-09T23:59:44Z

library(word2vec)
x <- readLines("example.txt", encoding = "UTF-8")
cat(x)
w2v <- word2vec::word2vec(x, min_count = 0)
predict(w2v, newdata = "形态", type = "nearest")

result is below:
predict(w2v, newdata = "形态", type = "nearest")
Error in w2v_nearest(object$model, x = x, top_n = top_n, ...) :
Could not find the word in the dictionary: 形态

How to solve this problem? My system is window.thank you!

jwijffels · 2021-09-10T07:37:15Z

Write your cleaned text to a file and run word2vec from the file (e.g. below test.txt) instead of passing a character vector

library(readr)
library(word2vec)
x <- txt_clean_word2vec(x, ascii = FALSE, alpha = FALSE, tolower = TRUE, trim = TRUE)
write_lines(x, file = "test.txt")
model <- word2vec(x = "test.txt", min_count = 0) ## you need to change hyperparameters to your own 
terminology <- summary(model)
example <- sample(terminology, size = 2)
example
predict(model, newdata = example, type = "nearest")

jwijffels · 2023-10-05T14:41:18Z

Maybe R package version 0.4.0 solves this issue. It allows to build a word2vec model from a list of tokenised sentences.
Writing text data to files before training for the file-based approach (word2vec.character) now uses useBytes = TRUE.
So it seems to me you can choose either one of the 2 options.

Closing, feel free to re-open if needed.

jwijffels mentioned this issue Sep 10, 2021

avoid reencoding when writing out files #7

Closed

jwijffels closed this as completed Oct 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

not chinese #6

not chinese #6

niutyut commented Sep 9, 2021

jwijffels commented Sep 9, 2021

niutyut commented Sep 9, 2021

jwijffels commented Sep 10, 2021

jwijffels commented Oct 5, 2023

not chinese #6

not chinese #6

Comments

niutyut commented Sep 9, 2021

jwijffels commented Sep 9, 2021

niutyut commented Sep 9, 2021

jwijffels commented Sep 10, 2021

jwijffels commented Oct 5, 2023