Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quick start - error in type.convert #10

Closed
loudermilk opened this issue May 1, 2016 · 2 comments
Closed

Quick start - error in type.convert #10

loudermilk opened this issue May 1, 2016 · 2 comments

Comments

@loudermilk
Copy link

Problem:
Running through_ Quick Start_ instructions in README.md the process dies with an error in type.convert.

> model = train_word2vec("cookbooks.txt",output="cookbooks.vectors",threads = 3,vectors = 100,window=12)
Starting training using file /home/brandon/repo/stack/data/cookbooks.txt
Vocab size: 32421
Words in train file: 10577282
Alpha: 0.000195  Progress: 99.24%  Words/thread/sec: 18.39k  
Error in type.convert(data[[i]], as.is = as.is[i], dec = dec, numerals = numerals,  : 
  invalid multibyte string at '<f6>(<83>;<a4><d0>�;��{<bb>{<d4>V<bb><b8>�<b3>:q<fd>E;ףv:<9a><99>]9<f6>(l<bb><d7>c�;'

I tried to retrain model on a small subset of cookbooks and that failed similarly.

> model = train_word2vec("cookbooks.txt",output="cookbooks.vectors",threads = 3,vectors = 100,window=12, force=T)
Starting training using file /home/brandon/repo/stack/data/cookbooks.txt
Vocab size: 5331
Words in train file: 345615
Alpha: 0.000073  Progress: 100.34%  Words/thread/sec: 19.73k  
Error in type.convert(data[[i]], as.is = as.is[i], dec = dec, numerals = numerals,  : 
  invalid multibyte string at '<f6>(<83>;<a4><d0>�;��{<bb>{<d4>V<bb><b8>�<b3>:q<fd>E;ףv:<9a><99>]9<f6>(l<bb><d7>c�;'

It appear to be choking on an usual character or unexpected byte. Was there a change in the way the cookbook data was initially saved versus how it is currently processed? The following warnings are also shown:

In addition: Warning messages:
1: In utils::read.table(filename, header = F, skip = 1, colClasses = c("character",  :
  line 1 appears to contain embedded nulls
2: In utils::read.table(filename, header = F, skip = 1, colClasses = c("character",  :
  line 2 appears to contain embedded nulls
3: In utils::read.table(filename, header = F, skip = 1, colClasses = c("character",  :
  line 5 appears to contain embedded nulls
4: In utils::read.table(filename, header = F, skip = 1, nrows = 1,  :
  line 1 appears to contain embedded nulls

Additional system details:
OS - Ubuntu 14.04
R - [1] "R version 3.2.3 (2015-12-10)"

bmschmidt added a commit that referenced this issue May 2, 2016
Fixing error in quickstart induced by switch to binary storage relying on extension names: #10
@bmschmidt
Copy link
Owner

Thanks for the bug report. This was my fault, forgetting to test the walkthrough all the way through to the end.

It should be fixed now in the readme, by changing the filename from "cookbooks.vectors" to "cookbook_vectors.bin".

If you've already run the model, you can also pick up in the quick start by running `model = read.binary.vectors("cookbooks.vectors").

The problem was that I changed the default storage format from text to binary with the latest release. But I didn't change the extension on the saved file. Since the function guesses at the filetype using the extension, it was trying to read a binary file as text.

@loudermilk
Copy link
Author

thanks for your prompt attention - I hope to get back to this in a cpl
days. Thanks again. ~b

On Mon, May 2, 2016 at 12:58 PM, Benjamin Schmidt notifications@github.com
wrote:

Thanks for the bug report. This was my fault, forgetting to test the
walkthrough all the way through to the end.

It should be fixed now in the readme, by changing the filename from
"cookbooks.vectors" to "cookbook_vectors.bin".

If you've already run the model, you can also pick up in the quick start
by running `model = read.binary.vectors("cookbooks.vectors").

The problem was that I changed the default storage format from text to
binary with the latest release. But I didn't change the extension on the
saved file. Since the function guesses at the filetype using the extension,
it was trying to read a binary file as text.


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#10 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants