Vocabulary size is much smaller than it ought to be #7

KevinBretonnelCohen · 2016-01-20T00:54:28Z

"Vocab size" is way off. See the attached screen shot: 1.4 billion words of PubMedCentral author manuscripts, and the vocabulary size is 1,307, according to the status message output. Doesn't seem likely.

I'm using whatever version of wordVectors was on GitHub as of mid-January 2015. Not sure what version of RStudio--I think those 1.4 billion words of text have choked my laptop to death... OS X.

KevinBretonnelCohen · 2016-01-20T17:56:41Z

An additional data point regarding the small vocabulary size that's showing up on the status message: I'm trying to run train_word2vec() on a data set that's about 1/10 the size of the data set that I was using yesterday, and it's showing the same vocabulary size--see the attached screen shot...

KevinBretonnelCohen · 2016-01-20T18:04:09Z

Cut the data set down by another order of magnitude--same vocabulary size showing. See screen shot.

bmschmidt · 2016-01-20T19:25:10Z

Huh. How are you normalizing the text? I can untar a few pubmed abstracts and run cat */*.txt | perl -pe 's/[^A-Za-z \n]/ /g;' > all.txt to get something that gives 40,195 vocab in 5.8 million words.

What's the output of system("head -20 YOURFILENAME | cut -c 1-80")? Does it look like real text? Or another check is; what are the first twenty rownames() of the trained object model?
Try rerunning install_github("bmschmidt/wordVectors"); maybe the update last week fixed it.

bmschmidt closed this as completed Mar 30, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vocabulary size is much smaller than it ought to be #7

Vocabulary size is much smaller than it ought to be #7

KevinBretonnelCohen commented Jan 20, 2016

KevinBretonnelCohen commented Jan 20, 2016

KevinBretonnelCohen commented Jan 20, 2016

bmschmidt commented Jan 20, 2016

Vocabulary size is much smaller than it ought to be #7

Vocabulary size is much smaller than it ought to be #7

Comments

KevinBretonnelCohen commented Jan 20, 2016

KevinBretonnelCohen commented Jan 20, 2016

KevinBretonnelCohen commented Jan 20, 2016

bmschmidt commented Jan 20, 2016