-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation for file iterators #69
Comments
Great question!
Regarding the question about reading files. If you use
In example below I use version from 0.3 branch library(text2vec)
library(data.table)
data("movie_review")
prepare <- function(txt) {
txt %>%
tolower %>%
word_tokenizer
}
setDT(movie_review)
movie_review[, review_tokens:= prepare(review)]
head(movie_review[, .(id, sentiment, review_tokens)])
v <- vocabulary(itoken(movie_review$review_tokens))
vectorizer <- vocab_vectorizer(v)
dtm <- create_dtm(itoken(movie_review$review_tokens), vectorizer) In case of P.S. I vectorized english wikipedia (which is about 13gb, 4M articles) in a 5 minutes on 16-core machine. Including vocabulary construction, pruning and corpus construction. hash-vectorization took ~ 2.5 minutes. |
Thanks for the explanation. So it seems like I'm using
One other question. Working with about 6,000 test files, I'm not noticing that text2vec is using multiple processor threads/cores to create the DTM. Is that just because the test data is small, or do I need to explicitly let it know how many cores to use? Just for an explanation of scale, I'm working with about 293 GB of text in 10.5 million documents. But for the particular thing that I'm trying to do, it is trivial for me to split that up into smaller chunks myself. This is an awesome package, btw. |
correct!
I haven't written new docs yet, but the following snippet demonstrates new high-level APIs: library(doParallel)
registerDoParallel(16)
start <- Sys.time()
# tab-separated wikipedia "article_title \t article_body"
reader <- function(x) {
fread(x, sep = '\t', header = F, select = 2, colClasses = rep('character', 2))[[1]]
}
fls <- list.files("~/datasets/enwiki_splits/", full.names = T)
# jobs are simply list of itoken iterators. Each element is separate job in a separate process.
# after finish the will be efficiently combined. (especially efficiently in case of `dgTMatrix`)
jobs <- fls %>%
split_into(64) %>%
lapply(function(x) x %>% ifiles(reader_function = reader) %>% itoken)
v <- vocabulary(jobs)
dtm <- create_dtm(jobs, vocab_vectorizer(v), type = 'dgTMatrix')
finish <- Sys.time() Also make sure, your splits are roughly equal and not too small (to reduce overhead). For example in my case splits were ~ 200mb each. 10mb - 300mb, probably also will be ok. 1mb will introduce too much overhead. 1000mb will trigger much larger ram footprint without any performance gain... |
🚀 Thanks very much for the helpful responses. |
P.S. I hope you won't take this offer the wrong way. But I'd be happy to proofread/revise the documentation before the next CRAN release. If that is something that would be helpful, maybe you could open a new issue and ping me? If it's not helpful, feel free to ignore. |
Are there python code? For word2vec
|
@lmullen it would be great! Yesterday I have started writing the overview of v0.3. I'll create 0.3 -> master pull request soon and add link to overview. After that I have plans to wait about 1 week to collect feedback and then push new version to CRAN. |
Sounds good. I'll do as much as I can on it once you've finished with the On Fri, Mar 11, 2016 at 12:16 AM, Dmitriy Selivanov <
Lincoln Mullen, http://lincolnmullen.com |
@lmullen I have created overview of text2vec 0.3 - text2vec 0.3 announce. Would be great if you can check docs and vignettes. |
@dselivanov: I'll do my best to revise the docs and vignettes this weekend. Is a single pull request okay, or do you want changes broken up? |
@lmullen single PR will be ok. Thanks in advance. |
Thanks
|
done in 0.3 |
I have a question about using file iterators. As mentioned in #65, it's necessary to iterate over the texts in a corpus twice, once for building the vocabulary, and then again to build the DTM. My assumption is that in the case of a file iterator, that means you are reading each file from disk twice. My guess is that for corpora that would fit into memory one might be better off loading the texts into a character vector one's self, then iterating over that twice. But for corpora that don't fit into memory, reading each file twice would be the only way to construct the DTM. Is that correct, or is there a better way?
The text was updated successfully, but these errors were encountered: