Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation for file iterators #69

Closed
lmullen opened this issue Mar 10, 2016 · 14 comments
Closed

Documentation for file iterators #69

lmullen opened this issue Mar 10, 2016 · 14 comments
Milestone

Comments

@lmullen
Copy link
Contributor

lmullen commented Mar 10, 2016

I have a question about using file iterators. As mentioned in #65, it's necessary to iterate over the texts in a corpus twice, once for building the vocabulary, and then again to build the DTM. My assumption is that in the case of a file iterator, that means you are reading each file from disk twice. My guess is that for corpora that would fit into memory one might be better off loading the texts into a character vector one's self, then iterating over that twice. But for corpora that don't fit into memory, reading each file twice would be the only way to construct the DTM. Is that correct, or is there a better way?

@dselivanov
Copy link
Owner

Great question!
I also don't like idea of iterating through input 2 times, but seems this is not an issue in real life. In fact, in the very first version of text2vec we needed only single pass to construct corpus / dtm (and without vocabulary concept). There are two main disadvantages for approach with single pass:

  1. Usually we don't know vocabulary and frequent words (stop words) in advance. This means we insert all these frequent terms into corpus and dtm become much more dense and heavy.
  2. If user was able to construct dtm, he/she usually need to prune very common and very uncommon terms. There is a transformer_filter_commons for that purpose (quite optimized). But as you can see, there are 2 transpose operations of big dtm/tdm matrices. This usually increases RAM consumption by a factor of 3x-4x and can become bottleneck!

Regarding the question about reading files. If you use data.table::fread or readr, I/O is only tiny fraction of the whole work. The notable fraction of computations can be in preprocessing step (stemming, regular expressions, etc.). Experienced user should try to avoid doing this work twice. For that purpose in 0.3 branch I created new itoken constructor which takes list of characters as input (list of preprocessed tokens). There are two choices here.

  1. When data doesn't fit into ram
    1. Preprocess and tokenize raw text file by file (so each file consists of list of characters). Save it in serialized form with saveRDS()
    2. read these files with ifiles and readRDS as reader function.
  2. When data fits into ram
    1. read and combine documents into single structure (say data.table or data.frame)
    2. preprocess and tokenize raw text, save it to another column.
    3. create itoken iterator from this new column.

In example below I use version from 0.3 branch

library(text2vec)
library(data.table)

data("movie_review")

prepare <- function(txt) {
  txt %>% 
    tolower %>% 
    word_tokenizer
}
setDT(movie_review)
movie_review[, review_tokens:= prepare(review)]

head(movie_review[, .(id, sentiment, review_tokens)])

id sentiment review_tokens
1: 5814_8 1 with,all,this,stuff,going,down,
2: 2381_9 1 the,classic,war,of,the,worlds,
3: 7759_3 0 the,film,starts,with,a,manager,
4: 3630_4 0 it,must,be,assumed,that,those,
5: 9495_8 1 superbly,trashy,and,wondrously,unpretentious,80,
6: 8196_8 1 i,dont,know,why,people,think,

v <- vocabulary(itoken(movie_review$review_tokens))
vectorizer <- vocab_vectorizer(v)
dtm <- create_dtm(itoken(movie_review$review_tokens), vectorizer)

In case of hash_vectorizer dtm will be created in a single pass.

P.S. I vectorized english wikipedia (which is about 13gb, 4M articles) in a 5 minutes on 16-core machine. Including vocabulary construction, pruning and corpus construction. hash-vectorization took ~ 2.5 minutes.

@lmullen
Copy link
Contributor Author

lmullen commented Mar 10, 2016

Thanks for the explanation. So it seems like I'm using ifiles() correctly here by passing it's returned value on to itoken(), correct? But I could have just read the files in to a data frame and pre-processed them myself. (In this case, I've got a specific vocabulary that I want to look for.)

  files_it <- ifiles(newspaper_pages, reader_function = read_file)
  pages_it <- itoken(files_it, preprocess_function = str_to_lower,
                     tokenizer = word_tokenizer)
  newspaper_corpus <- create_corpus(pages_it, vocab_vectorizer(biblical_vocab))
  newspaper_dtm <- get_dtm(newspaper_corpus, type = "dgTMatrix")

One other question. Working with about 6,000 test files, I'm not noticing that text2vec is using multiple processor threads/cores to create the DTM. Is that just because the test data is small, or do I need to explicitly let it know how many cores to use?

Just for an explanation of scale, I'm working with about 293 GB of text in 10.5 million documents. But for the particular thing that I'm trying to do, it is trivial for me to split that up into smaller chunks myself.

This is an awesome package, btw.

@dselivanov
Copy link
Owner

So it seems like I'm using ifiles() correctly here by passing it's returned value on to itoken(), correct?

correct!

I'm not noticing that text2vec is using multiple processor threads/cores to create the DTM. Is that just because the test data is small, or do I need to explicitly let it know how many cores to use?

I haven't written new docs yet, but the following snippet demonstrates new high-level APIs:

library(doParallel)
registerDoParallel(16)

start <- Sys.time()
# tab-separated wikipedia "article_title \t article_body"
reader <- function(x) {
  fread(x, sep = '\t', header = F, select = 2, colClasses = rep('character', 2))[[1]]
}

fls <- list.files("~/datasets/enwiki_splits/", full.names = T)

# jobs are simply list of itoken iterators. Each element is separate job in a separate process.
# after finish the will be efficiently combined. (especially efficiently in case of `dgTMatrix`)
jobs <- fls %>% 
  split_into(64) %>% 
  lapply(function(x) x %>% ifiles(reader_function = reader) %>% itoken)

v <- vocabulary(jobs)

dtm <- create_dtm(jobs, vocab_vectorizer(v), type = 'dgTMatrix')

finish <- Sys.time()

Also make sure, your splits are roughly equal and not too small (to reduce overhead). For example in my case splits were ~ 200mb each. 10mb - 300mb, probably also will be ok. 1mb will introduce too much overhead. 1000mb will trigger much larger ram footprint without any performance gain...

@lmullen
Copy link
Contributor Author

lmullen commented Mar 10, 2016

🚀

Thanks very much for the helpful responses.

@lmullen
Copy link
Contributor Author

lmullen commented Mar 10, 2016

P.S. I hope you won't take this offer the wrong way. But I'd be happy to proofread/revise the documentation before the next CRAN release. If that is something that would be helpful, maybe you could open a new issue and ping me? If it's not helpful, feel free to ignore.

@Sandy4321
Copy link

Are there python code? For word2vec
On Mar 10, 2016 15:36, "Lincoln Mullen" notifications@github.com wrote:

P.S. I hope you won't take this offer the wrong way. But I'd be happy to
proofread/revise the documentation before the next CRAN release. If that is
something that would be helpful, maybe you could open a new issue and ping
me? If it's not helpful, feel free to ignore.


Reply to this email directly or view it on GitHub
#69 (comment).

@zachmayer
Copy link

@dselivanov
Copy link
Owner

@lmullen it would be great! Yesterday I have started writing the overview of v0.3. I'll create 0.3 -> master pull request soon and add link to overview. After that I have plans to wait about 1 week to collect feedback and then push new version to CRAN.

@lmullen
Copy link
Contributor Author

lmullen commented Mar 11, 2016

Sounds good. I'll do as much as I can on it once you've finished with the
0.3 documentation.

On Fri, Mar 11, 2016 at 12:16 AM, Dmitriy Selivanov <
notifications@github.com> wrote:

@lmullen https://github.com/lmullen it would be great! Yesterday I have
started writing the overview of v0.3. I'll create PR 0.3->master soon and
add link to overview. After that I have plans to wait about 1 week to
collect feedback and then push new version to CRAN.


Reply to this email directly or view it on GitHub
#69 (comment).

Lincoln Mullen, http://lincolnmullen.com
Assistant Professor, Department of History & Art History
George Mason University

@dselivanov
Copy link
Owner

@lmullen I have created overview of text2vec 0.3 - text2vec 0.3 announce. Would be great if you can check docs and vignettes.

cc @zachmayer , @trinker, @pommedeterresautee

@lmullen
Copy link
Contributor Author

lmullen commented Mar 17, 2016

@dselivanov: I'll do my best to revise the docs and vignettes this weekend.

Is a single pull request okay, or do you want changes broken up?

@dselivanov
Copy link
Owner

@lmullen single PR will be ok. Thanks in advance.

@dselivanov dselivanov added this to the 0.3 milestone Mar 19, 2016
@Sandy4321
Copy link

Thanks
On Mar 10, 2016 20:06, "Zach Mayer" notifications@github.com wrote:

@Sandy4321 https://github.com/Sandy4321:
https://radimrehurek.com/gensim/


Reply to this email directly or view it on GitHub
#69 (comment).

@dselivanov
Copy link
Owner

done in 0.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants