Documentation for file iterators #69

Closed
lmullen opened this Issue Mar 10, 2016 · 14 comments

Comments

Projects
None yet
4 participants
@lmullen
Contributor

lmullen commented Mar 10, 2016

I have a question about using file iterators. As mentioned in #65, it's necessary to iterate over the texts in a corpus twice, once for building the vocabulary, and then again to build the DTM. My assumption is that in the case of a file iterator, that means you are reading each file from disk twice. My guess is that for corpora that would fit into memory one might be better off loading the texts into a character vector one's self, then iterating over that twice. But for corpora that don't fit into memory, reading each file twice would be the only way to construct the DTM. Is that correct, or is there a better way?

@dselivanov

This comment has been minimized.

Show comment
Hide comment
@dselivanov

dselivanov Mar 10, 2016

Owner

Great question!
I also don't like idea of iterating through input 2 times, but seems this is not an issue in real life. In fact, in the very first version of text2vec we needed only single pass to construct corpus / dtm (and without vocabulary concept). There are two main disadvantages for approach with single pass:

  1. Usually we don't know vocabulary and frequent words (stop words) in advance. This means we insert all these frequent terms into corpus and dtm become much more dense and heavy.
  2. If user was able to construct dtm, he/she usually need to prune very common and very uncommon terms. There is a transformer_filter_commons for that purpose (quite optimized). But as you can see, there are 2 transpose operations of big dtm/tdm matrices. This usually increases RAM consumption by a factor of 3x-4x and can become bottleneck!

Regarding the question about reading files. If you use data.table::fread or readr, I/O is only tiny fraction of the whole work. The notable fraction of computations can be in preprocessing step (stemming, regular expressions, etc.). Experienced user should try to avoid doing this work twice. For that purpose in 0.3 branch I created new itoken constructor which takes list of characters as input (list of preprocessed tokens). There are two choices here.

  1. When data doesn't fit into ram
    1. Preprocess and tokenize raw text file by file (so each file consists of list of characters). Save it in serialized form with saveRDS()
    2. read these files with ifiles and readRDS as reader function.
  2. When data fits into ram
    1. read and combine documents into single structure (say data.table or data.frame)
    2. preprocess and tokenize raw text, save it to another column.
    3. create itoken iterator from this new column.

In example below I use version from 0.3 branch

library(text2vec)
library(data.table)

data("movie_review")

prepare <- function(txt) {
  txt %>% 
    tolower %>% 
    word_tokenizer
}
setDT(movie_review)
movie_review[, review_tokens:= prepare(review)]

head(movie_review[, .(id, sentiment, review_tokens)])

id sentiment review_tokens
1: 5814_8 1 with,all,this,stuff,going,down,
2: 2381_9 1 the,classic,war,of,the,worlds,
3: 7759_3 0 the,film,starts,with,a,manager,
4: 3630_4 0 it,must,be,assumed,that,those,
5: 9495_8 1 superbly,trashy,and,wondrously,unpretentious,80,
6: 8196_8 1 i,dont,know,why,people,think,

v <- vocabulary(itoken(movie_review$review_tokens))
vectorizer <- vocab_vectorizer(v)
dtm <- create_dtm(itoken(movie_review$review_tokens), vectorizer)

In case of hash_vectorizer dtm will be created in a single pass.

P.S. I vectorized english wikipedia (which is about 13gb, 4M articles) in a 5 minutes on 16-core machine. Including vocabulary construction, pruning and corpus construction. hash-vectorization took ~ 2.5 minutes.

Owner

dselivanov commented Mar 10, 2016

Great question!
I also don't like idea of iterating through input 2 times, but seems this is not an issue in real life. In fact, in the very first version of text2vec we needed only single pass to construct corpus / dtm (and without vocabulary concept). There are two main disadvantages for approach with single pass:

  1. Usually we don't know vocabulary and frequent words (stop words) in advance. This means we insert all these frequent terms into corpus and dtm become much more dense and heavy.
  2. If user was able to construct dtm, he/she usually need to prune very common and very uncommon terms. There is a transformer_filter_commons for that purpose (quite optimized). But as you can see, there are 2 transpose operations of big dtm/tdm matrices. This usually increases RAM consumption by a factor of 3x-4x and can become bottleneck!

Regarding the question about reading files. If you use data.table::fread or readr, I/O is only tiny fraction of the whole work. The notable fraction of computations can be in preprocessing step (stemming, regular expressions, etc.). Experienced user should try to avoid doing this work twice. For that purpose in 0.3 branch I created new itoken constructor which takes list of characters as input (list of preprocessed tokens). There are two choices here.

  1. When data doesn't fit into ram
    1. Preprocess and tokenize raw text file by file (so each file consists of list of characters). Save it in serialized form with saveRDS()
    2. read these files with ifiles and readRDS as reader function.
  2. When data fits into ram
    1. read and combine documents into single structure (say data.table or data.frame)
    2. preprocess and tokenize raw text, save it to another column.
    3. create itoken iterator from this new column.

In example below I use version from 0.3 branch

library(text2vec)
library(data.table)

data("movie_review")

prepare <- function(txt) {
  txt %>% 
    tolower %>% 
    word_tokenizer
}
setDT(movie_review)
movie_review[, review_tokens:= prepare(review)]

head(movie_review[, .(id, sentiment, review_tokens)])

id sentiment review_tokens
1: 5814_8 1 with,all,this,stuff,going,down,
2: 2381_9 1 the,classic,war,of,the,worlds,
3: 7759_3 0 the,film,starts,with,a,manager,
4: 3630_4 0 it,must,be,assumed,that,those,
5: 9495_8 1 superbly,trashy,and,wondrously,unpretentious,80,
6: 8196_8 1 i,dont,know,why,people,think,

v <- vocabulary(itoken(movie_review$review_tokens))
vectorizer <- vocab_vectorizer(v)
dtm <- create_dtm(itoken(movie_review$review_tokens), vectorizer)

In case of hash_vectorizer dtm will be created in a single pass.

P.S. I vectorized english wikipedia (which is about 13gb, 4M articles) in a 5 minutes on 16-core machine. Including vocabulary construction, pruning and corpus construction. hash-vectorization took ~ 2.5 minutes.

@lmullen

This comment has been minimized.

Show comment
Hide comment
@lmullen

lmullen Mar 10, 2016

Contributor

Thanks for the explanation. So it seems like I'm using ifiles() correctly here by passing it's returned value on to itoken(), correct? But I could have just read the files in to a data frame and pre-processed them myself. (In this case, I've got a specific vocabulary that I want to look for.)

  files_it <- ifiles(newspaper_pages, reader_function = read_file)
  pages_it <- itoken(files_it, preprocess_function = str_to_lower,
                     tokenizer = word_tokenizer)
  newspaper_corpus <- create_corpus(pages_it, vocab_vectorizer(biblical_vocab))
  newspaper_dtm <- get_dtm(newspaper_corpus, type = "dgTMatrix")

One other question. Working with about 6,000 test files, I'm not noticing that text2vec is using multiple processor threads/cores to create the DTM. Is that just because the test data is small, or do I need to explicitly let it know how many cores to use?

Just for an explanation of scale, I'm working with about 293 GB of text in 10.5 million documents. But for the particular thing that I'm trying to do, it is trivial for me to split that up into smaller chunks myself.

This is an awesome package, btw.

Contributor

lmullen commented Mar 10, 2016

Thanks for the explanation. So it seems like I'm using ifiles() correctly here by passing it's returned value on to itoken(), correct? But I could have just read the files in to a data frame and pre-processed them myself. (In this case, I've got a specific vocabulary that I want to look for.)

  files_it <- ifiles(newspaper_pages, reader_function = read_file)
  pages_it <- itoken(files_it, preprocess_function = str_to_lower,
                     tokenizer = word_tokenizer)
  newspaper_corpus <- create_corpus(pages_it, vocab_vectorizer(biblical_vocab))
  newspaper_dtm <- get_dtm(newspaper_corpus, type = "dgTMatrix")

One other question. Working with about 6,000 test files, I'm not noticing that text2vec is using multiple processor threads/cores to create the DTM. Is that just because the test data is small, or do I need to explicitly let it know how many cores to use?

Just for an explanation of scale, I'm working with about 293 GB of text in 10.5 million documents. But for the particular thing that I'm trying to do, it is trivial for me to split that up into smaller chunks myself.

This is an awesome package, btw.

@dselivanov

This comment has been minimized.

Show comment
Hide comment
@dselivanov

dselivanov Mar 10, 2016

Owner

So it seems like I'm using ifiles() correctly here by passing it's returned value on to itoken(), correct?

correct!

I'm not noticing that text2vec is using multiple processor threads/cores to create the DTM. Is that just because the test data is small, or do I need to explicitly let it know how many cores to use?

I haven't written new docs yet, but the following snippet demonstrates new high-level APIs:

library(doParallel)
registerDoParallel(16)

start <- Sys.time()
# tab-separated wikipedia "article_title \t article_body"
reader <- function(x) {
  fread(x, sep = '\t', header = F, select = 2, colClasses = rep('character', 2))[[1]]
}

fls <- list.files("~/datasets/enwiki_splits/", full.names = T)

# jobs are simply list of itoken iterators. Each element is separate job in a separate process.
# after finish the will be efficiently combined. (especially efficiently in case of `dgTMatrix`)
jobs <- fls %>% 
  split_into(64) %>% 
  lapply(function(x) x %>% ifiles(reader_function = reader) %>% itoken)

v <- vocabulary(jobs)

dtm <- create_dtm(jobs, vocab_vectorizer(v), type = 'dgTMatrix')

finish <- Sys.time()

Also make sure, your splits are roughly equal and not too small (to reduce overhead). For example in my case splits were ~ 200mb each. 10mb - 300mb, probably also will be ok. 1mb will introduce too much overhead. 1000mb will trigger much larger ram footprint without any performance gain...

Owner

dselivanov commented Mar 10, 2016

So it seems like I'm using ifiles() correctly here by passing it's returned value on to itoken(), correct?

correct!

I'm not noticing that text2vec is using multiple processor threads/cores to create the DTM. Is that just because the test data is small, or do I need to explicitly let it know how many cores to use?

I haven't written new docs yet, but the following snippet demonstrates new high-level APIs:

library(doParallel)
registerDoParallel(16)

start <- Sys.time()
# tab-separated wikipedia "article_title \t article_body"
reader <- function(x) {
  fread(x, sep = '\t', header = F, select = 2, colClasses = rep('character', 2))[[1]]
}

fls <- list.files("~/datasets/enwiki_splits/", full.names = T)

# jobs are simply list of itoken iterators. Each element is separate job in a separate process.
# after finish the will be efficiently combined. (especially efficiently in case of `dgTMatrix`)
jobs <- fls %>% 
  split_into(64) %>% 
  lapply(function(x) x %>% ifiles(reader_function = reader) %>% itoken)

v <- vocabulary(jobs)

dtm <- create_dtm(jobs, vocab_vectorizer(v), type = 'dgTMatrix')

finish <- Sys.time()

Also make sure, your splits are roughly equal and not too small (to reduce overhead). For example in my case splits were ~ 200mb each. 10mb - 300mb, probably also will be ok. 1mb will introduce too much overhead. 1000mb will trigger much larger ram footprint without any performance gain...

@lmullen

This comment has been minimized.

Show comment
Hide comment
@lmullen

lmullen Mar 10, 2016

Contributor

🚀

Thanks very much for the helpful responses.

Contributor

lmullen commented Mar 10, 2016

🚀

Thanks very much for the helpful responses.

@lmullen

This comment has been minimized.

Show comment
Hide comment
@lmullen

lmullen Mar 10, 2016

Contributor

P.S. I hope you won't take this offer the wrong way. But I'd be happy to proofread/revise the documentation before the next CRAN release. If that is something that would be helpful, maybe you could open a new issue and ping me? If it's not helpful, feel free to ignore.

Contributor

lmullen commented Mar 10, 2016

P.S. I hope you won't take this offer the wrong way. But I'd be happy to proofread/revise the documentation before the next CRAN release. If that is something that would be helpful, maybe you could open a new issue and ping me? If it's not helpful, feel free to ignore.

@Sandy4321

This comment has been minimized.

Show comment
Hide comment
@Sandy4321

Sandy4321 Mar 11, 2016

Are there python code? For word2vec
On Mar 10, 2016 15:36, "Lincoln Mullen" notifications@github.com wrote:

P.S. I hope you won't take this offer the wrong way. But I'd be happy to
proofread/revise the documentation before the next CRAN release. If that is
something that would be helpful, maybe you could open a new issue and ping
me? If it's not helpful, feel free to ignore.


Reply to this email directly or view it on GitHub
#69 (comment).

Are there python code? For word2vec
On Mar 10, 2016 15:36, "Lincoln Mullen" notifications@github.com wrote:

P.S. I hope you won't take this offer the wrong way. But I'd be happy to
proofread/revise the documentation before the next CRAN release. If that is
something that would be helpful, maybe you could open a new issue and ping
me? If it's not helpful, feel free to ignore.


Reply to this email directly or view it on GitHub
#69 (comment).

@dselivanov

This comment has been minimized.

Show comment
Hide comment
@dselivanov

dselivanov Mar 11, 2016

Owner

@lmullen it would be great! Yesterday I have started writing the overview of v0.3. I'll create 0.3 -> master pull request soon and add link to overview. After that I have plans to wait about 1 week to collect feedback and then push new version to CRAN.

Owner

dselivanov commented Mar 11, 2016

@lmullen it would be great! Yesterday I have started writing the overview of v0.3. I'll create 0.3 -> master pull request soon and add link to overview. After that I have plans to wait about 1 week to collect feedback and then push new version to CRAN.

@lmullen

This comment has been minimized.

Show comment
Hide comment
@lmullen

lmullen Mar 11, 2016

Contributor

Sounds good. I'll do as much as I can on it once you've finished with the
0.3 documentation.

On Fri, Mar 11, 2016 at 12:16 AM, Dmitriy Selivanov <
notifications@github.com> wrote:

@lmullen https://github.com/lmullen it would be great! Yesterday I have
started writing the overview of v0.3. I'll create PR 0.3->master soon and
add link to overview. After that I have plans to wait about 1 week to
collect feedback and then push new version to CRAN.


Reply to this email directly or view it on GitHub
#69 (comment).

Lincoln Mullen, http://lincolnmullen.com
Assistant Professor, Department of History & Art History
George Mason University

Contributor

lmullen commented Mar 11, 2016

Sounds good. I'll do as much as I can on it once you've finished with the
0.3 documentation.

On Fri, Mar 11, 2016 at 12:16 AM, Dmitriy Selivanov <
notifications@github.com> wrote:

@lmullen https://github.com/lmullen it would be great! Yesterday I have
started writing the overview of v0.3. I'll create PR 0.3->master soon and
add link to overview. After that I have plans to wait about 1 week to
collect feedback and then push new version to CRAN.


Reply to this email directly or view it on GitHub
#69 (comment).

Lincoln Mullen, http://lincolnmullen.com
Assistant Professor, Department of History & Art History
George Mason University

@dselivanov

This comment has been minimized.

Show comment
Hide comment
@dselivanov

dselivanov Mar 17, 2016

Owner

@lmullen I have created overview of text2vec 0.3 - text2vec 0.3 announce. Would be great if you can check docs and vignettes.

cc @zachmayer , @trinker, @pommedeterresautee

Owner

dselivanov commented Mar 17, 2016

@lmullen I have created overview of text2vec 0.3 - text2vec 0.3 announce. Would be great if you can check docs and vignettes.

cc @zachmayer , @trinker, @pommedeterresautee

@lmullen

This comment has been minimized.

Show comment
Hide comment
@lmullen

lmullen Mar 17, 2016

Contributor

@dselivanov: I'll do my best to revise the docs and vignettes this weekend.

Is a single pull request okay, or do you want changes broken up?

Contributor

lmullen commented Mar 17, 2016

@dselivanov: I'll do my best to revise the docs and vignettes this weekend.

Is a single pull request okay, or do you want changes broken up?

@dselivanov

This comment has been minimized.

Show comment
Hide comment
@dselivanov

dselivanov Mar 18, 2016

Owner

@lmullen single PR will be ok. Thanks in advance.

Owner

dselivanov commented Mar 18, 2016

@lmullen single PR will be ok. Thanks in advance.

@dselivanov dselivanov added this to the 0.3 milestone Mar 19, 2016

@Sandy4321

This comment has been minimized.

Show comment
Hide comment
@Sandy4321

Sandy4321 Mar 25, 2016

Thanks
On Mar 10, 2016 20:06, "Zach Mayer" notifications@github.com wrote:

@Sandy4321 https://github.com/Sandy4321:
https://radimrehurek.com/gensim/


Reply to this email directly or view it on GitHub
#69 (comment).

Thanks
On Mar 10, 2016 20:06, "Zach Mayer" notifications@github.com wrote:

@Sandy4321 https://github.com/Sandy4321:
https://radimrehurek.com/gensim/


Reply to this email directly or view it on GitHub
#69 (comment).

@dselivanov

This comment has been minimized.

Show comment
Hide comment
@dselivanov

dselivanov Mar 31, 2016

Owner

done in 0.3

Owner

dselivanov commented Mar 31, 2016

done in 0.3

@dselivanov dselivanov closed this Mar 31, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment