I have a question about using file iterators. As mentioned in #65, it's necessary to iterate over the texts in a corpus twice, once for building the vocabulary, and then again to build the DTM. My assumption is that in the case of a file iterator, that means you are reading each file from disk twice. My guess is that for corpora that would fit into memory one might be better off loading the texts into a character vector one's self, then iterating over that twice. But for corpora that don't fit into memory, reading each file twice would be the only way to construct the DTM. Is that correct, or is there a better way?
I have a question about using file iterators. As mentioned in #65, it's necessary to iterate over the texts in a corpus twice, once for building the vocabulary, and then again to build the DTM. My assumption is that in the case of a file iterator, that means you are reading each file from disk twice. My guess is that for corpora that would fit into memory one might be better off loading the texts into a character vector one's self, then iterating over that twice. But for corpora that don't fit into memory, reading each file twice would be the only way to construct the DTM. Is that correct, or is there a better way?