Implement Java's CharSequence over very large text files; usable with regexes, or grappa
Switch branches/tags
Nothing to show
Clone or download
Latest commit db29b42 Dec 31, 2014

README.md

What this is

This library allows you to use very large (up to a few GiB) text files as CharSequence.

OK, this does not sound very sexy, but please read on!

Motivation; and a bit of history

This project stemmed from a discussion on StackOverflow where I suggested that the OP (in StackOverflow jargon, that means the user asking a question) implemented CharSequence over a large text file.

Even though the answer was accepted, well, nothing existed for that; and since I like a challenge (and there are many in this case), instead of just being satisfied with the answer, I decided to have a go at it; hence this project was born.

But the story does not end here. Since then I have also been working on Grappa, which is Parboiled (v1) continued; having this package in a corner of my mind, I decided to add CharSequence support.

So there we are: you can now use not only regexes, but full fledged parboiled1/grappa grammars, on large text files without worrying about memory consumption since you DO NOT need to load the whole file into memory; all of this thanks to a very simple interface which has been there since Java 1.4!

Versions

The current version is 0.2.0. Javadoc is available online. It is available on Maven Central.

Using gradle:

dependencies {
    compile(group: "com.github.fge", name: "largetext", version: "0.1.0");
}

Using maven:

<dependency>
    <groupId>com.github.fge</groupId>
    <artifactId>largetext</artifactId>
    <version>0.1.0</version>
</dependency>

Warning about .toString()!

Yes, this very simple, seemingly innocuous method is this package's death trap. The CharSequence contract stipulates that its .toString() implementation must return a string whose length and contents are that of the sequence; but we deal here with files which can potentially contain billions of charaters... And this means a billion character long string.

Using .toString() will therefore more than likely result in an OutOfMemory error, not to mention such an error will be triggered after an inordinate amount of time... The current version does not deal with that, so, at this moment, the only thing I can say is:

DON'T DO THAT

Beware when debugging!

Quick usage

The first thing to do is to create a LargeTextFactory. You can customize a factory in two ways:

  • specify the character encoding (Charset in Java) of your files;
  • specify the size of byte windows for the decoding process (see below).

Sample code:

// Default factory
final LargeTextFactory factory = LargeTextFactory.defaultFactory();
// Submit your own charset and window size
final LargeTextFactory factory = LargeTextFactory.newBuilder()
    .setCharset(StandardCharsets.US_ASCII) // either a Charset instance
    .setCharsetByName("windows-1252")      // or by name
    .setWindowSize(16, SizeUnit.MiB)        // set the window size
    .build();

The default factory uses UTF-8 as a character encoding and a 2 MiB byte window.

Then you create a LargeText instance; for this, you need the Path to the file.

Note that LargeText implements Closeable in addition to CharSequence, so it is important that you use it this way... Otherwise the file descriptor associated with it will stay open! Therefore:

final Path bigTextFile = Paths.get("/path/to/bigtextfile");

try (
    final LargeText largeText = factory.fromPath(bigTextFile);
) {
    // use "largeText" here
}

As mentioned in the introduction, the fact that it implements CharSequence means you can use it with regexes:

// You need Pattern.MULTILINE if you mean to match lines within
// the file! Otherwise "^" and "$" will only match the beginning
// and end of input (ie, the whole file) respectively.
private static final Pattern PATTERN = Pattern.compile("^\\d{4}:",
    Pattern.MULTILINE);

// In code:
final Path bigTextFile = Paths.get("/path/to/bigtextfile");

try (
    final LargeText largeText = factory.fromPath(bigTextFile);
) {
    final Matcher m = PATTERN.matcher(largeText);
    while (m.find())
        System.out.println("Match: " + m.group());
}

Limitations

The limitations are that of CharSequence (which is reflected in all their implementations): if you have more than Integer.MAX_VALUE characters in your file, you cannot use this class reliably!