regex-chunker

Splitting output from Read types with regular expressions.

The chief type in this crate is the ByteChunker, which wraps a type that implements Read and iterates over chunks of its byte stream delimited by a supplied regular expression. The following example reads from the standard input and prints word counts:

use std::collections::BTreeMap;
use regex_chunker::ByteChunker;
  
fn main() -> Result<(), Box<dyn Error>> {
    let mut counts: BTreeMap<String, usize> = BTreeMap::new();
    let stdin = std::io::stdin();
    
    // The regex is a stab at something matching strings of
    // "between-word" characters in general English text.
    let chunker = ByteChunker::new(stdin, r#"[ "\r\n.,!?:;/]+"#)?;
    for chunk in chunker {
        let word = String::from_utf8_lossy(&chunk?).to_lowercase();
        *counts.entry(word).or_default() += 1;
    }

    println!("{:#?}", &counts);
    Ok(())
}

The async feature enables the stream submodule, which contains an asynchronous version of ByteChunker that wraps an tokio::io::AsyncRead type and produces a Stream of byte chunks.

Running The Tests

If you want to run the tests for the async features, you need to first build src/bin/slowsource.rs with the async and test feature enabled:

$ cargo build --bin slowsource --all-features

Some of the [stream] module tests run it in a subprocess and use it as a source of bytes.

Unanswered Questions and Stuff To do

This is, as of yet, an essentially naive implementation. What can be done to optimize performance?

Is there room to tighten up the RcErr type?

When non-overlapping blanket impls (1672, maybe 20400) land, remove both the SimpleCustomChunker types.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
src		src
test		test
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

test

test

.gitignore

.gitignore

Cargo.toml

Cargo.toml

LICENSE

LICENSE

README.md

README.md

Repository files navigation

regex-chunker

Running The Tests

Unanswered Questions and Stuff To do

About

Releases

Packages

Languages

License

d2718/regex-chunker

Folders and files

Latest commit

History

Repository files navigation

regex-chunker

Running The Tests

Unanswered Questions and Stuff To do

About

Resources

License

Stars

Watchers

Forks

Languages