Splitting output from Read
types with regular expressions.
The chief type in this crate is the
ByteChunker
,
which wraps a type that implements
Read
and iterates over chunks of its byte stream delimited by a supplied
regular expression. The following example reads from the standard input
and prints word counts:
use std::collections::BTreeMap;
use regex_chunker::ByteChunker;
fn main() -> Result<(), Box<dyn Error>> {
let mut counts: BTreeMap<String, usize> = BTreeMap::new();
let stdin = std::io::stdin();
// The regex is a stab at something matching strings of
// "between-word" characters in general English text.
let chunker = ByteChunker::new(stdin, r#"[ "\r\n.,!?:;/]+"#)?;
for chunk in chunker {
let word = String::from_utf8_lossy(&chunk?).to_lowercase();
*counts.entry(word).or_default() += 1;
}
println!("{:#?}", &counts);
Ok(())
}
The async
feature enables the stream
submodule, which contains an
asynchronous version of ByteChunker
that wraps an
tokio::io::AsyncRead
type and produces a
Stream
of byte chunks.
If you want to run the tests for the async
features, you need to first
build src/bin/slowsource.rs
with the async
and test
feature enabled:
$ cargo build --bin slowsource --all-features
Some of the [stream
] module tests run it in a subprocess and use it as
a source of bytes.
This is, as of yet, an essentially naive implementation. What can be done to optimize performance?
Is there room to tighten up the RcErr
type?
When non-overlapping blanket impls
(1672,
maybe 20400) land, remove both the
SimpleCustomChunker
types.