Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve loading speed (of regex?) - cli usecase #56

Open
drahnr opened this issue Apr 1, 2021 · 13 comments
Open

Improve loading speed (of regex?) - cli usecase #56

drahnr opened this issue Apr 1, 2021 · 13 comments
Labels
P2 Medium priority

Comments

@drahnr
Copy link
Contributor

drahnr commented Apr 1, 2021

The biggest issue using this library currently is the fact, that on each startup a lot of regular expressions are compiled.

If regex (or whatever crate being used) implements serialization of the compiled regex this could be entirely avoided and shifted to build time / once upon a time. The current issue here is the impl of the crate regex itself which does not impl serde serialization.

In the meantime, parallel compilation of the parsed regexps would probably speed up the initial loading by a factor of $cores.

@bminixhofer
Copy link
Owner

The biggest issue using this library currently is the fact, that on each startup a lot of regular expressions are compiled.

That is quite subjective :) But I agree that loading times should be improved / tracked more closely.

If regex (or whatever crate being used) implements serialization of the compiled regex this could be entirely avoided

This would be really nice. I believe regex-automata already implements this kind of serialization but fancy-regex builds on regex, not regex-automata.

In the meantime, parallel compilation of the parsed regexps would probably speed up the initial loading by a factor of $cores.

Regular expressions are already compiled lazily, and regex execution happens partially in parallel so I don't think there's much potential for improvement there:

fn regex(&self) -> &regex_impl::Regex {
if let Some(regex) = self.regex.borrow() {
regex
} else {
let regex = regex_impl::Regex::new(&self.regex_str).unwrap_or_else(|_| {
panic!("regex string should be pre-tested: {}", self.regex_str)
});
self.regex.fill(regex).ok();
self.regex.borrow().unwrap()
}
}

Are you sure that regex compilation is actually the bottleneck? I think the deserialization of the tagger from an FST could be the issue. Happens here:

impl From<TaggerFields> for Tagger {

So the first step here would be to add a criterion benchmark for loading the tokenizer and rules, and then check where there's speed issues.

I want to focus on modularization for now, but I'm open to contributions and if there's some identifiable bottleneck which isn't too hard to fix we can do that now (for example, the word_store in the tagger is currently a map from String -> u32, I think that could be replaced with a set of Strings and computing hashes at runtime instead of a fixed u32 id, that would certainly improve memory usage but I'm not sure if there'd be any impact on loading time).

@drahnr
Copy link
Contributor Author

drahnr commented Apr 1, 2021

The biggest issue using this library currently is the fact, that on each startup a lot of regular expressions are compiled.

That is quite subjective :) But I agree that loading times should be improved / tracked more closely.

Note the constraint: cli - and there the load time is very significant for the end user.

If regex (or whatever crate being used) implements serialization of the compiled regex this could be entirely avoided

This would be really nice. I believe regex-automata already implements this kind of serialization but fancy-regex builds on regex, not regex-automata.

I've had a quick peek into the source of regex, and that task is actually non-trivial unfortunately.

In the meantime, parallel compilation of the parsed regexps would probably speed up the initial loading by a factor of $cores.

Regular expressions are already compiled lazily, and regex execution happens partially in parallel so I don't think there's much potential for improvement there:

fn regex(&self) -> &regex_impl::Regex {
if let Some(regex) = self.regex.borrow() {
regex
} else {
let regex = regex_impl::Regex::new(&self.regex_str).unwrap_or_else(|_| {
panic!("regex string should be pre-tested: {}", self.regex_str)
});
self.regex.fill(regex).ok();
self.regex.borrow().unwrap()
}
}

Are you sure that regex compilation is actually the bottleneck? I think the deserialization of the tagger from an FST could be the issue. Happens here:

impl From<TaggerFields> for Tagger {

So the first step here would be to add a criterion benchmark for loading the tokenizer and rules, and then check where there's speed issues.

Yeah, I'll add some criterion benches eventually next week.

I want to focus on modularization for now, but I'm open to contributions and if there's some identifiable bottleneck which isn't too hard to fix we can do that now (for example, the word_store in the tagger is currently a map from String -> u32, I think that could be replaced with a set of Strings and computing hashes at runtime instead of a fixed u32 id, that would certainly improve memory usage but I'm not sure if there'd be any impact on loading time).

Just making sure I am not investing time + effort for the 🗑️ 😉

@drahnr drahnr changed the title Improve loading speed of regex - cli usecase Improve loading speed (of regex?) - cli usecase Apr 1, 2021
@bminixhofer
Copy link
Owner

Note the constraint: cli - and there the load time is very significant for the end user.

Right, for CLI that is definitely true.

Yeah, I'll add some criterion benches eventually next week.

That would be great! Some very simple benchmark + cargo-flamegraph should be enough to track down the biggest contributors, at least that's how I did it for the core.

@drahnr
Copy link
Contributor Author

drahnr commented Apr 7, 2021

I took the lazy path and below is the current flamegraph of cargo-spellcheck w/ 0.5.3 and you are correct, most time is spent in deserialization which is quite unfortunate.

flamegraph.svg

Not quite sure how to approach this, I'd like to just do all the hard work in build.rs and serialize to smth that is zer0 copy and deserialize that in the actual application, i.e. https://lib.rs/crates/rkyv or https://capnproto.org/ - which both are clearly out of scope for this crate.

So my question here is, how much one can expose for deserialization purposes to make that easier to swap out as needed.
I'll think a bit about this at some point this week.

@bminixhofer
Copy link
Owner

bminixhofer commented Apr 7, 2021

Thanks for the graph! It seems almost all of the work is indeed done in the Tagger deserialization. In general, deserializing from an FST is necessary since otherwise size of the binary blows up, but it makes sense that it's slow.

Here's the problematic code:

impl From<TaggerFields> for Tagger {
fn from(data: TaggerFields) -> Self {
let word_store_fst = Map::new(data.word_store_fst).unwrap();
let word_store: BiMap<String, WordIdInt> = word_store_fst
.into_stream()
.into_str_vec()
.unwrap()
.into_iter()
.map(|(key, value)| (key, WordIdInt(value as u32)))
.collect();
let mut tags = DefaultHashMap::new();
let mut groups = DefaultHashMap::new();
let tag_fst = Map::new(data.tag_fst).unwrap();
let mut stream = tag_fst.into_stream();
while let Some((key, value)) = stream.next() {
let word = std::str::from_utf8(&key[..key.len() - 1]).unwrap();
let word_id = *word_store.get_by_left(word).unwrap();
let value_bytes = value.to_be_bytes();
let inflection_id = WordIdInt(u32::from_be_bytes([
value_bytes[0],
value_bytes[1],
value_bytes[2],
value_bytes[3],
]));
let pos_id = PosIdInt(u16::from_be_bytes([value_bytes[6], value_bytes[7]]));
let group = groups.entry(inflection_id).or_insert_with(Vec::new);
if !group.contains(&word_id) {
group.push(word_id);
}
tags.entry(word_id)
.or_insert_with(IndexMap::new)
.entry(inflection_id)
.or_insert_with(Vec::new)
.push(pos_id);
}
Tagger {
tags,
tag_store: data.tag_store,
word_store,
groups,
lang_options: data.lang_options,
..Default::default()
}
}
}

I can give parallelizing this to some degree a go, that's not trivial here but it should help.

(also, I just noticed the into_iter + map + collect in L111 above can be easily avoided so that should already measurably improve things, but it seems the bulk of the work is done in the loop starting at L119)

@drahnr
Copy link
Contributor Author

drahnr commented Apr 7, 2021

tl;dr need a good test case to improve this for real.


My measurements really were only a few runs of cargo-spellcheck on the same file with different impls, and comparing the total runtime percentage of the Tagger::from impl. So take it with a grain of salt.


The above code snippet seems slow due to the additional allocations, in reality, the change is insignificant and is slightly faster as is (probably due to some cache locallity) compared to:

        let word_store_fst = Map::new(data.word_store_fst).unwrap();
        let mut word_store = FastBiMap::<String, WordIdInt>::with_capacity_and_hashers(
            word_store_fst.len(),
            Default::default(),
            Default::default(),
        );
        let mut stream = word_store_fst.into_stream();
        while let Some((key, value)) = stream.next() {
            if let Some(key) = std::str::from_utf8(key).ok() {
                word_store.insert(key.to_owned(), WordIdInt(value as u32));
            }
        };

fnv hasher is supposed to be pretty fast for small keys, which is really all there is, but it appears hashbrown beats fnv, maybe due to SMID lookup that is advertised.

@drahnr
Copy link
Contributor Author

drahnr commented Apr 7, 2021

Some more digging revealed that the data really is very sparse. We do 2 allocations for structure and the intermediate combinator key, where everything holds 1 element only in 99% of all cases. So at this point, I am tempted to say that using Vec for the inner 2 layers should already give a pretty good speedup.
Another approach would be to eliminate the intermediate layer,

                .entry(inflection_id)

which also barely used, but in get_raw afaiu.
Using a Vec<(WordIdInt, PosIdInt)> should yield significantly better perf given our data sizes of mostly 1, up to 3 items.

@bminixhofer
Copy link
Owner

Thanks for looking into this!

Vec for the inner 2 layers should already give a pretty good speedup.

Another approach would be to eliminate the intermediate layer

Sorry, what's the difference between these?

Using a Vec<(WordIdInt, PosIdInt)>

IIRC the reason for IndexMap<WordIdInt, Vec<PosIdInt>> was better binary size, I should've documented that.. We can try Vec<(WordIdInt, PosIdInt)>, maybe the size difference is negligible.

Also, what do you think about e.g. https://docs.rs/flurry/0.3.1/flurry/ for parallelization?

I'd like to just do all the hard work in build.rs

If the above is not fast enough, I think the best approach in this direction would be splitting the tag store into multiple FSTs + deserializing in parallel. That should speed things up roughly by a factor of the number n of FSTs, assuming the concurrent hashmap is fast. n would have to be one by default and settable by nlprule-build (which would have to do deserialization + update n + serialization) but that would be significantly more work and add quite a bit of complexity. But the size difference is actually not too bad - using two FSTs instead of one adds ~1MB.

@drahnr
Copy link
Contributor Author

drahnr commented Apr 7, 2021

Thanks for looking into this!

Vec for the inner 2 layers should already give a pretty good speedup.

Another approach would be to eliminate the intermediate layer

Sorry, what's the difference between these?

One retains the current 2 layer inner nesting structure, the other flattens the IndexMap<_,Vec<_> to a single Vec

Using a Vec<(WordIdInt, PosIdInt)>

IIRC the reason for IndexMap<WordIdInt, Vec<PosIdInt>> was better binary size, I should've documented that.. We can try Vec<(WordIdInt, PosIdInt)>, maybe the size difference is negligible.

Also, what do you think about e.g. https://docs.rs/flurry/0.3.1/flurry/ for parallelization?

That could work, but the most significant part is not the insertion, but just calling next() on the Stream - I tried to write a safe wrapper as an iterator to then be able to use rayon with a parallel iterator adapter (which suffers the same issues).

I'd like to just do all the hard work in build.rs

If the above is not fast enough, I think the best approach in this direction would be splitting the tag store into multiple FSTs + deserializing in parallel. That should speed things up roughly by a factor of the number n of FSTs, assuming the concurrent hashmap is fast. n would have to be one by default and settable by nlprule-build (which would have to do deserialization + update n + serialization) but that would be significantly more work and add quite a bit of complexity. But the size difference is actually not too bad - using two FSTs instead of one adds ~1MB.

I think fixing the topology of the Tagger should be the first step, I am not very deep into the code yet, so I don't know if that already implies adding multiple FSTs or not.

@bminixhofer bminixhofer added the P2 Medium priority label Apr 8, 2021
@bminixhofer
Copy link
Owner

bminixhofer commented Apr 8, 2021

Please rebase this onto main, there's now a simple bench for rules and tokenizer loading speed, and a CI fix.

The Tagger is not well documented at the moment, I'll try to get around to better documentation a bit later so you know roughly what's going on if you're working on it :)

@drahnr
Copy link
Contributor Author

drahnr commented Apr 8, 2021

That'd be much appreciated, I'll have some time tomorrow night to look into this further.

@bminixhofer
Copy link
Owner

Alright, there's better doc comments for the tagger on main now. I think I covered the important parts, let me know if anything is unclear!

@bminixhofer
Copy link
Owner

bminixhofer commented Apr 28, 2021

This is somewhat addressed by #66 and #70. I'll keep this open because there's currently (at least) two more things worth investigating in this area:

  1. Track loading speed (and possibly some other metrics) over time using e.g. https://github.com/rhysd/github-action-benchmark to prevent regressions.
  2. Investigate parallelization of deserialization across structs. For example, the Tagger and the three Chunker models are all quite expensive to deserialize, and all of that happens sequentially at the moment although it could be done in parallel. I'm not sure how difficult this would be, there's some related discussion in Snazzy Rc/Arc handling bincode-org/bincode#139 but I don't think this will land in serde anytime soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 Medium priority
Projects
None yet
Development

No branches or pull requests

2 participants