Improve loading speed (of regex?) - cli usecase #56

drahnr · 2021-04-01T08:49:05Z

The biggest issue using this library currently is the fact, that on each startup a lot of regular expressions are compiled.

If regex (or whatever crate being used) implements serialization of the compiled regex this could be entirely avoided and shifted to build time / once upon a time. The current issue here is the impl of the crate regex itself which does not impl serde serialization.

In the meantime, parallel compilation of the parsed regexps would probably speed up the initial loading by a factor of $cores.

The text was updated successfully, but these errors were encountered:

bminixhofer · 2021-04-01T09:46:37Z

The biggest issue using this library currently is the fact, that on each startup a lot of regular expressions are compiled.

That is quite subjective :) But I agree that loading times should be improved / tracked more closely.

If regex (or whatever crate being used) implements serialization of the compiled regex this could be entirely avoided

This would be really nice. I believe regex-automata already implements this kind of serialization but fancy-regex builds on regex, not regex-automata.

In the meantime, parallel compilation of the parsed regexps would probably speed up the initial loading by a factor of $cores.

Regular expressions are already compiled lazily, and regex execution happens partially in parallel so I don't think there's much potential for improvement there:

nlprule/nlprule/src/utils/regex.rs

Lines 70 to 80 in afc2d3e

    
           fn regex(&self) -> &regex_impl::Regex { 
        
               if let Some(regex) = self.regex.borrow() { 
        
                   regex 
        
               } else { 
        
                   let regex = regex_impl::Regex::new(&self.regex_str).unwrap_or_else(|_| { 
        
                       panic!("regex string should be pre-tested: {}", self.regex_str) 
        
                   }); 
        
                   self.regex.fill(regex).ok(); 
        
                   self.regex.borrow().unwrap() 
        
               } 
        
           }

Are you sure that regex compilation is actually the bottleneck? I think the deserialization of the tagger from an FST could be the issue. Happens here:

nlprule/nlprule/src/tokenizer/tag.rs

Line 102 in afc2d3e

impl From<TaggerFields> for Tagger {

So the first step here would be to add a criterion benchmark for loading the tokenizer and rules, and then check where there's speed issues.

I want to focus on modularization for now, but I'm open to contributions and if there's some identifiable bottleneck which isn't too hard to fix we can do that now (for example, the word_store in the tagger is currently a map from String -> u32, I think that could be replaced with a set of Strings and computing hashes at runtime instead of a fixed u32 id, that would certainly improve memory usage but I'm not sure if there'd be any impact on loading time).

drahnr · 2021-04-01T09:56:07Z

The biggest issue using this library currently is the fact, that on each startup a lot of regular expressions are compiled.

That is quite subjective :) But I agree that loading times should be improved / tracked more closely.

Note the constraint: cli - and there the load time is very significant for the end user.

If regex (or whatever crate being used) implements serialization of the compiled regex this could be entirely avoided

This would be really nice. I believe regex-automata already implements this kind of serialization but fancy-regex builds on regex, not regex-automata.

I've had a quick peek into the source of regex, and that task is actually non-trivial unfortunately.

In the meantime, parallel compilation of the parsed regexps would probably speed up the initial loading by a factor of $cores.

Regular expressions are already compiled lazily, and regex execution happens partially in parallel so I don't think there's much potential for improvement there:

nlprule/nlprule/src/utils/regex.rs

Lines 70 to 80 in afc2d3e

fn regex(&self) -> &regex_impl::Regex {

if let Some(regex) = self.regex.borrow() {

regex

} else {

let regex = regex_impl::Regex::new(&self.regex_str).unwrap_or_else(|_| {

panic!("regex string should be pre-tested: {}", self.regex_str)

});

self.regex.fill(regex).ok();

self.regex.borrow().unwrap()

}

}

Are you sure that regex compilation is actually the bottleneck? I think the deserialization of the tagger from an FST could be the issue. Happens here:

nlprule/nlprule/src/tokenizer/tag.rs

Line 102 in afc2d3e

impl From<TaggerFields> for Tagger {

So the first step here would be to add a criterion benchmark for loading the tokenizer and rules, and then check where there's speed issues.

Yeah, I'll add some criterion benches eventually next week.

I want to focus on modularization for now, but I'm open to contributions and if there's some identifiable bottleneck which isn't too hard to fix we can do that now (for example, the word_store in the tagger is currently a map from String -> u32, I think that could be replaced with a set of Strings and computing hashes at runtime instead of a fixed u32 id, that would certainly improve memory usage but I'm not sure if there'd be any impact on loading time).

Just making sure I am not investing time + effort for the 🗑️ 😉

bminixhofer · 2021-04-01T11:00:49Z

Note the constraint: cli - and there the load time is very significant for the end user.

Right, for CLI that is definitely true.

Yeah, I'll add some criterion benches eventually next week.

That would be great! Some very simple benchmark + cargo-flamegraph should be enough to track down the biggest contributors, at least that's how I did it for the core.

drahnr · 2021-04-07T08:45:07Z

I took the lazy path and below is the current flamegraph of cargo-spellcheck w/ 0.5.3 and you are correct, most time is spent in deserialization which is quite unfortunate.

flamegraph.svg

Not quite sure how to approach this, I'd like to just do all the hard work in build.rs and serialize to smth that is zer0 copy and deserialize that in the actual application, i.e. https://lib.rs/crates/rkyv or https://capnproto.org/ - which both are clearly out of scope for this crate.

So my question here is, how much one can expose for deserialization purposes to make that easier to swap out as needed.
I'll think a bit about this at some point this week.

bminixhofer · 2021-04-07T09:25:50Z

Thanks for the graph! It seems almost all of the work is indeed done in the Tagger deserialization. In general, deserializing from an FST is necessary since otherwise size of the binary blows up, but it makes sense that it's slow.

Here's the problematic code:

nlprule/nlprule/src/tokenizer/tag.rs

Lines 102 to 153 in 12bd52f

    
           impl From<TaggerFields> for Tagger { 
        
               fn from(data: TaggerFields) -> Self { 
        
                   let word_store_fst = Map::new(data.word_store_fst).unwrap(); 
        
                   let word_store: BiMap<String, WordIdInt> = word_store_fst 
        
                       .into_stream() 
        
                       .into_str_vec() 
        
                       .unwrap() 
        
                       .into_iter() 
        
                       .map(|(key, value)| (key, WordIdInt(value as u32))) 
        
                       .collect(); 
        
                   let mut tags = DefaultHashMap::new(); 
        
                   let mut groups = DefaultHashMap::new(); 
        
                   let tag_fst = Map::new(data.tag_fst).unwrap(); 
        
                   let mut stream = tag_fst.into_stream(); 
        
                   while let Some((key, value)) = stream.next() { 
        
                       let word = std::str::from_utf8(&key[..key.len() - 1]).unwrap(); 
        
                       let word_id = *word_store.get_by_left(word).unwrap(); 
        
                       let value_bytes = value.to_be_bytes(); 
        
                       let inflection_id = WordIdInt(u32::from_be_bytes([ 
        
                           value_bytes[0], 
        
                           value_bytes[1], 
        
                           value_bytes[2], 
        
                           value_bytes[3], 
        
                       ])); 
        
                       let pos_id = PosIdInt(u16::from_be_bytes([value_bytes[6], value_bytes[7]])); 
        
                       let group = groups.entry(inflection_id).or_insert_with(Vec::new); 
        
                       if !group.contains(&word_id) { 
        
                           group.push(word_id); 
        
                       } 
        
                       tags.entry(word_id) 
        
                           .or_insert_with(IndexMap::new) 
        
                           .entry(inflection_id) 
        
                           .or_insert_with(Vec::new) 
        
                           .push(pos_id); 
        
                   } 
        
                   Tagger { 
        
                       tags, 
        
                       tag_store: data.tag_store, 
        
                       word_store, 
        
                       groups, 
        
                       lang_options: data.lang_options, 
        
                       ..Default::default() 
        
                   } 
        
               } 
        
           }

I can give parallelizing this to some degree a go, that's not trivial here but it should help.

(also, I just noticed the into_iter + map + collect in L111 above can be easily avoided so that should already measurably improve things, but it seems the bulk of the work is done in the loop starting at L119)

drahnr · 2021-04-07T15:10:18Z

tl;dr need a good test case to improve this for real.

My measurements really were only a few runs of cargo-spellcheck on the same file with different impls, and comparing the total runtime percentage of the Tagger::from impl. So take it with a grain of salt.

The above code snippet seems slow due to the additional allocations, in reality, the change is insignificant and is slightly faster as is (probably due to some cache locallity) compared to:

        let word_store_fst = Map::new(data.word_store_fst).unwrap();
        let mut word_store = FastBiMap::<String, WordIdInt>::with_capacity_and_hashers(
            word_store_fst.len(),
            Default::default(),
            Default::default(),
        );
        let mut stream = word_store_fst.into_stream();
        while let Some((key, value)) = stream.next() {
            if let Some(key) = std::str::from_utf8(key).ok() {
                word_store.insert(key.to_owned(), WordIdInt(value as u32));
            }
        };

fnv hasher is supposed to be pretty fast for small keys, which is really all there is, but it appears hashbrown beats fnv, maybe due to SMID lookup that is advertised.

drahnr · 2021-04-07T16:15:33Z

Some more digging revealed that the data really is very sparse. We do 2 allocations for structure and the intermediate combinator key, where everything holds 1 element only in 99% of all cases. So at this point, I am tempted to say that using Vec for the inner 2 layers should already give a pretty good speedup.
Another approach would be to eliminate the intermediate layer,

                .entry(inflection_id)

which also barely used, but in get_raw afaiu.
Using a Vec<(WordIdInt, PosIdInt)> should yield significantly better perf given our data sizes of mostly 1, up to 3 items.

bminixhofer · 2021-04-07T16:31:00Z

Thanks for looking into this!

Vec for the inner 2 layers should already give a pretty good speedup.

Another approach would be to eliminate the intermediate layer

Sorry, what's the difference between these?

Using a Vec<(WordIdInt, PosIdInt)>

IIRC the reason for IndexMap<WordIdInt, Vec<PosIdInt>> was better binary size, I should've documented that.. We can try Vec<(WordIdInt, PosIdInt)>, maybe the size difference is negligible.

Also, what do you think about e.g. https://docs.rs/flurry/0.3.1/flurry/ for parallelization?

I'd like to just do all the hard work in build.rs

If the above is not fast enough, I think the best approach in this direction would be splitting the tag store into multiple FSTs + deserializing in parallel. That should speed things up roughly by a factor of the number n of FSTs, assuming the concurrent hashmap is fast. n would have to be one by default and settable by nlprule-build (which would have to do deserialization + update n + serialization) but that would be significantly more work and add quite a bit of complexity. But the size difference is actually not too bad - using two FSTs instead of one adds ~1MB.

drahnr · 2021-04-07T16:39:35Z

Thanks for looking into this!

Vec for the inner 2 layers should already give a pretty good speedup.

Another approach would be to eliminate the intermediate layer

Sorry, what's the difference between these?

One retains the current 2 layer inner nesting structure, the other flattens the IndexMap<_,Vec<_> to a single Vec

Using a Vec<(WordIdInt, PosIdInt)>

IIRC the reason for IndexMap<WordIdInt, Vec<PosIdInt>> was better binary size, I should've documented that.. We can try Vec<(WordIdInt, PosIdInt)>, maybe the size difference is negligible.

Also, what do you think about e.g. https://docs.rs/flurry/0.3.1/flurry/ for parallelization?

That could work, but the most significant part is not the insertion, but just calling next() on the Stream - I tried to write a safe wrapper as an iterator to then be able to use rayon with a parallel iterator adapter (which suffers the same issues).

I'd like to just do all the hard work in build.rs

If the above is not fast enough, I think the best approach in this direction would be splitting the tag store into multiple FSTs + deserializing in parallel. That should speed things up roughly by a factor of the number n of FSTs, assuming the concurrent hashmap is fast. n would have to be one by default and settable by nlprule-build (which would have to do deserialization + update n + serialization) but that would be significantly more work and add quite a bit of complexity. But the size difference is actually not too bad - using two FSTs instead of one adds ~1MB.

I think fixing the topology of the Tagger should be the first step, I am not very deep into the code yet, so I don't know if that already implies adding multiple FSTs or not.

bminixhofer · 2021-04-08T07:12:10Z

Please rebase this onto main, there's now a simple bench for rules and tokenizer loading speed, and a CI fix.

The Tagger is not well documented at the moment, I'll try to get around to better documentation a bit later so you know roughly what's going on if you're working on it :)

drahnr · 2021-04-08T07:56:11Z

That'd be much appreciated, I'll have some time tomorrow night to look into this further.

bminixhofer · 2021-04-08T11:47:30Z

Alright, there's better doc comments for the tagger on main now. I think I covered the important parts, let me know if anything is unclear!

bminixhofer · 2021-04-28T12:09:53Z

This is somewhat addressed by #66 and #70. I'll keep this open because there's currently (at least) two more things worth investigating in this area:

Track loading speed (and possibly some other metrics) over time using e.g. https://github.com/rhysd/github-action-benchmark to prevent regressions.
Investigate parallelization of deserialization across structs. For example, the Tagger and the three Chunker models are all quite expensive to deserialize, and all of that happens sequentially at the moment although it could be done in parallel. I'm not sure how difficult this would be, there's some related discussion in Snazzy Rc/Arc handling bincode-org/bincode#139 but I don't think this will land in serde anytime soon.

drahnr changed the title ~~Improve loading speed of regex - cli usecase~~ Improve loading speed (of regex?) - cli usecase Apr 1, 2021

drahnr mentioned this issue Apr 7, 2021

refactor/speed: avoid unnecessary collection #60

Open

bminixhofer added the P2 Medium priority label Apr 8, 2021

bminixhofer mentioned this issue Apr 15, 2021

Speed up tagger loading: remove IndexMap, new -> with_capacity #66

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve loading speed (of regex?) - cli usecase #56

Improve loading speed (of regex?) - cli usecase #56

drahnr commented Apr 1, 2021

bminixhofer commented Apr 1, 2021

drahnr commented Apr 1, 2021 •

edited

Loading

bminixhofer commented Apr 1, 2021

drahnr commented Apr 7, 2021 •

edited

Loading

bminixhofer commented Apr 7, 2021 •

edited

Loading

drahnr commented Apr 7, 2021

drahnr commented Apr 7, 2021

bminixhofer commented Apr 7, 2021

drahnr commented Apr 7, 2021

bminixhofer commented Apr 8, 2021 •

edited

Loading

drahnr commented Apr 8, 2021

bminixhofer commented Apr 8, 2021

bminixhofer commented Apr 28, 2021 •

edited

Loading

Improve loading speed (of regex?) - cli usecase #56

Improve loading speed (of regex?) - cli usecase #56

Comments

drahnr commented Apr 1, 2021

bminixhofer commented Apr 1, 2021

drahnr commented Apr 1, 2021 • edited Loading

bminixhofer commented Apr 1, 2021

drahnr commented Apr 7, 2021 • edited Loading

bminixhofer commented Apr 7, 2021 • edited Loading

drahnr commented Apr 7, 2021

drahnr commented Apr 7, 2021

bminixhofer commented Apr 7, 2021

drahnr commented Apr 7, 2021

bminixhofer commented Apr 8, 2021 • edited Loading

drahnr commented Apr 8, 2021

bminixhofer commented Apr 8, 2021

bminixhofer commented Apr 28, 2021 • edited Loading

drahnr commented Apr 1, 2021 •

edited

Loading

drahnr commented Apr 7, 2021 •

edited

Loading

bminixhofer commented Apr 7, 2021 •

edited

Loading

bminixhofer commented Apr 8, 2021 •

edited

Loading

bminixhofer commented Apr 28, 2021 •

edited

Loading