Modularizing the crate #50

bminixhofer · 2021-03-01T15:10:45Z

As suggested originally by @drahnr (#2 (comment)) we should consider splitting nlprule into multiple sub-crates.

What's certain is that there will be one nlprule crate combining sub-crates into one higher level API which is equivalent to the Python API.

There are multiple reasons for doing a split:

Some users might want to only do tokenization, or only some part of the tokenization pipeline. They should not have to pull in the weight from suggestion producers.
Some users might not want to use all of the suggestion producers, but only spellchecking, only grammar rules etc.

Modularizing the crate would primarily benefit size of the binaries and needed dependencies.

There is a distinction between:

The original split into sentences and tokens.
Things that set / modify information related to the tokens (e. g. disambiguation rules, chunking, part-of-speech tagging) (functionality currently in the Tokenizer)
What I called suggestion producers above. Things that draw conclusions from the information set on the tokens (currently only the grammar rules, spellchecking will fall into this category too)

Splitting (3.) into modules is easy: there should be one module for each separate entity which operators on the tokens i. e. one for spellchecking (nlprule-spellcheck) and one for grammar rules (nlprule-grammar or something similar).

Splitting (2.) and (1.) is harder. I think having (1.) as a separate crate which only does sentence segmentation and token segmentation (nlprule-tokenize) would make sense. Then there could be multiple crates which set different information on the tokens for example nlprule-disambiguate (disambiguation rules), nlprule-multiword (multiword tagging) and nlprule-crf (chunking and possibly other CRF models behind feature flags).

There's a number of open issues like:

How to handle distribution of binaries. Distributing binaries with everything enabled and functionality to disable specific parts in nlprule-build might be an option.
How to handle invalid combinations of the modules. For example, nlprule-grammar does not make sense without nlprule-disambiguate. Panicking as early as possible is probably best.
How to handle interdependencies between modules related to (2). For example the lexical tagging is required for disambiguation, but it would be conceivable to use tagging without wanting to do disambiguation.
Whether the name nlprule still makes sense since it is possible to combine modules in a way that does not use rules at all. But that's not the biggest issue 🙂

Implementing this is not a near-term goal right now (see #42) and I am not sure whether it is worth it from a practical point of view but I wanted to open this issue for discussion. Also thanks @drahnr for the original suggestion.

The text was updated successfully, but these errors were encountered:

bminixhofer · 2021-03-01T15:17:30Z

One interesting thing this would enable is using more sophisticated ML approaches (e. g. modern Transformers) as suggestion producers. This is not possible at the moment because it fundamentally clashes with the portability goals of nlprule.

drahnr · 2021-03-02T13:32:51Z

I like the writeup and the direction this points!

If some crates don't make sense without others, re-export behind a feature flag.

Feel free to ping me for feedback on such a PR :)

sai-prasanna · 2021-03-12T10:32:30Z

I think spacy python library also deals with these modularization questions. They have a layered pipeline architecture which could be worth checking out. https://spacy.io/usage/processing-pipelines

bminixhofer · 2021-03-27T16:56:54Z

Thanks @sai-prasanna, SpaCy is definitely a good example where this was solved well.

I will move forward with some of the higher-level changes mentioned here before merging #51 since as @drahnr pointed out spellchecking would benefit a lot from having better modularization beforehand (also, I myself do not need nlprule in a project right now, so I'm not in a rush to add spellchecking).

In particular, properly addressing the complexity of combining multiple suggestion providers which may or may not need some preprocessing (e.g. in form of the Tokenizer) needs something with the ability to represent some tree structure. I believe the following would work:

let correcter = Pipeline::new((
    tokenizer!("en"),
    Union::new((rules!("en"), spell!("en", "en_GB"))),
));

where the Union and Pipeline are completely generic and check compatibility at compile-time. The individual components would then have to be completely separated from each other which should be quite easily possible. So this is part one.

Once spellchecking is implemented, this would add a third binary. I'm not happy with the way binaries are currently included for a couple of reasons:

it is very verbose:

let mut tokenizer_bytes: &'static [u8] = include_bytes!(concat!(
    env!("OUT_DIR"),
    "/",
    tokenizer_filename!("en")
));

for one binary is way too unergonomic.

it needs distribution via GH releases to be reliable which has been an issue in the past.
it requires the user to add a build.rs.
it pushes fetching build directories from Backblaze to application code in nlprule-build. I don't really want to guarantee that Backblaze is reliably up either.
it pushes inclusion of binaries downstream which e.g. in cargo-spellcheck was solved by storing the binaries in-tree which is problematic in itself and because of the 10MB limit on crates.io.

so to fix this I want to deprecate nlprule-build and store the binary source on crates.io instead (it will also stay on GH releases for the Python bindings). The crates.io team has kindly already increased the limit for nlprule to 100MB to enable this. This then makes it possible to have macros e.g. tokenizer!("en") and tokenizer_bytes!("en") which directly include the correct binary at compile time in one line of code. This is part two.

The key tradeoff here is convenience for less customizability. In particular, one change I believe I'll have to make is that the binaries will be internally gzipped (together with #20 they'll be a .tar.gz) to avoid having to store multiple binaries since the user will almost always want to do this.

I would as always appreciate feedback on both these things. They are not set in stone, just what I currently think is the best solution.

I'll open PRs once I have an implementation; in the meantime I think it makes sense to keep discussion in this issue even though the two parts are only tangentially related.

Also note that part one also implies changes for the Python bindings, with the only difference that correct composition is not checked at compile time (obviously). This would look very similar to the pipelines in sklearn.

bminixhofer mentioned this issue Mar 6, 2021

Add spellchecking #51

Draft

6 tasks

bminixhofer mentioned this issue Mar 28, 2021

Add a Sentence struct, replace Vec<Token> with Sentence where possible #54

Merged

2 tasks

bminixhofer mentioned this issue Apr 8, 2021

Configurability #47

Closed

bminixhofer added the P1 High priority label Apr 8, 2021

bminixhofer mentioned this issue May 29, 2021

Modularize #72

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modularizing the crate #50

Modularizing the crate #50

bminixhofer commented Mar 1, 2021 •

edited

bminixhofer commented Mar 1, 2021

drahnr commented Mar 2, 2021

sai-prasanna commented Mar 12, 2021

bminixhofer commented Mar 27, 2021

Modularizing the crate #50

Modularizing the crate #50

Comments

bminixhofer commented Mar 1, 2021 • edited

bminixhofer commented Mar 1, 2021

drahnr commented Mar 2, 2021

sai-prasanna commented Mar 12, 2021

bminixhofer commented Mar 27, 2021

bminixhofer commented Mar 1, 2021 •

edited