Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modularizing the crate #50

Open
bminixhofer opened this issue Mar 1, 2021 · 4 comments
Open

Modularizing the crate #50

bminixhofer opened this issue Mar 1, 2021 · 4 comments
Labels
P1 High priority

Comments

@bminixhofer
Copy link
Owner

bminixhofer commented Mar 1, 2021

As suggested originally by @drahnr (#2 (comment)) we should consider splitting nlprule into multiple sub-crates.

What's certain is that there will be one nlprule crate combining sub-crates into one higher level API which is equivalent to the Python API.

There are multiple reasons for doing a split:

  • Some users might want to only do tokenization, or only some part of the tokenization pipeline. They should not have to pull in the weight from suggestion producers.
  • Some users might not want to use all of the suggestion producers, but only spellchecking, only grammar rules etc.

Modularizing the crate would primarily benefit size of the binaries and needed dependencies.

There is a distinction between:

  1. The original split into sentences and tokens.
  2. Things that set / modify information related to the tokens (e. g. disambiguation rules, chunking, part-of-speech tagging) (functionality currently in the Tokenizer)
  3. What I called suggestion producers above. Things that draw conclusions from the information set on the tokens (currently only the grammar rules, spellchecking will fall into this category too)

Splitting (3.) into modules is easy: there should be one module for each separate entity which operators on the tokens i. e. one for spellchecking (nlprule-spellcheck) and one for grammar rules (nlprule-grammar or something similar).

Splitting (2.) and (1.) is harder. I think having (1.) as a separate crate which only does sentence segmentation and token segmentation (nlprule-tokenize) would make sense. Then there could be multiple crates which set different information on the tokens for example nlprule-disambiguate (disambiguation rules), nlprule-multiword (multiword tagging) and nlprule-crf (chunking and possibly other CRF models behind feature flags).

There's a number of open issues like:

  • How to handle distribution of binaries. Distributing binaries with everything enabled and functionality to disable specific parts in nlprule-build might be an option.
  • How to handle invalid combinations of the modules. For example, nlprule-grammar does not make sense without nlprule-disambiguate. Panicking as early as possible is probably best.
  • How to handle interdependencies between modules related to (2). For example the lexical tagging is required for disambiguation, but it would be conceivable to use tagging without wanting to do disambiguation.
  • Whether the name nlprule still makes sense since it is possible to combine modules in a way that does not use rules at all. But that's not the biggest issue 🙂

Implementing this is not a near-term goal right now (see #42) and I am not sure whether it is worth it from a practical point of view but I wanted to open this issue for discussion. Also thanks @drahnr for the original suggestion.

@bminixhofer
Copy link
Owner Author

One interesting thing this would enable is using more sophisticated ML approaches (e. g. modern Transformers) as suggestion producers. This is not possible at the moment because it fundamentally clashes with the portability goals of nlprule.

@drahnr
Copy link
Contributor

drahnr commented Mar 2, 2021

I like the writeup and the direction this points!

If some crates don't make sense without others, re-export behind a feature flag.

Feel free to ping me for feedback on such a PR :)

@bminixhofer bminixhofer mentioned this issue Mar 6, 2021
6 tasks
@sai-prasanna
Copy link

I think spacy python library also deals with these modularization questions. They have a layered pipeline architecture which could be worth checking out. https://spacy.io/usage/processing-pipelines

@bminixhofer
Copy link
Owner Author

Thanks @sai-prasanna, SpaCy is definitely a good example where this was solved well.

I will move forward with some of the higher-level changes mentioned here before merging #51 since as @drahnr pointed out spellchecking would benefit a lot from having better modularization beforehand (also, I myself do not need nlprule in a project right now, so I'm not in a rush to add spellchecking).

In particular, properly addressing the complexity of combining multiple suggestion providers which may or may not need some preprocessing (e.g. in form of the Tokenizer) needs something with the ability to represent some tree structure. I believe the following would work:

let correcter = Pipeline::new((
    tokenizer!("en"),
    Union::new((rules!("en"), spell!("en", "en_GB"))),
));

where the Union and Pipeline are completely generic and check compatibility at compile-time. The individual components would then have to be completely separated from each other which should be quite easily possible. So this is part one.


Once spellchecking is implemented, this would add a third binary. I'm not happy with the way binaries are currently included for a couple of reasons:

  • it is very verbose:
let mut tokenizer_bytes: &'static [u8] = include_bytes!(concat!(
    env!("OUT_DIR"),
    "/",
    tokenizer_filename!("en")
));

for one binary is way too unergonomic.

  • it needs distribution via GH releases to be reliable which has been an issue in the past.
  • it requires the user to add a build.rs.
  • it pushes fetching build directories from Backblaze to application code in nlprule-build. I don't really want to guarantee that Backblaze is reliably up either.
  • it pushes inclusion of binaries downstream which e.g. in cargo-spellcheck was solved by storing the binaries in-tree which is problematic in itself and because of the 10MB limit on crates.io.

so to fix this I want to deprecate nlprule-build and store the binary source on crates.io instead (it will also stay on GH releases for the Python bindings). The crates.io team has kindly already increased the limit for nlprule to 100MB to enable this. This then makes it possible to have macros e.g. tokenizer!("en") and tokenizer_bytes!("en") which directly include the correct binary at compile time in one line of code. This is part two.

The key tradeoff here is convenience for less customizability. In particular, one change I believe I'll have to make is that the binaries will be internally gzipped (together with #20 they'll be a .tar.gz) to avoid having to store multiple binaries since the user will almost always want to do this.

I would as always appreciate feedback on both these things. They are not set in stone, just what I currently think is the best solution.

I'll open PRs once I have an implementation; in the meantime I think it makes sense to keep discussion in this issue even though the two parts are only tangentially related.

Also note that part one also implies changes for the Python bindings, with the only difference that correct composition is not checked at compile time (obviously). This would look very similar to the pipelines in sklearn.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 High priority
Projects
None yet
Development

No branches or pull requests

3 participants