Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for custom trained PunktTokenizer in PreProcessor #2783

Merged

Conversation

danielbichuetti
Copy link
Contributor

@danielbichuetti danielbichuetti commented Jul 8, 2022

Related Issue(s): #2780 #2781

Proposed changes:

  • Today, the PreProcessor makes usages of NLTK PunktTokenizer. On some specific domains the default model demands some extra training on a corpus linked to the domain. Based on this, I propose to introduce a parameter tokenizer_model_folder that allows users to use custom trained models. The naming would be something like language.pickle. This change doesn't break current class and doesn't make impossible to use same parameter for another Sentence Tokenizer if on the future project decides for it (Spacy, Stanza as examples).
    If for that specific language, a model is present on this folder, PreProcessor would use it. If not, fallback to default one. If anyone wants he could have a folder with domains linked to law, another for medical tokenizer and so on.

This is the first draft on the PR, so I can know where it can be further improved.

Pre-flight checklist

@danielbichuetti danielbichuetti changed the title Add preprocessor custom model support Add support for custom trained PunktTokenizer in PreProcessor Jul 11, 2022
Copy link
Contributor

@ZanSara ZanSara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @danielbichuetti! Nice one! 👍 I added some comments on the usage of pathlib, a note on the tests and some other minor comments, but there's no issue with your approach. PR looks already very good as it is! Let me know if anything is unclear.

haystack/nodes/preprocessor/preprocessor.py Outdated Show resolved Hide resolved
haystack/nodes/preprocessor/preprocessor.py Outdated Show resolved Hide resolved
haystack/nodes/preprocessor/preprocessor.py Outdated Show resolved Hide resolved
haystack/nodes/preprocessor/preprocessor.py Outdated Show resolved Hide resolved
haystack/nodes/preprocessor/preprocessor.py Outdated Show resolved Hide resolved
test/nodes/test_preprocessor.py Outdated Show resolved Hide resolved
test/nodes/test_preprocessor.py Outdated Show resolved Hide resolved
test/nodes/test_preprocessor.py Outdated Show resolved Hide resolved
@danielbichuetti
Copy link
Contributor Author

danielbichuetti commented Jul 11, 2022

@ZanSara I just read another topic regarding usage of Catalan on the sentence tokenizer, which is not a default language for Punkt.

After reading the code and refactor, I edited here. I'll include a test for languages which are not on the hard-coded dictionary, but users have set up as name.

@danielbichuetti
Copy link
Contributor Author

@ZanSara I have refactored the code. Due to the fallback auto-handling adopted, code was separated and logs are specific for each scenario. This way, users can find his own errors easily.

I added some extra tests, following your suggestions. The small Portuguese PunktSentenceTokenizer model on samples folder has achieved results on legal documents way higher than default ones.

@danielbichuetti
Copy link
Contributor Author

danielbichuetti commented Jul 12, 2022

What sounds interesting is that on the GH Workflow test is failing because it can't determine the format of the model pt.pickle:

ERROR haystack.nodes.preprocessor.preprocessor:preprocessor.py:484 PreProcessor couldn't determine model format of sentence tokenizer at /home/runner/work/haystack/haystack/test/samples/preprocessor/nltk_models/pt.pickle.

On VS Code and command-line on local machine, the pt.pickle is correct. Yesterday I uploaded a huge model which was not for legal. Then I committed this small for the current text. Does workflow have some caching enabled ? Something that may be loading the old pickle model from samples folder?

Copy link
Contributor

@ZanSara ZanSara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I changed my mind a bit and I have a new proposal for tokenizer_model_folder. Nothing big, but I want to know what do you think about this 🙂 I also have a few comments on the new method, but again it's mostly small technicalities.

haystack/nodes/preprocessor/base.py Outdated Show resolved Hide resolved
haystack/nodes/preprocessor/preprocessor.py Outdated Show resolved Hide resolved
haystack/nodes/preprocessor/preprocessor.py Outdated Show resolved Hide resolved
haystack/nodes/preprocessor/preprocessor.py Outdated Show resolved Hide resolved
haystack/nodes/preprocessor/preprocessor.py Outdated Show resolved Hide resolved
haystack/nodes/preprocessor/preprocessor.py Outdated Show resolved Hide resolved
@ZanSara
Copy link
Contributor

ZanSara commented Jul 12, 2022

Does workflow have some caching enabled ? Something that may be loading the old pickle model from samples folder?

Yes, PYTHONPATH is cached, but this is the first time I see a problem like this... Super interesting, might be a bug in the way we do the caching. Share the findings if you have any, I'm going to investigate 👍

@danielbichuetti
Copy link
Contributor Author

danielbichuetti commented Jul 12, 2022

Super interesting, might be a bug in the way we do the caching. Share the findings if you have any, I'm going to investigate

Yeah, I got a headache about this issue.

The issue is being caused because I saved using protocol version 5. As you can see from Python documentation:

Protocol version 5 was added in Python 3.8. It adds support for out-of-band data and speedup for in-band data. Refer to PEP 574 for information about improvements brought by protocol 5.

Repo workflow is running using Python 3.7, so the highest allowed protocol version is 4.

I'll generate a model using version 4 😅

haystack/nodes/preprocessor/preprocessor.py Outdated Show resolved Hide resolved
haystack/nodes/preprocessor/preprocessor.py Outdated Show resolved Hide resolved
haystack/nodes/preprocessor/preprocessor.py Outdated Show resolved Hide resolved
Copy link
Member

@tstadel tstadel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a small comment about renaming the method as it currently doesn't feel intuitive to me. Besides that looks cool.

haystack/nodes/preprocessor/preprocessor.py Outdated Show resolved Hide resolved
@danielbichuetti
Copy link
Contributor Author

@tstadel I have refactored the method name as per your request. Much more intuitive now 😃

@ZanSara The improvements you proposed have been committed. 😄

Copy link
Contributor

@ZanSara ZanSara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you 🙂

Copy link
Member

@tstadel tstadel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielbichuetti I found some minor things. However they are worth fixing before merging. Could you please remove the superfluous params within the docstring and make the check for None explicit?

haystack/nodes/preprocessor/preprocessor.py Outdated Show resolved Hide resolved
haystack/nodes/preprocessor/preprocessor.py Outdated Show resolved Hide resolved
return nltk.tokenize.sent_tokenize(text, language="english")

# Use a default NLTK model
if language_name:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here too:

Suggested change
if language_name:
if language_name is not None:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tstadel I have done this way, but @ZanSara have made this point (which seems pretty plausible):

Just a minor thing here: language_name is not None will be True if language_name == "". With strings I think it's safer to check with not language_name, which will evaluate to False if language_name == "". I know that in the current code this can't happen, but I find it a bit more future-proof.

I have committed the if x is not None version again, please advise of any further adjustments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for bringing this up again, I think an explicit None check is just more clear about what we're checking here for. Without it it's kind of harder to understand what happens if someone passes "" or even [] (especially for the model path). I would find it unintuitive for this method to just work as if the params hasn't been set in that cases.

Co-authored-by: tstadel
Co-authored-by: tstadel
@danielbichuetti
Copy link
Contributor Author

@tstadel Hi, may you review the changes? @ZanSara approved, waiting your review now.

@tstadel tstadel merged commit 3948b99 into deepset-ai:master Jul 21, 2022
@danielbichuetti danielbichuetti deleted the add_preprocessor_custom_model_support branch July 21, 2022 11:49
Krak91 pushed a commit to Krak91/haystack that referenced this pull request Jul 26, 2022
…t-ai#2783)

* Add support for model folder into BasePreProcessor

* First draft of custom model on PreProcessor

* Update Documentation & Code Style

* Update tests to support custom models

* Update Documentation & Code Style

* Test for wrong models in custom folder

* Default to ISO names on custom model folder

Use long names only when needed

* Update Documentation & Code Style

* Refactoring language names usage

* Update fallback logic

* Check unpickling error

* Updated tests using parametrize

Co-authored-by:  Sara Zan <sara.zanzottera@deepset.ai>

* Refactored common logic

* Add format control to NLTK load

* Tests improvements

Add a sample for specialized model

* Update Documentation & Code Style

* Minor log text update

* Log model format exception details

* Change pickle protocol version to 4 for 3.7 compat

* Removed unnecessary model folder parameter

Changed logic comparisons

Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>

* Update Documentation & Code Style

* Removed unused import

* Change errors with warnings

* Change to absolute path

* Rename sentence tokenizer method

Co-authored-by: tstadel

* Check document content is a string before process

* Change to log errors and not warnings

* Update Documentation & Code Style

* Improve split sentences method

Co-authored-by:  Sara Zan  <sara.zanzottera@deepset.ai>

* Update Documentation & Code Style

* Empty commit - trigger workflow

* Remove superfluous parameters

Co-authored-by: tstadel

* Explicit None checking

Co-authored-by: tstadel

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants