Add support for custom trained PunktTokenizer in PreProcessor #2783

danielbichuetti · 2022-07-08T22:33:04Z

Related Issue(s): #2780 #2781

Proposed changes:

Today, the PreProcessor makes usages of NLTK PunktTokenizer. On some specific domains the default model demands some extra training on a corpus linked to the domain. Based on this, I propose to introduce a parameter tokenizer_model_folder that allows users to use custom trained models. The naming would be something like language.pickle. This change doesn't break current class and doesn't make impossible to use same parameter for another Sentence Tokenizer if on the future project decides for it (Spacy, Stanza as examples).
If for that specific language, a model is present on this folder, PreProcessor would use it. If not, fallback to default one. If anyone wants he could have a folder with domains linked to law, another for medical tokenizer and so on.

This is the first draft on the PR, so I can know where it can be further improved.

Pre-flight checklist

I have read the contributors guidelines
I have enabled actions on my fork
If this is a code change, I added tests or updated existing ones
If this is a code change, I updated the docstrings

Use long names only when needed

ZanSara

Hey @danielbichuetti! Nice one! 👍 I added some comments on the usage of pathlib, a note on the tests and some other minor comments, but there's no issue with your approach. PR looks already very good as it is! Let me know if anything is unclear.

haystack/nodes/preprocessor/preprocessor.py

test/nodes/test_preprocessor.py

danielbichuetti · 2022-07-11T15:10:03Z

@ZanSara I just read another topic regarding usage of Catalan on the sentence tokenizer, which is not a default language for Punkt.

After reading the code and refactor, I edited here. I'll include a test for languages which are not on the hard-coded dictionary, but users have set up as name.

Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>

Add a sample for specialized model

danielbichuetti · 2022-07-12T01:30:33Z

@ZanSara I have refactored the code. Due to the fallback auto-handling adopted, code was separated and logs are specific for each scenario. This way, users can find his own errors easily.

I added some extra tests, following your suggestions. The small Portuguese PunktSentenceTokenizer model on samples folder has achieved results on legal documents way higher than default ones.

danielbichuetti · 2022-07-12T09:40:49Z

What sounds interesting is that on the GH Workflow test is failing because it can't determine the format of the model pt.pickle:

ERROR haystack.nodes.preprocessor.preprocessor:preprocessor.py:484 PreProcessor couldn't determine model format of sentence tokenizer at /home/runner/work/haystack/haystack/test/samples/preprocessor/nltk_models/pt.pickle.

On VS Code and command-line on local machine, the pt.pickle is correct. Yesterday I uploaded a huge model which was not for legal. Then I committed this small for the current text. Does workflow have some caching enabled ? Something that may be loading the old pickle model from samples folder?

ZanSara

Ok, I changed my mind a bit and I have a new proposal for tokenizer_model_folder. Nothing big, but I want to know what do you think about this 🙂 I also have a few comments on the new method, but again it's mostly small technicalities.

haystack/nodes/preprocessor/base.py

haystack/nodes/preprocessor/preprocessor.py

ZanSara · 2022-07-12T10:01:44Z

Does workflow have some caching enabled ? Something that may be loading the old pickle model from samples folder?

Yes, PYTHONPATH is cached, but this is the first time I see a problem like this... Super interesting, might be a bug in the way we do the caching. Share the findings if you have any, I'm going to investigate 👍

danielbichuetti · 2022-07-12T10:23:45Z

Super interesting, might be a bug in the way we do the caching. Share the findings if you have any, I'm going to investigate

Yeah, I got a headache about this issue.

The issue is being caused because I saved using protocol version 5. As you can see from Python documentation:

Protocol version 5 was added in Python 3.8. It adds support for out-of-band data and speedup for in-band data. Refer to PEP 574 for information about improvements brought by protocol 5.

Repo workflow is running using Python 3.7, so the highest allowed protocol version is 4.

I'll generate a model using version 4 😅

…upport

haystack/nodes/preprocessor/preprocessor.py

tstadel

I added a small comment about renaming the method as it currently doesn't feel intuitive to me. Besides that looks cool.

haystack/nodes/preprocessor/preprocessor.py

…upport

Co-authored-by: tstadel

Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>

…upport

danielbichuetti · 2022-07-19T13:26:58Z

@tstadel I have refactored the method name as per your request. Much more intuitive now 😃

@ZanSara The improvements you proposed have been committed. 😄

ZanSara

Looks good! Thank you 🙂

tstadel

@danielbichuetti I found some minor things. However they are worth fixing before merging. Could you please remove the superfluous params within the docstring and make the check for None explicit?

haystack/nodes/preprocessor/preprocessor.py

tstadel · 2022-07-19T14:33:17Z

haystack/nodes/preprocessor/preprocessor.py

+            return nltk.tokenize.sent_tokenize(text, language="english")
+
+        # Use a default NLTK model
+        if language_name:


And here too:

Suggested change

if language_name:

if language_name is not None:

@tstadel I have done this way, but @ZanSara have made this point (which seems pretty plausible):

Just a minor thing here: language_name is not None will be True if language_name == "". With strings I think it's safer to check with not language_name, which will evaluate to False if language_name == "". I know that in the current code this can't happen, but I find it a bit more future-proof.

I have committed the if x is not None version again, please advise of any further adjustments.

Sorry for bringing this up again, I think an explicit None check is just more clear about what we're checking here for. Without it it's kind of harder to understand what happens if someone passes "" or even [] (especially for the model path). I would find it unintuitive for this method to just work as if the params hasn't been set in that cases.

Co-authored-by: tstadel

…upport

danielbichuetti · 2022-07-21T02:32:54Z

@tstadel Hi, may you review the changes? @ZanSara approved, waiting your review now.

…t-ai#2783) * Add support for model folder into BasePreProcessor * First draft of custom model on PreProcessor * Update Documentation & Code Style * Update tests to support custom models * Update Documentation & Code Style * Test for wrong models in custom folder * Default to ISO names on custom model folder Use long names only when needed * Update Documentation & Code Style * Refactoring language names usage * Update fallback logic * Check unpickling error * Updated tests using parametrize Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai> * Refactored common logic * Add format control to NLTK load * Tests improvements Add a sample for specialized model * Update Documentation & Code Style * Minor log text update * Log model format exception details * Change pickle protocol version to 4 for 3.7 compat * Removed unnecessary model folder parameter Changed logic comparisons Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai> * Update Documentation & Code Style * Removed unused import * Change errors with warnings * Change to absolute path * Rename sentence tokenizer method Co-authored-by: tstadel * Check document content is a string before process * Change to log errors and not warnings * Update Documentation & Code Style * Improve split sentences method Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai> * Update Documentation & Code Style * Empty commit - trigger workflow * Remove superfluous parameters Co-authored-by: tstadel * Explicit None checking Co-authored-by: tstadel Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>

danielbichuetti and others added 9 commits July 8, 2022 18:50

Add support for model folder into BasePreProcessor

41b6f03

First draft of custom model on PreProcessor

a8ee999

Update Documentation & Code Style

f03be5e

Update tests to support custom models

b7ffb45

Update Documentation & Code Style

96708ed

Test for wrong models in custom folder

4c9d0c5

Default to ISO names on custom model folder

111ed1c

Use long names only when needed

Update Documentation & Code Style

73eb3c6

Refactoring language names usage

93cc450

danielbichuetti changed the title ~~Add preprocessor custom model support~~ Add support for custom trained PunktTokenizer in PreProcessor Jul 11, 2022

ZanSara reviewed Jul 11, 2022

View reviewed changes

ZanSara added type:feature New feature or request topic:preprocessing journey:advanced labels Jul 11, 2022

danielbichuetti and others added 6 commits July 11, 2022 15:41

Update fallback logic

24b8519

Check unpickling error

7c9ea32

Updated tests using parametrize

8789d37

Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>

Refactored common logic

bf09fe9

Add format control to NLTK load

be72a3f

Tests improvements

5d77ed5

Add a sample for specialized model

danielbichuetti requested a review from ZanSara July 12, 2022 01:30

github-actions bot and others added 2 commits July 12, 2022 01:33

Update Documentation & Code Style

deba3d1

Minor log text update

ac835e0

Log model format exception details

23e8dfb

ZanSara reviewed Jul 12, 2022

View reviewed changes

Merge branch 'deepset-ai:master' into add_preprocessor_custom_model_s…

d012ce0

…upport

danielbichuetti requested a review from ZanSara July 12, 2022 13:34

Merge branch 'deepset-ai:master' into add_preprocessor_custom_model_s…

cfd6883

…upport

ZanSara reviewed Jul 14, 2022

View reviewed changes

haystack/nodes/preprocessor/preprocessor.py Outdated Show resolved Hide resolved

haystack/nodes/preprocessor/preprocessor.py Outdated Show resolved Hide resolved

haystack/nodes/preprocessor/preprocessor.py Outdated Show resolved Hide resolved

tstadel requested changes Jul 15, 2022

View reviewed changes

haystack/nodes/preprocessor/preprocessor.py Outdated Show resolved Hide resolved

danielbichuetti and others added 5 commits July 15, 2022 21:44

Merge branch 'deepset-ai:master' into add_preprocessor_custom_model_s…

420455e

…upport

Rename sentence tokenizer method

539a785

Co-authored-by: tstadel

Check document content is a string before process

93acd5f

Change to log errors and not warnings

f8b778f

Update Documentation & Code Style

b8774cc

danielbichuetti requested review from tstadel and ZanSara July 16, 2022 01:51

danielbichuetti and others added 4 commits July 15, 2022 22:53

Improve split sentences method

fe679cf

Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>

Update Documentation & Code Style

55f8b0b

Empty commit - trigger workflow

43c6248

Merge branch 'deepset-ai:master' into add_preprocessor_custom_model_s…

9a1102f

…upport

ZanSara approved these changes Jul 19, 2022

View reviewed changes

tstadel requested changes Jul 19, 2022

View reviewed changes

danielbichuetti added 2 commits July 19, 2022 13:17

Remove superfluous parameters

cc585ce

Co-authored-by: tstadel

Explicit None checking

be3e0ff

Co-authored-by: tstadel

danielbichuetti requested a review from tstadel July 19, 2022 16:25

danielbichuetti added 2 commits July 19, 2022 13:28

Merge branch 'deepset-ai:master' into add_preprocessor_custom_model_s…

cf3a13e

…upport

Merge branch 'deepset-ai:master' into add_preprocessor_custom_model_s…

fcbff2a

…upport

tstadel approved these changes Jul 21, 2022

View reviewed changes

tstadel merged commit 3948b99 into deepset-ai:master Jul 21, 2022

danielbichuetti deleted the add_preprocessor_custom_model_support branch July 21, 2022 11:49

masci mentioned this pull request Jul 22, 2022

Add support for custom trained PunktTokenizer in PreProcessor #2780

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for custom trained PunktTokenizer in PreProcessor #2783

Add support for custom trained PunktTokenizer in PreProcessor #2783

danielbichuetti commented Jul 8, 2022 •

edited

ZanSara left a comment

danielbichuetti commented Jul 11, 2022 •

edited

danielbichuetti commented Jul 12, 2022

danielbichuetti commented Jul 12, 2022 •

edited

ZanSara left a comment

ZanSara commented Jul 12, 2022

danielbichuetti commented Jul 12, 2022 •

edited

tstadel left a comment

danielbichuetti commented Jul 19, 2022

ZanSara left a comment

tstadel left a comment

tstadel Jul 19, 2022

danielbichuetti Jul 19, 2022 •

edited

tstadel Jul 21, 2022

danielbichuetti commented Jul 21, 2022

Add support for custom trained PunktTokenizer in PreProcessor #2783

Add support for custom trained PunktTokenizer in PreProcessor #2783

Conversation

danielbichuetti commented Jul 8, 2022 • edited

Pre-flight checklist

ZanSara left a comment

Choose a reason for hiding this comment

danielbichuetti commented Jul 11, 2022 • edited

danielbichuetti commented Jul 12, 2022

danielbichuetti commented Jul 12, 2022 • edited

ZanSara left a comment

Choose a reason for hiding this comment

ZanSara commented Jul 12, 2022

danielbichuetti commented Jul 12, 2022 • edited

tstadel left a comment

Choose a reason for hiding this comment

danielbichuetti commented Jul 19, 2022

ZanSara left a comment

Choose a reason for hiding this comment

tstadel left a comment

Choose a reason for hiding this comment

tstadel Jul 19, 2022

Choose a reason for hiding this comment

danielbichuetti Jul 19, 2022 • edited

Choose a reason for hiding this comment

tstadel Jul 21, 2022

Choose a reason for hiding this comment

danielbichuetti commented Jul 21, 2022

danielbichuetti commented Jul 8, 2022 •

edited

danielbichuetti commented Jul 11, 2022 •

edited

danielbichuetti commented Jul 12, 2022 •

edited

danielbichuetti commented Jul 12, 2022 •

edited

danielbichuetti Jul 19, 2022 •

edited