New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for custom trained PunktTokenizer in PreProcessor #2783
Add support for custom trained PunktTokenizer in PreProcessor #2783
Conversation
Use long names only when needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @danielbichuetti! Nice one! 👍 I added some comments on the usage of pathlib
, a note on the tests and some other minor comments, but there's no issue with your approach. PR looks already very good as it is! Let me know if anything is unclear.
@ZanSara I just read another topic regarding usage of Catalan on the sentence tokenizer, which is not a default language for Punkt. After reading the code and refactor, I edited here. I'll include a test for languages which are not on the hard-coded dictionary, but users have set up as name. |
Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
Add a sample for specialized model
@ZanSara I have refactored the code. Due to the fallback auto-handling adopted, code was separated and logs are specific for each scenario. This way, users can find his own errors easily. I added some extra tests, following your suggestions. The small Portuguese PunktSentenceTokenizer model on samples folder has achieved results on legal documents way higher than default ones. |
What sounds interesting is that on the GH Workflow test is failing because it can't determine the format of the model
On VS Code and command-line on local machine, the pt.pickle is correct. Yesterday I uploaded a huge model which was not for legal. Then I committed this small for the current text. Does workflow have some caching enabled ? Something that may be loading the old pickle model from samples folder? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I changed my mind a bit and I have a new proposal for tokenizer_model_folder
. Nothing big, but I want to know what do you think about this 🙂 I also have a few comments on the new method, but again it's mostly small technicalities.
Yes, |
Yeah, I got a headache about this issue. The issue is being caused because I saved using protocol version 5. As you can see from Python documentation:
Repo workflow is running using Python 3.7, so the highest allowed protocol version is 4. I'll generate a model using version 4 😅 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a small comment about renaming the method as it currently doesn't feel intuitive to me. Besides that looks cool.
Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thank you 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danielbichuetti I found some minor things. However they are worth fixing before merging. Could you please remove the superfluous params within the docstring and make the check for None explicit?
return nltk.tokenize.sent_tokenize(text, language="english") | ||
|
||
# Use a default NLTK model | ||
if language_name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here too:
if language_name: | |
if language_name is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tstadel I have done this way, but @ZanSara have made this point (which seems pretty plausible):
Just a minor thing here: language_name is not None will be True if language_name == "". With strings I think it's safer to check with not language_name, which will evaluate to False if language_name == "". I know that in the current code this can't happen, but I find it a bit more future-proof.
I have committed the if x is not None
version again, please advise of any further adjustments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for bringing this up again, I think an explicit None check is just more clear about what we're checking here for. Without it it's kind of harder to understand what happens if someone passes ""
or even []
(especially for the model path). I would find it unintuitive for this method to just work as if the params hasn't been set in that cases.
Co-authored-by: tstadel
Co-authored-by: tstadel
…t-ai#2783) * Add support for model folder into BasePreProcessor * First draft of custom model on PreProcessor * Update Documentation & Code Style * Update tests to support custom models * Update Documentation & Code Style * Test for wrong models in custom folder * Default to ISO names on custom model folder Use long names only when needed * Update Documentation & Code Style * Refactoring language names usage * Update fallback logic * Check unpickling error * Updated tests using parametrize Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai> * Refactored common logic * Add format control to NLTK load * Tests improvements Add a sample for specialized model * Update Documentation & Code Style * Minor log text update * Log model format exception details * Change pickle protocol version to 4 for 3.7 compat * Removed unnecessary model folder parameter Changed logic comparisons Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai> * Update Documentation & Code Style * Removed unused import * Change errors with warnings * Change to absolute path * Rename sentence tokenizer method Co-authored-by: tstadel * Check document content is a string before process * Change to log errors and not warnings * Update Documentation & Code Style * Improve split sentences method Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai> * Update Documentation & Code Style * Empty commit - trigger workflow * Remove superfluous parameters Co-authored-by: tstadel * Explicit None checking Co-authored-by: tstadel Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
Related Issue(s): #2780 #2781
Proposed changes:
tokenizer_model_folder
that allows users to use custom trained models. The naming would be something likelanguage.pickle
. This change doesn't break current class and doesn't make impossible to use same parameter for another Sentence Tokenizer if on the future project decides for it (Spacy, Stanza as examples).If for that specific language, a model is present on this folder, PreProcessor would use it. If not, fallback to default one. If anyone wants he could have a folder with domains linked to law, another for medical tokenizer and so on.
This is the first draft on the PR, so I can know where it can be further improved.
Pre-flight checklist