Improve _HTMLWordTruncator. Fix #2863#2864
Conversation
|
It would be nice to add a test case for this with non-latin text. Also, I'm a bit puzzled by the shortening of summaries in the current test cases. I don't see any apparent reason for that, but I need to look into it a bit further. |
Sorry to bother you, but have you found the reason now? 🤣 |
|
@ImBearChild, sorry I didn't really have the chance to look at it in detail before. Now that I have looked into it, I think the issue is the SBC=r"[\u0030-\u024f]|[\u0400-\u04FF]",especially the first part. It is too permissive. It includes punctuations, braces ( SBC=r"[0-9a-zA-Z]|[\u00C0-\u024f]|[\u0400-\u04FF]"
# ASCII |Extended Latin | Cyrillicwith this test cases remain same. |
Well, should I close this pull request and open a new one with your improved code? And I think I have to check DBC |
|
Hi NianQing. 👋 Thank you for asking first, instead of just closing and creating a new PR. Many folks do that without asking first, which is a shame because it is totally unnecessary. So thanks again for asking! Pull requests are designed to be modified, so it is fairly straightforward to make follow-up changes. Any new commits that are pushed to your forked repo's
Following are some related links from our documentation that you may find helpful:
I hope this was helpful. How else might I assist you? |
45aacfd to
963c552
Compare
|
Thank you for your instruction! And now it's time to check these new code.😄 |
21ef383 to
3af8382
Compare
|
That's strange, I've run all the tests on my machine and no failure happen. 🤔 |
|
There is currently an issue with feedgenerator and tests are likely failing because of that. I don't think your changes are the cause. Speaking of tests, can you add one or two tests for |
3af8382 to
2417f85
Compare
Oh, I have found where the problem is. I used wrong regular expressions like |
Use more than one unicode block in _word_regex, making word count function behave properly with CJK, cyrillic and more latin characters when generating summary.
2417f85 to
73aca20
Compare
Sorry, I've just found another typo in regular expression. I used |
|
Merged manually via 22192c1. Many thanks to @ImBearChild for the nice enhancement and to @avaris for reviewing. ✨ |
Use more than one unicode block in _word_regex, making word count function behave properly with CJK, cyrillic and more latin characters when generating summary.
Pull Request Checklist
Resolves: #2863