Skip to content

Improve _HTMLWordTruncator. Fix #2863#2864

Closed
ImBearChild wants to merge 1 commit into
getpelican:masterfrom
ImBearChild:better-word-count
Closed

Improve _HTMLWordTruncator. Fix #2863#2864
ImBearChild wants to merge 1 commit into
getpelican:masterfrom
ImBearChild:better-word-count

Conversation

@ImBearChild

@ImBearChild ImBearChild commented Apr 4, 2021

Copy link
Copy Markdown
Contributor

Use more than one unicode block in _word_regex, making word count function behave properly with CJK, cyrillic and more latin characters when generating summary.

Pull Request Checklist

Resolves: #2863

  • Ensured tests pass and (if applicable) updated functional test output
  • Conformed to code style guidelines by running appropriate linting tools
  • Added tests for changed code (Not necessary)
  • Updated documentation for changed code (Not necessary)

@justinmayer justinmayer requested a review from avaris June 7, 2021 16:21
@avaris

avaris commented Jun 12, 2021

Copy link
Copy Markdown
Member

It would be nice to add a test case for this with non-latin text. Also, I'm a bit puzzled by the shortening of summaries in the current test cases. I don't see any apparent reason for that, but I need to look into it a bit further.

@ImBearChild

Copy link
Copy Markdown
Contributor Author

It would be nice to add a test case for this with non-latin text. Also, I'm a bit puzzled by the shortening of summaries in the current test cases. I don't see any apparent reason for that, but I need to look into it a bit further.

Sorry to bother you, but have you found the reason now? 🤣

@avaris

avaris commented Sep 24, 2021

Copy link
Copy Markdown
Member

@ImBearChild, sorry I didn't really have the chance to look at it in detail before. Now that I have looked into it, I think the issue is the SBC block:

SBC=r"[\u0030-\u024f]|[\u0400-\u04FF]",

especially the first part. It is too permissive. It includes punctuations, braces ([ ( { } ) ]), mathematical signs and as well as a lot of control characters. Those are counted as words in some cases and changes the test cases. I think we need to divide it a bit more. Based on the codepoints and without making it way too complicated, I think this would be more appropriate:

SBC=r"[0-9a-zA-Z]|[\u00C0-\u024f]|[\u0400-\u04FF]"
#         ASCII  |Extended Latin | Cyrillic

with this test cases remain same.

@ImBearChild

ImBearChild commented Sep 25, 2021

Copy link
Copy Markdown
Contributor Author

@ImBearChild, sorry I didn't really have the chance to look at it in detail before. Now that I have looked into it, I think the issue is the SBC block:

SBC=r"[\u0030-\u024f]|[\u0400-\u04FF]",

especially the first part. It is too permissive. It includes punctuations, braces ([ ( { } ) ]), mathematical signs and as well as a lot of control characters. Those are counted as words in some cases and changes the test cases. I think we need to divide it a bit more. Based on the codepoints and without making it way too complicated, I think this would be more appropriate:

SBC=r"[0-9a-zA-Z]|[\u00C0-\u024f]|[\u0400-\u04FF]"
#         ASCII  |Extended Latin | Cyrillic

with this test cases remain same.

Well, should I close this pull request and open a new one with your improved code? And I think I have to check DBC
part to make sure there's no punctuation included. I have no idea of how to modify a open pull request.

@justinmayer

Copy link
Copy Markdown
Member

Hi NianQing. 👋  Thank you for asking first, instead of just closing and creating a new PR. Many folks do that without asking first, which is a shame because it is totally unnecessary. So thanks again for asking!

Pull requests are designed to be modified, so it is fairly straightforward to make follow-up changes. Any new commits that are pushed to your forked repo's better-word-count branch will appear in this pull request. The flow looks something like this:

  1. Use the same feature branch (better-word-count) to make changes to code, tests, documentation, etc.
  2. Use git add to select and stage the changed files.
  3. Make a new commit via git commit.
  4. Push the new commit to your forked repo's better-word-count branch, via the same manner as you did the first time.

Following are some related links from our documentation that you may find helpful:

I hope this was helpful. How else might I assist you?

@ImBearChild ImBearChild force-pushed the better-word-count branch 2 times, most recently from 45aacfd to 963c552 Compare September 25, 2021 08:10
@ImBearChild

Copy link
Copy Markdown
Contributor Author

Thank you for your instruction! And now it's time to check these new code.😄

@ImBearChild ImBearChild force-pushed the better-word-count branch 3 times, most recently from 21ef383 to 3af8382 Compare September 25, 2021 08:21
@ImBearChild

Copy link
Copy Markdown
Contributor Author

That's strange, I've run all the tests on my machine and no failure happen. 🤔

@avaris

avaris commented Sep 27, 2021

Copy link
Copy Markdown
Member

There is currently an issue with feedgenerator and tests are likely failing because of that. I don't think your changes are the cause. Speaking of tests, can you add one or two tests for truncate_html_words with some CJK text here? Just to make sure we cover this for any future changes.

@ImBearChild

Copy link
Copy Markdown
Contributor Author

There is currently an issue with feedgenerator and tests are likely failing because of that.

Oh, I have found where the problem is. I used wrong regular expressions like"([\u20000–\u2A6DF])|" instead of "([\U00020000-\U0002A6DF])|". The wrong ones will actually behave like ([\u2000–\u2A6D])|, so extra punctuation is included. Now I fixed this and added some test case.

@avaris avaris left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thank you!

Use more than one unicode block in _word_regex, making word count
function behave properly with CJK, cyrillic and more latin characters
when generating summary.
@ImBearChild

Copy link
Copy Markdown
Contributor Author

Looks good. Thank you

Sorry, I've just found another typo in regular expression. I used instead of - in one regular expression. And now it's fixed. Sorry for that...

@justinmayer

Copy link
Copy Markdown
Member

Merged manually via 22192c1. Many thanks to @ImBearChild for the nice enhancement and to @avaris for reviewing. ✨

@justinmayer justinmayer added the manual merge Pull request was merged manually label Oct 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

manual merge Pull request was merged manually

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Setting SUMMARY_MAX_LENGTH doesn't apply to CJK content

3 participants