Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hashtags parser fixes and improvements #4473

Merged
merged 6 commits into from
Nov 15, 2018

Conversation

leio10
Copy link
Contributor

@leio10 leio10 commented Nov 14, 2018

馃帺 What? Why?

When creating a proposal, the hashtag #acci贸n_mutante is parsed as #acci and the rest of the characters are left as plain text. When looking for fixing this, I've found other aspects of the code that I tried to improve:

  1. As said, HASHTAG_REGEX was not taking unicode characters into account. Also, it was defined in the model, while it is mainly related to the parser.
  2. There was one find_or_create_by executed for each hashtag detected. I've replaced with a query for every hashtag found in the content, and a creation for each non-existing hashtag.
  3. I added some test for additional scenarios: hashtag creation, repeated tags in the content, unicode chars and non-hashtag characters
  4. Code comments was copy&pasted from the UserParser class, I've updated them.

I think that, at least, improvements 1 and 2 could be applied to the UserParser class, I can add them in this PR if you agree with these improvements.

馃搶 Related Issues

馃搵 Subtasks

  • Add CHANGELOG entry

馃摲 Screenshots (optional)

@ghost ghost assigned leio10 Nov 14, 2018
@ghost ghost added the status: WIP label Nov 14, 2018
@leio10 leio10 changed the title Hashtag parser fixes and improvements Hashtags parser fixes and improvements Nov 14, 2018
@mrcasals
Copy link
Contributor

Wohooo! Looks real good @leio10, thanks!! Can you add a CHANGELOG entry? I'd add it under Fixed, saying that hashtags with unicode characters are now parsed correctly.

I think it'd be awesome if improvements 1 & 2 were applied to the UserParser too, what do you think @decidim/lot-mods?

@leio10
Copy link
Contributor Author

leio10 commented Nov 14, 2018

@mrcasals great!

By the way, I was researching a little and found that Decidim allows unicode chars on usernames (I have a username @le贸n now), while it seems that other services like Twitter don't (I didn't tested it, but I couldn't find a username like that after performing a random search). Maybe we should evaluate to stop allowing it? I think that usernames should be as simple and possible, allowing unicode we open the door to impersonation attacks as in domain names.

UPDATE: with this message we can confirm that github doesn't allow unicode in usernames, I've changed it to avoid the mention to the user @le. 馃槃

@mrcasals
Copy link
Contributor

Damn, I guess you're right... 馃槥

@leio10
Copy link
Contributor Author

leio10 commented Nov 14, 2018

@mrcasals Done, I've improved only the users query, without touching the unicode issue for users, as I understand that we don't want to allow unicode on usernames. Also, I've added the changelog entry.

Also, I've remembered another improvement I made to both parsers: I changed the boundaries for the regular expression. Instead of searching for the line starting or an space, I search for any word boundary. This allow to detect hashtags and users in contents like "blablabla (#hashtag) and:@user" that weren't detected before.

@mrcasals mrcasals requested review from mrcasals and oriolgual and removed request for mrcasals November 14, 2018 15:10
@mrcasals mrcasals merged commit efcd050 into decidim:master Nov 15, 2018
@leio10 leio10 deleted the fix/hashtag_parser branch November 15, 2018 11:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants