Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slugify can only handle ASCII chars #9

Closed
spaetz opened this issue Apr 19, 2012 · 5 comments
Closed

slugify can only handle ASCII chars #9

spaetz opened this issue Apr 19, 2012 · 5 comments
Labels
enhancement New feature or request

Comments

@spaetz
Copy link
Contributor

spaetz commented Apr 19, 2012

So far slugify can only work when the title consists of ASCII only (it removes some special chars, though). Being a German I tend to have a few Umlauts in my titles though and volt bombs out. I can see a few ways to handle this better from the volt side of things:

  1. Filter out even more special chars, using a list such as öäü. I did this by adding it to the list of RE_PRUNE (or similar) regexp.
  2. Filter out all non-ascii chars by adding \W to the list of pruned chars. (more general solution)
  3. Use maketrans and str.translate to replace some chars with others, this would not work well for e.g. ä which needs to become 'ae'.
  4. Add more replacement rules that replace öäü with oe, ae, ue etc...
  5. Allow to have umlauts in slugs (not sure how web servers handle that)

Personally, I think #2 might be the easiest solution.This seems also in line with what e.g. django does on slugify:
"Converts to lowercase, removes non-word characters (alphanumerics and underscores) and converts spaces to hyphens. Also strips leading and trailing whitespace."

If RE_PRUNE would strip all non-ascii chars, this would also make the whole non-ascii check in lines 459-469 redundant.

@bow
Copy link
Owner

bow commented Apr 19, 2012

    1. doesn't seem to be a good idea, specifically for the reason you mentioned.
  • Applying 5) sounds like it might complicate deployment, so it's a no go.
    1. is basically a subset of 2), which does seem more sensible and the easiest to implement.
    1. is seems doable as well.

As a person who regularly deals with modified latin vowels yourself, which one do you prefer?

I was thinking of doing no 4), but with an extra option in CONFIG.SITE whether to do this replacement or not (e.g. CONFIG.SITE.STRIP_MODIFIED_LATIN_CHARS. If this option is set to True (the default), we would do number 2. If set to False, we'd do number 4).

Alternatively, we could let the user set their own desired character replacement scheme in CONFIG.SITE . For example, setting CONFIG.SITE.SLUG_REPLACE_CHARS to (('ä', 'ae'), ('ö', 'oe') would replace all ä with ae and so on. If it's not set, then we do no. 2). I imagine this would let the slug retain its original meaning (no missing vowels/consonants) while not incurring too many unecessary overhead on Volt's part.

@spaetz
Copy link
Contributor Author

spaetz commented Apr 20, 2012

Personally, I would prefer no 4, as it would not lose information, and slugs would still be sensible words. A slug that is "Khe" versus "Kuehe" (cows) is basically nonsensical. The only reason speaking against it is the additional regex rules, ie time consumption.
I would not thing that we would need a configuration setting for this. If we can do #4, this is preferable to dropping letters in any case.

@bow
Copy link
Owner

bow commented Apr 20, 2012

I agree with you on no. 4). However, I do think that we need to include an extra configuration setting, because of the following.

The number of non-ascii vowels / consonants which can be replaced with one or several ascii-characters can be large. If we do a replacement for each of them, that would waste a lot of CPU cycles. I also suspect each user will only be concerned with a subset of these non-ascii characters, making this a bigger problem. With an extra configuration setting, we can just let Volt do a simple if True check before processing the slug.

There is the drawback that anyone who wishes to use the non-ascii vowels / consonants will have to set their own rules, which could be repetitive. However, this may be solved by including a language-specific defaults, so the user can specify their desired replacement specifically or just specify a language-dependent scheme.

In addition, I was also thinking that we might not need to do regex. I don't know the speed comparison exactly (should look it up), but we can also do a simple str.replace('ä', 'ae'), for example. Of course, this would mean the code would be more verbose, but if the speed gain could be considerable, I'm in favor of it.

EDIT: I made a new branch to play around with the idea, and it seems to work ok for now (all tests passed).

@spaetz
Copy link
Contributor Author

spaetz commented Apr 23, 2012

I have now added some code comments to dac134d#commitcomment-1243896 . Looks good in general. Thanks!

@bow
Copy link
Owner

bow commented Apr 23, 2012

Replied there as well. I'll push my fixes soon :).

@bow bow closed this as completed Apr 23, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants