slugify can only handle ASCII chars #9

spaetz · 2012-04-19T08:01:39Z

So far slugify can only work when the title consists of ASCII only (it removes some special chars, though). Being a German I tend to have a few Umlauts in my titles though and volt bombs out. I can see a few ways to handle this better from the volt side of things:

Filter out even more special chars, using a list such as öäü. I did this by adding it to the list of RE_PRUNE (or similar) regexp.
Filter out all non-ascii chars by adding \W to the list of pruned chars. (more general solution)
Use maketrans and str.translate to replace some chars with others, this would not work well for e.g. ä which needs to become 'ae'.
Add more replacement rules that replace öäü with oe, ae, ue etc...
Allow to have umlauts in slugs (not sure how web servers handle that)

Personally, I think #2 might be the easiest solution.This seems also in line with what e.g. django does on slugify:
"Converts to lowercase, removes non-word characters (alphanumerics and underscores) and converts spaces to hyphens. Also strips leading and trailing whitespace."

If RE_PRUNE would strip all non-ascii chars, this would also make the whole non-ascii check in lines 459-469 redundant.

bow · 2012-04-19T08:22:14Z

1. doesn't seem to be a good idea, specifically for the reason you mentioned.
Applying 5) sounds like it might complicate deployment, so it's a no go.
1. is basically a subset of 2), which does seem more sensible and the easiest to implement.
1. is seems doable as well.

As a person who regularly deals with modified latin vowels yourself, which one do you prefer?

I was thinking of doing no 4), but with an extra option in CONFIG.SITE whether to do this replacement or not (e.g. CONFIG.SITE.STRIP_MODIFIED_LATIN_CHARS. If this option is set to True (the default), we would do number 2. If set to False, we'd do number 4).

Alternatively, we could let the user set their own desired character replacement scheme in CONFIG.SITE . For example, setting CONFIG.SITE.SLUG_REPLACE_CHARS to (('ä', 'ae'), ('ö', 'oe') would replace all ä with ae and so on. If it's not set, then we do no. 2). I imagine this would let the slug retain its original meaning (no missing vowels/consonants) while not incurring too many unecessary overhead on Volt's part.

spaetz · 2012-04-20T13:06:09Z

Personally, I would prefer no 4, as it would not lose information, and slugs would still be sensible words. A slug that is "Khe" versus "Kuehe" (cows) is basically nonsensical. The only reason speaking against it is the additional regex rules, ie time consumption.
I would not thing that we would need a configuration setting for this. If we can do #4, this is preferable to dropping letters in any case.

bow · 2012-04-20T13:22:24Z

I agree with you on no. 4). However, I do think that we need to include an extra configuration setting, because of the following.

The number of non-ascii vowels / consonants which can be replaced with one or several ascii-characters can be large. If we do a replacement for each of them, that would waste a lot of CPU cycles. I also suspect each user will only be concerned with a subset of these non-ascii characters, making this a bigger problem. With an extra configuration setting, we can just let Volt do a simple if True check before processing the slug.

There is the drawback that anyone who wishes to use the non-ascii vowels / consonants will have to set their own rules, which could be repetitive. However, this may be solved by including a language-specific defaults, so the user can specify their desired replacement specifically or just specify a language-dependent scheme.

In addition, I was also thinking that we might not need to do regex. I don't know the speed comparison exactly (should look it up), but we can also do a simple str.replace('ä', 'ae'), for example. Of course, this would mean the code would be more verbose, but if the speed gain could be considerable, I'm in favor of it.

EDIT: I made a new branch to play around with the idea, and it seems to work ok for now (all tests passed).

spaetz · 2012-04-23T10:24:00Z

I have now added some code comments to dac134d#commitcomment-1243896 . Looks good in general. Thanks!

bow · 2012-04-23T10:31:46Z

Replied there as well. I'll push my fixes soon :).

bow closed this as completed Apr 23, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slugify can only handle ASCII chars #9

slugify can only handle ASCII chars #9

spaetz commented Apr 19, 2012

bow commented Apr 19, 2012

spaetz commented Apr 20, 2012

bow commented Apr 20, 2012

spaetz commented Apr 23, 2012

bow commented Apr 23, 2012

slugify can only handle ASCII chars #9

slugify can only handle ASCII chars #9

Comments

spaetz commented Apr 19, 2012

bow commented Apr 19, 2012

spaetz commented Apr 20, 2012

bow commented Apr 20, 2012

spaetz commented Apr 23, 2012

bow commented Apr 23, 2012