Add new unidecode_translate method #79

marcoffee · 2022-08-11T22:52:37Z

This method behaves similar to unidecode_expect_nonascii, but it uses a preloaded translation dict, built from the xNNN.py files on unidecode folder. This dictionary is, then, fed to str.translate.
It throws the same errors as unidecode, but only checks surrogates if the check_surrogates param is True.
Since it requires loading the dictionary every initialization (I could not generate a cache for this case), it is slower than unidecode_expect_nonascii for use on the utility, but when used on applications which convert many strings, it is faster.

Here are the results of benchmark.py when run with each configuration (I just replaced the internal calls to each of those methods):

unidecode:

unidecode_expect_ascii, ASCII string
2000000 loops, best of 5: 104 nsec per loop
unidecode_expect_ascii, non-ASCII string
100000 loops, best of 5: 2.8 usec per loop
unidecode_expect_nonascii, ASCII string
100000 loops, best of 5: 2.31 usec per loop
unidecode_expect_nonascii, non-ASCII string
100000 loops, best of 5: 2.46 usec per loop

unidecode_translate with check_surrogates=True

unidecode_expect_ascii, ASCII string
2000000 loops, best of 5: 108 nsec per loop
unidecode_expect_ascii, non-ASCII string
200000 loops, best of 5: 1.77 usec per loop
unidecode_expect_nonascii, ASCII string
200000 loops, best of 5: 1.32 usec per loop
unidecode_expect_nonascii, non-ASCII string
200000 loops, best of 5: 1.36 usec per loop

unidecode_translate with check_surrogates=False

unidecode_expect_ascii, ASCII string
2000000 loops, best of 5: 109 nsec per loop
unidecode_expect_ascii, non-ASCII string
200000 loops, best of 5: 1.21 usec per loop
unidecode_expect_nonascii, ASCII string
500000 loops, best of 5: 796 nsec per loop
unidecode_expect_nonascii, non-ASCII string
500000 loops, best of 5: 862 nsec per loop

It is also faster for big strings, which can be seem on the following benchmark:

In [1]: import unidecode as udec

In [2]: big_str = "ãbç" * 100000

In [3]: %timeit udec.unidecode_expect_nonascii(big_str)
78 ms ± 2.52 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %timeit udec.unidecode_translate(big_str, check_surrogates=False)
7.67 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: %timeit udec.unidecode_translate(big_str, check_surrogates=True)
21.5 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Note that the tests located on the tests folder also work for the unidecode_translate method given that check_surrogates=True. Note that the cases where it compares the exception context to None fail (even with the usage of raise ... from None), but it can be easily solved by storing the exception object on a variable and raising if outside the try-catch block.

avian2 · 2022-08-17T17:05:49Z

Thanks for this pull request. I like the performance increase and I think using str.translate might be interesting for use in Unidecode. I see some minor issues in the code, but they look easy to fix.

However the main issue I have with this change is that it basically duplicates all Unidecode functionality in another function. I don't like having two separate implementations.

I would be interested in exploring the possibility of just replacing the current implementation with a one based on str.translate.

For a long-term running program, preloading the tables shouldn't have much overhead since the current implementation already caches the tables. In the long-term the cache ends up loading all translations anyway. I'm not sure how many people only use Unidecode for short runs though.

Maybe the Translator object for str.translate() can act as a cache/wrapper around the current _get_repl_str()? Perhaps something based on collections.defaultdict? That could end up being very close to the current implementation as far as memory usage is concerned.

horsemankukka · 2023-04-12T17:41:26Z

Tried using collections.UserDict with __missing__() basically being the _get_repl_str() but adding the missing section directly to the self.data when loading and also caching missing sections separately to a set so None can be returned for those quickly. The performance increase was impressive (also about doubles the performance using benchmark.py), but not sure what would be the most elegant way to handle errors and replace_str here. Didn't do any further testing either, but doing this dynamically seems completely plausible.

from collections import UserDict
from itertools import zip_longest
 
class UnidecodeCache(UserDict):
    missing_sections = set()
 
    def __missing__(self, codepoint):
        if codepoint < 0x80:
            # Already ASCII
            raise LookupError()
 
        if codepoint > 0xeffff:
            # No data on characters in Private Use Area and above.
            return None
 
        if 0xd800 <= codepoint <= 0xdfff:
            warnings.warn(  "Surrogate character %r will be ignored. "
                            "You might be using a narrow Python build." % (char,),
                            RuntimeWarning, 2)
 
        section = codepoint >> 8   # Chop off the last two hex digits
 
        if section in self.missing_sections:
            return None
 
        try:
            mod = __import__('unidecode.x%03x'%(section), globals(), locals(), ['data'])
        except ImportError:
            # No data on this character
            self.missing_sections.add(section)
            return None
 
        for k, v in zip_longest(range(256), mod.data):
            self.data[(section << 8) | k] = v
 
        return self.data[codepoint]
 
Cache = UnidecodeCache()

# ...

    def _unidecode(string: str, errors: str, replace_str:str) -> str:
        return string.translate(Cache)

Furthermore, initially looks like the performance of unidecode_expect_ascii might improve by using if string.isascii(): return string. At least it shouldn't logically worsen it.

marcoffee force-pushed the unicode-translate-method branch from 9d8c7d1 to 4f8e0f0 Compare August 12, 2022 13:37

avian2 mentioned this pull request Sep 30, 2022

syntax error Unidecode 1.3.5 on python 3.5 #82

Closed

marcoffee force-pushed the unicode-translate-method branch 3 times, most recently from bc4fbb5 to 0fd9e6a Compare December 15, 2023 16:49

marcoffee added 3 commits April 30, 2024 11:55

Add unidecode_translate method

2b2e7b4

Fix print calls for Python 3

2a945f1

Fix Translator type hint

b43eb0a

marcoffee force-pushed the unicode-translate-method branch from 0fd9e6a to b43eb0a Compare April 30, 2024 14:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new unidecode_translate method #79

Add new unidecode_translate method #79

marcoffee commented Aug 11, 2022 •

edited

Loading

avian2 commented Aug 17, 2022

horsemankukka commented Apr 12, 2023

Add new unidecode_translate method #79

Are you sure you want to change the base?

Add new unidecode_translate method #79

Conversation

marcoffee commented Aug 11, 2022 • edited Loading

avian2 commented Aug 17, 2022

horsemankukka commented Apr 12, 2023

marcoffee commented Aug 11, 2022 •

edited

Loading