Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new unidecode_translate method #79

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

marcoffee
Copy link

@marcoffee marcoffee commented Aug 11, 2022

This method behaves similar to unidecode_expect_nonascii, but it uses a preloaded translation dict, built from the xNNN.py files on unidecode folder. This dictionary is, then, fed to str.translate.
It throws the same errors as unidecode, but only checks surrogates if the check_surrogates param is True.
Since it requires loading the dictionary every initialization (I could not generate a cache for this case), it is slower than unidecode_expect_nonascii for use on the utility, but when used on applications which convert many strings, it is faster.

Here are the results of benchmark.py when run with each configuration (I just replaced the internal calls to each of those methods):

  • unidecode:
unidecode_expect_ascii, ASCII string
2000000 loops, best of 5: 104 nsec per loop
unidecode_expect_ascii, non-ASCII string
100000 loops, best of 5: 2.8 usec per loop
unidecode_expect_nonascii, ASCII string
100000 loops, best of 5: 2.31 usec per loop
unidecode_expect_nonascii, non-ASCII string
100000 loops, best of 5: 2.46 usec per loop
  • unidecode_translate with check_surrogates=True
unidecode_expect_ascii, ASCII string
2000000 loops, best of 5: 108 nsec per loop
unidecode_expect_ascii, non-ASCII string
200000 loops, best of 5: 1.77 usec per loop
unidecode_expect_nonascii, ASCII string
200000 loops, best of 5: 1.32 usec per loop
unidecode_expect_nonascii, non-ASCII string
200000 loops, best of 5: 1.36 usec per loop
  • unidecode_translate with check_surrogates=False
unidecode_expect_ascii, ASCII string
2000000 loops, best of 5: 109 nsec per loop
unidecode_expect_ascii, non-ASCII string
200000 loops, best of 5: 1.21 usec per loop
unidecode_expect_nonascii, ASCII string
500000 loops, best of 5: 796 nsec per loop
unidecode_expect_nonascii, non-ASCII string
500000 loops, best of 5: 862 nsec per loop

It is also faster for big strings, which can be seem on the following benchmark:

In [1]: import unidecode as udec

In [2]: big_str = "ãbç" * 100000

In [3]: %timeit udec.unidecode_expect_nonascii(big_str)
78 ms ± 2.52 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %timeit udec.unidecode_translate(big_str, check_surrogates=False)
7.67 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: %timeit udec.unidecode_translate(big_str, check_surrogates=True)
21.5 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Note that the tests located on the tests folder also work for the unidecode_translate method given that check_surrogates=True. Note that the cases where it compares the exception context to None fail (even with the usage of raise ... from None), but it can be easily solved by storing the exception object on a variable and raising if outside the try-catch block.

@avian2
Copy link
Owner

avian2 commented Aug 17, 2022

Thanks for this pull request. I like the performance increase and I think using str.translate might be interesting for use in Unidecode. I see some minor issues in the code, but they look easy to fix.

However the main issue I have with this change is that it basically duplicates all Unidecode functionality in another function. I don't like having two separate implementations.

I would be interested in exploring the possibility of just replacing the current implementation with a one based on str.translate.

For a long-term running program, preloading the tables shouldn't have much overhead since the current implementation already caches the tables. In the long-term the cache ends up loading all translations anyway. I'm not sure how many people only use Unidecode for short runs though.

Maybe the Translator object for str.translate() can act as a cache/wrapper around the current _get_repl_str()? Perhaps something based on collections.defaultdict? That could end up being very close to the current implementation as far as memory usage is concerned.

@horsemankukka
Copy link

Tried using collections.UserDict with __missing__() basically being the _get_repl_str() but adding the missing section directly to the self.data when loading and also caching missing sections separately to a set so None can be returned for those quickly. The performance increase was impressive (also about doubles the performance using benchmark.py), but not sure what would be the most elegant way to handle errors and replace_str here. Didn't do any further testing either, but doing this dynamically seems completely plausible.

from collections import UserDict
from itertools import zip_longest
 
class UnidecodeCache(UserDict):
    missing_sections = set()
 
    def __missing__(self, codepoint):
        if codepoint < 0x80:
            # Already ASCII
            raise LookupError()
 
        if codepoint > 0xeffff:
            # No data on characters in Private Use Area and above.
            return None
 
        if 0xd800 <= codepoint <= 0xdfff:
            warnings.warn(  "Surrogate character %r will be ignored. "
                            "You might be using a narrow Python build." % (char,),
                            RuntimeWarning, 2)
 
        section = codepoint >> 8   # Chop off the last two hex digits
 
        if section in self.missing_sections:
            return None
 
        try:
            mod = __import__('unidecode.x%03x'%(section), globals(), locals(), ['data'])
        except ImportError:
            # No data on this character
            self.missing_sections.add(section)
            return None
 
        for k, v in zip_longest(range(256), mod.data):
            self.data[(section << 8) | k] = v
 
        return self.data[codepoint]
 
Cache = UnidecodeCache()

# ...

    def _unidecode(string: str, errors: str, replace_str:str) -> str:
        return string.translate(Cache)

Furthermore, initially looks like the performance of unidecode_expect_ascii might improve by using if string.isascii(): return string. At least it shouldn't logically worsen it.

@marcoffee marcoffee force-pushed the unicode-translate-method branch 3 times, most recently from bc4fbb5 to 0fd9e6a Compare December 15, 2023 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants