-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new unidecode_translate method #79
base: master
Are you sure you want to change the base?
Conversation
9d8c7d1
to
4f8e0f0
Compare
Thanks for this pull request. I like the performance increase and I think using However the main issue I have with this change is that it basically duplicates all Unidecode functionality in another function. I don't like having two separate implementations. I would be interested in exploring the possibility of just replacing the current implementation with a one based on For a long-term running program, preloading the tables shouldn't have much overhead since the current implementation already caches the tables. In the long-term the cache ends up loading all translations anyway. I'm not sure how many people only use Unidecode for short runs though. Maybe the Translator object for |
Tried using from collections import UserDict
from itertools import zip_longest
class UnidecodeCache(UserDict):
missing_sections = set()
def __missing__(self, codepoint):
if codepoint < 0x80:
# Already ASCII
raise LookupError()
if codepoint > 0xeffff:
# No data on characters in Private Use Area and above.
return None
if 0xd800 <= codepoint <= 0xdfff:
warnings.warn( "Surrogate character %r will be ignored. "
"You might be using a narrow Python build." % (char,),
RuntimeWarning, 2)
section = codepoint >> 8 # Chop off the last two hex digits
if section in self.missing_sections:
return None
try:
mod = __import__('unidecode.x%03x'%(section), globals(), locals(), ['data'])
except ImportError:
# No data on this character
self.missing_sections.add(section)
return None
for k, v in zip_longest(range(256), mod.data):
self.data[(section << 8) | k] = v
return self.data[codepoint]
Cache = UnidecodeCache()
# ...
def _unidecode(string: str, errors: str, replace_str:str) -> str:
return string.translate(Cache) Furthermore, initially looks like the performance of |
bc4fbb5
to
0fd9e6a
Compare
0fd9e6a
to
b43eb0a
Compare
This method behaves similar to
unidecode_expect_nonascii
, but it uses a preloaded translation dict, built from thexNNN.py
files onunidecode
folder. This dictionary is, then, fed tostr.translate
.It throws the same errors as
unidecode
, but only checks surrogates if thecheck_surrogates
param is True.Since it requires loading the dictionary every initialization (I could not generate a cache for this case), it is slower than
unidecode_expect_nonascii
for use on the utility, but when used on applications which convert many strings, it is faster.Here are the results of
benchmark.py
when run with each configuration (I just replaced the internal calls to each of those methods):unidecode
:unidecode_translate
withcheck_surrogates=True
unidecode_translate
withcheck_surrogates=False
It is also faster for big strings, which can be seem on the following benchmark:
Note that the tests located on the
tests
folder also work for theunidecode_translate
method given thatcheck_surrogates=True
. Note that the cases where it compares the exception context toNone
fail (even with the usage ofraise ... from None
), but it can be easily solved by storing the exception object on a variable and raising if outside the try-catch block.