Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal error 500: UnicodeEncodeError: 'utf-8' codec can't encode characters #1832

Open
SamantazFox opened this issue Feb 7, 2020 · 6 comments
Labels
bug

Comments

@SamantazFox
Copy link

@SamantazFox SamantazFox commented Feb 7, 2020

I tried to do an image research and got an internal error 500.
I did it again with debugging on, here is the callstack below

Traceback (most recent call last):
  File "/srv/searx/searx-ve/lib/python3.7/site-packages/flask/app.py", line 2309, in __call__
    return self.wsgi_app(environ, start_response)
  File "/srv/searx/searx/webapp.py", line 973, in __call__
    return self.app(environ, start_response)
  File "/srv/searx/searx-ve/lib/python3.7/site-packages/werkzeug/middleware/proxy_fix.py", line 169, in __call__
    return self.app(environ, start_response)
  File "/srv/searx/searx-ve/lib/python3.7/site-packages/flask/app.py", line 2295, in wsgi_app
    response = self.handle_exception(e)
  File "/srv/searx/searx-ve/lib/python3.7/site-packages/flask/app.py", line 1741, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "/srv/searx/searx-ve/lib/python3.7/site-packages/flask/_compat.py", line 35, in reraise
    raise value
  File "/srv/searx/searx-ve/lib/python3.7/site-packages/flask/app.py", line 2292, in wsgi_app
    response = self.full_dispatch_request()
  File "/srv/searx/searx-ve/lib/python3.7/site-packages/flask/app.py", line 1816, in full_dispatch_request
    return self.finalize_request(rv)
  File "/srv/searx/searx-ve/lib/python3.7/site-packages/flask/app.py", line 1831, in finalize_request
    response = self.make_response(rv)
  File "/srv/searx/searx-ve/lib/python3.7/site-packages/flask/app.py", line 1968, in make_response
    rv = self.response_class(rv, status=status, headers=headers)
  File "/srv/searx/searx-ve/lib/python3.7/site-packages/werkzeug/wrappers/base_response.py", line 212, in __init__
    self.set_data(response)
  File "/srv/searx/searx-ve/lib/python3.7/site-packages/werkzeug/wrappers/base_response.py", line 351, in set_data
    value = value.encode(self.charset)
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 73432-73433: surrogates not allowed

I'm running a fresh install with the latest commit available on date (at the time of writing: (ab1b1ac)

Here are below the engines used:
image

@asciimoo

This comment has been minimized.

Copy link
Owner

@asciimoo asciimoo commented Feb 7, 2020

What python version do you use? Probably it is the flickr engine, could you verify it?

@SamantazFox

This comment has been minimized.

Copy link
Author

@SamantazFox SamantazFox commented Feb 7, 2020

NB: I can't reproduce the issue when Flickr is disabled.

@SamantazFox

This comment has been minimized.

Copy link
Author

@SamantazFox SamantazFox commented Feb 7, 2020

I'm using the latest python3 package available from debian repos:

$ uname -srvo
Linux 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) GNU/Linux
$ apt show python3
Package: python3
Version: 3.7.3-1
[...]
$ python3 --version
Python 3.7.3
@SamantazFox

This comment has been minimized.

Copy link
Author

@SamantazFox SamantazFox commented Feb 7, 2020

And I confirm that this is related to the Flickr engine.
The issue immediately disappears when I disable it, and comes back when I enable it back.

Edit: After more resarch, this comes from the ecma_unescape() calls on one of those entries:
https://github.com/asciimoo/searx/blob/master/searx/engines/flickr_noapi.py#L79

Edit 2: The erroneous entry is generated when the content of photo.get('realname', '') goes through in the the ecma_unescape() function.

@SamantazFox

This comment has been minimized.

Copy link
Author

@SamantazFox SamantazFox commented Feb 7, 2020

And I think I found the culprit.
Here is the HTML extract, with unencoded photo-author field:

<div class="modal-body"><img class="img-responsive center-block" src="/image_proxy?url=https%3A%2F%2Flive.staticflickr.com%2F674%2F22986292445_c1d9af40d1_n.jpg&amp;h=ed726b9204f4ad8137e64c121809e0d8f4b969e73da009f74d4d02191d7a76cb" alt="201510_0093 Gouda - Museumcafé"><span class="photo-author">%uD83D%uDCF7%20Ad%20DeCort%20%28NL%29</span><br><p class="result-content">Gouda (Dutch pronunciation: [ˈɣʌu̯daː] is a municipality and city with population of 70,939 in the province of South Holland in the Netherlands. Gouda, which was granted city rights in 1272, is famous for its Gouda cheese, smoking pipes, and 15th-century city hall. In the Middle Ages, a settlement was founded at the location of the current city by the Van der Goude family, who built a fortified castle alongside the banks of the Gouwe River, from which the family and the city took its name. The area, originally marshland, developed over the course of two centuries. By 1225, a canal was linked to the Gouwe and its estuary was transformed into a harbour. Gouda's array of historic churches and other buildings makes it a very popular day trip destination. Around the year 1100, the area where Gouda now is located was swampy and covered with a peat forest, crossed by small creeks such as the Gouwe. Along the shores of this stream near the current market and city hall, peat harvesting began in the 11th and 12th centu</p><p class="result-format">jpg 800x528</p><p class="result-source">├ Ad DeCort (NL) ┤ @ Flickr</p></div>

Edit: Yeah, those are definitely the bad ones:
<span class="photo-author">%uD83D%uDCF7%20Ad%20DeCort%20%28NL%29</span>

https://unicode-table.com/en/search/?q=D83D
https://unicode-table.com/en/search/?q=DCF7

@SamantazFox

This comment has been minimized.

Copy link
Author

@SamantazFox SamantazFox commented Feb 14, 2020

Update: It happened on another engine.
I haven't investigated much, but the issue is more likely global and related to the function encoding the page itself, and not a specific engine.

@dalf dalf added the bug label Feb 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.