parser.from_buffer breaks on unicode #47

hyp3ri0n-ng · 2015-07-07T20:09:13Z

Platform: Ubuntu 14.04
Python version: 2.7.x

The following breaks:

#!/usr/bin/env python2.7                                                                                                                                                                      
import tika
from tika import parser
import requests

r = requests.get("http://www.hyperiongray.com/")
#stringed = r.text.encode('ascii','ignore')                                                                                                                                                   
string_parsed = parser.from_buffer(r.text)

with the exception:

punk@punk-controller:~/memex-dev/the-headless-horseman$ python tika-test.py 
Traceback (most recent call last):
  File "tika-test.py", line 10, in <module>
    string_parsed = parser.from_buffer(r.text)
  File "/usr/local/lib/python2.7/dist-packages/tika/parser.py", line 29, in from_buffer
    {'Accept': 'application/json'}, False)
  File "/usr/local/lib/python2.7/dist-packages/tika/tika.py", line 245, in callServer
    resp = verbFn(serviceUrl, data=data, headers=headers)
  File "/usr/lib/python2.7/dist-packages/requests/api.py", line 99, in put
    return request('put', url, data=data, **kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/api.py", line 44, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 455, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 558, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/adapters.py", line 330, in send
    timeout=timeout
  File "/usr/lib/python2.7/dist-packages/urllib3/connectionpool.py", line 542, in urlopen
    body=body, headers=headers)
  File "/usr/lib/python2.7/dist-packages/urllib3/connectionpool.py", line 367, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python2.7/httplib.py", line 973, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib/python2.7/httplib.py", line 1007, in _send_request
    self.endheaders(body)
  File "/usr/lib/python2.7/httplib.py", line 969, in endheaders
    self._send_output(message_body)
  File "/usr/lib/python2.7/httplib.py", line 833, in _send_output
    self.send(message_body)
  File "/usr/lib/python2.7/httplib.py", line 805, in send
    self.sock.sendall(data)
  File "/usr/lib/python2.7/socket.py", line 224, in meth
    return getattr(self._sock,name)(*args)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 5132-5134: ordinal not in range(128)

A workaround is something like the following:

#!/usr/bin/env python2.7                                                                                                                                                                      
import tika
from tika import parser
import requests

r = requests.get("http://www.hyperiongray.com/")
mangled_str = r.text.encode('ascii','ignore')
string_parsed = parser.from_buffer(mangled_str)

which works as expected. However, this won't be acceptable for any non-html content and most html content.

The text was updated successfully, but these errors were encountered:

chrismattmann · 2015-07-07T22:43:56Z

Hey @acaceres2176 check this out:

bash-3.2$ env | grep UTF
lang=UTF_8
LANG=en_US.UTF-8
bash-3.2$ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=
bash-3.2$ export LC_ALL=en_US.UTF-8
bash-3.2$ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL="en_US.UTF-8"
bash-3.2$ python2.7
Python 2.7.8 (default, Sep 27 2014, 11:46:04) 
[GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> from tika import parser
>>> r = requests.get('http://www.hyperiongray.com')
>>> parsed = parser.from_buffer(r.text)
>>>

Try that as a work-around for now. I will put some more resiliency into the library to guard against this. Working on it now.

chrismattmann · 2015-07-07T23:51:32Z

@acaceres2176 see above in #48 - that fixes it for me without having to set encoding to UTF-8 in my shell:

[chipotle:~] mattmann% python2.7
Python 2.7.8 (default, Sep 27 2014, 11:46:04) 
[GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from tika import parser
>>> import requests
>>> r = requests.get('http://www.hyperiongray.com')
>>> parsed = parser.from_buffer(r.text)
>>>

Try it out.

chrismattmann closed this as completed in adc4b9d Jul 7, 2015

chrismattmann mentioned this issue Jul 9, 2015

fix for unicode broke .from_file methods #49

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parser.from_buffer breaks on unicode #47

parser.from_buffer breaks on unicode #47

hyp3ri0n-ng commented Jul 7, 2015

chrismattmann commented Jul 7, 2015

chrismattmann commented Jul 7, 2015

parser.from_buffer breaks on unicode #47

parser.from_buffer breaks on unicode #47

Comments

hyp3ri0n-ng commented Jul 7, 2015

chrismattmann commented Jul 7, 2015

chrismattmann commented Jul 7, 2015