Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parser.from_buffer breaks on unicode #47

Closed
hyp3ri0n-ng opened this issue Jul 7, 2015 · 2 comments
Closed

parser.from_buffer breaks on unicode #47

hyp3ri0n-ng opened this issue Jul 7, 2015 · 2 comments

Comments

@hyp3ri0n-ng
Copy link

Platform: Ubuntu 14.04
Python version: 2.7.x

The following breaks:

#!/usr/bin/env python2.7                                                                                                                                                                      
import tika
from tika import parser
import requests

r = requests.get("http://www.hyperiongray.com/")
#stringed = r.text.encode('ascii','ignore')                                                                                                                                                   
string_parsed = parser.from_buffer(r.text)

with the exception:

punk@punk-controller:~/memex-dev/the-headless-horseman$ python tika-test.py 
Traceback (most recent call last):
  File "tika-test.py", line 10, in <module>
    string_parsed = parser.from_buffer(r.text)
  File "/usr/local/lib/python2.7/dist-packages/tika/parser.py", line 29, in from_buffer
    {'Accept': 'application/json'}, False)
  File "/usr/local/lib/python2.7/dist-packages/tika/tika.py", line 245, in callServer
    resp = verbFn(serviceUrl, data=data, headers=headers)
  File "/usr/lib/python2.7/dist-packages/requests/api.py", line 99, in put
    return request('put', url, data=data, **kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/api.py", line 44, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 455, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 558, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/adapters.py", line 330, in send
    timeout=timeout
  File "/usr/lib/python2.7/dist-packages/urllib3/connectionpool.py", line 542, in urlopen
    body=body, headers=headers)
  File "/usr/lib/python2.7/dist-packages/urllib3/connectionpool.py", line 367, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python2.7/httplib.py", line 973, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib/python2.7/httplib.py", line 1007, in _send_request
    self.endheaders(body)
  File "/usr/lib/python2.7/httplib.py", line 969, in endheaders
    self._send_output(message_body)
  File "/usr/lib/python2.7/httplib.py", line 833, in _send_output
    self.send(message_body)
  File "/usr/lib/python2.7/httplib.py", line 805, in send
    self.sock.sendall(data)
  File "/usr/lib/python2.7/socket.py", line 224, in meth
    return getattr(self._sock,name)(*args)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 5132-5134: ordinal not in range(128)

A workaround is something like the following:

#!/usr/bin/env python2.7                                                                                                                                                                      
import tika
from tika import parser
import requests

r = requests.get("http://www.hyperiongray.com/")
mangled_str = r.text.encode('ascii','ignore')
string_parsed = parser.from_buffer(mangled_str)

which works as expected. However, this won't be acceptable for any non-html content and most html content.

@chrismattmann
Copy link
Owner

Hey @acaceres2176 check this out:

bash-3.2$ env | grep UTF
lang=UTF_8
LANG=en_US.UTF-8
bash-3.2$ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=
bash-3.2$ export LC_ALL=en_US.UTF-8
bash-3.2$ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL="en_US.UTF-8"
bash-3.2$ python2.7
Python 2.7.8 (default, Sep 27 2014, 11:46:04) 
[GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> from tika import parser
>>> r = requests.get('http://www.hyperiongray.com')
>>> parsed = parser.from_buffer(r.text)
>>> 

Try that as a work-around for now. I will put some more resiliency into the library to guard against this. Working on it now.

@chrismattmann
Copy link
Owner

@acaceres2176 see above in #48 - that fixes it for me without having to set encoding to UTF-8 in my shell:

[chipotle:~] mattmann% python2.7
Python 2.7.8 (default, Sep 27 2014, 11:46:04) 
[GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from tika import parser
>>> import requests
>>> r = requests.get('http://www.hyperiongray.com')
>>> parsed = parser.from_buffer(r.text)
>>> 

Try it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants