Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeEncodeError when using Stream flavor #183

Closed
stpete111 opened this issue Aug 14, 2020 · 6 comments · Fixed by #188
Closed

UnicodeEncodeError when using Stream flavor #183

stpete111 opened this issue Aug 14, 2020 · 6 comments · Fixed by #188

Comments

@stpete111
Copy link

stpete111 commented Aug 14, 2020

Python 3.7 on Windows

Using this pdf: http://tsbde.texas.gov/78i8ljhbj/Fiscal-Year-2014-Disciplinary-Actions.pdf

I am running it through Camelot to convert to html using Stream flavor and I get the following error at execution of the export line, once it reaches page 4 of 8:

"UnicodeEncodeError -'charmap' codec can't encode character '\u2010' in position y: character maps to undefined."

Pages 1 through 3 get converted nicely - it crashes somewhere between page 4 and 5. In debug with the breakpoint after the tables.export line, it also brings me to line 19 of cp1252.py, if that's helpful.

I am on Windows, and this seems not to be an issue on Mac. But Windows is our environment so I have to figure this out. I have done a ton of research on this error and everything for this in Python world points to either adding encoding="utf-8" or errors="ignore", but those both relate to the file.read method and can't be used in Camelot's export method.

Any thoughts on what I could add to the script to get around this error? We can't avoid using Windows, and this seems to be the final blocker for us for being able to really make great use of this tool for our PDF's.

@stpete111
Copy link
Author

stpete111 commented Aug 14, 2020

At this point I'm willing to put try/except code around the export method (but would need guidance on how to do that). You should see how many Stack Overflow tabs I have open in my browser right now, trying every solution I can find, and still getting the same error no matter what.

@anakin87
Copy link
Contributor

I found this solution (it is a monkey patch): https://stackoverflow.com/questions/63403629/python-camelot-pdf-unicodeencodeerror-when-using-stream-flavor-on-windows/

@stpete111
Copy link
Author

Thanks @anakin87 this works great.

@vinayak-mehta
Copy link
Member

@anakin87 Would you like to open a PR to fix this in the library itself? :)

@anakin87
Copy link
Contributor

anakin87 commented Aug 25, 2020

#188

It is my first PR. If it is uncorrect, please provide some help.

@vinayak-mehta
Copy link
Member

vinayak-mehta commented Aug 25, 2020

@anakin87 It looks good! I'm waiting for the the tests to pass so that I can merge it, even though there isn't a test for the to_html method right now. (You can add it in a new PR if you want to work on it)

Also, I've noticed that you use a lot of different camelot features, based on your issue tracker replies and SO answers. I would love to chat about how you use camelot if you have some time this / next week!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants