New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python 3 support #104

Merged
merged 8 commits into from Nov 15, 2016

Conversation

Projects
None yet
4 participants
@sirex
Contributor

sirex commented Feb 16, 2016

Test resuls, running with Python 3.4:

FAILED (SKIP=2, errors=2, failures=7)

Two tests are skipped because pdfminer does not support Python 3 yet, see:

euske/pdfminer#71

Two errors because of msg-extractor (I don't have it installed).

And 7 failures mostly related with white spaces, not sure how to fix that.
Maybe I should normalize white spaces?

Othen than that, everything works and textract works with Python 3 pretty well.

Python 3 support
Test resuls, running with Python 3.4:

    FAILED (SKIP=2, errors=2, failures=7)

Two tests are skipped because pdfminer does not support Python 3 yet, see:

euske/pdfminer#71

Two errors because of msg-extractor (I don't have it installed).

And 7 failures mostly related with white spaces, not sure how to fix that.
Maybe I should normalize white spaces?

Othen than that, everything works and textract works with Python 3 pretty well.
@deanmalmgren

This comment has been minimized.

deanmalmgren commented on textract/parsers/xls_parser.py in e24ae5f Feb 16, 2016

Is there a reason you got rid of the relative import here? I'd prefer to keep the relative import (makes for cleaner code) if possible

This comment has been minimized.

Owner

sirex replied Feb 17, 2016

Here I changed ambiguous absolute import to relative. In Python 3 these ambiguous relative imports are no longer supported. Python 2 and 3 pythons support explicit relative imports. See the PEP-0328 explanation:

Imports can be ambiguous in the face of packages; within a package, it's not clear whether import foo refers to a module within the package or some module outside the package. (More precisely, a local module or package can shadow another hanging directly off sys.path.)

@deanmalmgren

This comment has been minimized.

deanmalmgren commented on tests/base.py in e24ae5f Feb 16, 2016

The [Travis logs](https://travis-ci.org/deanmalmgren/textract/builds/109677000) suggest that a lot of the tests are failing in python 2.7, which I'm guessing 'r' to 'rb' file streams. Is there a reason for this change in python3?

This comment has been minimized.

Owner

sirex replied Feb 17, 2016

Honestly, i didn't tried to run tests on python 2.7, I will try to do that and see if it is possible to adjust changes so that they do not brake python 2.7. Sorry for being lazy.

In python 3 if file is opened in r mode, file content is interpreted as text. From what I understand textract internally process all data in binary form? So that's why I changed that, to make it consistent. And since Python 3 is more strict about text vs binary, if file is opened in text mode it fails in many places, where text is mixed with binary.

@deanmalmgren

This comment has been minimized.

deanmalmgren commented on tests/base.py in e24ae5f Feb 16, 2016

Same thing here. I'm pretty sure this should be 'r' and not 'rb' unless there is a reason the format is in binary.

@deanmalmgren

This comment has been minimized.

deanmalmgren commented on tests/base.py in e24ae5f Feb 16, 2016

another 'rb' issue?

@deanmalmgren

This comment has been minimized.

deanmalmgren commented on tests/test_csv.py in e24ae5f Feb 16, 2016

Does python 3 choke on these relative imports within the tests/ package?

This comment has been minimized.

Owner

sirex replied Feb 17, 2016

import base in Python 3 is a global import, since tests are run from project root, and tests/ is not in sys.path, base can't be imported. Changing it to relative import (supported both by Python 2 and 3) fixed the issue.

@deanmalmgren

This comment has been minimized.

deanmalmgren commented on textract/cli.py in e24ae5f Feb 16, 2016

This doesn't work on python 2.7. Is it necessary to specify sys.stdout.buffer here? It looks like argparse in python 3 understands '-' just fine [docs]

This comment has been minimized.

Owner

sirex replied Feb 17, 2016

Unfortunately Python 3 does not honour b flag and in case of '-' even if b is specified file handler is sys.stdio which is opened in text mode, but it should be sys.stdio.buffer which is in binary mode.

Again, sorry for being lazy, I will run tests with Python 2.7 and will find a way to make it work on both pythons.

I explained earlier, why I'm changing all places with text processing to binary mode. Let me know, if you think, that files being processed should be opened text mode.

@deanmalmgren

This comment has been minimized.

deanmalmgren commented on textract/parsers/odt_parser.py in e24ae5f Feb 16, 2016

another 'rb' thingy

@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented Feb 16, 2016

This is fantastic. Thanks for putting this together @sirex! I'm really excited to see this coming together. This has come up in #91, and also in conversation on the v1.4.0 tag.

I'd like to see a few other things incorporated, but its great to see that converting this to python 3 will still support python 2 by using six. Here are a few other things I'd like to address:

  • First and foremost, don't forget to add a line to the changelog reflecting your contribution
  • I've added some comments on specific diffs. Forgive all the comments, but I figured it would be good to comment directly on the code in some contexts.
  • Let's add python 3.4 and/or 3.5 to the .travis.yml configuration here. This will make sure that we are running tests against python 2.7 and whatever python 3 versions that it makes sense to support.
  • It might be worth using @goulu's fork of pdfminer that is referenced in euske/pdfminer#71. It appears to be python 2/3 compatible (Thanks @goulu!). Alternatively, we could make the command line pdf extraction tool the default.
  • If you're having whitespace issues, you should be able to run pep8 textract/ bin/ to make sure everything is formatted correctly.
  • I'd like to think about the issues with msg-extractor a bit more

What do you think about all of this?

@sirex

This comment has been minimized.

Contributor

sirex commented Feb 17, 2016

@deanmalmgren thanks for the code review. As I mentioned in the comments I did not run tests on Python 2.7. I will take more time (probably this weekend) and finish Python 3 support properly, making sure, that it works with Python 2 and 3.

If you're having whitespace issues, you should be able to run pep8 textract/ bin/ to make sure everything is formatted correctly.

White space issues are related with the epub output. When epub files are processed with Python 3, the output has some extra white spaces and tests does not pass. I'm not sure why that happens, but I will try to look at it.

I'd like to think about the issues with msg-extractor a bit more

I found that msg-extractor was not installed because in requirements/python it is specified as URL instead of package name. I see, that in setup.py it is added as dependency link, but somehow it wasn't installed. I will check that myself too.

All other items in your check list makes sense.

@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented Feb 17, 2016

@sirex Thanks for the quick reply on the code review. Here are a couple of follow-up responses:

binary, byte strings and unicode

Since the output of different parsers varies, textract reads the output of everything as a "byte string". To standardize the processing of these different outputs, it guesses the encoding using chardet and decodes the byte string into unicode. The unicode is the converted to the desired byte string encoding. All of this happens in the unicode sandwich in BaseParser.process. So, in short, textract does not internally handle things in binary, but as some combination of byte strings and unicode, depending on what different parsers extract. Does this make sense?

I'm a complete python3 n00b so forgive me if I'm way off on this (or anything else for that matter), but my understanding is that unicode handling in python 3 is a lot better and it may warrant a fresh approach to this whole thing. I'm completely open to suggestions.

whitespace

I'll be curious what you find here. If its just whitespace that is getting added for some reason, we can always reconfigure the tests. The important thing is that the words are extracted in the correct order.

msg-extractor

I'll also be curious on this one. I made a pull request to msg-extractor to allow it to be pip-installable. The maintainer hasn't put the package on pypi to the best of my knowledge, but as far as I know, this is the only viable mechanism to extract text from klunky micrsoft .msg email format.

sirex added some commits Feb 20, 2016

Different packages for different pythons
This adds possibility to specify different packages for different pythons.
Currently there is only one difference - pdfminer which does not yet support
Python 3 (at least officially).
Ignore empty lines when testing output
Not sure why, but on my laptop these tests fail:

```
tests/test_epub.py::EpubTestCase::test_raw_text_cli <- tests/base.py FAILED
tests/test_epub.py::EpubTestCase::test_raw_text_python <- tests/base.py FAILED
```

They fail, because `EbookLib==0.15` adds extra empty lines in several places.
Not sure why this happen, but this failure happens outside of this project so
to fix that decided to ignore empty lines.
Fix HtmlTestCase.test_table_text_cli test
With Python 3 `tests/test_html.py::HtmlTestCase::test_table_text_cli` failed,
because `&#160;` gets converted to empty space, but with Python 2 this gets
converted to `\x00\xa0`. To fix that, decided to replace all `&#160;` to
empty spaces in test fixtures.
Fix argparse.FileType binary support
This is a Python 3 bug [1], while it is not fixed upstream, adding fix here.

[1] https://bugs.python.org/issue14156
Add Python 3.4 to travis config
For now I'm adding only Python 3.4, because this version I tested on my laptop.
@sirex

This comment has been minimized.

Contributor

sirex commented Feb 20, 2016

I think Python 3 support is done, but where is the green light from Travis?

@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented Feb 22, 2016

Thanks!!! I'm still confused about the need to have binary i/o with python 3, but I guess I'll have a closer look when I have time to merge this in.

I'm also confused about the Travis build not working. I just tried to restart the build from the travis website but it looks like its just re-running based on your initial commit or something? hrmpf

It should re-trigger the build when I merge the PR, hopefully later this week or next week at the latest (I have a big deadline early next week)

Thanks for all of your help with this!

@sirex

This comment has been minimized.

Contributor

sirex commented Feb 22, 2016

@deanmalmgren regarding binary i/o.

I'm not sure if I interpreted this place correctly:

byte_string = self.extract(filename, **kwargs)
unicode_string = self.decode(byte_string)
return self.encode(unicode_string, encoding)

But it looks like you are using Unicode sandwich approach, where Parser.extract returns bytes and Parser.process returns bytes too.

Since Parser.process returns bytes, --output file have to be opened in binary mode (wb) for sure.

But for self.extract part I'm not sure, it looks like you expecting bytes (byte_string = self.extract...), but self.decode accepts anything and returns unicode employing decoding from bytes if needed.

Since information about how to decode binary to unicode is only available for each parser, so it is logical, that self.extract should return unicode and each parser should be responsible for returning unicode.

So if you agree, that self.extract should return unicode, then I could refactor that place making sure, that all parsers returning unicode. In that case, there is no need for serf.decode call in BaseParser.process, it should be moved by each parser if input encoding is unknown and chardet have to be used.

@rot26

This comment has been minimized.

rot26 commented Mar 27, 2016

Hello,

Re travis build:
It looks like your travis branch might have failed on python 2.7 here while trying to install PyYAML. PyYAML does not appear to have an ssl cert.

  Running setup.py install for PyYAML
....................................E.................................................
======================================================================
ERROR: Make sure extraction does not hang (issue #33)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/deanmalmgren/textract/tests/test_pdf.py", line 37, in test_large_pdf
    filename,
  File "/home/travis/build/deanmalmgren/textract/tests/base.py", line 76, in download_file
    response = requests.get(url, stream=True)
  File "/home/travis/virtualenv/python2.7.9/lib/python2.7/site-packages/requests/api.py", line 67, in get
    return request('get', url, params=params, **kwargs)
  File "/home/travis/virtualenv/python2.7.9/lib/python2.7/site-packages/requests/api.py", line 53, in request
    return session.request(method=method, url=url, **kwargs)
  File "/home/travis/virtualenv/python2.7.9/lib/python2.7/site-packages/requests/sessions.py", line 468, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/travis/virtualenv/python2.7.9/lib/python2.7/site-packages/requests/sessions.py", line 576, in send
    r = adapter.send(request, **kwargs)
  File "/home/travis/virtualenv/python2.7.9/lib/python2.7/site-packages/requests/adapters.py", line 447, in send
    raise SSLError(e, request=request)
SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:581)
-------------------- >> begin captured logging << --------------------
requests.packages.urllib3.connectionpool: INFO: Starting new HTTPS connection (1): openknowledge.worldbank.org
--------------------- >> end captured logging << ---------------------

This seems like a similar problem...

http://odetodata.com/2015/01/https-certificate-verification-problem-in-python-2-7-9/

You can get around the new cert checking

https://www.python.org/dev/peps/pep-0476/#opting-out

import ssl

# This restores the same behavior as before.
context = ssl._create_unverified_context()
urllib.urlopen("https://no-valid-cert", context=context)

Unfortunately, I'm not quite sure how to get around this on pip install though? Any ideas?

@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented May 3, 2016

@sirex did you have any trouble installing EbookLib on python 3? @bjgordon ran into problems in #108 when experimenting with your branch on a Windows machine.

(As a side note: apologies on not getting this merged and being able to address these questions myself. The past few months have been unusually busy in a good way)

@gordcorp

This comment has been minimized.

gordcorp commented May 4, 2016

FYI I just managed to get EbookLib installed ok using their master branch, so they must not have released the fix yet. After this, sirex's branch installed fine for me, and seems to work fine on a pdf at least. This is on Windows Server 2012 R2, Python 3.5.

This was referenced Nov 15, 2016

@deanmalmgren deanmalmgren merged commit dee3775 into deanmalmgren:master Nov 15, 2016

@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented Nov 15, 2016

THANK YOU @sirex for doing the bulk of this work to bring python 3 support to textract. The community owes you a 🙌 and probably some 🍻 . Please accept my apologies in how long it has taken me to merge this in. I am going to wrap this into version 1.5.0, which I plan on releasing later today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment