New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

standardize encoding of textract.process? #39

Merged
merged 11 commits into from Aug 22, 2014

Conversation

Projects
None yet
4 participants
@deanmalmgren
Owner

deanmalmgren commented Aug 11, 2014

I'm not really sure this is needed but I thought I'd use this issue as an opportunity to get feedback from others. All thoughts welcome!

Some parsers return ascii-encoded text like the odt_parser whereas others are encoded as utf8 like the html_parser whereas others don't specify an encoding at all like the json_parser. Perhaps this matters. Perhaps it doesn't. But this seems like the type of thing that should maybe be standardized.

If we do standardize the encoding of parser methods, we should probably give end users the ability to specify the encoding that they want rather than enforcing a particular encoding on everyone. This would involve supporting a -e/--encoding option in the command line tool and supporting an encoding kwarg to the textract.parsers.*_parser.extract functions.

To make it as easy as possible to continue to add parsers, it might make sense to switch to using classes for parsers instead of functions so that we can use class inheritance for standardizing the encoding with something like this:

# textract/parsers/base.py
class BaseParser(object):
    def __call__(self, filename, encoding='utf-8', **kwargs):
        """This makes it possible to do things like: 
            parser = BaseParser()
            output = parser(filename, **kwargs)
        """
        return self.encode(self.extract(filename, **kwargs), encoding)
    def extract(self, filename, **kwargs):
        raise NotImplementedError('must be overridden by child classes')
    def encode(self, output, encoding):
        return output.encode(encoding)

class ShellParser(BaseParser):
    def run(self, command):
        # this would be where we could put the current textract.shell.run implementation
        pass

# textract/parsers/txt.py
from .base import BaseParser
class Parser(BaseParser):
    def extract(self, filename, **kwargs):
        with open(filename) as stream:
            return stream.read()

I'm also open to decorator implementations but, IMHO, decorators can be extremely confusing very quickly.

@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented Aug 10, 2014

@christomitov @anthonygarvan @Kokxx you've all contributed support for different extensions. What do you think about this? Should we standardize the output of the textract.process function? Do you have any thoughts on switching to class-based implementations?

I'll do the work, just wondering if anyone has an adverse reaction to it...

@nicokoch

This comment has been minimized.

Contributor

nicokoch commented Aug 10, 2014

Imo, yes we should definately do the switch, seems like a much cleaner solution. Especially with respect to i.e. the image format implementations, which could all extend a common superclass (this will also work well with different ebook formats).

For the encoding, I would introduce a new parameter to the extract function instead of using kwargs.

@christomitov

This comment has been minimized.

Contributor

christomitov commented Aug 10, 2014

Makes sense to me.

@deanmalmgren deanmalmgren referenced this pull request Aug 11, 2014

Open

--metadata flag? #47

@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented Aug 11, 2014

I started to implement something along these lines and got the bare bones functionality set up with the txt_parser module (the easiest one). This obviously breaks the shit out of the tests, but I thought I'd solicit feedback before I dig too much further into this.

For what its worth, this presentation by @nedbat (props!) is amazing for thinking about this problem. I particularly like the suggestion to "create a unicode sandwich", which is to say that when we read in text from a file we should decode it from its byte-string to unicode and, after all our processing is done, encode it from unicode to whatever the user specifies as the output byte-string encoding.

We've already done a fair bit of thinking about the encoding from unicode to a byte-string via the -e/--encoding command line option but I didn't even think about the first step—decoding the original byte-string into unicode. The chardet project looks particularly promising in this regard. It gives an educated guess about what the encoding is so you can properly decode the byte-string into unicode.

@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented Aug 12, 2014

I made some more progress this morning. Nearly all tests are passing now. Here is a short list of things that still need to be addressed:

  • experiment with using chardet to infer the encoding and encode the input into unicode. This will fix the problem with the pdf/pptx parsers?
  • standardize where the documentation is located for each parser. doc string on the Parser class or doc string on the Parser.extract method?
@anthonygarvan

This comment has been minimized.

Contributor

anthonygarvan commented Aug 12, 2014

Awesome!! I'm in a crunch now but hope to return to the project in a few
weeks.

On Tue, Aug 12, 2014 at 8:52 AM, Dean Malmgren notifications@github.com
wrote:

I made some more progress this morning. Nearly all tests are passing now.
Here is a short list of things that still need to be addressed:

  • experiment with using chardet to infer the encoding and encode the
    input into unicode. This will fix the problem with the pdf/pptx parsers?
  • standardize where the documentation is located for each parser. doc
    string on the Parser class or doc string on the Parser.extract method?


Reply to this email directly or view it on GitHub
#39 (comment).

@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented Aug 12, 2014

I'm feeling pretty good about how this is coming together. This will certainly involve a bump in the major version since the output is now guaranteed to be byte-string encoded whereas before it was a mix of unicode and byte-strings. As a result, I want to give people time to comment on the code, its readability, its functionality, etc to make sure textract is still as easy to extend as possible. I'll leave this PR open for the next few days in case anyone has any suggestions for how to improve the implementation.

With the current implementation, all someone has to do to add support for a new file type is to subclass the BaseParser or ShellParser class and write an extract(self, filename, **kwargs) method that returns the result as either a byte-string or unicode. I hope this is as easy as just writing an extract function, but realistically it is a bit more complicated to subclass something than it is to just write a function.

@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented Aug 15, 2014

One thing that we should probably figure out is how to deal with encoding errors. For example, this isn't the greatest behavior in the world:

vagrant@dev:/vagrant$ textract tests/xlsx/another_good_example.xlsx -e ascii
Traceback (most recent call last):
  File "/vagrant/bin/textract", line 33, in <module>
    main()
  File "/vagrant/bin/textract", line 25, in main
    output = process(**vars(args))
  File "/vagrant/textract/parsers/__init__.py", line 54, in process
    return parser.process(filename, encoding, **kwargs)
  File "/vagrant/textract/parsers/utils.py", line 39, in process
    return self.encode(unicode_string, encoding)
  File "/vagrant/textract/parsers/utils.py", line 28, in encode
    return text.encode(encoding)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 4589: ordinal not in range(128)

Do we need to have the ability to let users specify how they want to handle errors? For the time being, I'm thinking that either using 'ignore'-ing or 'replace'-ing them is the way to go and dealing with user-specified error handling down the road is better. Adding code to that effect momentarily.

@deanmalmgren deanmalmgren merged commit 2808a9b into master Aug 22, 2014

1 check passed

continuous-integration/travis-ci The Travis CI build passed
Details

deanmalmgren added a commit that referenced this pull request Aug 22, 2014

@deanmalmgren deanmalmgren deleted the parser-class branch Aug 22, 2014

@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented Aug 22, 2014

OK, going ahead and making this change. I'm going to merge a few other things throughout the day and hope to release v1.0.0 by the end of the day.

This was referenced Aug 22, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment