New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding .xls/xlsx support #42 #55

Merged
merged 9 commits into from Aug 15, 2014

Conversation

Projects
None yet
3 participants
@levivm
Contributor

levivm commented Aug 14, 2014

Added support for .xls and .xlsx format, using xlrd python lib.

@nicokoch

This comment has been minimized.

Contributor

nicokoch commented Aug 14, 2014

Two things I came across reviewing your code:

  • The comment in your parser describes the method as extracting .docx files :-P
  • The code for .xls and .xlsx seem to be very similar (or exactly the same?). Instead of copying the code, it would be a lot cleaner to reuse it (an example of how to do this can be seen in the .jpg, .gif and .png parsers).

So nothing critical, just some suggestions.

@deanmalmgren

This comment has been minimized.

deanmalmgren commented on textract/parsers/xls_parser.py in 2ac7b59 Aug 14, 2014

Agree with @Kokxx on this. Since the xls_parser, it would be better if the xls_parser just did something like from xlsx_parser import export and then everything should work the same.

This comment has been minimized.

Owner

levivm replied Aug 14, 2014

Thanks to @Kokxx for poiting out this. @deanmalmgren did you mean <from xlsx_parser import extract> ?. Thanks :)

This comment has been minimized.

deanmalmgren replied Aug 14, 2014

yes; sorry about that.

This comment has been minimized.

Owner

levivm replied Aug 14, 2014

@deanmalmgren no problem, thansk to you :) I just pushed new changes fixing this and the others issues.

@deanmalmgren

This comment has been minimized.

deanmalmgren commented on textract/parsers/xlsx_parser.py in 4f3967d Aug 14, 2014

If you end up using the same function for both, we should probably just generically say "Microsoft Excel files" (instead of "docx file using python-docx")

This comment has been minimized.

Owner

levivm replied Aug 14, 2014

Yes, sorry, It was a typo. I will fix it.

@deanmalmgren

This comment has been minimized.

deanmalmgren commented on textract/parsers/xlsx_parser.py in 4f3967d Aug 14, 2014

Thanks for thinking about unicode and byte-strings here. I ended up putting together a more robust solution in deanmalmgren#39 that I'll merge in after people have a chance to provide feedback. Feel free to leave this here if it makes things easier for you now. If you're so inclined, you could just replace this with worksheet.cell_value(curr_row, index_col) and not worry about decoding to unicode and then re-encoding the byte-string as utf-8 (which is what deanmalmgren#39 will take care of).

@deanmalmgren

This comment has been minimized.

deanmalmgren commented on textract/parsers/xlsx_parser.py in 4f3967d Aug 14, 2014

To be consistent with the other parsers, I think I'd prefer to use a whitespace character of some sort to join adjacent rows instead of the pipe ('|') character. Perhaps we could use a tab ('\t') instead?

This comment has been minimized.

Owner

levivm replied Aug 14, 2014

ok fine, sounds great, I will add it.

@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented Aug 14, 2014

I just took a quick look, too. Thanks for putting this together; looks great! I made some comments above, mostly building off @Kokxx's suggestions and accounting for changes I plan to merge in #39, probably sometime next week.

I was a little confused about all the changing of the functional test checksums but I see that all of the non-excel checksums were ultimately changed to the same values in the master branch. Apologies if you received an email about a comment I made; I ultimately figured it out and deleted the comment on github :)

@deanmalmgren deanmalmgren merged commit fe02796 into deanmalmgren:master Aug 15, 2014

deanmalmgren added a commit that referenced this pull request Aug 15, 2014

@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented Aug 15, 2014

Excellent; thanks @levivm! This is all merged in now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment