New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for JSON #13

Merged
merged 6 commits into from Jul 30, 2014

Conversation

Projects
None yet
2 participants
@anthonygarvan
Contributor

anthonygarvan commented Jul 30, 2014

Extracts string fields from a JSON document because hey, who doesn't love recursion? Suitable for text analysis of MongoDB dumps via mongoexport command. Output does not maintain / enforce any order of fields in an object, so the result should be used in bag-of-words modelling- n-gram algos beware. Also includes a small doc revision. All tests are passing on travis: https://travis-ci.org/anthonygarvan/textract.

Also, the md5sum seems to be broken for the snow_fall.html example on Ubuntu 13.10 (fails for me locally but passes on Travis). Using OS utilities for this seems overkill - any reason not to use python's hashlib?

@@ -18,8 +18,11 @@ def process(filename, **kwargs):
# is a relative import so the name of the package is necessary
root, ext = os.path.splitext(filename)
ext = ext.lower()
# cannot call module json.py, conflicts with built-in json library
module = '.json_parser' if ext == '.json' else ext

This comment has been minimized.

@deanmalmgren

deanmalmgren Jul 30, 2014

Owner

This is an interesting way of handling this. Maybe we should rename all of the parser modules to something like {{ext}}_parser to avoid conflicts with globally installed packages.

We run into this problem in the docx and pptx modules too and we use the utils.non_local_import function to deal with it. While clever, I'm not terribly happy with that implementation and think you've got a better way of handling this. I'll make some changes to how those are incorporated after merging.

@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented Jul 30, 2014

Nice! Thanks for the contribution.

Regarding the use of python's hashlib vs using the system command md5, I'm totally ambivalent (they should actually give the same results, I believe). Since the run_functional_tests.sh script is already a shell script, md5 seemed like a natural choice but I'm open to changing.

Merging now and I'll rename all the modules to {{ext}}_parser as you suggest. Great idea; this is much clearer.

deanmalmgren added a commit that referenced this pull request Jul 30, 2014

@deanmalmgren deanmalmgren merged commit 5982099 into deanmalmgren:master Jul 30, 2014

1 check passed

continuous-integration/travis-ci The Travis CI build passed
Details
@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented Jul 30, 2014

Also, for what its worth, in da09661 I made sure to sort the keys of the dictionary to make sure that the results of the json parser are always returned in the same (alphabetical) order.

Thanks again for the PR!

@anthonygarvan

This comment has been minimized.

Contributor

anthonygarvan commented Jul 30, 2014

Glad to help- cool project, well structured. I wonder what was going on with my md5sum - why it would work on Travis but not locally. I suspected OS issues but may be something else.

Makes sense to sort the dictionaries alphabetically.

I'll open up cases for adding image support - the nodejs version uses tesseract ocr, maybe we should do something similar.

@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented Jul 31, 2014

I'm not sure what was going on with the md5sum's. That is a bizarre problem. The only difference I could imagine is that I'm using an Ubuntu 12.04 development environment that mocks the Travis CI environment instead of your 13.10; could that be the difference? Seems unlikely to me but I suppose its a possibility.

What version of python do you have? What version of bs4 gets installed? Those are the only differences that could affect the html output being different (which, in turn, would cause the md5sum's to be different).

If you want, maybe post a gist with the text output that you have and I can check to see what the difference is?

Oh, and thanks for opening up the image support things. OCR would be a great addition!

@anthonygarvan

This comment has been minimized.

Contributor

anthonygarvan commented Jul 31, 2014

Ah! I bet it's a beautiful soup thing. I'm on python 2.7.5+ and beautiful
soup 4.2.0. Will post a gist of my output in the next couple of days.

On Thu, Jul 31, 2014 at 8:27 AM, Dean Malmgren notifications@github.com
wrote:

I'm not sure what was going on with the md5sum's. That is a bizarre
problem. The only difference I could imagine is that I'm using an Ubuntu
12.04
https://github.com/deanmalmgren/textract/blob/master/Vagrantfile#L28
development environment that mocks the Travis CI environment instead of
your 13.10; could that be the difference? Seems unlikely to me but I
suppose its a possibility.

What version of python do you have? What version of bs4 gets installed?
Those are the only differences that could affect the html output being
different (which, in turn, would cause the md5sum's to be different).

If you want, maybe post a gist with the text output that you have and I
can check to see what the difference is?


Reply to this email directly or view it on GitHub
#13 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment