Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Support for JSON #13
Extracts string fields from a JSON document because hey, who doesn't love recursion? Suitable for text analysis of MongoDB dumps via mongoexport command. Output does not maintain / enforce any order of fields in an object, so the result should be used in bag-of-words modelling- n-gram algos beware. Also includes a small doc revision. All tests are passing on travis: https://travis-ci.org/anthonygarvan/textract.
Also, the md5sum seems to be broken for the snow_fall.html example on Ubuntu 13.10 (fails for me locally but passes on Travis). Using OS utilities for this seems overkill - any reason not to use python's hashlib?
Nice! Thanks for the contribution.
Regarding the use of python's hashlib vs using the system command
Merging now and I'll rename all the modules to
added a commit
this pull request
Jul 30, 2014
Glad to help- cool project, well structured. I wonder what was going on with my md5sum - why it would work on Travis but not locally. I suspected OS issues but may be something else.
Makes sense to sort the dictionaries alphabetically.
I'll open up cases for adding image support - the nodejs version uses tesseract ocr, maybe we should do something similar.
I'm not sure what was going on with the md5sum's. That is a bizarre problem. The only difference I could imagine is that I'm using an Ubuntu 12.04 development environment that mocks the Travis CI environment instead of your 13.10; could that be the difference? Seems unlikely to me but I suppose its a possibility.
What version of python do you have? What version of bs4 gets installed? Those are the only differences that could affect the html output being different (which, in turn, would cause the md5sum's to be different).
If you want, maybe post a gist with the text output that you have and I can check to see what the difference is?
Oh, and thanks for opening up the image support things. OCR would be a great addition!
Ah! I bet it's a beautiful soup thing. I'm on python 2.7.5+ and beautiful
On Thu, Jul 31, 2014 at 8:27 AM, Dean Malmgren email@example.com