New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add additional Audio parsing method - Sphinx audio #122

merged 69 commits into from Mar 24, 2017


None yet
9 participants

barrust commented Aug 25, 2016

The SpeechRecognition library allows for using different extraction libraries and APIs. I need to be able to extract text from audio locally and possibly without internet access. To do so I added the ability to programmatically select which engine to use.

  • Added engine parameter to extract method
    • Defaults to Google API extraction
  • Updated required packages
  • Updated required python packages
  • Added tests for both Sphinx and Google for MP3 extraction
  • Updated the documentation

Currently supports google and CMU Sphinx but should be mostly trivial to add others.

Tests worked using docker configuration.

ankushshah89 and others added some commits Nov 9, 2015

Python 3 support
Test resuls, running with Python 3.4:

    FAILED (SKIP=2, errors=2, failures=7)

Two tests are skipped because pdfminer does not support Python 3 yet, see:


Two errors because of msg-extractor (I don't have it installed).

And 7 failures mostly related with white spaces, not sure how to fix that.
Maybe I should normalize white spaces?

Othen than that, everything works and textract works with Python 3 pretty well.
Different packages for different pythons
This adds possibility to specify different packages for different pythons.
Currently there is only one difference - pdfminer which does not yet support
Python 3 (at least officially).
Ignore empty lines when testing output
Not sure why, but on my laptop these tests fail:

tests/ <- tests/ FAILED
tests/ <- tests/ FAILED

They fail, because `EbookLib==0.15` adds extra empty lines in several places.
Not sure why this happen, but this failure happens outside of this project so
to fix that decided to ignore empty lines.
Fix HtmlTestCase.test_table_text_cli test
With Python 3 `tests/` failed,
because `&#160;` gets converted to empty space, but with Python 2 this gets
converted to `\x00\xa0`. To fix that, decided to replace all `&#160;` to
empty spaces in test fixtures.
Fix argparse.FileType binary support
This is a Python 3 bug [1], while it is not fixed upstream, adding fix here.

Add Python 3.4 to travis config
For now I'm adding only Python 3.4, because this version I tested on my laptop.
Add .TIF extension synonym
Add ".tif" as an extension synonym for ".tiff" files
Bring back proper line comparison for unit tests, Use proper ordering…
… when extracting from epub, remove extra \n in epub output

deanmalmgren and others added some commits Nov 15, 2016

Merge pull request #129 from ankushshah89/docs
Updating the documentation about docx file.

This comment has been minimized.


barrust commented Dec 30, 2016

@deanmalmgren Sorry for the slow response. Not a problem. I am currently swamped at work but will get to this as soon as I can. I don't think it should be to large of a delta to get this working.


This comment has been minimized.

coveralls commented Dec 30, 2016

Coverage Status

Coverage increased (+0.5%) to 88.962% when pulling 94da8ec on barrust:sphinx_audio into 2d598ad on deanmalmgren:master.


This comment has been minimized.


barrust commented Dec 30, 2016

@deanmalmgren Apparently it only required re-basing to get the test to pass; let me know if there are other questions or concerns!

One thing I was wondering about was the naming convention of the engine selector for audio and image extraction. I didn't think they should be the same name so that one could set optional settings for different file types. Let me know if you would like me to rename the variable to something else.


This comment has been minimized.


deanmalmgren commented Dec 30, 2016

@barrust thanks for putting this together. I'm on my phone at the moment and will merge this in when I'm back at my computer. In the meantime, what do you mean by "the naming convention of the engine selector for audio and image extraction"?


This comment has been minimized.


barrust commented Dec 30, 2016

@deanmalmgren Mostly, I am referring to the pdf extraction "method" and for audio I call it the "engine" I am not sure these are the best names but at least they do not collide. I went with a different name, in this case engine, so that someone looping over an entire directory to parse could do something like the following:

for filename in dir:
   txt = textract.process(filename, method='tesseract', engine='sphinx' )

If they have the same name, then it would be impossible to set different extraction tools for different types. If you have a better convention or name for the audio tool, please, let me know!


This comment has been minimized.


deanmalmgren commented Mar 24, 2017

@barrust apologies on the slow response with this. Your point about having different keyword arguments for different extraction methods is interesting and a bit unexpected. Great point!! I'm going to start a new issue to discuss different ways of approaching it.

In the meantime, I'm going to merge this in and clean a few things up. This functionality will be available in the next release!

@deanmalmgren deanmalmgren merged commit 873519b into deanmalmgren:master Mar 24, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed

This comment has been minimized.


barrust commented Mar 24, 2017

@deanmalmgren This is wonderful! Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment