Skip to content

Commit

Permalink
added support for extracting text for html documents using beautifuls…
Browse files Browse the repository at this point in the history
…oup4. fixes #7
  • Loading branch information
Dean Malmgren committed Jul 23, 2014
1 parent 3dc299c commit 39c7cf1
Show file tree
Hide file tree
Showing 9 changed files with 2,621 additions and 1 deletion.
1 change: 0 additions & 1 deletion Vagrantfile
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,6 @@ Vagrant.configure("2") do |config|
vb.customize ["modifyvm", :id, "--memory", "2048"]
override_config.vm.box = "precise64"
override_config.vm.box_url = "http://files.vagrantup.com/precise64.box"
override_config.vm.network :forwarded_port, guest: 8000, host: 8000
end

# steps for provisioning so that these provisioning steps are
Expand Down
2 changes: 2 additions & 0 deletions docs/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ track version numbers, where backwards incompatible changes
latest
------

* support for ``.html`` files (#7)


0.3.0
-----
Expand Down
2 changes: 2 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,8 @@ Currently supporting

* ``.docx`` via `python-docx <https://python-docx.readthedocs.org/en/latest/>`__

* ``.html`` via `beautifulsoup4 <http://beautiful-soup-4.readthedocs.org/en/latest/>`__

* ``.pptx`` via `python-pptx <https://python-pptx.readthedocs.org/en/latest/>`__

* ``.pdf`` via `pdftotext <http://poppler.freedesktop.org/>`__ (default) or `pdfminer <https://euske.github.io/pdfminer/>`__
Expand Down
7 changes: 7 additions & 0 deletions docs/python_package.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,13 @@ textract.parsers.docx
:members:


textract.parsers.html
---------------------

.. automodule:: textract.parsers.html
:members:


textract.parsers.pdf
---------------------

Expand Down
1 change: 1 addition & 0 deletions requirements/python
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@ argcomplete
python-pptx
python-docx
pdfminer==20140328
beautifulsoup4
Loading

0 comments on commit 39c7cf1

Please sign in to comment.