Skip to content

Commit

Permalink
doc,readme: Update note about youtube-dl integration
Browse files Browse the repository at this point in the history
  • Loading branch information
chfoo committed Jan 28, 2015
1 parent 0239a61 commit 17d2f05
Show file tree
Hide file tree
Showing 6 changed files with 44 additions and 10 deletions.
2 changes: 1 addition & 1 deletion README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Features:
* Graceful stopping and resuming
* Python & Lua scripting support
* Modular, extensible, & asynchronous API
* PhantomJS integration
* PhantomJS & youtube-dl integration

**Currently in beta quality! Some features are not implemented yet and the API
is not considered stable.**
Expand Down
12 changes: 12 additions & 0 deletions doc/api/coprocessor.youtubedl.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
.. This document was automatically generated.
DO NOT EDIT!
:mod:`coprocessor.youtubedl` Module
===================================

.. automodule:: wpull.coprocessor.youtubedl
:members:
:show-inheritance:
:private-members:
:special-members:
:exclude-members: __dict__,__weakref__
1 change: 1 addition & 0 deletions doc/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ What's New
* Added ``--no-skip-getaddrinfo`` option.
* Added ``--limit-rate`` option.
* Added ``--phantomjs-max-time`` option.
* Added ``--youtube-dl`` option.
* Improved PhantomJS stability.


Expand Down
4 changes: 3 additions & 1 deletion doc/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,11 @@ The following are optional:

* `Lunatic Python (bastibe version)
<https://github.com/bastibe/lunatic-python>`_ for Lua support
* `Manhole <https://pypi.python.org/pypi/manhole>`_ for a REPL debugging socket
* `PhantomJS <http://phantomjs.org/>`_ for capturing interactive
JavaScript pages
* `Manhole <https://pypi.python.org/pypi/manhole>`_ for a REPL debugging socket
* `youtube-dl <https://rg3.github.io/youtube-dl/>`_ for downloading complex
video streaming sites

For installing Wpull, it is recommended to use `pip installer
<http://www.pip-installer.org/>`_.
Expand Down
2 changes: 1 addition & 1 deletion doc/intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Features:
* Graceful stopping and resuming
* Python & Lua scripting support
* Modular, extensible, & asynchronous API
* PhantomJS integration
* PhantomJS & youtube-dl integration


.. ⬆ Please keep this intro above in sync with the README file. ⬆
Expand Down
33 changes: 26 additions & 7 deletions doc/usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -65,18 +65,37 @@ the same behavior as the previous run.
file manually or use the additional option ``--warc-append``.


PhantomJS Integration (Experimental)
====================================
Proxied Services
================

``--phantomjs`` will enable PhantomJS integration. If a HTML document is encountered, Wpull will open the URL in PhantomJS. The requests will go through an HTTP proxy to Wpull's HTTP client (which can be recorded with ``--warc-file``).
Wpull is able to use an HTTP proxy server to capture traffic from third-party programs such as PhantomJS.
The requests will go through the proxy to Wpull's HTTP client (which can be recorded with ``--warc-file``).

After the page is loaded, Wpull will try to scroll the page as specified by ``--phantomjs-scroll``. Then, the HTML source is scraped for URLs as normal. HTML and PDF snapshots are taken by default.
.. warning:: Wpull uses the HTTP proxy insecurely on localhost.

It is possible for another user, on the same machine as Wpull, to send bogus requests to the HTTP proxy. Wpull, however, does *not* expose the HTTP proxy outside to the net by default.


PhantomJS Integration
+++++++++++++++++++++

**PhantomJS support is currently experimental.**

``--phantomjs`` will enable PhantomJS integration.

If a HTML document is encountered, Wpull will open the URL in PhantomJS. After the page is loaded, Wpull will try to scroll the page as specified by ``--phantomjs-scroll``. Then, the HTML DOM source is scraped for URLs as normal. HTML and PDF snapshots are taken by default.

Currently, Wpull will *not do anything else* to manipulate the page such as clicking on links. As a consequence, Wpull with PhantomJS is *not* a complete solution for dynamic web pages yet!

The filename of the PhantomJS executable must be on the PATH environment variable.

.. warning:: Wpull uses an HTTP proxy insecurely with PhantomJS on localhost.
youtube-dl Integration
++++++++++++++++++++++

**youtube-dl support is currently experimental.**

``--youtube-dl`` will enable youtube-dl integration.

If a HTML document is encountered, Wpull will run youtube-dl on the URL.

It is possible for another user, on the same machine as Wpull, to send bogus requests to the HTTP proxy. Wpull, however, does *not* expose the HTTP proxy outside to the net.
It is not recommended to use recursion because it may fetch redundant amounts of data.

0 comments on commit 17d2f05

Please sign in to comment.