Skip to content

Commit

Permalink
Removed references to Elasticsearch in TOR tutorial
Browse files Browse the repository at this point in the history
  • Loading branch information
aecio committed Jun 19, 2017
1 parent 3a5c616 commit 84af4ed
Show file tree
Hide file tree
Showing 3 changed files with 8 additions and 63 deletions.
5 changes: 0 additions & 5 deletions config/config_docker_tor/ache.yml
Original file line number Diff line number Diff line change
@@ -1,8 +1,3 @@
# Configure ACHE to index pages in Elasticsearch container
target_storage.data_format.type: ELASTICSEARCH
target_storage.data_format.elasticsearch.rest.hosts:
- http://elasticsearch:9200

# Basic configuration in-depth web site crawling
link_storage.link_strategy.use_scope: true
link_storage.link_strategy.outlinks: true
Expand Down
20 changes: 1 addition & 19 deletions config/config_docker_tor/docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,28 +10,10 @@ services:
- ./data-ache/:/data
- ./:/config
links:
- elasticsearch
- torproxy
depends_on:
- elasticsearch
- torproxy
elasticsearch:
image: elasticsearch:2.4.5
environment:
- xpack.security.enabled=false
- cluster.name=docker-cluster
- bootstrap.memory_lock=true
- "ES_JAVA_OPTS=-Xms512m -Xmx512m"
ulimits:
memlock:
soft: -1
hard: -1
mem_limit: 1g
volumes:
- ./data-es/:/usr/share/elasticsearch/data # elasticsearch data will be stored at ./data-es/
ports:
- 9200:9200
torproxy:
image: dperson/torproxy
ports:
- "8118:8118"
- "8118:8118"
46 changes: 7 additions & 39 deletions docs/tutorial-crawling-tor.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,10 @@ being accessed.
Sites hidden on the TOR network are accessed via domain addresses under the top-level domain ``.onion``.
In order to crawl such sites, ACHE relies on external HTTP proxies, such as `Privoxy <https://www.privoxy.org/>`_,
configured to route traffic trough the TOR network.
Besides configuring the proxy, we just need to configure ACHE to route requests to `.onion` addresses via the TOR proxy.
Besides configuring the proxy, we just need to configure ACHE to route requests to ``.onion`` addresses via the TOR proxy.

Fully configuring a web proxy to route traffic through TOR is out-of-scope of this tutorial, so we will just
use Docker to run the pre-configured docker image for Privoxy/TOR available at https://hub.docker.com/r/dperson/torproxy/.
For convenience, we will also run ACHE and Elasticsearch using docker containers.

To start and stop the containers, we will use `docker-compose`, so make sure that the Docker version that you installed includes it.
You can verify whether it is installed by running the following command on the Terminal (it should print the version of docker-compose to the output)::
Expand All @@ -31,15 +30,15 @@ The following steps explain in details how to crawl ``.onion`` sites using ACHE.
Download the following files and put them in single directory named ``config_docker_tor``:

#. `tor.seeds <https://raw.githubusercontent.com/ViDA-NYU/ache/master/config/config_docker_tor/tor.seeds>`_: a plain text containing the URLs of the sites you want to crawl. In this example, the file contains a few URLs taken from https://thehiddenwiki.org/. If you want to crawl specific websites, you should list them on this file (one URL per line).
#. `ache.yml <https://raw.githubusercontent.com/ViDA-NYU/ache/master/config/config_docker_tor/ache.yml>`_: the configuration file for ACHE. It basically configures ACHE to run a in-depth website crawl of the seed URLs, to index crawled pages in the Elasticsearch container, and to download .onion addresses using the TOR proxy container.
#. `docker-compose.yml <https://raw.githubusercontent.com/ViDA-NYU/ache/master/config/config_docker_tor/docker-compose.yml>`_: a configuration file for Docker, which specifies which containers should be used. It starts an Elasticsearch node, the TOR proxy, and ACHE crawler.
#. `ache.yml <https://raw.githubusercontent.com/ViDA-NYU/ache/master/config/config_docker_tor/ache.yml>`_: the configuration file for ACHE. It basically configures ACHE to run a in-depth website crawl of the seed URLs, and to download .onion addresses using the TOR proxy container.
#. `docker-compose.yml <https://raw.githubusercontent.com/ViDA-NYU/ache/master/config/config_docker_tor/docker-compose.yml>`_: a configuration file for Docker, which specifies which containers should be used. It starts the TOR proxy and the crawler.

If you are using Mac or Linux, you can run the following commands on the Terminal to create a folder and download the files automatically:

.. code:: bash
mkdir /tmp/config_docker_tor/
cd /tmp/config_docker_tor/
mkdir config_docker_tor/
cd config_docker_tor/
curl -O https://raw.githubusercontent.com/ViDA-NYU/ache/master/config/config_docker_tor/ache.yml
curl -O https://raw.githubusercontent.com/ViDA-NYU/ache/master/config/config_docker_tor/docker-compose.yml
curl -O https://raw.githubusercontent.com/ViDA-NYU/ache/master/config/config_docker_tor/tor.seeds
Expand Down Expand Up @@ -81,28 +80,7 @@ named ``torproxy`` that listens on the port 8118:
ports:
- "8118:8118"
An Elasticsearch node named ``elasticsearch`` that listens on the port 9200 (we also add some common Elasticsearch settings):

.. code:: yaml
elasticsearch:
image: elasticsearch:2.4.5
environment:
- xpack.security.enabled=false
- cluster.name=docker-cluster
- bootstrap.memory_lock=true
- "ES_JAVA_OPTS=-Xms512m -Xmx512m"
ulimits:
memlock:
soft: -1
hard: -1
mem_limit: 1g
volumes:
- ./data-es/:/usr/share/elasticsearch/data # elasticsearch data will be stored at ./data-es/
ports:
- 9200:9200
An finally, we configure a container named ``ache``. Note that in order to make the config (``ache.yml``) and the seeds (``tor.seeds``) files available inside the container, we need to mount the volume ``/config`` to point to the current working directory. We also mount the volume ``/data`` in the directory ``./data-ache`` so that the crawled data is stored outside the container. In order to make ACHE communicate to the other containers, we need to link the ACHE container to the other two containers ``elasticsearch`` and ``torproxy``.
We also configure a container named ``ache``. Note that in order to make the config (``ache.yml``) and the seeds (``tor.seeds``) files available inside the container, we need to mount the volume ``/config`` to point to the current working directory. We also mount the volume ``/data`` in the directory ``./data-ache`` so that the crawled data is stored outside the container. In order to make ACHE communicate to the other containers, we need to link the ACHE container to the ``torproxy`` container.

.. code:: yaml
Expand All @@ -116,23 +94,13 @@ An finally, we configure a container named ``ache``. Note that in order to make
- ./data-ache/:/data
- ./:/config
links:
- elasticsearch
- torproxy
depends_on:
- elasticsearch
- torproxy
**Understanding the ache.yml file**

The ``ache.yml`` file basically configures ACHE to index crawled data in the ``elasticsearch`` container:

.. code:: yaml
target_storage.data_format.type: ELASTICSEARCH
target_storage.data_format.elasticsearch.rest.hosts:
- http://elasticsearch:9200
and to download .onion addresses using the ``torproxy`` container:
The ``ache.yml`` file basically configures ACHE to download .onion addresses using the ``torproxy`` container:

.. code:: yaml
Expand Down

0 comments on commit 84af4ed

Please sign in to comment.