From 84af4ed62016f56d9d62c927c5bfe81831d1d8b8 Mon Sep 17 00:00:00 2001 From: Aecio Santos Date: Mon, 19 Jun 2017 02:42:29 -0400 Subject: [PATCH] Removed references to Elasticsearch in TOR tutorial --- config/config_docker_tor/ache.yml | 5 --- config/config_docker_tor/docker-compose.yml | 20 +-------- docs/tutorial-crawling-tor.rst | 46 ++++----------------- 3 files changed, 8 insertions(+), 63 deletions(-) diff --git a/config/config_docker_tor/ache.yml b/config/config_docker_tor/ache.yml index 8fe1368ee..06d9d13f9 100644 --- a/config/config_docker_tor/ache.yml +++ b/config/config_docker_tor/ache.yml @@ -1,8 +1,3 @@ -# Configure ACHE to index pages in Elasticsearch container -target_storage.data_format.type: ELASTICSEARCH -target_storage.data_format.elasticsearch.rest.hosts: - - http://elasticsearch:9200 - # Basic configuration in-depth web site crawling link_storage.link_strategy.use_scope: true link_storage.link_strategy.outlinks: true diff --git a/config/config_docker_tor/docker-compose.yml b/config/config_docker_tor/docker-compose.yml index 79d9571a8..88cbf8768 100644 --- a/config/config_docker_tor/docker-compose.yml +++ b/config/config_docker_tor/docker-compose.yml @@ -10,28 +10,10 @@ services: - ./data-ache/:/data - ./:/config links: - - elasticsearch - torproxy depends_on: - - elasticsearch - torproxy - elasticsearch: - image: elasticsearch:2.4.5 - environment: - - xpack.security.enabled=false - - cluster.name=docker-cluster - - bootstrap.memory_lock=true - - "ES_JAVA_OPTS=-Xms512m -Xmx512m" - ulimits: - memlock: - soft: -1 - hard: -1 - mem_limit: 1g - volumes: - - ./data-es/:/usr/share/elasticsearch/data # elasticsearch data will be stored at ./data-es/ - ports: - - 9200:9200 torproxy: image: dperson/torproxy ports: - - "8118:8118" \ No newline at end of file + - "8118:8118" diff --git a/docs/tutorial-crawling-tor.rst b/docs/tutorial-crawling-tor.rst index 8ded0bdd3..7230f49c7 100644 --- a/docs/tutorial-crawling-tor.rst +++ b/docs/tutorial-crawling-tor.rst @@ -10,11 +10,10 @@ being accessed. Sites hidden on the TOR network are accessed via domain addresses under the top-level domain ``.onion``. In order to crawl such sites, ACHE relies on external HTTP proxies, such as `Privoxy `_, configured to route traffic trough the TOR network. -Besides configuring the proxy, we just need to configure ACHE to route requests to `.onion` addresses via the TOR proxy. +Besides configuring the proxy, we just need to configure ACHE to route requests to ``.onion`` addresses via the TOR proxy. Fully configuring a web proxy to route traffic through TOR is out-of-scope of this tutorial, so we will just use Docker to run the pre-configured docker image for Privoxy/TOR available at https://hub.docker.com/r/dperson/torproxy/. -For convenience, we will also run ACHE and Elasticsearch using docker containers. To start and stop the containers, we will use `docker-compose`, so make sure that the Docker version that you installed includes it. You can verify whether it is installed by running the following command on the Terminal (it should print the version of docker-compose to the output):: @@ -31,15 +30,15 @@ The following steps explain in details how to crawl ``.onion`` sites using ACHE. Download the following files and put them in single directory named ``config_docker_tor``: #. `tor.seeds `_: a plain text containing the URLs of the sites you want to crawl. In this example, the file contains a few URLs taken from https://thehiddenwiki.org/. If you want to crawl specific websites, you should list them on this file (one URL per line). - #. `ache.yml `_: the configuration file for ACHE. It basically configures ACHE to run a in-depth website crawl of the seed URLs, to index crawled pages in the Elasticsearch container, and to download .onion addresses using the TOR proxy container. - #. `docker-compose.yml `_: a configuration file for Docker, which specifies which containers should be used. It starts an Elasticsearch node, the TOR proxy, and ACHE crawler. + #. `ache.yml `_: the configuration file for ACHE. It basically configures ACHE to run a in-depth website crawl of the seed URLs, and to download .onion addresses using the TOR proxy container. + #. `docker-compose.yml `_: a configuration file for Docker, which specifies which containers should be used. It starts the TOR proxy and the crawler. If you are using Mac or Linux, you can run the following commands on the Terminal to create a folder and download the files automatically: .. code:: bash - mkdir /tmp/config_docker_tor/ - cd /tmp/config_docker_tor/ + mkdir config_docker_tor/ + cd config_docker_tor/ curl -O https://raw.githubusercontent.com/ViDA-NYU/ache/master/config/config_docker_tor/ache.yml curl -O https://raw.githubusercontent.com/ViDA-NYU/ache/master/config/config_docker_tor/docker-compose.yml curl -O https://raw.githubusercontent.com/ViDA-NYU/ache/master/config/config_docker_tor/tor.seeds @@ -81,28 +80,7 @@ named ``torproxy`` that listens on the port 8118: ports: - "8118:8118" -An Elasticsearch node named ``elasticsearch`` that listens on the port 9200 (we also add some common Elasticsearch settings): - -.. code:: yaml - - elasticsearch: - image: elasticsearch:2.4.5 - environment: - - xpack.security.enabled=false - - cluster.name=docker-cluster - - bootstrap.memory_lock=true - - "ES_JAVA_OPTS=-Xms512m -Xmx512m" - ulimits: - memlock: - soft: -1 - hard: -1 - mem_limit: 1g - volumes: - - ./data-es/:/usr/share/elasticsearch/data # elasticsearch data will be stored at ./data-es/ - ports: - - 9200:9200 - -An finally, we configure a container named ``ache``. Note that in order to make the config (``ache.yml``) and the seeds (``tor.seeds``) files available inside the container, we need to mount the volume ``/config`` to point to the current working directory. We also mount the volume ``/data`` in the directory ``./data-ache`` so that the crawled data is stored outside the container. In order to make ACHE communicate to the other containers, we need to link the ACHE container to the other two containers ``elasticsearch`` and ``torproxy``. +We also configure a container named ``ache``. Note that in order to make the config (``ache.yml``) and the seeds (``tor.seeds``) files available inside the container, we need to mount the volume ``/config`` to point to the current working directory. We also mount the volume ``/data`` in the directory ``./data-ache`` so that the crawled data is stored outside the container. In order to make ACHE communicate to the other containers, we need to link the ACHE container to the ``torproxy`` container. .. code:: yaml @@ -116,23 +94,13 @@ An finally, we configure a container named ``ache``. Note that in order to make - ./data-ache/:/data - ./:/config links: - - elasticsearch - torproxy depends_on: - - elasticsearch - torproxy **Understanding the ache.yml file** -The ``ache.yml`` file basically configures ACHE to index crawled data in the ``elasticsearch`` container: - - .. code:: yaml - - target_storage.data_format.type: ELASTICSEARCH - target_storage.data_format.elasticsearch.rest.hosts: - - http://elasticsearch:9200 - -and to download .onion addresses using the ``torproxy`` container: +The ``ache.yml`` file basically configures ACHE to download .onion addresses using the ``torproxy`` container: .. code:: yaml