Docs update (#210)

* [WIP] added troubleshootings documentation * complete docs part 1 * more docs for admin * more docs in admins * more docs in admins: ES Performance considerations
cga-harvard · Sep 17, 2016 · 49c8033 · 49c8033
1 parent 661222f
commit 49c8033
Show file tree

Hide file tree

Showing 4 changed files with 325 additions and 7 deletions.
diff --git a/_docs/admins.md b/_docs/admins.md
@@ -118,8 +118,166 @@ As an administrator, verify in the *periodic tasks* section that index cached la
 
 ![](https://cloud.githubusercontent.com/assets/54999/18128944/f18219f0-6f4d-11e6-98d3-6dab0a2a37d9.png)
 
+## Performance considerations
 
+The Hypermap architecture depends on 6 main components:
 
 
+```
++-------------------+      +----------------------+
+|                   |      |                      |
+|                   |      |    postgres          |
+|    django app     <--+--->                      |
+|                   |  |   |                      |
+|                   |  |   |                      |
++--------^----------+  |   +----------------------+
+         |             |
++--------v----------+  |   +----------------------+
+|                   |  |   |                      |
+|                   |  |   |                      |
+|     rabbitmq      |  +--->    elastic search    |
+|                   |  |   |                      |
+|                   |  |   |                      |
++---------^---------+  |   +----------------------+
+          |            |
++---------v--------+   |   +----------------------+
+|                  |   |   |                      |
+|                  |   |   |                      |
+|      celery      <---+--->      memcached       |
+|   & celery beats |       |                      |
+|                  |       |                      |
++------------------+       +----------------------+
+```
+
+
+If you want to see how to install those services, refer to "Manual Installations" in the developers documentation.
+
+### Django app
+
+The application layer [#TODO: provide more info here] 
+
+The app can be hosted via wsgi application located here: `hypermap/wsgi.py` for production enviroment is recommended to host it with uWSGI application server. Refer to https://uwsgi-docs.readthedocs.io/en/latest/ to more documentation.
+
+##### How to start?
+
+##### Development:
+```
+python manage.py runserver
+```
+
+##### Production:
+```
+uwsgi --module=hypermap.wsgi:application --env DJANGO_SETTINGS_MODULE=hypermap.settings
+```
+Read more about [Configuring and starting the uWSGI server for Django](https://docs.djangoproject.com/en/1.10/howto/deployment/wsgi/uwsgi/#configuring-and-starting-the-uwsgi-server-for-django)
+
+##### Docker:
+```
+make start
+```
+
+
+### Rabbit MQ, Celery and Memcached
+
+The queue/task layer. It performs operations (as follows above) that works with dedicated async workers that could run in the local or remote machines connected to a Rabbit MQ instance.
+
+ **Harvesting and Indexing**
+
+Download metadata from Internet, each time an Endpoint, Service and Layer is created a worker starts async jobs to fetch the information for the remote services.
+
+**Perform Periodic/Scheduled Tasks (AKA beats)**
+
+Kicks off tasks at regular intervals, two important periodic tasks are placed in the settings file:
+
+Once a Layers are created, and checked with `hypermap.aggregator.tasks.check_all_services` are inserted to Memcached to store a buffer for the task `hypermap.aggregator.tasks.index_cached_layers` where a batch call is made to Search engine in order to index. 
+
+
+***Important settings***
+
+`REGISTRY_CHECK_PERIOD` (in minutes) defines the interval which the task `check_all_services` will be executed by the available workers to start checking the Service and Layers status.
+
+`REGISTRY_INDEX_CACHED_LAYERS_PERIOD` (in minutes) defines the interval which the task `index_cached_layers` will be executed by the available workers to start to send memcached buffered layers to the search backend.
+
+The setting `CELERYBEAT_SCHEDULE` registers the creation of those periodic tasks:
+
+```
+CELERYBEAT_SCHEDULE = {
+    'Check All Services': {
+        'task': 'hypermap.aggregator.tasks.check_all_services',
+        'schedule': timedelta(minutes=REGISTRY_CHECK_PERIOD)
+    },
+    'Index Cached Layers': {
+        'task': 'hypermap.aggregator.tasks.index_cached_layers',
+        'schedule': timedelta(minutes=REGISTRY_INDEX_CACHED_LAYERS_PERIOD)
+    }
+}
+```
+
+Those 2 periodic tasks should be automatically created in admin site when starting the celery workers. One way to check this is go to the admin site and verify in the "Periodic Tasks" page the presence of 3 tasks:
+
+##### How to start?
+
+```
+celery worker --app=hypermap.celeryapp:app --concurrency 4 -B -l INFO
+```
+
+##### Docker:
+```
+make start
+```
+
+##### How to check periodic tasks created by Celery?
+
+<img src="http://panchicore.d.pr/kLYB+" width="400">
+
+You have to ensure only a single scheduler is running for a schedule at a time, otherwise you would end up with duplicate tasks. Using a centralized approach means the schedule does not have to be synchronized, and the service can operate without using locks.
+
+**Why `REGISTRY_CHECK_PERIOD` should be an extended period of time**
+
+`check_all_services` performs connections to the registered services in order to make checks and download information, if checks periods are too low it could be causing massive connections to the services and cause high incoming traffic and workload that could looks like a denial of service attack. The recommended setting with `REGISTRY_CHECK_PERIOD` is `60*24` to perform a daily check.
+
+One way to avoid those remote connections to the service servers not required/needed to harvest, is to set `Service.is_monitored=True`.
+
+As in admin site:
+
+<img src="http://panchicore.d.pr/1eC9N+" width="300">
+
+**Workers quantity (--concurrency N)**
+
+The recommended number of concurrent workers running in a machine should be near to the number of CPU cores.
+
+**Scalling up/down: Register celery nodes**
+
+Deploy the hypermap code in a different machine in the same cluster and use the same `BROKER_URL`. Task will be automaticaly starting on this node. Dont start with beats to ensure only a single scheduler is running for a schedule at a time, otherwise you would end up with duplicate tasks.
+
+**Starting celery without beats**
+
+Just remove the `-B` from the start command:
+
+```
+celery worker --app=hypermap.celeryapp:app --concurrency 4 -l INFO
+```
+
+**How to purge active and pending task from celery**
+
+this is unrecoverable, and the tasks will be deleted from the messaging server.
+
+First stop all workers and run:
+```
+celery worker --app=hypermap.celeryapp:app purge -f
+```
+
+
+### Elasticsearch
+
+**`REGISTRY_MAPPING_PRECISION`**
+
+This parameter may be used instead of tree_levels to set an appropriate value for the tree_levels parameter. The value specifies the desired precision and Elasticsearch will calculate the best tree_levels value to honor this precision. The value should be a number followed by `m` distance unit.
+
+**Performance considerations**
+
+Elasticsearch uses the paths in the prefix tree as terms in the index and in queries. The higher the levels is (and thus the precision), the more terms are generated. Of course, calculating the terms, keeping them in memory, and storing them on disk all have a price. Especially with higher tree levels, indices can become extremely large even with a modest amount of data. Additionally, the size of the features also matters. Big, complex polygons can take up a lot of space at higher tree levels. Which setting is right depends on the use case. Generally one trades off accuracy against index size and query performance.
 
+The defaults in Elasticsearch for both implementations are a compromise between index size and a reasonable level of precision of 50m at the equator. This allows for indexing tens of millions of shapes without overly bloating the resulting index too much relative to the input size.
 
+So take care settings low `REGISTRY_MAPPING_PRECISION` because at the moment of sending Layers to Elasticsearch it will become slow.
diff --git a/_docs/developers.md b/_docs/developers.md
@@ -123,36 +123,90 @@ docker-compose restart celery
 
 ##### Tests commands
 
-Unit tests
+*Unit tests* asserts the correct functionality of Hypermap workflow where an added endpoint, creates Services and their Layers, checks
+ the correct metadata is being harvested and stored in DB and indexed in the Search backend.
 
 ```sh
 make test-unit
 ```
 
-Solr backend
+*Solr backend* asserts the correct functionality of Solr implementation to index Layers with Solr and tests the Hypermap search API connected to that implementation by
+querying data by the most important fields.
+
+1. inserts 4 layers
+2. test all match docs, q.text, q.geo, q.time and some facets, see the search API documentation. (#TODO link to the api docs). 
 
 ```sh
 make test-solr
 ```
 
-Elasticsearch backend
+*Elasticsearch backend* asserts the correct functionality of Elasticsearch implementation to index Layers with ES and tests the Hypermap search API connected to that implementation by
+querying data by the most important fields.
+
+1. inserts 4 layers
+2. test all match docs, q.text, q.geo, q.time and some facets, see the search API documentation. (#TODO link to the api docs). 
 
 ```sh
 make test-elastic
 ```
 
-Selenium
+*Selenium Browser* is an end-to-end tests that runs a Firefox and emulates the user interaction with some basic actions to test the correct funcionality of
+ the Django admin site and registry UI, this test covers the following actions:
+
+ 1. admin login (user sessions works as expected)
+ 2. periodic tasks verifications (automatic periodic tasks are created on startup in order to perform important automatic tasks like check layers, index cached layers on search backend and clean up tasks)
+ 3. upload endpoint list (file uploads correctly and store in db, it triggers all harvesting load like: create endpoints, create services and their layers, index layers in search backend and perform firsts service checks)
+ 4. verify creation of endpoint, service and layers (previous workflow executed correctly)
+ 5. browser the search backend url (should appear indexed layers previouly created)
+ 6. browser /registry/ (services created are being display to users correctly)
+ 7. browser service details (check basic service metadata present on the page)
+ 8. reset service checks (correct functionality should start new check tasks)
+ 9. create new service checks and verification (trigger the verification tasks and verifies it in service listing page)
+ 10. browser layers details (check basic service metadata present on the page)
+ 11. reset layer checks (correct functionality should start new check tasks)
+ 12. create new layer checks and verification (trigger the verification tasks and verifies it in service layers listing page)
+ 13. clear index (tests the clean up indice functionality)
+
 
 ```sh
 make test-endtoend-selenium-firefox
 ```
 
+Selenium and Firefox interaction can be viewed by connecting to VNC protocol, the easiest method is to use Safari. 
+Just open up Safari and in the URL bar type `vnc://localhost:5900` hit enter and entry `secret` in the password field. Other method is using VNCViever: https://www.realvnc.com/download/viewer/
+
+*CSW-T* asserts the correct functionality of CSW transaction requests. 
+
+1. inserts a full XML documents with `request=Transaction` and verifies Layers created correctly, the inserted document with 10 Layers can be found here: `data/cswt_insert.xml`
+2. verifies the Listing by calling  `request=GetRecords` and asserting 10 Layers created.
+3. verifies the search by calling `request=GetRecords` and passing a `q` parameter.
+4. as that harvesting method also sends the layers to the search backend, a verification is made in order to assert the 10 layers created.
+
+```sh
+make - test-csw-transactions
+```
+
 To run all tests above in a single command:
 
 ```sh
 make test
 ```
 
+##### Travis Continuos Integration Server
+
+`master` branch is automaticaly synced on https://travis-ci.org/ and reporting test results, too see how travis is running tests refer to the `.travis.yml` file placed in the project root.
+If you want to run tests in your local containers first, Execute travis-solo (`pip install travis-solo`) in directory containing .travis.yml configuration file. It’s return code will be 0 in case of success and non-zero in case of failure.
+
+##### Tool For Style Guide Enforcement
+
+The modular source code checker for `pep8`, `pyflakes` and `co` runs thanks to `flake8` already installed with the project requirements and can be executed with this command:
+
+```sh
+make flake
+```
+
+Note that Travis-CI will assert flake returns 0 code incidences.
+
 ## Known Issues in version 0.3.9
 
  - Items from Brazil appear in Australia: https://github.com/cga-harvard/HHypermap/issues/199

diff --git a/_docs/troubleshootings.md b/_docs/troubleshootings.md
@@ -0,0 +1,12 @@
+# Hhypermap registry troubleshootings
+
+**1. When I add an url into the database, services are not created**
+
+  - Verify that database service is ready with migrations.
+  - Check that celery process started after database migrations.
+
+**2. Services and layers are created, but layers are not indexed into search backend**
+
+As an administrator, verify in the *periodic tasks* section that index cached layers task is set.
+
+![](https://cloud.githubusercontent.com/assets/54999/18128944/f18219f0-6f4d-11e6-98d3-6dab0a2a37d9.png)
diff --git a/_docs/users.md b/_docs/users.md
@@ -83,21 +83,28 @@ HHypermap Registry visualization tool comes with different types of filtering:
 
 ## Testing
 
-### Creating layers
+
+### How to connect with Hypermap features
+
+There are tree ways to connect with hypermap functionalities:
+
+### 1. Registry web app
+
+#### Creating layers
 
 - **WMS testing:** Upload http://demo.opengeo.org/geoserver/wms. There should be 72 layers created and indexed in search backend.
 - Modify environment variable REGISTRY_LIMIT_LAYERS to 2. If you are using Docker, modify your docker-compose.yml file for both django and celery docker images and verify that no service has more than 2 layers.
 - Verify that the number of layers created in the database and documents indexed in search backend are the same.
 - the total number when the map UI client is loaded matches the total number of records in elasticsearch and the total number of layers in registry's home page.
 
-### Service detail page.
+#### Service detail page.
 
 - Using a previously generated service from an endpoint, remove checks using **Remove Checks** button. Then press Check now button and verify in the celery monitor tab that check task is run. After the check is finished  you can verify the number of total checks in the *monitoring period* section.
 ![image](https://cloud.githubusercontent.com/assets/3285923/17679102/91ec62b6-62ff-11e6-8672-4dfe306c7aa6.png)
 
 ![image](http://d.pr/i/16v0E+)
 
-### Celery monitor and search backend indexing
+#### Celery monitor and search backend indexing
 
 - Using a previously created service. Verify that all layers are indexed in search backend. For elasticsearch you can verify executing this commmand in terminal.
 ```sh
@@ -107,3 +114,90 @@ HHypermap Registry visualization tool comes with different types of filtering:
 - Press *Reindex all layers* to add all created layers into the search backend. And verify in the search backend.
 ![image](https://cloud.githubusercontent.com/assets/3285923/17679268/584b7faa-6300-11e6-9bf3-31007ca6ce8f.png)
 ![image](http://d.pr/i/P0I1+)
+
+### 2. Search API
+
+This document outlines the architecture and specifications of the HHypermap Search API.
+
+#### API Documentation powered by Swagger
+
+The goal of Swagger is to define a standard, language-agnostic interface to REST APIs which allows both humans and computers to discover and understand the capabilities of the service without access to source code, documentation, or through network traffic inspection. When properly defined via Swagger, a consumer can understand and interact with the remote service with a minimal amount of implementation logic. Similar to what interfaces have done for lower-level programming, Swagger removes the guesswork in calling the service.
+
+Technically speaking - Swagger is a formal specification surrounded by a large ecosystem of tools, which includes everything from front-end user interfaces, low-level code libraries and commercial API management solutions.
+
+The swagger file can be found here: `hypermap/search_api/static/swagger/search_api.yaml` and will be hosted while Hypermap server is up and running on here `http://localhost/registry/api/docs`
+
+![image](http://panchicore.d.pr/1jk74+)
+
+#### Architecture
+
+The Search API will connect to a dedicated Search backend instance and provide read only search capabilities.
+
+```
+  /registry/{catalog_slug}/api/
++---------------------------------+           +----------------------------------+
+|                                 |           |                                  |
+|   - filter params               |           |                                  |
+|   by text, geo, time            |           |                                  |
+|   facets params                 |  HTTP     |                                  |
+|   - text, geo heatmap,          <----------->                                  |
+|   time, username                |           |     Search backend               |
+|   - presentation params         |           |                                  |
+|   limits, pagination,           |           |                                  |
+|   ordering.                     |           |                                  |
+|                                 |           |                                  |
++---------------------------------+           +----------------------------------+
+
+```
+#### Parameters documentation
+
+As Swagger is the API documentation per se, all filter, facets, presentations parameters, data types, request and response data models, response messages, curl examples, etc,  are described in the Swagger UI  `http://localhost/registry/api/docs` as presented in this screenshot:
+
+![image](http://panchicore.d.pr/1gHWu+)
+
+### 3. Embebed CSW-T
+
+Hypermap has the ability to process CSW Harvest and Transaction requests via CSW-T
+
+pycsw is an OGC CSW server implementation written in Python.
+
+pycsw fully implements the OpenGIS Catalogue Service Implementation Specification (Catalogue Service for the Web). Initial development started in 2010 (more formally announced in 2011). The project is certified OGC Compliant, and is an OGC Reference Implementation.  Since 2015, pycsw is an official OSGeo Project.
+
+Please read the docs at http://docs.pycsw.org/en/2.0.0/ for more information.
+
+
+#### How to use
+
+The following instructions will show how to use the different requests types:
+
+#### 1. Insert
+Insert layers from a XML file located in `data/cswt_insert.xml` with `request=Transaction`, the file contains 10 layers.
+
+```
+curl -v -X "POST" \
+    "http://admin:admin@localhost/registry/hypermap/csw/?service=CSW&request=Transaction" \
+     -H "Content-Type: application/xml" \
+     -f "data:@data/cswt_insert.xml"
+```
+
+#### 1. Retrieve
+Return the 10 layers added before with `request=GetRecords`
+
+```
+curl -X "GET" \
+    "http://admin:admin@localhost/registry/hypermap/csw/?service=CSW&version=2.0.2&request=GetRecords&typenames=csw:Record&elementsetname=full&resulttype=results
+```
+
+#### 3. Filter layers
+Query records with `mode=opensearch` and `q=` parameter, in this example one layer is named "Airport" 
+
+```
+curl -X "GET" \
+    "http://admin:admin@localhost/registry/hypermap/csw/?mode=opensearch&service=CSW&version=2.0.2&request=GetRecords&elementsetname=full&typenames=csw:Record&resulttype=results&q=Airport"
+```
+
+#### 4. Ensure layers are also indexed in Search backend:
+```
+curl -X "GET" \ 
+    "http://localhost/_elastic/hypermap/_search"
+```