Skip to content

Commit

Permalink
add biothings.web tutorial
Browse files Browse the repository at this point in the history
  • Loading branch information
Jerry committed Dec 1, 2021
1 parent f2e6f65 commit 723ed06
Show file tree
Hide file tree
Showing 2 changed files with 327 additions and 0 deletions.
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ We recommend to follow `this tutorial <tutorial/studio.html>`_ to develop your f
tutorial/studio
tutorial/standalone
tutorial/hub_tutorial
tutorial/web
tutorial/datatransform

.. toctree::
Expand Down
326 changes: 326 additions & 0 deletions docs/tutorial/web.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,326 @@
BioThings Web
---------------

In this tutorial we will start a Biothings API and learn to customize the API,
adding new features and overriding the default behaviors, using increasingly
more advanced techniques step by step. In the end, you will be able to make
your own Biothings API, run other production APIs, like Mygene.info, and
additionally, customize and add more features to those projects.

Before starting the tutorial, you should have the biothings package installed,
and have an Elasticsearch with only one index populated with this dataset
`data <https://github.com/biothings/biothings.api/blob/master/tests/web/handlers/test_data.ndjson>`_ using this
`mapping <https://github.com/biothings/biothings.api/blob/master/tests/web/handlers/test_data_index.json>`_.

First, assuming your Elasticsearch service is running on the default port,
we can run a Biothings API with all default settings to explore the data,
simply by creating a ``config.py`` under your project folder. After creating
the file, run ``python -m biothings.web`` to start the API server. You should
be able to see the following console output::

[I 211130 22:21:57 launcher:28] Biothings API 0.10.0
[I 211130 22:21:57 configs:86] <module 'config' from 'C:\\Users\\Jerry\\code\\biothings.tutorial\\config.py'>
[INFO biothings.web.connections:31] <Elasticsearch([{'host': 'localhost', 'port': 9200}])>
[INFO biothings.web.connections:31] <AsyncElasticsearch([{'host': 'localhost', 'port': 9200}])>
[INFO biothings.web.applications:137] API Handlers:
[('/', <class 'biothings.web.handlers.services.FrontPageHandler'>, {}),
('/status', <class 'biothings.web.handlers.services.StatusHandler'>, {}),
('/metadata/fields/?', <class 'biothings.web.handlers.query.MetadataFieldHandler'>, {}),
('/metadata/?', <class 'biothings.web.handlers.query.MetadataSourceHandler'>, {}),
('/v1/spec/?', <class 'biothings.web.handlers.services.APISpecificationHandler'>, {}),
('/v1/doc(?:/([^/]+))?/?', <class 'biothings.web.handlers.query.BiothingHandler'>, {'biothing_type': 'doc'}),
('/v1/metadata/fields/?', <class 'biothings.web.handlers.query.MetadataFieldHandler'>, {}),
('/v1/metadata/?', <class 'biothings.web.handlers.query.MetadataSourceHandler'>, {}),
('/v1/query/?', <class 'biothings.web.handlers.query.QueryHandler'>, {})]
[INFO biothings.web.launcher:99] Server is running on "0.0.0.0:8000"...
[INFO biothings.web.connections:25] Elasticsearch Package Version: 7.13.4
[INFO biothings.web.connections:27] Elasticsearch DSL Package Version: 7.3.0
[INFO biothings.web.connections:51] localhost:9200: docker-cluster 7.9.3

Note the console log shows the API version, the config file it uses, its
database connections, HTTP routes, service port, important python dependency
package versions, as well as the database cluster details, showing in the very
end because it is scheduled asynchronously at start time, but executed later
after the main program has launched.

Of all the information provided, note that it says the server is running on
port 8000, this is the default port we use when we start a Biothings API.
You can pass the "port" parameter during startup to configure the port our
service will run on, for example: ``python -m biothings.web --port=9000``.
Now we open the browser and when we access localhost:8000, we should be able
to see the biothings welcome page, showing the public routes in regex formats.

The last route in the welcome page shows the url pattern of the query API.
Let's use this pattern to access the query endpoint. Accessing http://localhost:8000/v1/query/
returns a JSON document containing 10 results from our elasticsearch index.

Let's explore some Biothings API feature here, adding a query parameter "fields"
to limit the fields returned by the API, and another parameter "size" to limit
the returned document number. If you used the dataset mentioned at the start
of the tutorial, accessing http://localhost:8000/v1/query?fields=symbol,alias,name&size=1
should return a document like this::

{
"took": 15,
"total": 1030,
"max_score": 1,
"hits": [
{
"_id": "1017",
"_score": 1,
"alias": [
"CDKN2",
"p33(CDK2)"
],
"name": "cyclin dependent kinase 2",
"symbol": "CDK2"
}
]
}

The most commonly used parameter is the "q" parameter, try http://localhost:8000/v1/query?q=cdk2
and see all the returned results contain "cdk2", the value specified for "q" parameter.
Is there a way to access individual documents directly by their "_id" or other
id fields? We can support this feature by providing more information for the
program in the ``config.py`` file. In the file, write::

ES_HOST = "localhost:9200" # optional
ES_INDICES = {"gene": "<your index name>"}
ANNOTATION_DEFAULT_SCOPES = ["_id", "symbol"]

Restart your program and see an addtional route prefixed with /v1/gene is created
if you pay close attention to the console log. Now try the follwing URL:

http://localhost:8000/v1/gene/1017
http://localhost:8000/v1/gene/CDK2

See that using both of the URL can take you staright to the document previously
mentioned. Note using the symbol field "CDK2" may yield multiple documents
because multiple documents may have the same key-value pair. This also means
"symbol" may not be a good choice of key field we want to support in the URL.

The query endpoint we first introduced is usually used to search documents with
text keywords. When we know the document id we want to access instead, we can
use the annotation endpoint in the previous example. These two endpoints are
the pillars for Biothings API. You can additional customize these endpoints
to work better with your data.

For example, if you think our default returned result for the query endpoint
is too verbose and we want to only include limited information unless the user
specifically asked for more, we can set a default "fields" value, for this
parameter used in the previous example. Open ``config.py`` and add::

from biothings.web.settings.default import QUERY_KWARGS
QUERY_KWARGS['*']['_source']['default'] = ['name', 'symbol', 'taxid', 'entrezgene']

Restart your program after changing the config file and visit http://localhost:8000/v1/query,
see the effect of specifying default fields to return. Like this::

{
"took": 9,
"total": 100,
"max_score": 1,
"hits": [
{
"_id": "1017",
"_score": 1,
"entrezgene": "1017",
"name": "cyclin dependent kinase 2",
"symbol": "CDK2",
"taxid": 9606
},
{
"_id": "12566",
"_score": 1,
"entrezgene": "12566",
"name": "cyclin-dependent kinase 2",
"symbol": "Cdk2",
"taxid": 10090
},
{
"_id": "362817",
"_score": 1,
"entrezgene": "362817",
"name": "cyclin dependent kinase 2",
"symbol": "Cdk2",
"taxid": 10116
},
...
]
}

We also see that in this example, the "entrezgene" field are numbers formatted
as strings. Let's modify the internal logic called the query pipeline to convert
this value to an integer just to show what we can do for customization. Add to
config.py::
ES_RESULT_TRANSFORM = "pipeline.MyFormatter"
And create a file ``pipeline.py`` to include::

from biothings.web.query import ESResultFormatter


class MyFormatter(ESResultFormatter):

def transform_hit(self, path, doc, options):

if path == '' and 'entrezgene' in doc: # root level
try:
doc['entrezgene'] = int(doc['entrezgene'])
except:
...

Commit your changes and restart the web server process. Run some queries
and you should be able to see the "entrezgene" field now become integers::
{
"_id": "1017",
"_score": 1,
"entrezgene": 1017, # instead of the quoted "1017" (str)
"name": "cyclin dependent kinase 2",
"symbol": "CDK2",
"taxid": 9606
}

In the previous example, we are making changes to the query transformation
stage, controlled by the ESResultFormatter class, this is one of the three
stages that defined the query pipeline. The two stages coming before this
are represented by ESQueryBuilder and ESQueryEngine.

Let's addtionally try to add another feature to better sort the query result
incorporating domain knowledge of the data to make sure we deliver the most
user-friendly result. Additionally import ``ESQueryBuilder`` from ``biothings.web.query``
and also add ``from elasticsearch_dsl import Search`` in the ``pipeline.py``.
Customize the sort score by adding in the same file::

class MyQueryBuilder(ESQueryBuilder):

def apply_extras(self, search, options):

search = Search().query(
"function_score",
query=search.query,
functions=[
{"filter": {"term": {"name": "pseudogene"}}, "weight": "0.5"}, # downgrade
{"filter": {"term": {"taxid": 9606}}, "weight": "1.55"},
{"filter": {"term": {"taxid": 10090}}, "weight": "1.3"},
{"filter": {"term": {"taxid": 10116}}, "weight": "1.1"},
], score_mode="first")

return super().apply_extras(search, options)

Make sure our application can pick up the change by adding this line to ``config.py``::

ES_QUERY_BUILDER = "pipeline.MyQueryBuilder"

Save the file and restart the web server process. Search something and if you compare
with the application before, you may notice some result rankings have changed.
It is not easy to pick up this change if you are not familiar with the data,
visit http://localhost:8000/v1/query?q=kinase&rawquery instead and see that
our code was indeed making a difference and get passed to elasticsearch,
affecting the query result ranking. Notice the "rawquery" is a feature in
our program to intercept the raw query we sent to elasticsearch for debugging.

Taking it one more step further, we can add more stages to the pipeline by
overwriting the Pipeline class. Add to the config file::

ES_QUERY_PIPELINE = "pipeline.MyQueryPipeline"

and add the following code to ``pipeline.py``::

class MyQueryPipeline(AsyncESQueryPipeline):

async def fetch(self, id, **options):

if id == "tutorial":
res = {"_welcome": "to the world of biothings.api"}
res.update(await super().fetch("1017", **options))
return res

res = await super().fetch(id, **options)
return res

Now we made ourselves a tutorial page to show what annotation results
can look like, by visiting http://localhost:8000/v1/gene/tutorial, you
can see what http://localhost:8000/v1/gene/1017 would typically give you,
and the additional welcome message::
{
"_welcome": "to the world of biothings.api",
"_id": "1017",
"_version": 1,
"entrezgene": 1017,
"name": "cyclin dependent kinase 2",
"symbol": "CDK2",
"taxid": 9606
}

The examples above demonstrated the customizations you can make on top of
our pre-defined APIs, for the most demanding tasks, you can additionally add
your own API routes to the web app.

Declare a new route in the config file as a usual first step::

from biothings.web.settings.default import APP_LIST
APP_LIST = [
*APP_LIST, # keep the original ones
(r"/{ver}/echo/(.+)", "handlers.EchoHandler"),
]

Let's make an echo handler that just echos what the user put in the URL,
create a ``handlers.py`` and add::

from biothings.web.handlers import BaseAPIHandler


class EchoHandler(BaseAPIHandler):

def get(self, text):
self.write({
"status": "ok",
"result": text
})

Now we have added a completely new feature not basing on any of the existing
biothings offerings, which can be as simple and as complex as you need.
Visiting http://localhost:8000/v1/echo/hello would give you::

{
"status": "ok",
"result": "hello"
}

in which case, the "hello" in "result" is the input we give the application
in the URL. Another convenient place to combine your existing application
logic with biothings or configuring the server level settings is to provide
a launching module, typically called ``index.py``, create this file in your
project folder and add these lines::

from biothings.web.launcher import main
from tornado.web import RedirectHandler

if __name__ == '__main__':
main([
(r"/v2/query(.*)", RedirectHandler, {"url": "/v1/query{0}"})
], {
"static_path": "static"
})

Create another folder called "static" and add a file of random content
named "file.txt" under the newly created static folder. In this step,
we added a redirction of a later-to-launch v2 query API, that we
temporarily set to redirect to the v1 API and passing a tornado web
server, the default web server we use, static file configuration that
asks tornado to serve files under the static folder we specified, in
this case, named "static" and containing only one file.

After making the changes, visiting http://localhost:8000/v2/query/?q=cdk2
would direct you back to http://localhost:8000/v1/query/?q=cdk2 and by visiting
http://localhost:8000/static/file.txt you should see the random content you
previously created. Note in this step, you should run the python launcher
module directly by calling something like ``python index.py`` instead of
running the first command we introduced.

Finishing this tutorial, you have completed the most common steps to customize
biothings.api from the smallest step, by passing a different parameter at
launch time, to modifying the app code from the lower to highest level.
I hope you feel confident running biothings API now and please check out
the documentation page for more details on customizing APIs.

0 comments on commit 723ed06

Please sign in to comment.