Merge pull request #310 from biothings/tutorial-update

Tutorial update
biothings · Jan 4, 2024 · df1d9dd · df1d9dd
2 parents f5a3ee0 + d81ae23
commit df1d9dd
Showing 1 changed file with 40 additions and 56 deletions.
diff --git a/docs/tutorial/studio_tutorial.rst b/docs/tutorial/studio_tutorial.rst
@@ -7,16 +7,15 @@ to a fully operational BioThings API. In a second part, this API will enrich for
 
 .. note:: You may also want to read the `developer's guide <studio_guide.html>`_ for more detailed informations.
 
-.. note:: The following tutorial is only valid for **BioThings Studio** release **0.2b**. Check
-   all available `releases <https://github.com/biothings/biothings_studio/releases>`_ for more.
+.. note:: The following tutorial uses a docker-compose file to run the **BioThings Studio** and **Hub**. This file is available `here <https://github.com/biothings/biothings_docker/tree/docker-compose>`_
 
 =================
 1. What you'll learn
 =================
 
 Through this guide, you'll learn:
 
-* how to obtain a Docker image to run your favorite API
+* how to run a docker-compose to run your favorite API
 * how to run that image inside a Docker container and how to access the **BioThings Studio** application
 * how to integrate a new data source by defining a data plugin
 * how to define a build configuration and create data releases
@@ -29,14 +28,9 @@ Through this guide, you'll learn:
 =============
 
 Using **BioThings Studio** requires a Docker server up and running, some basic knowledge
-about commands to run and use containers. Images have been tested on Docker >=17. Using AWS cloud,
-you can use our public AMI **biothings_demo_docker** (``ami-44865e3c`` in Oregon region) with Docker pre-configured
-and ready for studio deployment. Instance type depends on the size of data you
-want to integrate and parsers' performances. For this tutorial, we recommend using instance type with at least
-4GiB RAM, such as ``t2.medium``. AMI comes with an extra 30GiB EBS volume, which is more than enough
-for the scope of this tutorial.
-
-Alternately, you can install your own Docker server (on recent Ubuntu systems, ``sudo apt-get install docker.io``
+about commands to run and use containers. Images have been tested on Docker >=17.
+
+You can install your own Docker server (on recent Ubuntu systems, ``sudo apt-get install docker.io``
 is usually enough). You may need to point Docker images directory to a specific hard drive to get enough space,
 using ``-g`` option:
 
@@ -47,16 +41,16 @@ using ``-g`` option:
   # restart to make this change active
   sudo service docker restart
 
+Alternatively, if you have a Mac or Windows, you can install `Docker Desktop <https://www.docker.com/products/docker-desktop>`_.
+It will install the docker server for you. Once you have Docker Desktop installed, go to settings->resources->advanced. You should give at least 80% of your resources to Docker for each category.
+This will prevent your Docker from crashing if you are running a large datasource or build.
 
 ============
 3. Installation
 ============
 
-**BioThings Studio** is available as a Docker image that you can pull from our BioThings Docker Hub repository:
-
-.. code:: bash
-
-  $ docker pull biothings/biothings-studio:0.2b
+**BioThings Studio** is available as a docker-compose file at our `github repository <https://github.com/biothings/biothings_docker/>`_.:
+Clone the repository and go to the ``docker-compose`` branch.
 
 A **BioThings Studio** instance exposes several services on different ports:
 
@@ -70,44 +64,25 @@ A **BioThings Studio** instance exposes several services on different ports:
 * **9000**: `Cerebro <https://github.com/lmenezes/cerebro>`_, a webapp used to easily interact with ElasticSearch clusters
 * **60080**: `Code-Server <https://github.com/cdr/code-server>`_, a webapp used to directly edit code in the container
 
-We will map and expose those ports to the host server using option ``-p`` so we can access BioThings services without
-having to enter the container:
+.. note:: Ports 8080, 7022, 7080, 9200, 27017, 8000, 9000, 60080 are exposed by default in the docker-compose.yml file.
 
 .. code:: bash
 
-  $ docker run --rm --name studio -p 8080:8080 -p 7022:7022 -p 7080:7080 -p 7081:7081 -p 9200:9200 \
-    -p 27017:27017 -p 8000:8000 -p 9000:9000 -p 60080:60080 -d biothings/biothings-studio:0.2b
-
-.. note:: we need to add the release number after the image name: biothings-studio:**0.2b**. Should you use another release (including unstable releases,
-   tagged as ``master``) you would need to adjust this parameter accordingly.
-
-.. note:: Biothings Studio and the Hub are not designed to be publicly accessible. Those ports should **not** be exposed. When
-   accessing the Studio and any of these ports, SSH tunneling can be used to safely access the services from outside.
-   Ex: ``ssh -L 7080:localhost:7080 -L 8080:localhost:8080 -L 7022:localhost:7022 -L 9000:localhost:9000 user@mydockerserver`` will expose the Hub REST API, the web application,
-   the Hub SSH, and Cerebro app ports to your computer, so you can access the webapp using http://localhost:8080, the Hub REST API using http://localhost:7080,
-   http://localhost:9000 for Cerebro, and directly type ``ssh -p 7022 biothings@localhost`` to access Hub's internals via the console.
-   See https://www.howtogeek.com/168145/how-to-use-ssh-tunneling for more details.
+  $ docker compose up -d --build
 
 We can follow the starting sequence using ``docker logs`` command:
 
 .. code:: bash
 
-  $ docker logs -f studio
-  Waiting for mongo
-  tcp        0      0 127.0.0.1:27017         0.0.0.0:*               LISTEN      -
-  * Starting Elasticsearch Server
-  ...
-  Waiting for cerebro
-  ...
-  now run webapp
-  not interactive
+  $ docker logs -f biothings
+   ARG
+   SSH keys not yet created, creating
+   Generating SSH Keys for BioThings Hub...
+   SSH Key has been generated, Public Key:
 
 Please refer to `Filesystem overview <studio_guide.html#filesystem-overview>`_ and  `Services check <studio_guide.html#services-check>`_ for
 more details about Studio's internals.
 
-By default, the studio will auto-update its source code to the latest version available and install all required dependencies. This behavior can be skipped
-by adding ``no-update`` at the end of the command line of ``docker run ...``.
-
 We can now access **BioThings Studio** using the dedicated web application (see `webapp overview <studio_guide.html#overview-of-biothings-studio-web-application>`_).
 
 
@@ -119,7 +94,7 @@ In this section we'll dive in more details on using the **BioThings Studio** and
 within the **Hub**, declare a build configuration using that datasource, create a build from that configuration, then a data release and finally instantiate a new API service
 and use it to query our data.
 
-The whole source code is available at https://github.com/sirloon/pharmgkb, each branch pointing to a specific step in this tutorial.
+The whole source code is available at https://github.com/biothings/tutorials/tree/master, each branch pointing to a specific step in this tutorial.
 
 4.1. Input data
 ^^^^^^^^^^^^^^^
@@ -137,12 +112,15 @@ The last two files will be used in the second part of this tutorial when we'll a
 .. _`drugLabels.zip`: https://s3.pgkb.org/data/drugLabels.zip
 .. _`occurrences.zip`: https://s3.pgkb.org/data/occurrences.zip
 
+These files will be downloaded by the **Hub** when we trigger the dumper. These files will go into a folder named ``data_folder`` by default.
+This will be explained in more detail in the `Data plugin <studio.html#data-plugin>`_ section.
+
 4.2. Parser
 ^^^^^^^^^^^
 
 In order to ingest this data and make it available as an API, we first need to write a parser. Data is pretty simple, tab-separated files, and we'll
 make it even simpler by using ``pandas`` python library. The first version of this parser is available in branch ``pharmgkb_v1`` at
-https://github.com/sirloon/pharmgkb/blob/pharmgkb_v1/parser.py. After some boilerplate code at the beginning for dependencies and initialization,
+https://github.com/biothings/tutorials/blob/pharmgkb_v1/parser.py. After some boilerplate code at the beginning for dependencies and initialization,
 the main logic is the following:
 
 
@@ -179,6 +157,10 @@ containing the downloaded data. This path is automatically set by the Hub and po
 It is the responsibility of the parser to select, within that folder, the file(s) of interest. Here we need data from a file named ``var_drug_ann.tsv``.
 Following the moto "don't assume it, prove it", we make that file exists.
 
+.. note:: In this case, an assertion isn't necessary as code will fail anyway if the file doesn't exist. But it's a good practice to make sure
+   the file exists before trying to open it. Also, it's a good practice to use ``os.path.join()`` to build the path to the file, as it will
+   automatically use the right path separator depending on the operating system.
+
 .. code:: python
 
     dat = pandas.read_csv(infile,sep="\t",squeeze=True,quoting=csv.QUOTE_NONE).to_dict(orient='records')
@@ -232,6 +214,9 @@ in a dictionary indexed by gene ID. The final documents are assembled in the las
 .. note:: In this specific example, we read the whole content of this input file in memory, then store annotations per gene. The data itself
    is small enough to do this, but memory usage always needs to be cautiously considered when we write a parser.
 
+.. note:: In this case, the final documents are assembled within a generator function, which is a good practice to save memory.
+   You may see within our `Biothings github organization <https://github.com/biothings>`_ that we have plugins where we return a dictonary or a list of documents.
+   This is also fine, but it is recommended to use a generator function when possible.
 
 4.3. Data plugin
 ^^^^^^^^^^^^^^^^
@@ -252,7 +237,7 @@ that contains everything useful for the datasource. This is what we'll do in the
    so we don't have to regurlarly update the plugin code (``git pull``) from the webapp, to fetch the latest code. That said, since the plugin
    is already defined in github in our case, we'll use the github repo registration method.
 
-The corresponding data plugin repository can be found at https://github.com/sirloon/pharmgkb/tree/pharmgkb_v1. The manifest file looks like this:
+The corresponding data plugin repository can be found at https://github.com/biothings/tutorials/tree/pharmgkb_v1. The manifest file looks like this:
 
 .. code:: bash
 
@@ -313,12 +298,9 @@ reconnect, which we'll do!
 .. image:: ../_static/hub_restarting.png
    :width: 250px
 
-The Hub shows an error though:
+Once you reconnect, you will have to do a hard refresh on your webpage, for example, ``cmd + shift + r`` on a Mac or ``ctrl + shift + r`` on a Windows/Linux.
 
-.. image:: ../_static/nomanifest.png
-   :width: 250px
-
-Indeed, we fetch source code from branch ``master``, which doesn't contain any manifest file. We need to switch to another branch (this tutorial is organized using branches,
+Since we fetch source code from branch ``master``, which doesn't contain any manifest file. We need to switch to another branch (this tutorial is organized using branches,
 and also it's a perfect opportunity to learn how to use a specific branch/commit using **BioThings Studio**...)
 
 Let's click on ``pharmgkb`` link, then |plugin|. In the textbox on the right, enter ``pharmgkb_v1`` then click on ``Update``.
@@ -332,6 +314,8 @@ Let's click on ``pharmgkb`` link, then |plugin|. In the textbox on the right, en
 **BioThings Studio** will fetch the corresponding branch (we could also have specified a commit hash for instance), source code changes will be detected and the Hub will restart.
 The new code version is now visible in the plugin tab
 
+.. note:: Remember to do a hard refresh again before continuing as the hub will attempt to restart.
+
 .. image:: ../_static/branch.png
    :width: 400px
 
@@ -364,7 +348,7 @@ we've run 3 commands to register the plugin, dump the data and upload the JSON d
 .. image:: ../_static/allcommands.png
    :width: 450px
 
-We also have new notifications as shown by the red number on the right. Let's have a quick look:
+We also have new notifications as shown by the red number on the left. Let's have a quick look:
 
 .. image:: ../_static/allnotifs.png
    :width: 450px
@@ -375,7 +359,7 @@ release number, the data folder, when the last download was, how long it tooks t
 .. image:: ../_static/dumptab.png
    :width: 450px
 
-Same for the `Uploader` tab, we now have 979 documents uploaded to MongoDB.
+Same for the `Uploader` tab, we now have 979 documents uploaded to MongoDB. Exact number may change depending on source file that is downloaded.
 
 .. image:: ../_static/uploadtab.png
    :width: 450px
@@ -498,10 +482,10 @@ tells the **Hub** which datasources should be merged together, and how. Click on
 * the `document type` represents the kind of documents stored in the merged collection. It gives its name to the annotate API endpoint (eg. /gene). This source
   is about gene annotations, so "gene" it is...
 * open the dropdown list and select the `sources` you want to be part of the merge. We only have one, "pharmgkb"
-* in `root sources`, we can declare which sources are allowed to create new documents in the merged collection, that is merge documents from a
-  datasource, but only if corresponding documents exist in the merged collection. It's useful if data from a specific source relates to data on
-  another source (it only makes sense to merge that relating data if the data itself is present). If root sources are declared, **Hub** will first
-  merge them, then the others. In our case, we can leave it empty (no root sources specified, all sources can create documents in the merged collection)
+* in `root sources`, we can declare which sources are allowed to create new documents in the merged collection.
+  If a root source is declared, data from other sources will only be merged if documents previously exist with same IDs (documents coming from root sources).
+  If not, data is discarded. Finally, if no root source is declared, any data sources can generate a new document in the merged data.
+  In our case, we can leave it empty (no root sources specified, all sources can create documents in the merged collection).
 * selecting a builder is optional, but for the sake of this tutorial, we'll choose ``LinkDataBuilder``. This special builder will fetch documents directly from
   our datasources `pharmgkb` when indexing documents, instead of duplicating documents into another connection (called `target` or `merged` collection). We can
   do this (and save time and disk space) because we only have one datasource here.