# Thoth 0.5.0 - Example 2 Guided Notebook

This notebook is a supportive material for example 2 of Thoth's 0.5.0 release. The prerequisite for this notebook is "Thoth 0.5.0 - Example 1 Guided Notebook" which will cover some of the commands more in depth. This notebook expects that reader is familiar with Example 1 Guided Notebook.


See internal document for more info and clarification in [Google Docs](https://docs.google.com/document/d/1QflQpGXtOuHFFC2hkEFBlu0JmCQWEnXdgryvwxCNXpQ/edit#) - section "Example 2".

## Initial graph database setup

In order to go through this scenario, we need first connect to a graph database instance. This notebook is playeble from within your computer, it inserts all the data into a provided JanusGraph instance, so select a graph database instance you would like to use. If you want to run this script purely on your local machine, setup your local graph database as [described in the README file of thoth-station/janusgraph-thoth-config repo](https://github.com/thoth-station/janusgraph-thoth-config#running-janusgraph-instance-locally). Ideally, just clone the repo and issue the following command to setup your local JanusGraph database instance:

```
sudo ./local.sh all
```

In [None]:
# Configure JanusGraph instance to talk to:
JANUSGRAPH_SERVICE_HOST = 'localhost'

# For directly talking to test environment, uncomment the following line:
# JANUSGRAPH_SERVICE_HOST = 'janusgraph.test.thoth-station.ninja'

Now let's connect to desired JanusGraph database and check if we are properly connected:

In [None]:
from thoth.storages import GraphDatabase

# Instantiate and connect the JanusGraph database.
graph = GraphDatabase.create(JANUSGRAPH_SERVICE_HOST)
graph.connect()

graph.is_connected()

In the next step we download result of a [thoth-solver](https://github.com/thoth-station/solver) run which resolved all the TensorFlow stacks (TensorFlow and all its transitive dependencies) to this date.

In [None]:
import requests
from thoth.common import timestamp2datetime

SOLVER_DOCUMENT_URL = 'https://raw.githubusercontent.com/thoth-station/misc/master/examples/runtime-environment/resolved.json'

response = requests.get(SOLVER_DOCUMENT_URL)
response.raise_for_status()
solver_document = response.json()

print("Document covers all transitive packages which can be installed when %r is installed (no version specifier, any version)." % solver_document["metadata"]["arguments"]["pypi"]["requirements"].replace("\\n", "\n"))
print("Stacks were resolved at", timestamp2datetime(solver_document["metadata"]["timestamp"]))

If there is any new release of a package, Thoth automatically runs thoth-solver for new package releases on [PEP-503](https://www.python.org/dev/peps/pep-0503/) compatible package sources. If you deploy [Warehouse](https://github.com/pypa/warehouse) on your own, Thoth can be configured to talk directly to its REST API which optimizes some of the requests performed by Thoth components.

In the next step, we show what operating system (environment) and what Python version was used when resolving software stacks. As Python is a dynamic programming language, [resolution of dependencies highly depends on evaluation on a package installation](https://dustingram.com/articles/2018/03/05/why-pypi-doesnt-know-dependencies/). That's one of the main reasons why Thoth pre-computes dependencies and creates structures inside graph database to have notion about ecosystem structure (another significant reason is speed of generating software stacks when scoring, installations of Python packages is slow). This pre-computation is done by solvers which are installed in a Thoth deployment (they can be dynamically added, Thoth automatically resolves monitored packages in the ecosystem and tracks updates).

In [None]:
from thoth.storages import SolverResultsStore

document_id = SolverResultsStore.get_document_id(solver_document)
solver_name = SolverResultsStore.get_solver_name_from_document_id(document_id)
graph.parse_python_solver_name(solver_name)

Let's sync the solver document into graph so we can later on use these dependency graphs:

In [None]:
%%time

graph.sync_solver_result(solver_document)

Another type of documents in Thoth are inspection documents. These documents are created on a service called [Amun](https://github.com/thoth-station/amun-api/). [Amun can, based on JSON specification, run a software stack](https://github.com/thoth-station/amun-api/#a-request-to-amun-api), this run is called "inspection".

![alt text](https://raw.githubusercontent.com/thoth-station/amun-api/master/fig/diagram.png "Amun API diagram")


The main aim of an inspection is to gather application stack characteristics, such as:

* Can be the given stack assembled?
  * Gather errors observed during software stack installation.

* Is application runnable?
  * Gather errors done due to issues, such as ABI incompatibility or Python version requirements.

The results of above characteristics are automatically synced into Thoth's knowledge base. If all of the above is successful, there can be additionaly observed other characteristics such as performance. We will use performance from previous runs of [Dependency Monkey](https://github.com/thoth-station/adviser/blob/master/docs/dependency_monkey.md) in "Thoth" in which Dependency Monkey generated software stacks and submitted them to Amun for inspection in the given runtime/buildtime environment. You can find results of Dependency Monkey runs on Ceph, where Thoth primarily stores them and subsequently syncs them to the graph database. However, in this Jupyter Notebook we will use inspection documents stored on GitHub so anyone (without access to Ceph) can run this notebook and experiment with data:

In [None]:
%%time

import requests

INSPECTION_DOCUMENT_BASE_URL = 'https://raw.githubusercontent.com/thoth-station/misc/master/examples/runtime-environment/'

for inspection in ('inspection_1.json', 'inspection_2.json'):
    url = INSPECTION_DOCUMENT_BASE_URL + inspection
    
    response = requests.get(url)
    response.raise_for_status()
    inspection_document = response.json()
    
    print("Syncing document %r into graph database..." % inspection)
    graph.sync_inspection_result(inspection_document)

We have fed all the data necessary for this notebook into our graph database instance. Next, let's consider a user's stack which is described in `Pipfile` and `Pipfile.lock` as produced by [Pipenv](https://pipenv.readthedocs.io).

In [None]:
PIPFILE_URL = "https://raw.githubusercontent.com/thoth-station/thamos/master/examples/runtime-environment/Pipfile"
PIPFILE_LOCK_URL = "https://raw.githubusercontent.com/thoth-station/thamos/master/examples/runtime-environment/Pipfile.lock"

response = requests.get(PIPFILE_URL)
response.raise_for_status()
pipfile_str = response.text

response = requests.get(PIPFILE_LOCK_URL)
response.raise_for_status()
pipfile_lock_str = response.text

Direct dependencies on which user directly uses in her/his application:

In [None]:
print(pipfile_str)

As a user wants to have deterministic builds, user ships her/his application with a lock file, `Pipfile.lock`. `Pipfile.lock`, `Pipfile` together with user's runtime environment create Thoth's abstraction "Project".

In [None]:
from pprint import pprint
import yaml
from thoth.common import RuntimeEnvironment
from thoth.python import Pipfile
from thoth.python import PipfileLock
from thoth.python import Project


THOTH_YAML_URL = "https://raw.githubusercontent.com/thoth-station/thamos/master/examples/runtime-environment/.thoth.yaml"

response = requests.get(THOTH_YAML_URL)
response.raise_for_status()
config_content = yaml.load(response.text)


# We will stick with the first runtime environment stated in the configuration file. Optionally, a user can specify multiple runtime environments.
# In that case all of them are tracked in Thoth based on their configuration.
config_content["runtime_environments"][0]["name"] = "fedora:29"
config_content["runtime_environments"][0]["operating_system"]["name"] = "fedora"
config_content["runtime_environments"][0]["operating_system"]["version"] = "29"
runtime_environment = RuntimeEnvironment.from_dict(config_content["runtime_environments"][0])
print("Runtime configuration:")
pprint(runtime_environment.to_dict())

pipfile = Pipfile.from_string(pipfile_str)
project = Project(
    pipfile=pipfile,
    pipfile_lock=PipfileLock.from_string(pipfile_lock_str, pipfile=pipfile),
    runtime_environment=runtime_environment
)

The warning reported can be ingnored. The `recommendation_type` configuration option in the configuration file is used in Thamos (Thoth's CLI) to override default `recommendation_type` if needed (for a specific runtime environment entry).

Everything's set, let's compute some advises:

In [None]:
%%time
%env THOTH_ADVISER_SHOW_PACKAGES=1
%env ISIS_API_URL=http://isis-api-thoth-test-core.cloud.paas.upshift.redhat.com

from thoth.adviser.python import Adviser
from thoth.adviser.enums import RecommendationType

stack_info, report = Adviser.compute_on_project(
    project,
    recommendation_type=RecommendationType.STABLE,
    count=1,  # Number of stacks reported in the output.
    limit=1,  # Number of stacks scored in total.
    limit_latest_versions=2,  # Consider only first two latest versions of each package.
    dry_run=False,
    graph=graph,
)

To understand what is going on on the background, lets dive into steps in the following chapters.

All the direct dependencies are parsed and resolved given the version ranges stated in `Pipfile`. The resolution is done offline but respecting `pip`'s logic (`pip`'s internal resolution logic is reused), based on Thoth's resolver implemented on top of the graph instance (resolution is thus faster and respects packages known to Thoth - packages already analyzed). This resolution is done cross-package source index, meaning you can configure Thoth to be a resolver on top of different Python package indexes. This is something some tools *partially* supprt (such as [Pipenv](https://pipenv.readthedocs.io/en/latest/advanced/#specifying-package-indexes)), however in Python ecosystem, all package sources are treated as mirrors. If a package cannot be found on one package index, current tools silently fallback to another Python package index which can be miss-leading or undesired in some cases (e.g. if you want to consume optimized TensorFlow wheel files [from AICoE index](http://tensorflow.pypi.thoth-station.ninja), you don't want to fallback to an un-optimized upstream TensorFlow release, see [Thoth's provenance checks](https://github.com/thoth-station/adviser#provenance-checks) for more info and for provenance gating). This issue is present also in Pipenv even though the [docs](https://pipenv.readthedocs.io/en/latest/advanced/#specifying-package-indexes) do not state it.

Once the resolution is done on direct dependencies, there are issued queries to JanusGraph instance to gather transitive dependencies of all the direct dependencies in a specific version from a specific package source index respecting their version ranges (on specific environment, in this case Fedora 29 with Python 3.6). The result of query is in fact, a serialized dependency graph as stored in the Thoth's graph database.

As some of the packages are *NOT* installable into the given environment (e.g. too old versions which do not run on Python 3+, but run only on legacy Python 2) are automatically removed in a *cut-off* phase. This way we reduce stacks we know they will not work at all and reduce space traversed during stacks generation. Another type of packages which are removed in the *cut-off phase* are "core" Python packages which we pick automatically to the latest working release (e.g. `setuptools`, `pip`, ...).

After the step above is done, Thoth's adviser implementation sorts packages in the resulting query in such way, they can be used to generate software stacks systematically, from the latest stack to the oldest one. This way, there are prefered newer stacks in comparision to the older ones - the newer stack is, the sooner it gets scored and takes precedence in case of same score with another but older software stack.

If a user supplies `limit_latest_versions` option, there are removed older versions of each package (in the example above, there are considered only 2 latest versions of each package). This option helps to reduce space - for larger stacks, number of stacks result in a really huge number and the depedency graph traversal does not fit into memory (you can try it by yourself by removing `limit_latest_versions` parameter or setting it to `None`). See estimation in the adviser log to check number of stacks estimated for scoring - `estimated stacks count` (its estimation - upper bound, the actual number may, and usually will, differ based on the dependency graph traversal).

Now, we are ready to perform the actual graph traversal. The current implemenation uses a subprocess which calls to [C/C++ dependency graph implementation](https://github.com/thoth-station/adviser/blob/master/docs/libdependency_graph.md) implemented in a form of a library. The subprocess generates a stream of stacks which is consumed by the parent process (see [docs for more info](https://github.com/thoth-station/adviser/blob/master/docs/libdependency_graph.md)) to perform stack scoring. As stated above, there are generated stacks from the newest one to the oldest one. The actual scoring is performed based on data stored in the graph database.

In this Jupyter Notebook we show "STABLE" recommendation type which covers performance based scoring, data for this scoring we have already synced as shown above (see syncs of inspection documents). To get familiar with other types of scoring see other Thoth notebooks.

The scoring is computed as an average of performance related inspection runs, where inspections which show that the stack does not work have negative (-1.0) performance score, where positive performance scores reflect performance score in inspections (see [thoth-station/performance](https://github.com/thoth-station/performance) for performance indicators used).

As some of the packages do not have performance impact on a software stack, Thoth uses [Isis](https://github.com/thoth-station/isis-api) and its [feature based queries on top of project2vec](https://docs.google.com/document/d/18Nghre5s4O8MUvslU2n_avznEmF78gOiDq_Xe6ygMjc/). In this case the query consists of only "core performance" packages (without for example `markdown` package in case of TensorFlow stacks) which do affect performance in any way. Thoth internally queries Isis API (which is an API for project2vec vector space). If you would like to suppress this behaviour and rather ask for exact match on software stack, just omit `ISIS_API_URL` environment variable - stacks will be scored based on exact match in the graph database.

In [None]:
%env ISIS_API_URL=http://isis-api-thoth-test-core.cloud.paas.upshift.redhat.com

from thoth.adviser.isis import ISIS_API

print("TensorFlow performance impact is", ISIS_API.get_python_project_performance_import("tensorflow"))
print("Markdown performance impact is", ISIS_API.get_python_project_performance_import("markdown"))
print("protobuf performance impact is", ISIS_API.get_python_project_performance_import("protobuf"))

... And let's have a look at the recommended stack (`Pipfile` and `Pipfile.lock`):

In [None]:
report[0][1].pipfile.to_dict()

In [None]:
report[0][1].pipfile_lock.to_dict()

In [None]:
report[0][0]

Thoth recommended to use [optimized AICoE TensorFlow release](http://tensorflow.pypi.thoth-station.ninja). Based on data stored in Thoth, it has the best performance on the given CPU, respecting operating system and Python version.

If you take a look closely, you might spot also explictly configured Python version in Pipfile, that's another recommendation Thoth is giving:

In [None]:
stack_info[0][1]

Without this configuration option, you might run the application under different Python interpreter (the default one configured in the operating system used) causing possible issues on deployment.

Feel free to experiment with parameters to adviser:

In [None]:
# Set recommendation type to one of the following:

[e.name for e in RecommendationType]

In [None]:
# Adjust runtime specific information (e.g. provide CPU, CUDA information, ...)

runtime_environment = RuntimeEnvironment.from_dict({}).to_dict(without_none=False)
runtime_environment

In [None]:
# Adjust version ranges of Python packages being installed. These versions are resolved using
# pip's internal algorithm, so anything which is compatible with PEP-440 (and Pipfile compatible
# for Pipfile inputs) works out of box. Note this resolution is not done by installing
# dependencies (as in case of Pip/Pipenv), but there is implemented resolver on top of
# graph database which can resolve dependencies much faster as all the data are pre-computed.

PIPFILE_STR = """
[[source]]
url = "https://pypi.org/simple"
verify_ssl = true
name = "pypi"

# This can resolve in large number of stacks when running against test database.
[packages]
tensorflow = "*"

[requires]
python_version = "3.6"
"""

# In the example above, there is also used Pipfile.lock. Actually Pipfile.lock is not used in
# recommendations, but Thoth stores it internally to track user's stacks, their evolution and changes.
pipfile = Pipfile.from_string(PIPFILE_STR)
project = Project(pipfile=pipfile, pipfile_lock=None, runtime_environment=runtime_environment)

# Mind dependencies resolved in solver run, unknown dependencies to Thoth, obviusly, cannot be resolved by Thoth.

Parameter `count` limits number of stacks provided in the output, parameter `limit` limits numbef of stacks scored in total.

All of the above can be accomplished using Thamos CLI (as the above is more in-depth description what Thoth does on lower level). From user's perspective a user just installs `Thamos`, adjusts configuration via `thamos config` (automatic discovery of available hardware is performed) and issues `thamos advise` which talks to a Thoth deployment. All of the above is transparent to the user, the report is shown in a well formatted table. [Follow README instructions in thamos repo](https://github.com/thoth-station/thamos/tree/master/examples/runtime-environment) to experience this on your own.


Happy hacking! ;-)

In [None]:
from thoth.lab import packages_info

# Let's state Thoth's package versions for reproducible next runs of this Jupyter Notebook.
packages_info()