<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Document-type" data-toc-modified-id="Document-type-1"><span class="toc-item-num">1&nbsp;&nbsp;</span><code>Document</code> type</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#FLow" data-toc-modified-id="FLow-1.0.1"><span class="toc-item-num">1.0.1&nbsp;&nbsp;</span>FLow</a></span></li></ul></li><li><span><a href="#Executors" data-toc-modified-id="Executors-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Executors</a></span></li><li><span><a href="#Drivers" data-toc-modified-id="Drivers-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Drivers</a></span></li><li><span><a href="#Peas" data-toc-modified-id="Peas-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Peas</a></span></li><li><span><a href="#Pods" data-toc-modified-id="Pods-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Pods</a></span></li><li><span><a href="#Flow" data-toc-modified-id="Flow-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Flow</a></span></li><li><span><a href="#Summary" data-toc-modified-id="Summary-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Summary</a></span></li></ul></li></ul></div>

In [1]:
import jina

In [2]:
jina.__version__

'0.9.29'

## `Document` type

In Jina the `Document` is the most important type that we will use.

When doing search we will use a query (that is a `Document`) to search other similar examples (that are also of type `Document`).

In [3]:
jina.Document

jina.types.document.Document

In [4]:
d = jina.Document()

In [5]:
d.__dict__

{'_pb_body': id: "98469e92-67b4-11eb-9974-787b8ab3f5de"}

In [6]:
d.__getattr__('id')

'98469e92-67b4-11eb-9974-787b8ab3f5de'

In [7]:
d.__getattr__('tags')



In [8]:
d.__getattr__('text')

''

In [9]:
d.__getattr__('dog')

AttributeError: dog

#### FLow

In [10]:
import jina

In [11]:
f = jina.Flow.load_config('flows/index.yml')

usage: ipykernel_launcher.py [-h] [-v] [-vf] [--name] [--log-config]
                             [--hide-exc-info] [--port-ctrl] [--ctrl-with-ipc]
                             [--timeout-ctrl] [--ssh-server] [--ssh-keyfile]
                             [--ssh-password] [--uses]
                             [--py-modules [PATH [PATH ...]]] [--port-in]
                             [--port-out] [--host-in] [--host-out]
                             [--socket-in {PULL_BIND, PULL_CONNECT, PUSH_BIND, PUSH_CONNECT ... 8 more choices}]
                             [--socket-out {PULL_BIND, PULL_CONNECT, PUSH_BIND, PUSH_CONNECT ... 8 more choices}]
                             [--dump-interval] [--read-only] [--memory-hwm]
                             [--on-error-strategy {IGNORE, SKIP_EXECUTOR, SKIP_HANDLE, THROW_EARLY}]
                             [--uses-internal] [--entrypoint]
                             [--docker-kwargs [KEY:VALUE [KEY:VALUE ...]]]
                             [--pull-l

ValueError: bad arguments "['--name', 'encoder', '--uses', 'pods/encode.yml', '--show-exc-info', '--parallel', '$JINA_PARALLEL', '--timeout-ready', '600000', '--read-only', '--pod-role', 'POD', '--num-part', '1']" with parser ArgumentParser(prog='ipykernel_launcher.py', usage=None, description='Command Line Interface of `%(prog)s`', formatter_class=<class 'jina.parsers.helper._ColoredHelpFormatter'>, conflict_handler='error', add_help=True), you may want to double check your args 

Note that the flow was not correctly created.

In the error message we can see:

```
ValueError: invalid literal for int() with base 10: '$JINA_PARALLEL'
```

This happens because there are several os variables that need to be created.

To do this we will create the `config` function below that will provide basic values for some of this variables that are needed to instanciate a `Flow`.



In [12]:

import sys, os

def config():
    parallel = 1 if sys.argv[1] == 'index' else 1
    shards = 1
    os.environ['JINA_PARALLEL'] = str(parallel)
    os.environ['JINA_SHARDS'] = str(shards)
    os.environ['WORKDIR'] = './workspace'
    os.makedirs(os.environ['WORKDIR'], exist_ok=True)
    os.environ['JINA_PORT'] = os.environ.get('JINA_PORT', str(65481))
    os.environ['JINA_DATA_PATH'] = 'dataset/test_answers.csv'

In [13]:
config()

In [14]:
f = jina.Flow.load_config('flows/index.yml')

In [18]:
type(f)

jina.flow.Flow

Note that `f` was constructed from a `.yml` file, containing the following information

```yml
!Flow
pods:
  encoder:
    uses: pods/encode.yml
    show_exc_info: true
    parallel: $JINA_PARALLEL
    timeout_ready: 600000
    read_only: true
  doc_indexer:
    uses: pods/doc.yml
    shards: $JINA_SHARDS
    separated_workspace: true
```

This flow defines two pods because inside `pods:` there are two items:

- An `encoder`
```yml
  encoder:
    uses: pods/encode.yml
    show_exc_info: true
    parallel: $JINA_PARALLEL
    timeout_ready: 600000
    read_only: true
```
    - The encoder is specified in `pods/encode.yml` with the following `.yml` file:
    
        ```yml
        !TFIDFTextEncoder
        metas:
          name: tfidf_encoder  
          py_modules: tfidf_vectorizer_jina.py
        with:
          path_vectorizers: ./pods/tfidf_vectorizer.pickle
        ```

- A `doc_indexter`

```yml
  doc_indexer:
    uses: pods/doc.yml
    shards: $JINA_SHARDS
    separated_workspace: true
```





Observations:

- TODO: why `shards` is set for the `doc_indexter`  but not for the `encoder`
- TODO: why `parallel` is set for the `encoder`  but not for the `doc_indexer`





In [24]:
f

In [25]:
f.__dict__

{'_exit_callbacks': deque([]),
 '_version': '1',
 '_pod_nodes': OrderedDict([('encoder',
               <jina.peapods.pods.BasePod at 0x7fa95888c310>),
              ('doc_indexer', <jina.peapods.pods.BasePod at 0x7fa948514c50>),
              ('gateway', <jina.peapods.pods.BasePod at 0x7fa95887cf50>)]),
 '_inspect_pods': {},
 '_build_level': <FlowBuildLevel.GRAPH: 1>,
 '_last_changed_pod': ['gateway', 'encoder', 'doc_indexer'],
 'args': Namespace(hide_exc_info=False, identity='29296701-d80f-484d-8887-878854b99541', inspect=<FlowInspectType.COLLECT: 2>, log_config='/Users/davidbuchaca1/Documents/git_stuff/jina/jina/resources/logging.default.yml', name=None, optimize_level=<FlowOptimizeLevel.NONE: 0>, uses=None),
 '_common_kwargs': {},
 '_kwargs': {},
 '_env': None,
 'logger': <jina.logging.logger.JinaLogger at 0x7fa958888d50>}

In [27]:
import numpy as np

In [31]:
np.random.random([2,3,4])

array([[[0.5446453 , 0.08644422, 0.88410086, 0.48285969],
        [0.50491452, 0.79365734, 0.60228975, 0.68198766],
        [0.35428707, 0.89117713, 0.20379067, 0.41147209]],

       [[0.81756016, 0.69862945, 0.36534735, 0.93785348],
        [0.77202678, 0.7108007 , 0.59331261, 0.91438273],
        [0.02077957, 0.88784957, 0.02834557, 0.54882761]]])

### Executors 

How do we break down a Document into Chunks, and what happens next? 

Executors do all of this hard work, and each represents an algorithmic unit. They do things like encoding images into vectors, storing vectors on disk, ranking results, and so on. Each one has a simple interface, letting you concentrate on the algorithm and not get lost in the weeds. They handle feature persistence, scheduling, chaining, grouping, and parallelization out of the box. The properties of an Executor are stored in a YAML file. They always go hand in hand.

The Executors are a big family. Each family member focuses on one important aspect of the search system. Let’s meet:

- **Crafter**: for crafting/segmenting/transforming the Documents and Chunks;

- **Encoder**: for representing the Chunk as vector;

- **Indexer**: for saving and retrieving vectors and key-value information from storage;

- **Ranker**: for sorting results;

Got a new algorithm in mind? No problem, this family always welcomes new members!




### Drivers

Executors do all the hard work, but they're not great at talking to each other.

A Driver helps them do this by defining how an Executor behaves to network requests. 

It interprets network traffic into a format the Executor can understand, for example translating Protobuf into a Numpy array.



### Peas


All healthy families need to communicate, and the Executor clan is no different. They talk to each other via Peas.

While a Driver translates data for an Executor, A Pea wraps an Executor and lets it exchange data over a 
network or with other Peas. Peas can also run in Docker, containing all dependencies and context in one place.


```
Pea_1                   Pea_2
*--------------*        *--------------*
|              |        |              |
| Executor_1   |------->| Executor_2   | 
|              |        |              |
*--------------*        *--------------*
```

### Pods


A Pod is a group of Peas with the same property, running in parallel on a local host or over the network. 

A Pod provides a single network interface for its Peas, making them look like one single Pea from the outside. 

Beyond that, a Pod adds further control, scheduling, and context management to the Peas.

### Flow

A Flow is like a Pea plant. Just as a plant manages nutrient flow and growth rate for its branches, Flow manages the states and context of a group of Pods, orchestrating them to accomplish one task. 




### Summary


```

            FLOW
              |
              |
    *--------------------*
    |         |          |
  Pod_1     Pod_2      Pod_3
    |         |          |
  Pea_1_1   Pea_2_1   Pea_3_1
  Pea_1_2   Pea_2_2   Pea_3_2
  


Executor -> Specified by a YAML file
```
