<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Download-the-data" data-toc-modified-id="Download-the-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Download the data</a></span></li><li><span><a href="#Train-an-embedding-method" data-toc-modified-id="Train-an-embedding-method-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Train an embedding method</a></span></li><li><span><a href="#Create-a-custom-TFIDFTextEncoder" data-toc-modified-id="Create-a-custom-TFIDFTextEncoder-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Create a custom TFIDFTextEncoder</a></span></li><li><span><a href="#Create-a-flow" data-toc-modified-id="Create-a-flow-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Create a flow</a></span></li><li><span><a href="#Index-the-data" data-toc-modified-id="Index-the-data-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Index the data</a></span></li><li><span><a href="#Inspect-the-embeddings-created-with--index_generator" data-toc-modified-id="Inspect-the-embeddings-created-with--index_generator-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Inspect the embeddings created with  <code>index_generator</code></a></span></li><li><span><a href="#Query-a-document" data-toc-modified-id="Query-a-document-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Query a document</a></span></li></ul></div>

In [1]:
import os
import sys
import jina
from jina.flow import Flow
from jina import Document

In [2]:
jina.__version__

'1.0.1'

## Download the data

The script `01_fetch_dataset.py` will download the 20newsgroup dataset

In [3]:
!python 01_fetch_dataset.py

Traceback (most recent call last):
  File "01_fetch_dataset.py", line 20, in <module>
    twenty_newsgroup_to_csv()
  File "01_fetch_dataset.py", line 7, in twenty_newsgroup_to_csv
    os.mkdir(data_path)
FileExistsError: [Errno 17] File exists: './dataset'


In [4]:
ls dataset/

20_newsgroup.csv


In [5]:
import pandas as pd

In [6]:
df = pd.read_csv('./dataset/20_newsgroup.csv')

In [7]:
df.head()

Unnamed: 0.1,Unnamed: 0,text,target,title,date
0,0,I was wondering if anyone out there could enli...,7,rec.autos,2021-02-12 13:00:53.168143
1,17,I recently posted an article asking what kind ...,7,rec.autos,2021-02-12 13:00:53.168143
2,29,\nIt depends on your priorities. A lot of peo...,7,rec.autos,2021-02-12 13:00:53.168143
3,56,an excellent automatic can be found in the sub...,7,rec.autos,2021-02-12 13:00:53.168143
4,64,: Ford and his automobile. I need information...,7,rec.autos,2021-02-12 13:00:53.168143


In [58]:
df.shape

(11314, 5)

In [9]:
!head dataset/20_newsgroup.csv

,text,target,title,date
0,"I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.",7,rec.autos,2021-02-12 13:00:53.168143
17,"I recently posted an article asking what kind of rates single, male
drivers under 25 yrs old were paying on performance cars. Here's a summary of


## Train an embedding method

Jina will help us build search programs based on embedding data into vectors.
Therefore, before we star any project with jina we need a method to transform data into vectors.



In [10]:

def load_data(data_path):
    """
    Load the data from `data_path` and return a list of strings
    """
    import csv
    
    with open(data_path) as f:
        data = csv.reader(f, delimiter=',')
        X = [x[1] for x in data][1:]
    return X

In [11]:
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
import pickle

data_path = "./dataset/20_newsgroup.csv"
X = load_data(data_path)
    
# fit text featurizer descriptor
tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit(X)

# store the object to disk
pickle.dump(tfidf_vectorizer, open("tfidf_vectorizer.pickle", "wb"))

Now we have an object `tfidf_vectorizer` can can convert our data to vectors.

## Create a custom TFIDFTextEncoder

Before we create our flow, we need to define an encoder inside `./pods`.

Here we define a `TFIDFTextEncoder(BaseEncoder)` that will be used for indexing the documents.

```python
class TFIDFTextEncoder(BaseEncoder):
    def __init__(self,
                 path_vectorizer= "./pods/tfidf_vectorizer.pickle",
                 *args,
                 **kwargs):
        super().__init__(*args, **kwargs)
        self.path_vectorizer = path_vectorizer

    def post_init(self):
        self.tfidf_vectorizer = pickle.load(open(self.path_vectorizer, "rb"))

    @batching
    @as_ndarray
    def encode(self, data: np.ndarray, *args, **kwargs) -> 'np.ndarray':
        return self.tfidf_vectorizer.transform(data).toarray()
```

## Create a flow

In [12]:
def config():
    """
    Configure environment variables.
    """
    parallel = 1 if sys.argv[1] == 'index' else 1
    shards = 1
    os.environ['JINA_PARALLEL'] = str(parallel)
    os.environ['JINA_SHARDS'] = str(shards)
    os.environ['WORKDIR'] = './workspace'
    os.makedirs(os.environ['WORKDIR'], exist_ok=True)
    os.environ['JINA_PORT'] = os.environ.get('JINA_PORT', str(65481))
    os.environ['JINA_DATA_PATH'] = 'dataset/20_newsgroup.csv'

In [13]:
config()

In [14]:
ls

01_fetch_dataset.py                    [1m[34mflows[m[m/
02_build_tfidf.py                      jina_search_20newsgroup_dataset.ipynb
Dockerfile                             [1m[34mpods[m[m/
README.md                              tfidf_vectorizer.pickle
app.py                                 [1m[34mworkspace[m[m/
[1m[34mdataset[m[m/


In [15]:
f = Flow.load_config('flows/index.yml')

In [16]:
f

## Index the data 

In [17]:
def index_generator():
    """
    Define data as Document to be indexed.
    """
    import csv
    data_path = os.path.join(os.curdir, os.environ['JINA_DATA_PATH'])

    # Get Document and ID
    with open(data_path) as f:
        reader = csv.reader(f, delimiter=',')
        next(reader, None)  # skip the header from 20newsgroup dataset
        for i, data in enumerate(reader):
            d = Document()
            # docid
            d.tags['id'] = int(data[0])
            # doc
            d.text = data[1]
            d.tags['label'] = int(data[2])
            yield d

def index():
    """
    Index data using Index Flow.
    """
    f = Flow.load_config('flows/index.yml')

    with f:
        f.index(input_fn=index_generator, batch_size=16)

Now we can index our data using **`index_generator`**.

This will create a folder `workspace` that will contain the indexed documents

In [18]:
with f:
    f.index(input_fn=index_generator, request_size=500)

        encoder@6725[I]:starting jina.peapods.runtimes.zmq.zed.ZEDRuntime...
        encoder@6725[I]:input [33mtcp://0.0.0.0:64474[0m (PULL_BIND) output [33mtcp://0.0.0.0:64478[0m (PUSH_CONNECT) control over [33mtcp://0.0.0.0:64473[0m (PAIR_BIND)
    doc_indexer@6726[I]:starting jina.peapods.runtimes.zmq.zed.ZEDRuntime...
    doc_indexer@6726[I]:input [33mtcp://0.0.0.0:64478[0m (PULL_BIND) output [33mtcp://0.0.0.0:64479[0m (PUSH_BIND) control over [33mtcp://0.0.0.0:64477[0m (PAIR_BIND)
TFIDFTextEncoder@6725[I]:post_init may take some time...
        gateway@6727[I]:starting jina.peapods.runtimes.asyncio.grpc.GRPCRuntime...
        gateway@6727[I]:input [33mtcp://0.0.0.0:64479[0m (PULL_CONNECT) output [33mtcp://0.0.0.0:64474[0m (PUSH_CONNECT) control over [33mipc:///var/folders/05/h71x7gh54sx_5y43ppkq9_dw0000gq/T/tmpakvgop4i[0m (PAIR_BIND)
        gateway@6727[S]:[32mGRPCRuntime is listening at: 0.0.0.0:64484[0m
   NumpyIndexer@6726[I]:post_init may take some time...

        encoder@6725[I]:recv IndexRequest  from gateway[32m▸[0mencoder/ZEDRuntime[32m▸[0m⚐
        encoder@6725[I]:#sent: 17 #recv: 18 sent_size: 26.8 MB recv_size: 6.1 MB
        encoder@6725[I]:recv IndexRequest  from gateway[32m▸[0mencoder/ZEDRuntime[32m▸[0m⚐
        encoder@6725[I]:#sent: 18 #recv: 19 sent_size: 28.2 MB recv_size: 6.6 MB
    doc_indexer@6726[I]:#sent: 8 #recv: 9 sent_size: 2.6 MB recv_size: 14.3 MB
    doc_indexer@6726[I]:recv IndexRequest  from gateway[32m▸[0mencoder/ZEDRuntime[32m▸[0mdoc_indexer/ZEDRuntime[32m▸[0m⚐
[36mindex[0m |[32m█████████[0m           | 📃   4500 ⏱️ 9.6s 🐎 467.3/s      9      batch        encoder@6725[I]:recv IndexRequest  from gateway[32m▸[0mencoder/ZEDRuntime[32m▸[0m⚐
        encoder@6725[I]:#sent: 19 #recv: 20 sent_size: 30.0 MB recv_size: 7.2 MB
        encoder@6725[I]:recv IndexRequest  from gateway[32m▸[0mencoder/ZEDRuntime[32m▸[0m⚐
    doc_indexer@6726[I]:#sent: 9 #recv: 10 sent_size: 3.2 MB recv_size: 15.9 MB


In [53]:
ls 

01_fetch_dataset.py                    [1m[34mflows[m[m/
02_build_tfidf.py                      jina_search_20newsgroup_dataset.ipynb
Dockerfile                             [1m[34mpods[m[m/
README.md                              tfidf_vectorizer.pickle
app.py                                 [1m[34mworkspace[m[m/
[1m[34mdataset[m[m/


In [55]:
df.shape

(11314, 5)

In [20]:
ls workspace/

doc.gz       doc.gz.head  docidx.bin   vec.gz       vecidx.bin


## Inspect the embeddings created with  `index_generator`

In [25]:
def load_embeddings(path, num_dim):    
    with gzip.open(path, 'rb') as fp:
        b = fp.read()
        return np.frombuffer(b, dtype=np.float64).reshape([-1, num_dim])


In [30]:
f_embedding_method = open('./pods/tfidf_vectorizer.pickle','rb')
embedder = pickle.load(f_embedding_method)

In [36]:
n_dimensions = len(embedder.vocabulary_)

In [38]:
path = './workspace/vec.gz' 
X = load_embeddings(path, n_dimensions)

In [50]:
X.shape, df.size

((5657, 101631), 56570)

In [49]:
embedder.transform(df['text'][0:10])

<10x101631 sparse matrix of type '<class 'numpy.float64'>'
	with 1315 stored elements in Compressed Sparse Row format>

## Query a document


Now that we have the documents stored as a `vec.gz` we can search for similar documents

In [68]:

def print_resp(resp, text):
    """
    Print response.
    """
    for d in resp.search.docs:
        print(f"Ranked list of related documents: {text} \n")

        # d.matches contains the closests top_k documents in order 
        # from closer to farther from the query.
        for idx, match in enumerate(d.matches):

            score = match.score.value
            if score < 0.0:
                continue
            answer = match.text.strip()
            print(f'> {idx+1:>2d}. "{answer}"\n Score: ({score:.2f})')


def search():
    """
    Search results using Query Flow.
    """
    f = Flow.load_config('flows/query.yml')

    with f:
        while True:
            text = input("Introduce a sentece as query: ")
            if not text:
                break

            def ppr(x):
                print_resp(x, text)

            f.search_lines(lines=[text, ], on_done=ppr, top_k=2, line_format=None)


In [69]:
search()

        encoder@8296[I]:starting jina.peapods.runtimes.zmq.zed.ZEDRuntime...
        encoder@8296[I]:input [33mtcp://0.0.0.0:59879[0m (PULL_BIND) output [33mtcp://0.0.0.0:59883[0m (PUSH_CONNECT) control over [33mtcp://0.0.0.0:59878[0m (PAIR_BIND)
    doc_indexer@8297[I]:starting jina.peapods.runtimes.zmq.zed.ZEDRuntime...
    doc_indexer@8297[I]:input [33mtcp://0.0.0.0:59883[0m (PULL_BIND) output [33mtcp://0.0.0.0:59884[0m (PUSH_BIND) control over [33mtcp://0.0.0.0:59882[0m (PAIR_BIND)
        gateway@8298[I]:starting jina.peapods.runtimes.asyncio.grpc.GRPCRuntime...
TFIDFTextEncoder@8296[I]:post_init may take some time...
        gateway@8298[I]:input [33mtcp://0.0.0.0:59884[0m (PULL_CONNECT) output [33mtcp://0.0.0.0:59879[0m (PUSH_CONNECT) control over [33mipc:///var/folders/05/h71x7gh54sx_5y43ppkq9_dw0000gq/T/tmpr2xpve_p[0m (PAIR_BIND)
        gateway@8298[S]:[32mGRPCRuntime is listening at: 0.0.0.0:59889[0m
   NumpyIndexer@8297[I]:post_init may take some time...

  return A / np.linalg.norm(A, ord=2, axis=1, keepdims=True)


BinaryPbIndexer@8297[I]:indexer size: 11314
    doc_indexer@8297[I]:#sent: 0 #recv: 1 sent_size: 0 Bytes recv_size: 397.6 KB
Enter a snippet of text used as query: The operating system that the Apple computer use is usually more stable than Windows. 

>  1. "This is the official Request for Discussion (RFD) for the creation of two
new newsgroups for Microsoft Windows NT.  This is a second RFD, replacing
the one originally posted in January '93 (and never taken to a vote).  The
proposed groups are described below:

NAME: 	 comp.os.ms-windows.nt.setup
STATUS:  Unmoderated.
PURPOSE: Discussions about setting up and installing Windows NT, and about
	 system and peripheral compatability issues for Windows NT.

NAME:	 comp.os.ms-windows.nt.misc
STATUS:	 Unmoderated.
PURPOSE: Miscellaneous non-programming discussions about using Windows NT,
	 including issues such as security, networking features, console
	 mode and Windows 3.1 (Win16) compatability.

RATIONALE:
	Microsoft NT is the newest me