![](https://miro.medium.com/max/700/1*pJOoQOlHCses8zvefz3YFg.png)

# Stack Overflow search engine

This tutorial helps you build an ML-powered search engine for Stack Overflow data while introducing [DocArray](https://docarray.jina.ai/) and [Jina](https://docs.jina.ai). A user can input a text query and then retrieve questions and answers where the question title is similar to the query.

![](https://static.scarf.sh/a.png?x-pxid=b3bc5e07-9c1c-4ecd-9016-2d0342823a6c)

## Meet our ingredients

### **[DocArray](https://docarray.jina.ai/)**

DocArray is a library for nested, unstructured data in transit, including text, image, audio, video, 3D mesh, etc. It allows deep-learning engineers to efficiently process, embed, search, recommend, store, and transfer the multi-modal data with a Pythonic API. ([star the repo]())

###**[Jina](https://docs.jina.ai)**
 
 Jina is a framework that empowers anyone to build cross-modal and multi-modal[*] applications on the cloud. It uplifts a PoC into a production-ready service. Jina handles the infrastructure complexity, making advanced solution engineering and cloud-native technologies accessible to every developer. ([star the repo]())

### **[Jina Hub](https://hub.jina.ai)**

Download pre-built building blocks for neural search.


### **[Stack Overflow R dataset](https://www.kaggle.com/datasets/stackoverflow/rquestions)**

Why not use the [Python dataset](https://www.kaggle.com/datasets/stackoverflow/pythonquestions)? When I tried reading in the CSV I got a few encoding errors and it frankly wasn't worth the headache.

---

Let's start by installing DocArray:

In [1]:
!pip install -q docarray==0.13.30

[K     |████████████████████████████████| 634 kB 5.1 MB/s 
[K     |████████████████████████████████| 235 kB 38.3 MB/s 
[K     |████████████████████████████████| 51 kB 4.6 MB/s 
[?25h  Building wheel for docarray (setup.py) ... [?25l[?25hdone


...and then importing [DocumentArray](ttps://docarray.jina.ai/fundamentals/documentarray/)

In [2]:
from docarray import DocumentArray

## Downloading our Data

Unfortunately Colab notebooks don't save state, so we can't store our data alongside our notebook. So how can we convert our CSV from the dataset?

We could remedy this in two ways:

1. Download the CSV and [import directly](https://docarray.jina.ai/datatypes/tabular/) into a [DocumentArray](https://docarray.jina.ai/fundamentals/documentarray/) with `docs = DocumentArray.from_csv("Questions.csv")`. This is tricky since it's stored on Kaggle and I don't really want to share my Kaggle key publicly. Or...

2. Here's one I made earlier! In one command we can [pull in a pre-existing DocumentArray from the cloud](https://docarray.jina.ai/fundamentals/documentarray/serialization/?highlight=pull#from-to-cloud). We'll just use the first 1,000 questions in the dataset since this is a demo:

In [3]:
docs = DocumentArray.pull(name="stack_overflow_r_q")[:1000]

Let's see what's we've got. As we can see, 1,000 [Documents](https://docarray.jina.ai/fundamentals/document/), each with:
- The title of the question in `doc.text` - this is what will be encoded later in our [Flow](https://docs.jina.ai/fundamentals/flow/).
- Tags - i.e. metadata, containing a `dict` of all the other fields associated with that question title.
- ID - a unique identifier for each Document.

In [4]:
docs.summary()

Let's take a closer look at a single Document so we can get an idea of the structure

In [5]:
from pprint import pprint # Without pretty-print it'll be harder to read the output

print(docs[0].text)
pprint(docs[0].tags)
pprint(docs[0].id)

How to access the last value in a vector?
{'Body': '<p>Suppose I have a vector that is nested in a dataframe one or two '
         'levels.  Is there a quick and dirty way to access the last value, '
         'without using the <code>length()</code> function?  Something ala '
         "PERL's <code>$#</code> special var?</p>\n"
         '\n'
         '<p>So I would like something like:</p>\n'
         '\n'
         '<pre><code>dat$vec1$vec2[$#]\n'
         '</code></pre>\n'
         '\n'
         '<p>instead of</p>\n'
         '\n'
         '<pre><code>dat$vec1$vec2[length(dat$vec1$vec2)]\n'
         '</code></pre>\n',
 'CreationDate': '2008-09-16T21:40:29Z',
 'Id': '77434',
 'OwnerUserId': '14008',
 'Score': '171'}
'd273553fdd28b8b8012c384a40be3b88'


## Setting up our Flow

To build a search engine we need to pass our Documents into a [Flow](https://docs.jina.ai/fundamentals/flow/). This is what will create embeddings and store our Documents in an index for fast look-up later.

We'll use the [Jina](https://docs.jina.ai/) package to build and orchestrate our Flow.

In [6]:
!pip install -q jina==3.6.11

[K     |████████████████████████████████| 220 kB 5.4 MB/s 
[K     |████████████████████████████████| 4.1 MB 32.7 MB/s 
[K     |████████████████████████████████| 54 kB 1.1 MB/s 
[K     |████████████████████████████████| 146 kB 36.6 MB/s 
[K     |████████████████████████████████| 596 kB 39.5 MB/s 
[K     |████████████████████████████████| 1.1 MB 40.0 MB/s 
[K     |████████████████████████████████| 57 kB 5.1 MB/s 
[K     |████████████████████████████████| 112 kB 51.2 MB/s 
[K     |████████████████████████████████| 3.8 MB 32.7 MB/s 
[K     |████████████████████████████████| 1.0 MB 49.4 MB/s 
[K     |████████████████████████████████| 1.2 MB 44.1 MB/s 
[K     |████████████████████████████████| 63 kB 1.7 MB/s 
[K     |████████████████████████████████| 80 kB 8.4 MB/s 
[K     |████████████████████████████████| 271 kB 52.9 MB/s 
[K     |████████████████████████████████| 94 kB 2.2 MB/s 
[K     |████████████████████████████████| 144 kB 53.3 MB/s 
[K     |██████████████████████████

Creating a Flow is a matter of chaining together building blocks (a.k.a [Executors](https://docs.jina.ai/fundamentals/executor/)). In our case we won't [write these manually](https://docs.jina.ai/fundamentals/executor/executor-api/), but rather we'll either download them from [Jina Hub](https://hub.jina.ai) or run them in a [sandbox in the cloud](https://docs.jina.ai/how-to/sandbox/?highlight=sandbox). This will save us some time and effort.

Let's start by creating an empty Flow:

In [32]:
from jina import Flow

flow = Flow()

Now we'll add our encoder. This will encode the text from each Document into vector embeddings. We'll need these for matching similar text later on.

In our case we'll use [SpacyTextEncoder](https://hub.jina.ai/executor/u7h7cuh2) with the medium language model, though you could swap it out easily for other encoders like [Transformers](https://hub.jina.ai/executor/u9pqs8eb).

We'll run it in a sandbox in the cloud.

In [33]:
flow = flow.add(
    name="encoder",
    uses="jinahub+sandbox://SpacyTextEncoder/v0.4",
    uses_with={"model_name": "en_core_web_md"}
)

Next we'll add our indexer. This takes the vector embeddings and metadata and stores them in a database for fast lookup when a user is searching.

We'll use [AnnLiteIndexer](https://hub.jina.ai/executor/7yypg8qk), which will store our data in a SQLite database. For production use, other indexers like [HNSWPostgresIndexer](https://hub.jina.ai/executor/dvp0845a) may be more suitable, but for a simple notebook this is a good fit.

In this case we won't run it in a sandbox, since we want our indexed data stored in the same place as our notebook.

In [35]:
flow = flow.add(
    name="indexer",
    uses="jinahub://AnnLiteIndexer/0.3.0",
    uses_with={"dim": 300},  # we're using a 300 dimension model
    uses_metas={"workspace": "workspace"},  # this is where we'll store our data on disk
    install_requirements=True
)

Let's preview our Flow:

In [36]:
flow.plot()

## Indexing our data

That's our Flow built. Now we can run it to start pushing our data through the pipeline.

In [37]:
with flow:
    docs = flow.index(docs)

Output()



## Searching our data

Now that we've built our index, it's time to do some searching!

Everything we've worked with while indexing has been in the form of a [Document](https://docarray.jina.ai/fundamentals/document/) (stored in a DocumentArray). So we'll need to create another Document for searching that index:

In [38]:
from docarray import Document

search_term = "How do I create a matrix?"
query = Document(text=search_term)

with flow:
  results = flow.search(query)

Output()



Now to look at what matched our search term. `results` is also a DocumentArray (can you see the pattern?). We'll access its `matches` attribute and see what's stored inside:

In [39]:
matches = results[0].matches

for match in matches:
  print(match.text)

How do I make a matrix from a list of vectors in R?
How do I produce a boxplot in ggplot using a matrix
How do I construct a new centrality measure?
How can I partition a vector?
how do tell if its better to standardize your data matrix first when you do principal component analysis in R?
How do I get confidence intervals without inverting a singular Hessian matrix in R?
How do I specify random factors in R?
How do you make a new dataset given a set of vectors?
How can I declare a thousand separator in read.csv?
R: How to write out a data.frame so that I can paste it into SO for others to read?


## Getting answers to our questions

So far, so good. We've got a list of matching questions. But how can we pair those with the relevant answers?

First we'll need to download our answers. In this case we won't limit them to just 1,000 because:

* Many questions have more than one answer.
* The order may be different, so the first question in our dataset may have answer 1,234, 50,234 or 1,337 as its answer.

Once again, we'll [pull from the cloud](https://docarray.jina.ai/fundamentals/documentarray/serialization/?highlight=pull#from-to-cloud):

In [40]:
answers = DocumentArray.pull(name="stack_overflow_r_a")
answers.summary()

Now we can use the [`find` method](https://docarray.jina.ai/fundamentals/documentarray/find/) to dig out answers where the answer's `ParentId` tag matches the question's `Id` tag:

In [41]:
for match in matches:
  print(match.text)
  match_answers = answers.find({"tags__ParentId": {"$eq": match.tags["Id"]}})
  for answer in match_answers:
    print("---")
    print(answer.text)
  print("-----------")

How do I make a matrix from a list of vectors in R?
---
<p>One option is to use <code>do.call()</code>: </p>

<pre><code> &gt; do.call(rbind, a)
      [,1] [,2] [,3] [,4] [,5] [,6]
 [1,]    1    1    2    3    4    5
 [2,]    2    1    2    3    4    5
 [3,]    3    1    2    3    4    5
 [4,]    4    1    2    3    4    5
 [5,]    5    1    2    3    4    5
 [6,]    6    1    2    3    4    5
 [7,]    7    1    2    3    4    5
 [8,]    8    1    2    3    4    5
 [9,]    9    1    2    3    4    5
[10,]   10    1    2    3    4    5
</code></pre>

---
<p>Not straightforward, but it works:</p>

<pre><code>&gt; t(sapply(a, unlist))
      [,1] [,2] [,3] [,4] [,5] [,6]
 [1,]    1    1    2    3    4    5
 [2,]    2    1    2    3    4    5
 [3,]    3    1    2    3    4    5
 [4,]    4    1    2    3    4    5
 [5,]    5    1    2    3    4    5
 [6,]    6    1    2    3    4    5
 [7,]    7    1    2    3    4    5
 [8,]    8    1    2    3    4    5
 [9,]    9    1    2    3    4    5


Voila! You can see:

* Questions matching our search term
* Answers to those questions

Admittedly, the HTML formatting looks a bit janky, but if you were using this IRL you'd strip that out or properly display it. Since this is just a notebook I'll leave that as an exercise for you, dear reader.

## Putting it into production

Colab notebooks have a number of restrictions that make cool stuff quite difficult. If we were building this outside of a notebook, we could:

* Set up a [RESTful or gRPC gateway](https://docs.jina.ai/fundamentals/gateway/) and keep the Flow open to requests using `flow.block()`
* Use [sharding and replicas](https://docs.jina.ai/how-to/scale-out/) to improve performance and reliability.
* [Monitor our Flow with Grafana](https://docs.jina.ai/fundamentals/flow/monitoring-flow/)
* Better yet, host our Flow on [JCloud](https://docs.jina.ai/fundamentals/jcloud/), so we don't have to use any of our own compute for encoding, indexing, hosting, etc (encoding is especially hungry on the hardware)
* Finetune our results using [Finetuner](https://finetuner.jina.ai) to provide better matches
* Use a more specialized model for dealing with technical/code queries (rather than just general purpose)