# Searh article in Medium

## 0. Overview

We'll search for text in the Medium dataset, and it will find the most similar results to the search text across all titles. Searching for articles is different from traditional keyword searches, which search for semantically relevant content. If you search for "**funny python demo**" it will return "**Python Coding for Kids - Setting Up For the Adventure**", not "**No key words about funny python demo**".

We will use Milvus and Towhee to help searches. Towhee is used to extract the semantics of the text and return the text embedding. The Milvus vector database can store and search vectors, and return related articles. So we first need to install [Milvus](https://github.com/milvus-io/milvus) and [Towhee](https://github.com/towhee-io/towhee).

Before getting started, please make sure that you have started a [Milvus service](https://milvus.io/docs/install_standalone-docker.md). This notebook uses [milvus 2.2.10](https://milvus.io/docs/v2.2.x/install_standalone-docker.md) and [pymilvus 2.2.11](https://milvus.io/docs/release_notes.md#2210).

## 1. Data preprocessing

The data is from the [Cleaned Medium Articles Dataset](https://www.kaggle.com/datasets/shiyu22chen/cleaned-medium-articles-dataset)(you can download it from Kaggle), which cleared the empty article titles in the data and conver the string title to the embeeding with Towhee [text_embedding.dpr operator](https://towhee.io/text-embedding/dpr), as you can see the `title_vector` is the embedding vectors of the title.

pip install -v towhee pymilvus==2.2.11

In [4]:
# Download data
! wget -q https://github.com/towhee-io/examples/releases/download/data/New_Medium_Data.csv

In [1]:
import pandas as pd

df = pd.read_csv('New_Medium_Data.csv', converters={'title_vector': lambda x: eval(x)})
df.head()

Unnamed: 0,id,title,title_vector,link,reading_time,publication,claps,responses
0,0,The Reported Mortality Rate of Coronavirus Is ...,"[0.041732933, 0.013779674, -0.027564144, -0.01...",https://medium.com/swlh/the-reported-mortality...,13,The Startup,1100,18
1,1,Dashboards in Python: 3 Advanced Examples for ...,"[0.0039737443, 0.003020432, -0.0006188639, 0.0...",https://medium.com/swlh/dashboards-in-python-3...,14,The Startup,726,3
2,2,How Can We Best Switch in Python?,"[0.031961977, 0.00047043373, -0.018263113, 0.0...",https://medium.com/swlh/how-can-we-best-switch...,6,The Startup,500,7
3,3,Maternity leave shouldn’t set women back,"[0.032572296, -0.011148319, -0.01688577, -0.00...",https://medium.com/swlh/maternity-leave-should...,9,The Startup,460,1
4,4,Python NLP Tutorial: Information Extraction an...,"[-0.011735886, -0.016938083, -0.027233299, 0.0...",https://medium.com/swlh/python-nlp-tutorial-in...,7,The Startup,163,0


## 2. Load Data

The next step is to get the text embedding, and then insert all the extracted embedding vectors into Milvus.

### Create Milvus Collection

We need to create a collection in Milvus first, which contains multiple fields of `id`, `title`, `title_vector`, `link`, `reading_time`, `publication`, `claps` and `responses`.

In [2]:
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility

connections.connect(host='127.0.0.1', port='19530')

def create_milvus_collection(collection_name, dim):
    if utility.has_collection(collection_name):
        utility.drop_collection(collection_name)
    
    fields = [
            FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=False),
            FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=500),   
            FieldSchema(name="title_vector", dtype=DataType.FLOAT_VECTOR, dim=dim),
            FieldSchema(name="link", dtype=DataType.VARCHAR, max_length=500),
            FieldSchema(name="reading_time", dtype=DataType.INT64),
            FieldSchema(name="publication", dtype=DataType.VARCHAR, max_length=500),
            FieldSchema(name="claps", dtype=DataType.INT64),
            FieldSchema(name="responses", dtype=DataType.INT64)
    ]
    schema = CollectionSchema(fields=fields, description='search text')
    collection = Collection(name=collection_name, schema=schema)
    
    index_params = {
        'metric_type': "L2",
        'index_type': "IVF_FLAT",
        'params': {"nlist": 2048}
    }
    collection.create_index(field_name='title_vector', index_params=index_params)
    return collection

collection = create_milvus_collection('search_article_in_medium', 768)

### Data to Milvus


Towhee supports reading df data through the `from_df` interface, and then we need to convert the `title_vector` column in the data to a two-dimensional list in float format, and then insert all the fields into Milvus, each field inserted into Milvus corresponds to one Collection fields created earlier.

In [3]:
from towhee import ops, pipe, DataCollection

insert_pipe = (pipe.input('df')
                   .flat_map('df', 'data', lambda df: df.values.tolist())
                   .map('data', 'res', ops.ann_insert.milvus_client(host='127.0.0.1', 
                                                                    port='19530',
                                                                    collection_name='search_article_in_medium'))
                   .output('res')
)

In [4]:
%time _ = insert_pipe(df)

2023-07-13 10:19:36,769 - 13243097088 - node.py-node:167 - INFO: Begin to run Node-_input
2023-07-13 10:19:36,770 - 13259886592 - node.py-node:167 - INFO: Begin to run Node-lambda-0
2023-07-13 10:19:36,771 - 13276676096 - node.py-node:167 - INFO: Begin to run Node-ann-insert/milvus-client-1
2023-07-13 10:19:36,771 - 13243097088 - node.py-node:167 - INFO: Begin to run Node-_output


CPU times: user 13.8 s, sys: 3.05 s, total: 16.9 s
Wall time: 57.8 s


We need to call `collection.load()` to load the data after inserting the data, then run `collection.num_entities` to get the number of vectors in the collection. We will see the number of vectors is 5979, and we have successfully load the data to Milvus.

In [5]:
collection.load()
collection.num_entities

4317

## 3. Search embedding title

### Search one text in Milvus


The retrieval process also to generate the text embedding of the query text, then search for similar vectors in Milvus, and finally return the result, which contains `id`(primary_key) and `score`. For example, we can search for "funny python demo":

In [None]:
import numpy as np

# Initialize input
pipe_input = pipe.input('query')

# Map 'query' to 'vec' using DPR encoder
query_to_vec = pipe_input.map('query', 'vec', ops.text_embedding.dpr(model_name="facebook/dpr-ctx_encoder-single-nq-base"))

# Normalize 'vec'
vec_normalized = query_to_vec.map('vec', 'vec', lambda x: x / np.linalg.norm(x, axis=0))

# # Perform nearest neighbor search using Milvus client
search_result = vec_normalized.flat_map('vec', ('id', 'score'), ops.ann_search.milvus_client(host='127.0.0.1', port='19530', collection_name='search_article_in_medium')) 

# # Define the output of the pipeline
search_pipe = search_result.output('query', 'id', 'score')


In [None]:
res = search_pipe('funny python demo')
DataCollection(res).show()

### Search multi text in Milvus

We can also retrieve multiple pieces of data, for example we can specify the array(['funny python demo', 'AI in data analysis']) to search in batch, which will be retrieved in Milvus:

In [None]:
res = search_pipe.batch(['funny python demo', 'AI in data analysis'])
for re in res:
    DataCollection(re).show()

### Search text and return multi fields

If we want to return more information when retrieving, we can set the `output_fields` parameter in [ann_search.milvus operator](https://towhee.io/ann-search/milvus). For example, in addition to `id` and `score`, we can also return `title`, `link`, `claps`, `reading_time`, `and response`:

In [None]:
search_pipe1 = (pipe.input('query')
                    .map('query', 'vec', ops.text_embedding.dpr(model_name="facebook/dpr-ctx_encoder-single-nq-base"))
                    .map('vec', 'vec', lambda x: x / np.linalg.norm(x, axis=0))
                    .flat_map('vec', ('id', 'score', 'title'), ops.ann_search.milvus_client(host='127.0.0.1', 
                                                                                   port='19530',
                                                                                   collection_name='search_article_in_medium',
                                                                                   output_fields=['title']))  
                    .output('query', 'id', 'score', 'title')
               )

res = search_pipe1('funny python demo')
DataCollection(res).show()

In [None]:
# milvus search with multi outpt fields
search_pipe2 = (pipe.input('query')
                    .map('query', 'vec', ops.text_embedding.dpr(model_name="facebook/dpr-ctx_encoder-single-nq-base"))
                    .map('vec', 'vec', lambda x: x / np.linalg.norm(x, axis=0))
                    .flat_map('vec', ('id', 'score', 'title', 'link', 'reading_time', 'publication', 'claps', 'responses'), 
                                       ops.ann_search.milvus_client(host='127.0.0.1', 
                                                                    port='19530',
                                                                    collection_name='search_article_in_medium',
                                                                    output_fields=['title', 'link', 'reading_time', 'publication', 'claps', 'responses'], 
                                                                    limit=5))  
                    .output('query', 'id', 'score', 'title', 'link', 'reading_time', 'publication', 'claps', 'responses')
               )

res = search_pipe2('funny python demo')
DataCollection(res).show()

### Search text with some expr


In addition, we can also set some expressions for retrieval. For example, we can specify that the beginning of the article is an article in Python by setting expr='title like "Python%"':

In [None]:
search_pipe3 = (pipe.input('query')
                    .map('query', 'vec', ops.text_embedding.dpr(model_name="facebook/dpr-ctx_encoder-single-nq-base"))
                    .map('vec', 'vec', lambda x: x / np.linalg.norm(x, axis=0))
                    .flat_map('vec', ('id', 'score', 'title', 'link', 'reading_time', 'publication', 'claps', 'responses'), 
                                       ops.ann_search.milvus_client(host='127.0.0.1', 
                                                                    port='19530',
                                                                    collection_name='search_article_in_medium',
                                                                    expr='title like "Python%"',
                                                                    output_fields=['title', 'link', 'reading_time', 'publication', 'claps', 'responses'], 
                                                                    limit=5))  
                    .output('query', 'id', 'score', 'title', 'link', 'reading_time', 'publication', 'claps', 'responses')
               )

res = search_pipe3('funny python demo')
DataCollection(res).show()

## 4. Query data in Milvus

We have done the text retrieval process before, and we can get articles such as "Python coding for kids - getting ready for an adventure" by retrieving "fun python demos".

We can also do a simple query on the data, we need to set `expr` and `output_fields` with the `collection.query` interface, for example, we can filter out articles with faults greater than 300 and reading time less than 15 minutes, and submitted to TDS :

In [None]:
collection.query(
  expr = 'claps > 3000 && reading_time < 15 && publication like "Towards Data Science%"', 
  output_fields = ['id', 'title', 'link', 'reading_time', 'publication', 'claps', 'responses'],
  consistency_level='Strong'
)