# NLP text search
The workbook implements NLP text search in Elasticsearch using the Elastic blogs dataset.

You will index blogs documents, and using ingest pipeline generate text embeddings. By using NLP model you will query the documents using natural language over the the blogs documents.

The dataset is a ~4000 blog posts from [elastic.co/blog](https://elastic.co/blog).

### Prerequisities

Before you start make sure you have Elasticsearch cluster running. The cluster must have at least one machine learning (ML) node with enough (4GB) memory.

### Install Python requirements
Before you start you need to install all required Python dependencies.

In [None]:
!pip install sentence-transformers eland elasticsearch transformers torch tqdm

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting eland
  Downloading eland-8.7.0-py3-none-any.whl (152 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m152.8/152.8 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting elasticsearch
  Downloading elasticsearch-8.9.0-py3-none-any.whl (395 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m395.5/395.5 kB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m60.8 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece (from sentence-transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_

### Upload NLP model for querying
Using an `eland` tool you will import required NLP models. One model will transfer your search query into vector which will be used for the search over the set of documents stored in Elasticsearch.

Model used:
- [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) to transform the search text into the vector

How to get cloud id? Go to [ESS cloud](https://cloud.elastic.co/logout?redirectTo=%2Fhome&reason=unauthorised) and `On the deployment overview page, copy down the Cloud ID.`

The authentication is using api key (`--es-api-key`). Learn how to generate [API key](https://www.elastic.co/guide/en/kibana/current/api-keys.html#create-api-key). Another option is to use username/password.

```
$ eland_import_hub_model --cloud-id $CLOUD_ID \
  --hub-model-id sentence-transformers/all-MiniLM-L6-v2 \
  --task-type text_embedding --es-api-key $API_KEY --start
```



In [None]:
API_KEY=''
CLOUD_ID=''
!eland_import_hub_model --cloud-id $CLOUD_ID --hub-model-id sentence-transformers/all-MiniLM-L6-v2 --task-type text_embedding --es-api-key $API_KEY --start

2023-08-01 14:24:13,344 INFO : Establishing connection to Elasticsearch
2023-08-01 14:24:13,923 INFO : Connected to cluster named 'a597bbe1e0d047c494e7d4015f67ef37' (version: 8.8.2)
2023-08-01 14:24:13,924 INFO : Loading HuggingFace transformer tokenizer and model 'sentence-transformers/all-MiniLM-L6-v2'
Downloading (…)okenizer_config.json: 100% 350/350 [00:00<00:00, 1.84MB/s]
Downloading (…)solve/main/vocab.txt: 100% 232k/232k [00:00<00:00, 3.98MB/s]
Downloading (…)cial_tokens_map.json: 100% 112/112 [00:00<00:00, 714kB/s]
Downloading (…)lve/main/config.json: 100% 612/612 [00:00<00:00, 3.15MB/s]
Downloading pytorch_model.bin: 100% 90.9M/90.9M [00:00<00:00, 198MB/s]
Downloading (…)e9125/.gitattributes: 100% 1.18k/1.18k [00:00<00:00, 6.64MB/s]
Downloading (…)_Pooling/config.json: 100% 190/190 [00:00<00:00, 1.07MB/s]
Downloading (…)7e55de9125/README.md: 100% 10.6k/10.6k [00:00<00:00, 37.5MB/s]
Downloading (…)55de9125/config.json: 100% 612/612 [00:00<00:00, 2.35MB/s]
Downloading (…)ce_tran

### Import Python libraries

In [None]:
from elasticsearch import Elasticsearch
from elasticsearch.helpers import parallel_bulk
import requests
import os
import sys
import shutil
import zipfile
from tqdm.auto import tqdm
import pandas as pd
import json
from getpass import getpass

### Connect to Elasticsearch cluster
Use your own cluster details `ELASTIC_CLOUD_ID`, `API_KEY`.

You also have another option to authentication (Username/Password) if you wish do so.

In [None]:
# ESS Cloud connection definition using an API_KEY
ELASTIC_CLOUD_ID = getpass("Elastic Cloud ID: ")
API_KEY = getpass("Elastic API key: ")

# ELASTIC_CLOUD_USER = "elastic"
# CLOUD_PASSWORD = getpass("Elastic Password")

es = Elasticsearch(
  cloud_id=ELASTIC_CLOUD_ID,
  #basic_auth=(ELASTIC_CLOUD_USER, ELASTIC_CLOUD_PASSWORD),
  api_key=API_KEY,
  request_timeout=600
)

es.info() # should return cluster info

Elastic Cloud ID: ··········
Elastic API key: ··········


ObjectApiResponse({'name': 'instance-0000000000', 'cluster_name': 'a597bbe1e0d047c494e7d4015f67ef37', 'cluster_uuid': 'EnT0vwwSSZeAahPw3Vhsuw', 'version': {'number': '8.8.2', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '98e1271edf932a480e4262a471281f1ee295ce6b', 'build_date': '2023-06-26T05:16:16.196344851Z', 'build_snapshot': False, 'lucene_version': '9.6.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})

### Upload Ingest pipeline
You need to upload the ingest pipeline which will generate vector (text) embeddings for selected field.

The pipeline below is defining a processor for the inference to the correct NLP model.

In [None]:
# ingest pipeline definition
PIPELINE_ID="vectorize_blogs"
PIPELINE_DESC="This is the description for document procesing"
PIPELINE_VERSION=3

es.ingest.put_pipeline(id=PIPELINE_ID, description=PIPELINE_DESC, version=PIPELINE_VERSION,
                       processors=[{
        "inference": {
          "model_id": "sentence-transformers__all-minilm-l6-v2",
          "target_field": "text_embedding",
          "field_map": {
            "body_content_window": "text_field"
          }
        }
      }])

ObjectApiResponse({'acknowledged': True})

### Create Index and mappings for Blogs
Befor you can index documents into Elasticsearch, you need to create an Index with correct mappings.

In [None]:
INDEX_NAME="blogs"
DELETE_INDEX=True

INDEX_MAPPING = {
    "properties": {
      "body_content_window": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "byline": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "name": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "publish_date": {
        "type": "date"
      },
      "title": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "url": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "text_embedding": {
        "properties": {
          "is_truncated": {
            "type": "boolean"
          },
          "model_id": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "predicted_value": {
            "type": "dense_vector",
            "dims": 384,
            "index": True,
            "similarity": "l2_norm"
          }
        }
      }
    }
  }

INDEX_SETTINGS = {
    "index": {
      "number_of_replicas": "1",
      "number_of_shards": "1",
      "default_pipeline": PIPELINE_ID
    }
}

if(DELETE_INDEX):
  if es.indices.exists(index=INDEX_NAME):
    print("Deleting existing %s" % INDEX_NAME)
    es.indices.delete(index=INDEX_NAME, ignore=[400, 404])

print("Creating index %s" % INDEX_NAME)
es.indices.create(index=INDEX_NAME, mappings=INDEX_MAPPING, settings=INDEX_SETTINGS,
                  ignore=[400, 404])


Creating index blogs


  es.indices.create(index=INDEX_NAME, mappings=INDEX_MAPPING, settings=INDEX_SETTINGS,


ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'blogs'})

### Get blogs dataset
Download:
- The example blogs dataset is located in [GitHub](https://raw.githubusercontent.com/elastic/elasticsearch-labs/tree/main/notebooks/integrations/hugging-face/2023-07-27-blogs-no-embeddings.json.zip)

Then unzip the file.

In [None]:
# Download the dataset
!wget https://raw.githubusercontent.com/elastic/elasticsearch-labs/tree/main/notebooks/integrations/hugging-face/2023-07-27-blogs-no-embeddings.json.zip -P data

In [None]:
# Unzip the file
BLOGS_ZIP_FILE="data/2023-07-27-blogs-no-embeddings.json.zip"
with zipfile.ZipFile(BLOGS_ZIP_FILE, 'r') as zip_ref:
  print('Extracting file ', BLOGS_ZIP_FILE, '.')
  zip_ref.extractall('data')

Extracting file  2023-07-27-blogs-no-embeddings.json.zip .


Let's index blogs using the ingest pipeline

In [None]:
BLOGS_FILE="data/2023-07-27-blogs-no-embeddings.json"
df_blogs_embeddings = pd.read_json(BLOGS_FILE, lines=True)

def gen_rows(df):
  for doc in df.to_dict(orient='records'):
    yield doc

count = 0
for success, info in parallel_bulk(
        client=es,
        actions=gen_rows(df_blogs_embeddings),
        thread_count=5,
        chunk_size=200,
        timeout='%ss' % 600,
        index=INDEX_NAME
):
  if success:
    count += 1
    if count % 200 == 0:
      print('Indexed %s documents' % str(count), flush=True)
      sys.stdout.flush()
  else:
    print(info)
    print('Doc failed', info)

print('Indexed %s blogs embeddings documents' % str(count), flush=True)
sys.stdout.flush()

Indexed 200 documents
Indexed 400 documents
Indexed 600 documents
Indexed 800 documents
Indexed 1000 documents
Indexed 1200 documents
Indexed 1400 documents
Indexed 1600 documents
Indexed 1800 documents
Indexed 2000 documents
Indexed 2200 documents
Indexed 2400 documents
Indexed 2600 documents
Indexed 2800 documents
Indexed 3000 documents
Indexed 3200 documents
Indexed 3400 documents
Indexed 3600 documents
Indexed 3800 documents
Indexed 4000 documents
Indexed 4200 documents
Indexed 4400 documents
Indexed 4600 documents
Indexed 4800 documents
Indexed 5000 documents
Indexed 5200 documents
Indexed 5400 documents
Indexed 5600 documents
Indexed 5800 documents
Indexed 6000 documents
Indexed 6200 documents
Indexed 6400 documents
Indexed 6600 documents
Indexed 6800 documents
Indexed 7000 documents
Indexed 7200 documents
Indexed 7400 documents
Indexed 7600 documents
Indexed 7800 documents
Indexed 8000 documents
Indexed 8200 documents
Indexed 8400 documents
Indexed 8600 documents
Indexed 8800 do

### Query the dataset
The next step is to run a query to search for relevant blogs. The example query searches for `"model_text": "how to do image similarity search"` using the model we uploaded to Elasticsearch `sentence-transformers__all-minilm-l6-v2`.

The process is one query even it internally consists of two tasks. One is to transform your search text into a vector using the NLP model and the second task is to run the vector search over the dataset.

```
POST blogs/_search
{
  "knn": {
  "field": "text_embedding.predicted_value",
  "k": 10,
  "num_candidates": 50,
  "query_vector_builder": {
    "text_embedding": {
      "model_id": "sentence-transformers__all-minilm-l6-v2",
      "model_text": "how to do image similarity search"
    }
  },
  "fields": [
    "body_content_window",
    "id",
    "title",
    "url"
  ],
  "_source": false
}
```



In [None]:
INDEX_NAME="blogs"

source_fields = ["body_content_window", "id", "title", "url"]

query = {
  "field": "text_embedding.predicted_value",
  "k": 10,
  "num_candidates": 50,
  "query_vector_builder": {
    "text_embedding": {"model_id": "sentence-transformers__all-minilm-l6-v2",
      "model_text": "how to do image similarity search"
    }
  }
}

response = es.search(
    index=INDEX_NAME,
    fields=source_fields,
    knn=query,
    source=False)

print(response.body)

results = pd.json_normalize(json.loads(json.dumps(response.body['hits']['hits'])))
results[['_id', '_score', 'fields.title', 'fields.body_content_window', 'fields.url']]


{'took': 13, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 10, 'relation': 'eq'}, 'max_score': 0.5594575, 'hits': [{'_index': 'blogs', '_id': 'MbiqsYkBLg4Kd5ySdDxp', '_score': 0.5594575, '_ignored': ['body_content_window.keyword'], 'fields': {'body_content_window': [' GET my-image-embeddings/_search { "knn": { "field": "image_embedding", "k": 5, "num_candidates": 10, "query_vector": [ -0.19898493587970734, 0.1074572503566742, -0.05087625980377197, ... 0.08200495690107346, -0.07852292060852051 ] }, "fields": [ "image_id", "image_name", "relative_path" ], "_source": false } Read more The response from Elasticsearch will give you the best matching images based on our kNN search query, stored in Elastic as documents. The flow graph below summarizes the steps your interactive application moves through while processing a user query: Load the interactive application, its front end. The user selects an image they’re interes

Unnamed: 0,_id,_score,fields.title,fields.body_content_window,fields.url
0,MbiqsYkBLg4Kd5ySdDxp,0.559458,[5 Technical components of image similarity se...,"[ GET my-image-embeddings/_search { ""knn"": { ""...",[https://www.elastic.co/blog/5-technical-compo...
1,3birsYkBLg4Kd5ySxT6o,0.548522,[Overview of image similarity search | Elastic...,[ Compared to traditional text search (in Elas...,[https://www.elastic.co/blog/overview-image-si...
2,3LirsYkBLg4Kd5ySxT6o,0.544762,[Overview of image similarity search | Elastic...,[ Log in Blog Solutions Stack + Cloud Tech Top...,[https://www.elastic.co/blog/overview-image-si...
3,L7iqsYkBLg4Kd5ySdDxp,0.53475,[5 Technical components of image similarity se...,[ ... img_model = SentenceTransformer('clip-Vi...,[https://www.elastic.co/blog/5-technical-compo...
4,LriqsYkBLg4Kd5ySdDxp,0.527482,[5 Technical components of image similarity se...,[ # Run kNN search against <query-embedding> o...,[https://www.elastic.co/blog/5-technical-compo...
5,orivsYkBLg4Kd5ySC0bj,0.503949,[How to implement similarity image search | El...,"[ In the example below, the text entered was “...",[https://www.elastic.co/blog/implement-image-s...
6,JbiosYkBLg4Kd5ySgDcS,0.498829,[Text similarity search in Elasticsearch using...,[ Log in Blog Solutions Stack + Cloud Tech Top...,[https://www.elastic.co/blog/text-similarity-s...
7,LbiqsYkBLg4Kd5ySdDxp,0.491115,[5 Technical components of image similarity se...,[ Another important aspect of CLIP is that it ...,[https://www.elastic.co/blog/5-technical-compo...
8,qLinsYkBLg4Kd5yS0jWv,0.484716,[Introducing approximate nearest neighbor sear...,[ Thankfully there is a great open source proj...,[https://www.elastic.co/blog/introducing-appro...
9,p7insYkBLg4Kd5yS0jWv,0.482573,[Introducing approximate nearest neighbor sear...,[ Log in Blog Solutions Stack + Cloud Tech Top...,[https://www.elastic.co/blog/introducing-appro...
