# NLP text search

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/integrations/hugging-face/loading-model-from-hugging-face.ipynb)


The workbook implements NLP text search in Elasticsearch using the Elastic blogs dataset.

You will index blogs documents, and using ingest pipeline generate text embeddings. By using NLP model you will query the documents using natural language over the the blogs documents.

The dataset is a ~4000 blog posts from [elastic.co/blog](https://elastic.co/blog).

### Prerequisities

Before you start make sure you have Elasticsearch cluster running. The cluster must have at least one machine learning (ML) node with enough (4GB) memory.

### Install Python requirements
Before you start you need to install all required Python dependencies.

In [None]:
!pip install sentence-transformers eland elasticsearch transformers torch tqdm

### Upload NLP model for querying
Using an `eland` tool you will import required NLP models. One model will transfer your search query into vector which will be used for the search over the set of documents stored in Elasticsearch.

Model used:
- [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) to transform the search text into the vector

How to get cloud id? Go to [ESS cloud](https://cloud.elastic.co/logout?redirectTo=%2Fhome&reason=unauthorised) and `On the deployment overview page, copy down the Cloud ID.`

The authentication is using api key (`--es-api-key`). Learn how to generate [API key](https://www.elastic.co/guide/en/kibana/current/api-keys.html#create-api-key). Another option is to use username/password.

```
$ eland_import_hub_model --cloud-id $CLOUD_ID \
  --hub-model-id sentence-transformers/all-MiniLM-L6-v2 \
  --task-type text_embedding --es-api-key $API_KEY --start
```



In [None]:
API_KEY=''
CLOUD_ID=''
!eland_import_hub_model --cloud-id $CLOUD_ID --hub-model-id sentence-transformers/all-MiniLM-L6-v2 --task-type text_embedding --es-api-key $API_KEY --start

### Import Python libraries

In [None]:
from elasticsearch import Elasticsearch
from elasticsearch.helpers import parallel_bulk
import requests
import os
import sys
import shutil
import zipfile
from tqdm.auto import tqdm
import pandas as pd
import json
from getpass import getpass

### Connect to Elasticsearch cluster
Use your own cluster details `ELASTIC_CLOUD_ID`, `API_KEY`.

You also have another option to authentication (Username/Password) if you wish do so.

In [None]:
# ESS Cloud connection definition using an API_KEY
ELASTIC_CLOUD_ID = getpass("Elastic Cloud ID: ")
API_KEY = getpass("Elastic API key: ")

# ELASTIC_CLOUD_USER = "elastic"
# CLOUD_PASSWORD = getpass("Elastic Password")

es = Elasticsearch(
  cloud_id=ELASTIC_CLOUD_ID,
  #basic_auth=(ELASTIC_CLOUD_USER, ELASTIC_CLOUD_PASSWORD),
  api_key=API_KEY,
  request_timeout=600
)

es.info() # should return cluster info

### Upload Ingest pipeline
You need to upload the ingest pipeline which will generate vector (text) embeddings for selected field.

The pipeline below is defining a processor for the inference to the correct NLP model.

In [None]:
# ingest pipeline definition
PIPELINE_ID="vectorize_blogs"
PIPELINE_DESC="This is the description for document procesing"
PIPELINE_VERSION=3

es.ingest.put_pipeline(id=PIPELINE_ID, description=PIPELINE_DESC, version=PIPELINE_VERSION,
                       processors=[{
        "inference": {
          "model_id": "sentence-transformers__all-minilm-l6-v2",
          "target_field": "text_embedding",
          "field_map": {
            "body_content_window": "text_field"
          }
        }
      }])

### Create Index and mappings for Blogs
Befor you can index documents into Elasticsearch, you need to create an Index with correct mappings.

In [None]:
INDEX_NAME="blogs"
DELETE_INDEX=True

INDEX_MAPPING = {
    "properties": {
      "body_content_window": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "byline": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "name": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "publish_date": {
        "type": "date"
      },
      "title": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "url": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "text_embedding": {
        "properties": {
          "is_truncated": {
            "type": "boolean"
          },
          "model_id": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "predicted_value": {
            "type": "dense_vector",
            "dims": 384,
            "index": True,
            "similarity": "l2_norm"
          }
        }
      }
    }
  }

INDEX_SETTINGS = {
    "index": {
      "number_of_replicas": "1",
      "number_of_shards": "1",
      "default_pipeline": PIPELINE_ID
    }
}

if(DELETE_INDEX):
  if es.indices.exists(index=INDEX_NAME):
    print("Deleting existing %s" % INDEX_NAME)
    es.indices.delete(index=INDEX_NAME, ignore=[400, 404])

print("Creating index %s" % INDEX_NAME)
es.indices.create(index=INDEX_NAME, mappings=INDEX_MAPPING, settings=INDEX_SETTINGS,
                  ignore=[400, 404])


### Get blogs dataset
Download:
- The example blogs dataset is located in [GitHub](https://raw.githubusercontent.com/elastic/elasticsearch-labs/tree/main/notebooks/integrations/hugging-face/2023-07-27-blogs-no-embeddings.json.zip)

Then unzip the file.

In [None]:
# Download the dataset
!wget https://raw.githubusercontent.com/elastic/elasticsearch-labs/tree/main/notebooks/integrations/hugging-face/2023-07-27-blogs-no-embeddings.json.zip -P data

In [None]:
# Unzip the file
BLOGS_ZIP_FILE="data/2023-07-27-blogs-no-embeddings.json.zip"
with zipfile.ZipFile(BLOGS_ZIP_FILE, 'r') as zip_ref:
  print('Extracting file ', BLOGS_ZIP_FILE, '.')
  zip_ref.extractall('data')

Let's index blogs using the ingest pipeline

In [None]:
BLOGS_FILE="data/2023-07-27-blogs-no-embeddings.json"
df_blogs_embeddings = pd.read_json(BLOGS_FILE, lines=True)

def gen_rows(df):
  for doc in df.to_dict(orient='records'):
    yield doc

count = 0
for success, info in parallel_bulk(
        client=es,
        actions=gen_rows(df_blogs_embeddings),
        thread_count=5,
        chunk_size=200,
        timeout='%ss' % 600,
        index=INDEX_NAME
):
  if success:
    count += 1
    if count % 200 == 0:
      print('Indexed %s documents' % str(count), flush=True)
      sys.stdout.flush()
  else:
    print(info)
    print('Doc failed', info)

print('Indexed %s blogs embeddings documents' % str(count), flush=True)
sys.stdout.flush()

### Query the dataset
The next step is to run a query to search for relevant blogs. The example query searches for `"model_text": "how to do image similarity search"` using the model we uploaded to Elasticsearch `sentence-transformers__all-minilm-l6-v2`.

The process is one query even it internally consists of two tasks. One is to transform your search text into a vector using the NLP model and the second task is to run the vector search over the dataset.

```
POST blogs/_search
{
  "knn": {
  "field": "text_embedding.predicted_value",
  "k": 10,
  "num_candidates": 50,
  "query_vector_builder": {
    "text_embedding": {
      "model_id": "sentence-transformers__all-minilm-l6-v2",
      "model_text": "how to do image similarity search"
    }
  },
  "fields": [
    "body_content_window",
    "id",
    "title",
    "url"
  ],
  "_source": false
}
```



In [None]:
INDEX_NAME="blogs"

source_fields = ["body_content_window", "id", "title", "url"]

query = {
  "field": "text_embedding.predicted_value",
  "k": 10,
  "num_candidates": 50,
  "query_vector_builder": {
    "text_embedding": {"model_id": "sentence-transformers__all-minilm-l6-v2",
      "model_text": "how to do image similarity search"
    }
  }
}

response = es.search(
    index=INDEX_NAME,
    fields=source_fields,
    knn=query,
    source=False)

print(response.body)

results = pd.json_normalize(json.loads(json.dumps(response.body['hits']['hits'])))
results[['_id', '_score', 'fields.title', 'fields.body_content_window', 'fields.url']]
