feat: semantic search on documents #812

tias112 · 2023-10-24T09:41:41Z

No description provided.

# Conflicts: # search/docker-compose.yml # search/search/es.py # search/search/harvester.py # search/search/schemas/pieces.py

BREAKING CHANGE: changes docker-compose

khyurri · 2023-11-22T09:31:36Z

.env.example

@@ -32,6 +32,9 @@ S3_PREFIX=
 S3_ACCESS_KEY=minioadmin
 S3_SECRET_KEY=minioadmin

+S3_START_PATH=annotation


What exactly these S3_START_PATH and S3_TEXT_PATH are doing? Is it required for all services?

these are folder names in Minio. I guess if service works directly with S3 it should have known it.

@khyurri
I have cleaned up all unnecessary properties. here's the scope of changes:

moved search configs from root env to search env

extracted embeddings service

adjust docker-compose & README notes

khyurri · 2023-11-22T09:33:01Z

.env.example

+MANIFEST=manifest.json
+TEXT_PIECES_PATH=/pieces
+INDEXATION_PATH=/indexation
+JOBS_URL=http://jobs


Use JOBS_SERVICE_* in search microservice to connect co jobs microservices

khyurri · 2023-11-22T09:33:29Z

.env.example

+ES_HOST=badgerdoc-elasticsearch
+ES_PORT=9200
+APP_TITLE=Badgerdoc Search
+ANNOTATION_URL=http://annotation


Use ANNOTATION_SERVICE_* in search microservice to connect to annotation microservice

khyurri · 2023-11-22T09:34:33Z

.env.example

+# Search Service
+ES_HOST=badgerdoc-elasticsearch
+ES_PORT=9200
+APP_TITLE=Badgerdoc Search


Do we need to use APP_TITLE in search microservice? Or it's badgerdoc configuration?

khyurri · 2023-11-22T09:36:17Z

.env.example

+ES_PORT=9200
+APP_TITLE=Badgerdoc Search
+ANNOTATION_URL=http://annotation
+ANNOTATION_CATEGORIES=/categories


You can move URI inside search microservice, we really don't need to configure path for all services globally

Ok, I'd agree. Am I right to put in .env for microservice is enough?

khyurri · 2023-11-22T09:36:42Z

.env.example

+ANNOTATION_URL=http://annotation
+ANNOTATION_CATEGORIES=/categories
+ANNOTATION_CATEGORIES_SEARCH=/categories/search
+MANIFEST=manifest.json


Why this MANIFEST exists globally? Is it for search microservice only?

khyurri · 2023-11-22T09:38:43Z

.env.example

+JOBS_URL=http://jobs
+JOBS_SEARCH=/jobs/search
+COMPUTED_FIELDS=["job_id", "category"]
+EMBED_URL=http://embeddings:3334/api/use


If EMBED is new microservice, ensure configuration aligned with other host configuration, for example:

EMBEDINGS_SERVICE_SCHEME=http EMBEDINGS_SERVICE_HOST=badgerdoc-embeddings EMBEDINGS_SERVICE_PORT=8080

khyurri · 2023-11-22T09:42:21Z

docker-compose-dev.yaml

@@ -268,6 +270,65 @@ services:
    devices:
      - "/dev/fuse"

+  badgerdoc-elasticsearch:
+    container_name: badgerdoc-elasticsearch
+    image: amazon/opendistro-for-elasticsearch:latest


Better to bind exact version here

khyurri · 2023-11-22T09:46:11Z

docker-compose-dev.yaml

@@ -11,6 +11,7 @@
 #   file for local needs)
 # - 8082 for Keycloak.
 # - 8083 for BadgerDoc web
+#   8084 for BadgerDoc search


What exactly exposed in 8084?

this is search API microservice

web talks to that service via RP

khyurri · 2023-11-22T09:47:15Z

docker-compose-dev.yaml

+    networks:
+      - badgerdoc
+    ports:
+      - "0.0.0.0:3334:3334"


Why do you expose port here? Do we really need this service outside of reproxy ?

khyurri · 2023-11-22T09:48:30Z

docker-compose-dev.yaml

+      - ROOT_PATH=/search
+    working_dir: /opt/search
+    ports:
+      - 8084:8080


All microservices proxies from reproxy to keep the same port for all microservices. You need to remove ports bindings here and add EXPOSE to dockerfile

khyurri · 2023-11-22T09:49:27Z

search/create_dataset.py

+import csv
+from opensearchpy import OpenSearch, helpers
+
+PATH_PRODUCTS_DATASET = "data/"


All hardcode must be moved to configuration or global configuration

khyurri · 2023-11-22T09:50:20Z

search/create_dataset.py

+annotation_dataset = load_annotation_dataset()
+
+#### Use the embedding model to calculate vectors for all annotation texts
+print("Computing embeddings for %d sentences" % len(annotation_dataset))


Use logging instead of print

khyurri · 2023-11-22T09:50:57Z

search/create_dataset.py

+es = OpenSearch([{"host": ES_HOST, "port": ES_PORT}])
+
+#### load test data set
+annotation_dataset = load_annotation_dataset()


Is it expected, that we load annotation_dataset globally here?

khyurri · 2023-11-22T09:52:04Z

search/embeddings/main.py

@@ -0,0 +1,66 @@
+from flask import request


We don't use flask in Badgerdoc, use Fast API instead

khyurri

Very good job and great progress, however PR must be changed according current Badgerdoc requirements for development!
Thank you!

tias112 added 12 commits October 24, 2023 11:40

draft vectors

e81ddf3

draft vectors

5624eef

Merge remote-tracking branch 'origin/add-vectors' into add-vectors

41335a2

# Conflicts: # search/docker-compose.yml # search/search/es.py # search/search/harvester.py # search/search/schemas/pieces.py

build(es.py): support for vectors

381c9a6

BREAKING CHANGE: changes docker-compose

feature: add question/answer search API

11a4618

feature: add image for embeddings

c7c421e

feature: add image for embeddings

ade006c

feature: add image for embeddings

332c54e

feature: add image for embeddings

8b38563

feature: add search in documents

05f6155

feature: add search in documents

6ac5019

feature: add search in documents

0e3c0b3

tias112 changed the title ~~draft: vectors~~ feat: semantic search on documents Nov 21, 2023

khyurri self-requested a review November 22, 2023 09:30

khyurri reviewed Nov 22, 2023

View reviewed changes

khyurri requested changes Nov 22, 2023

View reviewed changes

tias112 added 5 commits November 26, 2023 18:27

feat: refactoring

bf4c69c

feature: add search in documents

a9474e7

feature: add embedding search

e43e98d

feature: add embedding search

bb3c1bd

feature: add embedding search

946ef13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: semantic search on documents #812

feat: semantic search on documents #812

tias112 commented Oct 24, 2023

khyurri Nov 22, 2023

tias112 Nov 22, 2023

tias112 Nov 30, 2023

khyurri Nov 22, 2023

khyurri Nov 22, 2023

khyurri Nov 22, 2023

khyurri Nov 22, 2023

tias112 Nov 22, 2023

khyurri Nov 22, 2023

khyurri Nov 22, 2023

khyurri Nov 22, 2023

khyurri Nov 22, 2023

tias112 Nov 22, 2023

tias112 Nov 22, 2023

khyurri Nov 22, 2023

khyurri Nov 22, 2023

khyurri Nov 22, 2023

khyurri Nov 22, 2023

khyurri Nov 22, 2023

khyurri Nov 22, 2023

khyurri left a comment

feat: semantic search on documents #812

Are you sure you want to change the base?

feat: semantic search on documents #812

Conversation

tias112 commented Oct 24, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

khyurri left a comment

Choose a reason for hiding this comment