Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: semantic search on documents #812

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open

Conversation

tias112
Copy link

@tias112 tias112 commented Oct 24, 2023

No description provided.

@tias112 tias112 changed the title draft: vectors feat: semantic search on documents Nov 21, 2023
@khyurri khyurri self-requested a review November 22, 2023 09:30
.env.example Outdated
@@ -32,6 +32,9 @@ S3_PREFIX=
S3_ACCESS_KEY=minioadmin
S3_SECRET_KEY=minioadmin

S3_START_PATH=annotation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly these S3_START_PATH and S3_TEXT_PATH are doing? Is it required for all services?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are folder names in Minio. I guess if service works directly with S3 it should have known it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@khyurri
I have cleaned up all unnecessary properties. here's the scope of changes:

  • moved search configs from root env to search env
  • extracted embeddings service
  • adjust docker-compose & README notes

.env.example Outdated
MANIFEST=manifest.json
TEXT_PIECES_PATH=/pieces
INDEXATION_PATH=/indexation
JOBS_URL=http://jobs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use JOBS_SERVICE_* in search microservice to connect co jobs microservices

.env.example Outdated
ES_HOST=badgerdoc-elasticsearch
ES_PORT=9200
APP_TITLE=Badgerdoc Search
ANNOTATION_URL=http://annotation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use ANNOTATION_SERVICE_* in search microservice to connect to annotation microservice

.env.example Outdated
# Search Service
ES_HOST=badgerdoc-elasticsearch
ES_PORT=9200
APP_TITLE=Badgerdoc Search
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to use APP_TITLE in search microservice? Or it's badgerdoc configuration?

.env.example Outdated
ES_PORT=9200
APP_TITLE=Badgerdoc Search
ANNOTATION_URL=http://annotation
ANNOTATION_CATEGORIES=/categories
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can move URI inside search microservice, we really don't need to configure path for all services globally

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'd agree. Am I right to put in .env for microservice is enough?

.env.example Outdated
ANNOTATION_URL=http://annotation
ANNOTATION_CATEGORIES=/categories
ANNOTATION_CATEGORIES_SEARCH=/categories/search
MANIFEST=manifest.json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this MANIFEST exists globally? Is it for search microservice only?

.env.example Outdated
JOBS_URL=http://jobs
JOBS_SEARCH=/jobs/search
COMPUTED_FIELDS=["job_id", "category"]
EMBED_URL=http://embeddings:3334/api/use
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If EMBED is new microservice, ensure configuration aligned with other host configuration, for example:

EMBEDINGS_SERVICE_SCHEME=http
EMBEDINGS_SERVICE_HOST=badgerdoc-embeddings
EMBEDINGS_SERVICE_PORT=8080

@@ -268,6 +270,65 @@ services:
devices:
- "/dev/fuse"

badgerdoc-elasticsearch:
container_name: badgerdoc-elasticsearch
image: amazon/opendistro-for-elasticsearch:latest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to bind exact version here

@@ -11,6 +11,7 @@
# file for local needs)
# - 8082 for Keycloak.
# - 8083 for BadgerDoc web
# 8084 for BadgerDoc search
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly exposed in 8084?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is search API microservice

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

web talks to that service via RP

networks:
- badgerdoc
ports:
- "0.0.0.0:3334:3334"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you expose port here? Do we really need this service outside of reproxy ?

- ROOT_PATH=/search
working_dir: /opt/search
ports:
- 8084:8080
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All microservices proxies from reproxy to keep the same port for all microservices. You need to remove ports bindings here and add EXPOSE to dockerfile

import csv
from opensearchpy import OpenSearch, helpers

PATH_PRODUCTS_DATASET = "data/"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All hardcode must be moved to configuration or global configuration

annotation_dataset = load_annotation_dataset()

#### Use the embedding model to calculate vectors for all annotation texts
print("Computing embeddings for %d sentences" % len(annotation_dataset))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use logging instead of print

es = OpenSearch([{"host": ES_HOST, "port": ES_PORT}])

#### load test data set
annotation_dataset = load_annotation_dataset()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it expected, that we load annotation_dataset globally here?

@@ -0,0 +1,66 @@
from flask import request
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use flask in Badgerdoc, use Fast API instead

Copy link
Contributor

@khyurri khyurri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good job and great progress, however PR must be changed according current Badgerdoc requirements for development!
Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants