# Search Vector Store

---
이 노트북은 OpenSearch 벡터 스토어에 저장되어 있는 문서를 검색하기 위한 노트북 입니다.
---

# 1. Bedrock Client 생성

In [3]:
%load_ext autoreload
%autoreload 2

import sys, os
# module_path = "../../utils"
module_path = "../.."
sys.path.append(os.path.abspath(module_path))
print(os.path.abspath(module_path))

/home/sagemaker-user/aws-ai-ml-workshop-kr/genai/aws-gen-ai-kr/20_applications/02_qa_chatbot/01_preprocess_docs


In [4]:
import json
import boto3
from pprint import pprint
from termcolor import colored
from utils import bedrock, print_ww
from utils.bedrock import bedrock_info

boto3_bedrock = bedrock.get_bedrock_client(
    assumed_role=os.environ.get("BEDROCK_ASSUME_ROLE", None),
    endpoint_url=os.environ.get("BEDROCK_ENDPOINT_URL", None),
    region=os.environ.get("AWS_DEFAULT_REGION", None),
)

print(colored("\n== FM lists ==", "green"))
pprint(bedrock_info.get_list_fm_models())

Create new client
  Using region: us-east-1
  Using profile: None
boto3 Bedrock client successfully created!
bedrock-runtime(https://bedrock-runtime.us-east-1.amazonaws.com)
[32m
== FM lists ==[0m
{'Claude-Instant-V1': 'anthropic.claude-instant-v1',
 'Claude-V1': 'anthropic.claude-v1',
 'Claude-V2': 'anthropic.claude-v2',
 'Command': 'cohere.command-text-v14',
 'Jurassic-2-Mid': 'ai21.j2-mid-v1',
 'Jurassic-2-Ultra': 'ai21.j2-ultra-v1',
 'Llama2-13b-Chat': 'meta.llama2-13b-chat-v1',
 'Titan-Embeddings-G1': 'amazon.titan-embed-text-v1',
 'Titan-Text-G1': 'TBD'}


# 2. Embedding 모델 로딩

## Embedding Model 선택

In [5]:
Use_Titan_Embedding = True
Use_Cohere_English_Embedding = False

## Embedding Model 로딩

In [6]:
# We will be using the Titan Embeddings Model to generate our Embeddings.
from langchain.embeddings import BedrockEmbeddings
from langchain.llms.bedrock import Bedrock

if Use_Titan_Embedding:
    llm_emb = BedrockEmbeddings(client=boto3_bedrock, model_id = "amazon.titan-embed-text-v1")
    dimension = 1536
elif Use_Cohere_English_Embedding:
    llm_emb = BedrockEmbeddings(client=boto3_bedrock, model_id = "cohere.embed-english-v3")    
    dimension = 1024
else:
    lim_emb = None

llm_emb

BedrockEmbeddings(client=<botocore.client.BedrockRuntime object at 0x7f94189d3880>, region_name=None, credentials_profile_name=None, model_id='amazon.titan-embed-text-v1', model_kwargs=None, endpoint_url=None, normalize=False)

# 3. Load all Json files

In [7]:
from utils.proc_docs import get_load_json, show_doc_json

In [8]:
import glob

# Specify the directory and file pattern for .txt files
folder_path = 'data/poc/preprocessed_json/all_processed_data.json'

# List all .txt files in the specified folder
json_files = glob.glob(folder_path)
# json_files = ['data/poc/customer_EFOTA.json']

# Load each item per json file and append to a list
doc_json_list = []
for file_path in json_files:
    doc_json = get_load_json(file_path)
    doc_json_list.append(doc_json)

print("all json files: ", len(doc_json_list))    
# Flatten the list of lists into a single list
all_docs = []
for item in doc_json_list:
        all_docs.extend(item)
        
print("all items: ", len(all_docs))

all json files:  0
all items:  0


# 4. Index 생성

## Index 이름 결정

In [9]:
index_name = "v15-genai-poc-knox-parent-doc-retriever"

# 5. LangChain OpenSearch VectorStore 생성 
## 선수 조건


## 오픈 서치 도메인 및 인증 정보 세팅

- [langchain.vectorstores.opensearch_vector_search.OpenSearchVectorSearch](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.opensearch_vector_search.OpenSearchVectorSearch.html)

#### [중요] 아래에 aws parameter store 에 아래 인증정보가 먼저 입력되어 있어야 합니다.

In [10]:
from utils.proc_docs import get_parameter

In [11]:
import boto3
ssm = boto3.client('ssm', 'us-east-1')

opensearch_domain_endpoint = get_parameter(
    boto3_clinet = ssm,
    parameter_name = 'knox_opensearch_domain_endpoint',
)

opensearch_user_id = get_parameter(
    boto3_clinet = ssm,
    parameter_name = 'knox_opensearch_userid',
)

opensearch_user_password = get_parameter(
    boto3_clinet = ssm,
    parameter_name = 'knox_opensearch_password',
)


In [12]:
opensearch_domain_endpoint = opensearch_domain_endpoint
rag_user_name = opensearch_user_id
rag_user_password = opensearch_user_password

http_auth = (rag_user_name, rag_user_password) # Master username, Master password

## OpenSearch Client 생성

In [13]:
from utils.opensearch import opensearch_utils

In [14]:
aws_region = os.environ.get("AWS_DEFAULT_REGION", None)

os_client = opensearch_utils.create_aws_opensearch_client(
    aws_region,
    opensearch_domain_endpoint,
    http_auth
)

In [15]:
from utils.opensearch import opensearch_utils

## 랭체인 인덱스 연결 오브젝트 생성

- [langchain.vectorstores.opensearch_vector_search.OpenSearchVectorSearch](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.opensearch_vector_search.OpenSearchVectorSearch.html)

In [16]:
from langchain.vectorstores import OpenSearchVectorSearch

In [17]:
vector_db = OpenSearchVectorSearch(
    index_name=index_name,
    opensearch_url=opensearch_domain_endpoint,
    embedding_function=llm_emb,
    http_auth=http_auth, # http_auth
    is_aoss =False,
    engine="faiss",
    space_type="l2",
    bulk_size=100000,
    timeout=60    
)
vector_db

<langchain_community.vectorstores.opensearch_vector_search.OpenSearchVectorSearch at 0x7f93f6f9d3f0>

삽입된 Parent Chunk 의 첫번째를 확인 합니다. family_tree, parent_id 의 값을 확인 하세요.

In [18]:
def show_opensearch_doc_info(response):
    print("opensearch document id:" , response["_id"])
    print("family_tree:" , response["_source"]["metadata"]["family_tree"])
    print("parent document id:" , response["_source"]["metadata"]["parent_id"])
    print("parent document text: \n" , response["_source"]["text"])



# 7. 검색 테스트

## Lexical 검색

In [19]:
q = "'how to add image"
query ={'query': 
        {'bool': {'must': 
                  [{'match': 
                    {'text': 
                     {'query': "{q}", 'minimum_should_match': '0%', 'operator': 'or'}}}], 
                  'filter': {
                    "term": {
                      "metadata.family_tree": "parent"
                    }                      
                  }
                 }
        }
       }
pprint(query)

{'query': {'bool': {'filter': {'term': {'metadata.family_tree': 'parent'}},
                    'must': [{'match': {'text': {'minimum_should_match': '0%',
                                                 'operator': 'or',
                                                 'query': '{q}'}}}]}}}


In [20]:
# query = "how to add image"
# query = opensearch_utils.get_query(
#     query=query
# )

response = opensearch_utils.search_document(os_client, query, index_name)
opensearch_utils.parse_keyword_response(response, show_size=3)

# of searched docs:  10
# of display: 3
---------------------
_id in index:  005475dd-ad31-4b5e-b311-923992779d59
9.430416
. Q - When I renew an expired license, does the license come into effect and enroll the relevant devices immediately? A - For iOS and Windows devices, they are enrolled immediately. For Android devices, they can be enrolled according to the schedule set on the system or by sending a device command. Q - When a license expires, does it expire in the order the devices were enrolled? A - When a license being used on various devices expires, all the devices become unable to use Knox Manage simultaneously. Q - If I want to allocate the renewed licenses only to new devices (not already existing registered devices), what should I do? A - After increasing the number of licenses, you should first unenroll the existing registered devices, and then enroll the new devices. The renewed licenses will then be allocated to the new devices. Q - When a factory reset command is sent f

## Only Filter 검색

In [24]:
def search_filter(how, field, value, family_tree="parent", verbose=False):
    '''
    1. Search URL
        url = "https://docs.samsungknox.com/admin/knox-guard"
        search_filter(how="match_phrase", field="metadata.url", value=url)    
    2. Title        
        title = "Get support"
        query = search_filter(how="match_phrase", field="metadata.title", value=title)    
    3. Project
        project = "KG"
        query = search_filter(how="term", field="metadata.project", value=project)    
    '''
    query = {
        "query": {
            "bool": {
                "must" : {
                    how: {field : value}
                },
                "filter": {
                    "term": {"metadata.family_tree" : family_tree}
                }                                
            }
        }
    }   
    
    if verbose:
        print("query:")    
        print(query)

    return query


# url = "https://docs.samsungknox.com/admin/knox-mobile-enrollment/how-to-guides/manage-devices/enroll-devices/"
url = "https://docs.samsungknox.com/admin/knox-manage/kbas/kba-472-upload-vpn-certificates-with-ksp-in-km"
query = search_filter(how="match_phrase", field="metadata.url", value=url, family_tree= "parent", verbose=True)    


response = opensearch_utils.search_document(os_client, query, index_name)
opensearch_utils.parse_keyword_response(response, show_size=30)


query:
{'query': {'bool': {'must': {'match_phrase': {'metadata.url': 'https://docs.samsungknox.com/admin/knox-manage/kbas/kba-472-upload-vpn-certificates-with-ksp-in-km'}}, 'filter': {'term': {'metadata.family_tree': 'parent'}}}}}
# of searched docs:  1
# of display: 30
---------------------
_id in index:  8ee2f844-0d16-4e86-8fb6-efe850d406b4
43.076317
How to upload certificates for VPN with KSP in Knox Manage. Knox Service Plugin provides the option of installing your certificates for VPN connection silently in the device keystore. This article will explain how to prepare and upload certificate data correctly with KSP. How to upload certificates for VPN with KSP in Knox Manage? Normally, in VPN configs you need to provide the VPN client with a user certificate. This may be signed by a CA and if it's not a public CA, then you also need to add the CA. How to install CA cert (pem encoded X509 certificate) with KSP 1. Open your pem file with an editor. 2. Copy the whole contents except th

## 시맨틱 검색

In [23]:
vector_db.similarity_search(q, k=2)

[Document(page_content='. You can add up to 10 image files in the PNG, JPG, JPEG, or GIF format (animated files are not supported). Each image file must be less than 5 MB. To upload an image file, click Add and select a file. To delete an image file, click next to the name of the uploaded image file. Note The device control command must be transferred to the device to apply an image file to it. &gt;&gt;&gt; Video Select a video file for the screen saver. You can add only one video file in the MP4 or MKV format. The video file must be less than 50 MB. To upload a video file, click Add and select a file. To delete a video file, click next to the name of the uploaded video file. Note The device control command must be transferred to the device to apply a video to it. &gt; Session timeout Allows the use of the session timeout feature for the Kiosk Browser', metadata={'source': 'all_processed_data.json', 'seq_num': 911, 'title': 'Android Enterprise policies', 'url': 'https://docs.samsungkno