# OpenSearch Warming Up (노리 분석기 사용)
>이 노트북은 SageMaker Studio* Data Science 3.0 kernel 및 ml.t3.medium 인스턴스에서 테스트 되었습니다.



여기서는 OpenSearch 가 설치된 것을 가정하고, 한글 형태소 분석기의 사용하는 법을 알려 드립니다.

---
## Ref: 
- [Amazon OpenSearch Service, 한국어 분석을 위한 ‘노리(Nori)’ 플러그인 활용](https://aws.amazon.com/ko/blogs/tech/amazon-opensearch-service-korean-nori-plugin-for-analysis/)
- [Amazon OpenSearch Service로 검색 구현하기](https://catalog.us-east-1.prod.workshops.aws/workshops/de4e38cb-a0d9-4ffe-a777-bf00d498fa49/ko-KR/indexing/blog-reindex)
- [OpenSearch Python Client](https://opensearch.org/docs/1.3/clients/python-high-level/)
- [nori_part_of_speech token filter](https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-nori-speech.html)
- [Elasticsearch를 검색 엔진으로 사용하기(1): Nori 한글 형태소 분석기로 검색 고도화 하기](https://hanamon.kr/elasticsearch-%EA%B2%80%EC%83%89%EC%97%94%EC%A7%84-nori-%ED%98%95%ED%83%9C%EC%86%8C-%EB%B6%84%EC%84%9D%EA%B8%B0-%EA%B2%80%EC%83%89-%EA%B3%A0%EB%8F%84%ED%99%94-%EB%B0%A9%EB%B2%95/)

# 1. 환경 세팅

In [2]:
%load_ext autoreload
%autoreload 2

import sys, os
module_path = "../"
sys.path.append(os.path.abspath(module_path))
print("module_path: ", os.path.abspath(module_path))
from utils import print_ww

module_path:  /root/aws-ai-ml-workshop-kr/genai/aws-gen-ai-kr/20_applications/02_qa_chatbot/01_preprocess_docs


# 2. Bedrock Client 생성

In [3]:
import json
import boto3
from pprint import pprint
from termcolor import colored
from utils import bedrock, print_ww
from utils.bedrock import bedrock_info

# ---- ⚠️ Un-comment and edit the below lines as needed for your AWS setup ⚠️ ----

# os.environ["AWS_DEFAULT_REGION"] = "<REGION_NAME>"  # E.g. "us-east-1"
# os.environ["AWS_PROFILE"] = "<YOUR_PROFILE>"
# os.environ["BEDROCK_ASSUME_ROLE"] = "<YOUR_ROLE_ARN>"  # E.g. "arn:aws:..."
# os.environ["BEDROCK_ENDPOINT_URL"] = "<YOUR_ENDPOINT_URL>"  # E.g. "https://..."


boto3_bedrock = bedrock.get_bedrock_client(
    assumed_role=os.environ.get("BEDROCK_ASSUME_ROLE", None),
    endpoint_url=os.environ.get("BEDROCK_ENDPOINT_URL", None),
    region=os.environ.get("AWS_DEFAULT_REGION", None),
)

print(colored("\n== FM lists ==", "green"))
pprint(bedrock_info.get_list_fm_models())

Create new client
  Using region: us-east-1
  Using profile: None
boto3 Bedrock client successfully created!
bedrock-runtime(https://bedrock-runtime.us-east-1.amazonaws.com)

== FM lists ==
{'Claude-Instant-V1': 'anthropic.claude-instant-v1',
 'Claude-V1': 'anthropic.claude-v1',
 'Claude-V2': 'anthropic.claude-v2',
 'Command': 'cohere.command-text-v14',
 'Jurassic-2-Mid': 'ai21.j2-mid-v1',
 'Jurassic-2-Ultra': 'ai21.j2-ultra-v1',
 'Llama2-13b-Chat': 'meta.llama2-13b-chat-v1',
 'Titan-Embeddings-G1': 'amazon.titan-embed-text-v1',
 'Titan-Text-G1': 'TBD'}


# 3. Titan Embedding 모델 로딩

In [4]:
# We will be using the Titan Embeddings Model to generate our Embeddings.
from langchain.embeddings import BedrockEmbeddings
from langchain.llms.bedrock import Bedrock

llm_emb = BedrockEmbeddings(client=boto3_bedrock)
llm_emb

BedrockEmbeddings(client=<botocore.client.BedrockRuntime object at 0x7f1867cb56c0>, region_name=None, credentials_profile_name=None, model_id='amazon.titan-embed-text-v1', model_kwargs=None, endpoint_url=None)

# 4. OpenSearch Client 생성

## 오픈 서치 도메인 및 인증 정보 세팅

- [langchain.vectorstores.opensearch_vector_search.OpenSearchVectorSearch](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.opensearch_vector_search.OpenSearchVectorSearch.html)

In [5]:
from utils.proc_docs import get_parameter

In [6]:
import boto3
ssm = boto3.client('ssm', 'us-east-1')

opensearch_domain_endpoint = get_parameter(
    boto3_client = ssm,
    parameter_name = 'knox_opensearch_domain_endpoint',
)

opensearch_user_id = get_parameter(
    boto3_client = ssm,
    parameter_name = 'knox_opensearch_userid',
)

opensearch_user_password = get_parameter(
    boto3_client = ssm,
    parameter_name = 'knox_opensearch_password',
)


In [7]:
opensearch_domain_endpoint = opensearch_domain_endpoint
rag_user_name = opensearch_user_id
rag_user_password = opensearch_user_password

http_auth = (rag_user_name, rag_user_password) # Master username, Master password

In [8]:
from utils.opensearch import opensearch_utils

In [9]:
aws_region = os.environ.get("AWS_DEFAULT_REGION", None)

os_client = opensearch_utils.create_aws_opensearch_client(
    aws_region,
    opensearch_domain_endpoint,
    http_auth
)

# 5. 디폴트 Index Creation
- 간단하게 text 타입으로 title, body 두개의 컬럼으로 구성합니다.

In [10]:
from utils.rag import create_aws_opensearch_client, check_if_index_exists, delete_index
from utils.rag import create_index, add_doc, search_document

## Index 이름 정의

In [11]:
index_name = 'sm-poc-konx-warming-up-nori-index'

## 기존 Index 삭제

In [12]:

index_exists = opensearch_utils.check_if_index_exists(
    os_client,
    index_name
)

if index_exists:
    opensearch_utils.delete_index(
        os_client,
        index_name
    )
else:
    print("Index does not exist")

index_name=sm-poc-konx-warming-up-nori-index, exists=True

Deleting index:
{'acknowledged': True}


## Index 스키마 정의

In [13]:
index_body = {
    'settings': {
        'analysis': {'analyzer': {'my_analyzer': {'char_filter': ['html_strip'],
                                                    'tokenizer': 'nori',
                                                       'filter': [
                                                                   'nori_number',
                                                                   'lowercase',
                                                                   'trim',
                                                                   'my_nori_part_of_speech'
                                                                 ],
                                                       'type': 'custom'}},
                                   'tokenizer': {'nori': {
                                                  'decompound_mode': 'mixed',
                                                  'discard_punctuation': 'true',
                                                  'type': 'nori_tokenizer'}
                                                },
                                    "filter": {
                                          "my_nori_part_of_speech": {
                                                "type": "nori_part_of_speech",
                                                "stoptags": [
                                                      "E", "IC", "J", "MAG", "MAJ",
                                                      "MM", "SP", "SSC", "SSO", "SC",
                                                      "SE", "XPN", "XSA", "XSN", "XSV",
                                                      "UNA", "NA", "VSV"
                                                ]
                                          }
                                    }
                    },        
        'index': {
            'knn': True,
            'knn.space_type': 'cosinesimil'  # Example space type
        }
    },
    'mappings': {
        'properties': {
            'metadata': {
                'properties': {
                               'source' : {'type': 'keyword'},                    
                               'last_updated': {'type': 'date'},
                               'project': {'type': 'keyword'},
                               'seq_num': {'type': 'long'},
                               'title': {'type': 'text'},  # For full-text search
                               'url': {'type': 'text'},  # For full-text search
                            }
            },            
            'text': {
                'analyzer': 'my_analyzer',
                'search_analyzer': 'my_analyzer',
                'type': 'text'
            },
            'vector_field': {
                'type': 'knn_vector',
                'dimension': 1536  # Replace with your vector dimension
            }
        }
    }
}


In [14]:

opensearch_utils.create_index(os_client, index_name, index_body)
index_info = os_client.indices.get(index=index_name)
index_info


Creating index:
{'acknowledged': True, 'shards_acknowledged': True, 'index': 'sm-poc-konx-warming-up-nori-index'}


{'sm-poc-konx-warming-up-nori-index': {'aliases': {},
  'mappings': {'properties': {'metadata': {'properties': {'last_updated': {'type': 'date'},
      'project': {'type': 'keyword'},
      'seq_num': {'type': 'long'},
      'source': {'type': 'keyword'},
      'title': {'type': 'text'},
      'url': {'type': 'text'}}},
    'text': {'type': 'text', 'analyzer': 'my_analyzer'},
    'vector_field': {'type': 'knn_vector', 'dimension': 1536}}},
  'settings': {'index': {'number_of_shards': '5',
    'provided_name': 'sm-poc-konx-warming-up-nori-index',
    'knn.space_type': 'cosinesimil',
    'knn': 'true',
    'creation_date': '1700967308037',
    'analysis': {'filter': {'my_nori_part_of_speech': {'type': 'nori_part_of_speech',
       'stoptags': ['E',
        'IC',
        'J',
        'MAG',
        'MAJ',
        'MM',
        'SP',
        'SSC',
        'SSO',
        'SC',
        'SE',
        'XPN',
        'XSA',
        'XSN',
        'XSV',
        'UNA',
        'NA',
        'VS

# 6. 디폴트 Index 에 Doc 넣기
- 아래와 같이 문서 하나를 추가 합니다.

In [15]:
text = "이제 핵심기능으로 OpenSearch에서도 노리를 사용할 수 있습니다."
text_emb = llm_emb.embed_query(text)
print(len(text_emb))

1536


In [16]:
# Example document
doc_body = {
    "text": "이제 핵심기능으로 OpenSearch에서도 노리를 사용할 수 있습니다.",
    "vector_field": text_emb,  # Replace with your vector
    "metadata" : [
        {         
         "source": "all_kr_files.json", 
         "last_updated": "2022-01-01", 
         "project": "sample", 
         "seq_num": 1, 
         "title": "sample", 
         "url": ""}
    ]
}

opensearch_utils.add_doc(os_client, index_name, doc_body, id='1')



Adding document:
{'_index': 'sm-poc-konx-warming-up-nori-index', '_id': '1', '_version': 1, 'result': 'created', 'forced_refresh': True, '_shards': {'total': 3, 'successful': 3, 'failed': 0}, '_seq_no': 0, '_primary_term': 1}


# 7. Term Vector 확인

아래의 결과의 하단에 보면 아래와 같이 "opensearch" 가 하나의 term 으로 저장이 된 것을 볼 수 있습니다.

```
text = "이제 OpenSearch에서도 노리를 사용할 수 있습니다."
'terms': {'opensearch': {'term_freq': 1,
     'tokens': [{'position': 1, 'start_offset': 3, 'end_offset': 13}]},
```

### my_nori_part_of_speech 사용 이전의 Term Vector 
- 인덱스 생성시 위의 필터를 제거를 하고 실행하면 아래와 같은 Term 으로 생성됨.
- Term
```
{'_index': 'sm-poc-konx-warming-up-nori-index',
 '_id': '1',
 '_version': 1,
 'found': True,
 'took': 1,
 'term_vectors': {'text': {'field_statistics': {'sum_doc_freq': 16,
    'doc_count': 1,
    'sum_ttf': 16},
   'terms': {'opensearch': {'term_freq': 1,
     'tokens': [{'position': 4, 'start_offset': 10, 'end_offset': 20}]},
    'ᆯ': {'term_freq': 1,
     'tokens': [{'position': 11, 'start_offset': 30, 'end_offset': 31}]},
    '기능': {'term_freq': 1,
     'tokens': [{'position': 2, 'start_offset': 5, 'end_offset': 7}]},
    '노리': {'term_freq': 1,
     'tokens': [{'position': 7, 'start_offset': 24, 'end_offset': 26}]},
    '도': {'term_freq': 1,
     'tokens': [{'position': 6, 'start_offset': 22, 'end_offset': 23}]},
    '를': {'term_freq': 1,
     'tokens': [{'position': 8, 'start_offset': 26, 'end_offset': 27}]},
    '사용': {'term_freq': 1,
     'tokens': [{'position': 9, 'start_offset': 28, 'end_offset': 30}]},
    '수': {'term_freq': 1,
     'tokens': [{'position': 12, 'start_offset': 32, 'end_offset': 33}]},
    '습니다': {'term_freq': 1,
     'tokens': [{'position': 14, 'start_offset': 35, 'end_offset': 38}]},
    '에서': {'term_freq': 1,
     'tokens': [{'position': 5, 'start_offset': 20, 'end_offset': 22}]},
    '으로': {'term_freq': 1,
     'tokens': [{'position': 3, 'start_offset': 7, 'end_offset': 9}]},
    '이제': {'term_freq': 1,
     'tokens': [{'position': 0, 'start_offset': 0, 'end_offset': 2}]},
    '있': {'term_freq': 1,
     'tokens': [{'position': 13, 'start_offset': 34, 'end_offset': 35}]},
    '하': {'term_freq': 1,
     'tokens': [{'position': 10, 'start_offset': 30, 'end_offset': 31}]},
    '할': {'term_freq': 1,
     'tokens': [{'position': 10, 'start_offset': 30, 'end_offset': 31}]},
    '핵심': {'term_freq': 1,
     'tokens': [{'position': 1, 'start_offset': 3, 'end_offset': 5}]}}}}}
```

In [17]:
os_client.termvectors(index=index_name, id='1', fields='text')

{'_index': 'sm-poc-konx-warming-up-nori-index',
 '_id': '1',
 '_version': 1,
 'found': True,
 'took': 1,
 'term_vectors': {'text': {'field_statistics': {'sum_doc_freq': 7,
    'doc_count': 1,
    'sum_ttf': 7},
   'terms': {'opensearch': {'term_freq': 1,
     'tokens': [{'position': 4, 'start_offset': 10, 'end_offset': 20}]},
    '기능': {'term_freq': 1,
     'tokens': [{'position': 2, 'start_offset': 5, 'end_offset': 7}]},
    '노리': {'term_freq': 1,
     'tokens': [{'position': 7, 'start_offset': 24, 'end_offset': 26}]},
    '사용': {'term_freq': 1,
     'tokens': [{'position': 9, 'start_offset': 28, 'end_offset': 30}]},
    '수': {'term_freq': 1,
     'tokens': [{'position': 12, 'start_offset': 32, 'end_offset': 33}]},
    '있': {'term_freq': 1,
     'tokens': [{'position': 13, 'start_offset': 34, 'end_offset': 35}]},
    '핵심': {'term_freq': 1,
     'tokens': [{'position': 1, 'start_offset': 3, 'end_offset': 5}]}}}}}

# 7. 문서 검색

## Lexical 검색

In [18]:
q = '핵심기능'
query = {
  "query": {
    "match": {
      "text": {
        "query": f"{q}"
      }
    }
  }
}
print("query: ", query)
response = opensearch_utils.search_document(os_client, query, index_name)    
response

query:  {'query': {'match': {'text': {'query': '핵심기능'}}}}


{'took': 5,
 'timed_out': False,
 '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 1, 'relation': 'eq'},
  'max_score': 0.5753642,
  'hits': [{'_index': 'sm-poc-konx-warming-up-nori-index',
    '_id': '1',
    '_score': 0.5753642,
    '_source': {'text': '이제 핵심기능으로 OpenSearch에서도 노리를 사용할 수 있습니다.',
     'vector_field': [0.19628906,
      0.087890625,
      0.06982422,
      0.36914062,
      0.061279297,
      -0.52734375,
      -0.19140625,
      -0.0005340576,
      -0.97265625,
      -0.515625,
      -0.09277344,
      0.8984375,
      -0.23339844,
      0.30859375,
      -0.07763672,
      -0.40234375,
      0.71875,
      0.69140625,
      -0.83203125,
      0.6171875,
      -0.84375,
      0.13769531,
      -0.10449219,
      1.078125,
      -0.06982422,
      0.44921875,
      -0.55859375,
      -0.38476562,
      0.01373291,
      -0.15917969,
      -0.546875,
      1.34375,
      0.765625,
      0.20996094,
      -0.4921875,
     

이유는 body의 termvector를 보면 알 수 있는데 출시하고는 term으로 저장되었으나 우리가 원하는 출시에 대해서는 저장되어 있지 않기 때문입니다. 아래의 termvectors Query를 사용해 현재 색인된 문서의 term vector를 확인 할 수 있습니다. Response에서 "출시하고"만 저장된 것을 확인하십시요.

## Semantic 검색

In [19]:
text = "핵심기능"
text_emb = llm_emb.embed_query(text)

In [20]:
query = {
  "size": 2,  
  "query": {
    "script_score": {
      "query": {
        "match_all": {}  
      },
      "script": {
        "source": "cosineSimilarity(params.query_vector, doc['vector_field']) + 1.0",
        "params": {
          "query_vector": text_emb  
        }
      }
    }
  }
}

# print("query: ", query)
response = opensearch_utils.search_document(os_client, query, index_name)    
response


{'took': 4,
 'timed_out': False,
 '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 1, 'relation': 'eq'},
  'max_score': 1.363281,
  'hits': [{'_index': 'sm-poc-konx-warming-up-nori-index',
    '_id': '1',
    '_score': 1.363281,
    '_source': {'text': '이제 핵심기능으로 OpenSearch에서도 노리를 사용할 수 있습니다.',
     'vector_field': [0.19628906,
      0.087890625,
      0.06982422,
      0.36914062,
      0.061279297,
      -0.52734375,
      -0.19140625,
      -0.0005340576,
      -0.97265625,
      -0.515625,
      -0.09277344,
      0.8984375,
      -0.23339844,
      0.30859375,
      -0.07763672,
      -0.40234375,
      0.71875,
      0.69140625,
      -0.83203125,
      0.6171875,
      -0.84375,
      0.13769531,
      -0.10449219,
      1.078125,
      -0.06982422,
      0.44921875,
      -0.55859375,
      -0.38476562,
      0.01373291,
      -0.15917969,
      -0.546875,
      1.34375,
      0.765625,
      0.20996094,
      -0.4921875,
      0

# 8. 생성된 인덱스 삭제

In [21]:
index_exists = opensearch_utils.check_if_index_exists(
    os_client,
    index_name
)


if index_exists:
    opensearch_utils.delete_index(
        os_client,
        index_name
    )
else:
    print("Index does not exist")    

index_name=sm-poc-konx-warming-up-nori-index, exists=True

Deleting index:
{'acknowledged': True}
