# 필수 패키지 설정 및 OpenSearch 클러스터 생성 (약 40분 소요)
> 이 노트북은  SageMaker Studio* **`Data Science 3.0`** kernel 및 ml.t3.medium 인스턴스에서 테스트 되었습니다.

## 0. 필수 사항
- 실습을 위해서 노트북을 실행하는 역할(Role) 에 아래 권한이 추가 되어 있어야 합니다.
    - AmazonOpenSearchServiceFullAccess
    - AmazonSSMFullAccess

<br>

# 1. OpenSearch Client 생성
- 랭체인 오프서치 참고 자료
    - [Langchain Opensearch](https://python.langchain.com/docs/integrations/vectorstores/opensearch)

#### [주의] OpenSearch 도메인 생성에는 약 15-16분의 시간이 소요됩니다.

In [3]:
import boto3
import uuid
import botocore
import time
DEV = True # True일 경우 1-AZ without standby로 생성, False일 경우 3-AZ with standby. 워크샵 목적일 때는 지나친 과금/리소스 방지를 위해 True로 설정하는 것을 권장
VERSION = "2.11" # OpenSearch Version (예: 2.7 / 2.9 / 2.11)

opensearch_user_id = "<your id>" # ex) 'raguser'
opensearch_user_password = "<your password>" # ex) 'MarsEarth1!'

opensearch_user_id = "raguser"
opensearch_user_password = "MarsEarth1!"

region = boto3.Session().region_name
account_id = boto3.client("sts").get_caller_identity()["Account"]
opensearch = boto3.client('opensearch', region)
rand_str = uuid.uuid4().hex[:8]
domain_name = f'rag-hol-{rand_str}'

cluster_config_prod = {
    'InstanceCount': 3,
    'InstanceType': 'r6g.large.search',
    'ZoneAwarenessEnabled': True,
    'DedicatedMasterEnabled': True,
    'MultiAZWithStandbyEnabled': True,
    'DedicatedMasterType': 'r6g.large.search',
    'DedicatedMasterCount': 3
}

cluster_config_dev = {
    'InstanceCount': 1,
    'InstanceType': 'r6g.large.search',
    'ZoneAwarenessEnabled': False,
    'DedicatedMasterEnabled': False,
}


ebs_options = {
    'EBSEnabled': True,
    'VolumeType': 'gp3',
    'VolumeSize': 100,
}

advanced_security_options = {
    'Enabled': True,
    'InternalUserDatabaseEnabled': True,
    'MasterUserOptions': {
        'MasterUserName': opensearch_user_id,
        'MasterUserPassword': opensearch_user_password
    }
}

ap = f'{{\"Version\":\"2012-10-17\",\"Statement\":[{{\"Effect\":\"Allow\",\"Principal\":{{\"AWS\":\"*\"}},\"Action\":\"es:*\",\"Resource\":\"arn:aws:es:{region}:{account_id}:domain\/{domain_name}\/*\"}}]}}'

if DEV:
    cluster_config = cluster_config_dev
else:
    cluster_config = cluster_config_prod

response = opensearch.create_domain(
    DomainName=domain_name,
    EngineVersion=f'OpenSearch_{VERSION}',
    ClusterConfig=cluster_config,
    AccessPolicies=ap,
    EBSOptions=ebs_options,
    AdvancedSecurityOptions=advanced_security_options,
    NodeToNodeEncryptionOptions={'Enabled': True},
    EncryptionAtRestOptions={'Enabled': True},
    DomainEndpointOptions={'EnforceHTTPS': True}
)

In [4]:
%%time
def wait_for_domain_creation(domain_name):
    try:
        response = opensearch.describe_domain(
            DomainName=domain_name
        )
        # Every 60 seconds, check whether the domain is processing.
        while 'Endpoint' not in response['DomainStatus']:
            print('Creating domain...')
            time.sleep(60)
            response = opensearch.describe_domain(
                DomainName=domain_name)

        # Once we exit the loop, the domain is ready for ingestion.
        endpoint = response['DomainStatus']['Endpoint']
        print('Domain endpoint ready to receive data: ' + endpoint)
    except botocore.exceptions.ClientError as error:
        if error.response['Error']['Code'] == 'ResourceNotFoundException':
            print('Domain not found.')
        else:
            raise error

wait_for_domain_creation(domain_name)

Creating domain...
Creating domain...
Creating domain...
Creating domain...
Creating domain...
Creating domain...
Creating domain...
Creating domain...
Creating domain...
Creating domain...
Creating domain...
Creating domain...
Creating domain...
Creating domain...
Creating domain...
Creating domain...
Domain endpoint ready to receive data: search-rag-hol-96e7c241-bkmgtnquj3lxujicjpijra3nyi.us-west-2.es.amazonaws.com
CPU times: user 312 ms, sys: 10.3 ms, total: 322 ms
Wall time: 16min 3s


In [5]:
response = opensearch.describe_domain(DomainName=domain_name)
opensearch_domain_endpoint = f"https://{response['DomainStatus']['Endpoint']}"

### OpenSearch 인증정보 ssm에 저장하기

In [6]:
%load_ext autoreload
%autoreload 2

In [7]:
import sys, os
module_path = ".."
sys.path.append(os.path.abspath(module_path))

In [8]:
import boto3
from utils.ssm import parameter_store

In [9]:
region=boto3.Session().region_name
pm = parameter_store(region)

In [10]:
opensearch_domain_endpoint

'https://search-rag-hol-96e7c241-bkmgtnquj3lxujicjpijra3nyi.us-west-2.es.amazonaws.com'

In [12]:
pm.put_params(
    key="opensearch_domain_endpoint",
    value=f'{opensearch_domain_endpoint}',
    overwrite=True,
    enc=False
)

pm.put_params(
    key="opensearch_user_id",
    value=f'{opensearch_user_id}',
    overwrite=True,
    enc=False
)

pm.put_params(
    key="opensearch_user_password",
    value=f'{opensearch_user_password}',
    overwrite=True,
    enc=True
)

Parameter stored successfully.
Parameter stored successfully.
Parameter stored successfully.


### ssm기반 OpenSearch 인증정보 불러오기

In [13]:
print (pm.get_params(key="opensearch_domain_endpoint", enc=False))
print (pm.get_params(key="opensearch_user_id", enc=False))
print (pm.get_params(key="opensearch_user_password", enc=True))

https://search-rag-hol-96e7c241-bkmgtnquj3lxujicjpijra3nyi.us-west-2.es.amazonaws.com
raguser
MarsEarth1!


<br>

# 2. 한국어 분석을 위한 노리(Nori) 플러그인 설치
Amazon OpenSearch Service에서 유명한 오픈 소스 한국어 텍스트 분석기인 노리(Nori) 플러그인을 지원합니다. 기존에 지원하던 은전한닢(Seunjeon) 플러그인과 더불어 노리를 활용하면 개발자가 한국 문서에 대해 전문 검색을 쉽게 구현할 수 있습니다.

이와 함께, 중국어 분석을 위한 Pinyin 플러그인과 STConvert 플러그인, 그리고 일본어 분석을 위한 Sudachi 플러그인도 추가됐습니다.
노리 플러그인은 OpenSearch 1.0 이상 버전을 실행하는 신규 도메인과 기존 도메인에서 사용 가능합니다.

#### Option 1. AWS 콘솔 수동 설치
../10_advanced_question_answering/img 폴더의 nori_1.png, nori_2.png, nori_3.png 를 참조하여 직접 설치합니다.

#### Option 2. boto3 API로 설치
아래 코드 셀을 실행합니다.

#### [주의] 노리 플러그인 연동에는 약 25-27분의 시간이 소요됩니다.

In [14]:
nori_pkg_id = {}
nori_pkg_id['us-east-1'] = {
    '2.3': 'G196105221',
    '2.5': 'G240285063',
    '2.7': 'G16029449', 
    '2.9': 'G60209291',
    '2.11': 'G181660338'
}

nori_pkg_id['us-west-2'] = {
    '2.3': 'G94047474',
    '2.5': 'G138227316',
    '2.7': 'G182407158', 
    '2.9': 'G226587000',
    '2.11': 'G79602591'
}

pkg_response = opensearch.associate_package(
    PackageID=nori_pkg_id[region][VERSION], # nori plugin
    DomainName=domain_name
)

In [15]:
%%time
def wait_for_associate_package(domain_name, max_results=1):

    response = opensearch.list_packages_for_domain(
        DomainName=domain_name,
        MaxResults=1
    )
    # Every 60 seconds, check whether the domain is processing.
    while response['DomainPackageDetailsList'][0]['DomainPackageStatus'] == "ASSOCIATING":
        print('Associating packages...')
        time.sleep(60)
        response = opensearch.list_packages_for_domain(
            DomainName=domain_name,
            MaxResults=1
        )

    #endpoint = response['DomainStatus']['Endpoint']
    print('Associated!')

wait_for_associate_package(domain_name)

Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associating packages...
Associated!
CPU times: user 582 ms, sys: 44.7 ms, total: 627 ms
Wall time: 30min 4s


![nn](../10_advanced_question_answering/img/nori_4.png)

In [16]:
! pip list | grep langchain
! pip list | grep opensearch

langchain                             0.1.11
langchain-community                   0.0.25
langchain-core                        0.1.29
langchain-text-splitters              0.0.1
opensearch-py                         2.4.2
