# Full Text Search 


## Set Up the Elasticsearch Python Client
In this section you will install the Elasticsearch client library for Python and use it to connect to the Elasticsearch service.



## Installation
The Elasticsearch client library is a Python package that is installed with pip. Make sure the virtual environment you created earlier is activated, and then run the following command to install the client:


```bash 
pip install elasticsearch
```

To avoid any potential incompatibilities, make sure the version of the Elasticsearch client library you install matches the version of the Elasticsearch stack that you are using.

It is always recommended to keep a requirements.txt file updated with all your dependencies, so this is a good time to update this file to include the newly installed package. Run the following command from your terminal:


```bash 
pip freeze > requirements.txt
```

In [1]:
%pip install elasticsearch


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
pip freeze > requirements.txt

Note: you may need to restart the kernel to use updated packages.


## Connect to Elasticsearch
To create a connection to your Elasticsearch service, an Elasticsearch object must be created with the appropriate connection options.

Create a new search.py file in your code editor, located in the search-tutorial directory. The search.py file is going to be where all the search functions will be defined. The idea of having a separate file for the search functionality is that this will make it easy for you to extract this file and add it into your own projects later on.

Enter the following code in search.py to add a Search class:




In [4]:
import json
from pprint import pprint
import os
import time

from dotenv import load_dotenv
from elasticsearch import Elasticsearch

load_dotenv()


class Search:
    def __init__(self):
        self.es = Elasticsearch('http://localhost:9200')
        client_info = self.es.info()
        print('Connected to Elasticsearch!')
        pprint(client_info.body)

There is a lot to unpack here. The load_dotenv() function that is called right after the imports comes from the python-dotenv package. This package knows how to work with .env files, which are used to store configuration variables such as passwords and keys. The load_dotenv() function reads the variables that are stored in the .env file and imports them into the Python process as environment variables.

The Search class has a constructor that creates an instance of the Elasticsearch client class. This is where all the client logic to communicate with the Elasticsearch service lives. Note that this line is currently incomplete, as connection options appropriate to your service need to be included. You will learn what options apply in your case below. Once created, the Elasticsearch object is then stored in an instance variable named self.es.

To ensure that the client object can communicate with your Elastic Cloud deployment, the info() method is invoked. This method makes a call to the service requesting basic information. If this call succeeds, then you can assume that you have a valid connection to the service.

The method then prints a status message indicating that the connection has been established, and then uses the pprint function from Python to display the information that the service returned in an easy to read format.

NOTE:: You may have noticed that the json package from the Python standard library is imported in this file, but not used. Do not remove this import, as this package will be used later.


To complete the constructor of the Search class, the Elasticsearch object needs to be given appropriate connection options. The following sub-sections will tell you what options you need for the Elastic Cloud and Docker methods of installation.



## Test the Connection
At this point you are ready to make a connection to your Elasticsearch service. To do this, make sure that your Python virtual environment is activated, and then type python to start a Python interactive session. You should see the familiar >>> prompt, in which you can enter Python statements.

Import the Search class as follows:




In [6]:
es = Search()



Connected to Elasticsearch!
{'cluster_name': 'docker-cluster',
 'cluster_uuid': 'KiZ5r8-vR7GZf9YFdRuehg',
 'name': '39f73e6b6ad4',
 'tagline': 'You Know, for Search',
 'version': {'build_date': '2024-03-22T03:35:46.757803203Z',
             'build_flavor': 'default',
             'build_hash': '09df99393193b2c53d92899662a8b8b3c55b45cd',
             'build_snapshot': False,
             'build_type': 'docker',
             'lucene_version': '9.10.0',
             'minimum_index_compatibility_version': '7.0.0',
             'minimum_wire_compatibility_version': '7.17.0',
             'number': '8.13.0'}}


## Create an Elasticsearch Index

Two very important concepts in Elasticsearch are documents and indexes.

A document is collection of fields with their associated values. To work with Elasticsearch you have to organize your data into documents, and then add all your documents to an index. You can think of an index as a collection of documents that is stored in a highly optimized format designed to perform efficient searches.

If you have worked with other databases, you may know that many databases require a schema definition, which is essentially a description of all the fields that you want to store and their types. An Elasticsearch index can be configured with a schema if desired, but it can also automatically derive the schema from the data itself. In this section you are going to let Elasticsearch figure out the schema on its own, which works quite well for simple data types such as text, numbers and dates. Later, after you are introduced to more complex data types, you will learn how to provide explicit schema definitions.



## Create the Index
This is how you create an Elasticsearch index using the Python client library:


```python
self.es.indices.create(index='my_documents')
```

In this example, self.es is an instance of the Elasticsearch class, which in this tutorial is stored in the Search class in search.py. An Elasticsearch deployment can be used to store multiple indexes, each identified by a name such as my_documents in the example above.

Indexes can also be deleted:

```python
self.es.indices.delete(index='my_documents')
```



If you attempt to create an index with a name that is already assigned to an existing index, you will get an error. Sometimes it is useful to create an index automatically deleting a previous instance of the index if it exists. This is especially useful while developing an application, because you will likely need to regenerate an index several times.

Let's add a create_index() helper method in search.py. Open this file in your code editor, and add the following code at the bottom, leaving the existing contents as they are:




In [9]:
class Search:
    def __init__(self):
        self.es = Elasticsearch('http://localhost:9200')
        client_info = self.es.info()
        print('Connected to Elasticsearch!')
        pprint(client_info.body)

    def create_index(self):
        self.es.indices.delete(index='my_documents', ignore_unavailable=True)
        self.es.indices.create(index='my_documents')

    def insert_document(self, document):
        return self.es.index(index='my_documents', body=document)


The create_index() method first deletes an index with the name my_documents. The ignore_unavailable=True option prevents this call from failing when the index name isn't found. The following line in the method creates a brand new index with that same name.

The example application featured in this tutorial needs a single Elasticsearch index, and for that reason it hardcodes the index name as my_documents. For more complex applications that use multiple indexes, you may consider accepting the index name as an argument.




## Add Documents to the Index
In the Elasticsearch client library for Python, a document is represented as a dictionary of key/value fields. Fields that have a string value are automatically indexed for full-text and keyword search, but in addition to strings you can use other field types such as numbers, dates and booleans, which are also indexed for efficient operations such as filtering. You can also build complex data structures in which a field is set to a list or a dictionary with sub-items.



In [10]:
es = Search()

document = {
    'title': 'Work From Home Policy',
    'contents': 'The purpose of this full-time work-from-home policy is...',
    'created_on': '2023-11-02',
}
response = es.insert_document(document)
print(response['_id'])

Connected to Elasticsearch!
{'cluster_name': 'docker-cluster',
 'cluster_uuid': 'KiZ5r8-vR7GZf9YFdRuehg',
 'name': '39f73e6b6ad4',
 'tagline': 'You Know, for Search',
 'version': {'build_date': '2024-03-22T03:35:46.757803203Z',
             'build_flavor': 'default',
             'build_hash': '09df99393193b2c53d92899662a8b8b3c55b45cd',
             'build_snapshot': False,
             'build_type': 'docker',
             'lucene_version': '9.10.0',
             'minimum_index_compatibility_version': '7.0.0',
             'minimum_wire_compatibility_version': '7.17.0',
             'number': '8.13.0'}}
dBOhn5ABMnum9rmL7gdm


## Ingesting Documents from a JSON File
When setting up a new Elasticsearch index, you are likely going to need to import a large number of documents. For this tutorial, the starter project includes a data.json file with some data in JSON format. In this section you will learn how to import all the documents contained in this file into the index.

The structure of the documents that are included in the data.json is as follows:
- name: the document title
- url: a URL to the document hosted on an external site
- summary: a short summary of the contents of the document
- content: the body of the document
- created_on: creation date
- updated_at: update date (could be missing if the document was never updated)
- category: the document's category, which can be github, sharepoint or teams
- rolePermissions: a list of role permissions

At this point you are encouraged to open data.json in your editor to familiarize yourself with the data that you are going to work with.

In essence, importing a large number of documents is no different than importing one document inside a for-loop. To import the entire contents of the data.json file, you could do something like this:




In [12]:
%pip install faker 

Collecting faker
  Downloading Faker-26.0.0-py3-none-any.whl.metadata (15 kB)
Downloading Faker-26.0.0-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: faker
Successfully installed faker-26.0.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [13]:
import json
import random
from datetime import datetime, timedelta
from faker import Faker

fake = Faker()

def generate_data(num_records=1000):
    data = []
    categories = ['github', 'sharepoint', 'teams']
    roles = ['admin', 'user', 'guest', 'manager']

    for _ in range(num_records):
        created_on = fake.date_time_between(start_date='-2y', end_date='now')
        
        record = {
            'name': fake.sentence(nb_words=4),
            'url': fake.url(),
            'summary': fake.paragraph(nb_sentences=2),
            'content': fake.text(max_nb_chars=1000),
            'created_on': created_on.isoformat(),
            'category': random.choice(categories),
            'rolePermissions': random.sample(roles, k=random.randint(1, len(roles)))
        }

        # 50% 확률로 updated_at 필드 추가
        if random.random() > 0.5:
            updated_at = fake.date_time_between(start_date=created_on, end_date='now')
            record['updated_at'] = updated_at.isoformat()

        data.append(record)

    return data

def save_to_json(data, filename='data.json'):
    with open(filename, 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=2)

data = generate_data()
save_to_json(data)
print(f"1000개의 레코드가 data.json 파일에 저장되었습니다.")

1000개의 레코드가 data.json 파일에 저장되었습니다.


In [15]:
import json

es = Search()

with open('data.json', 'rt') as f:
    documents = json.loads(f.read())

for document in documents:
    es.insert_document(document)

Connected to Elasticsearch!
{'cluster_name': 'docker-cluster',
 'cluster_uuid': 'KiZ5r8-vR7GZf9YFdRuehg',
 'name': '39f73e6b6ad4',
 'tagline': 'You Know, for Search',
 'version': {'build_date': '2024-03-22T03:35:46.757803203Z',
             'build_flavor': 'default',
             'build_hash': '09df99393193b2c53d92899662a8b8b3c55b45cd',
             'build_snapshot': False,
             'build_type': 'docker',
             'lucene_version': '9.10.0',
             'minimum_index_compatibility_version': '7.0.0',
             'minimum_wire_compatibility_version': '7.17.0',
             'number': '8.13.0'}}


UI: 
- http://localhost:9200/_cat/indices?v
- http://localhost:9200/my_documents/_search
- or Kibana 

In [19]:
class Search:
    def __init__(self):
        self.es = Elasticsearch('http://localhost:9200')
        client_info = self.es.info()
        print('Connected to Elasticsearch!')
        pprint(client_info.body)

    def create_index(self):
        self.es.indices.delete(index='my_documents', ignore_unavailable=True)
        self.es.indices.create(index='my_documents')

    def insert_document(self, document):
        return self.es.index(index='my_documents', body=document)

    def search(self, query=None, filters=None, sort=None, size=10, from_=0):
        """
        검색 메소드
        :param query: 검색 쿼리 (딕셔너리)
        :param filters: 필터 조건 (리스트 또는 딕셔너리)
        :param sort: 정렬 조건 (리스트 또는 딕셔너리)
        :param size: 반환할 문서 수
        :param from_: 검색 시작 오프셋
        :return: 검색 결과
        """
        body = {}
        
        if query:
            body["query"] = query
        else:
            body["query"] = {"match_all": {}}

        if filters:
            if "query" not in body:
                body["query"] = {"bool": {}}
            if isinstance(filters, list):
                body["query"]["bool"]["filter"] = filters
            else:
                body["query"]["bool"]["filter"] = [filters]

        if sort:
            body["sort"] = sort

        try:
            result = self.es.search(index="my_documents", body=body, size=size, from_=from_)
            return result
        except Exception as e:
            print(f"검색 중 오류 발생: {str(e)}")
            return None


es = Search()

# 데이터 삽입 후
result = es.search()
print(f"총 {result['hits']['total']['value']}개의 문서가 있습니다.")
for hit in result['hits']['hits']:
    print(hit['_source'])

Connected to Elasticsearch!
{'cluster_name': 'docker-cluster',
 'cluster_uuid': 'KiZ5r8-vR7GZf9YFdRuehg',
 'name': '39f73e6b6ad4',
 'tagline': 'You Know, for Search',
 'version': {'build_date': '2024-03-22T03:35:46.757803203Z',
             'build_flavor': 'default',
             'build_hash': '09df99393193b2c53d92899662a8b8b3c55b45cd',
             'build_snapshot': False,
             'build_type': 'docker',
             'lucene_version': '9.10.0',
             'minimum_index_compatibility_version': '7.0.0',
             'minimum_wire_compatibility_version': '7.17.0',
             'number': '8.13.0'}}
총 1001개의 문서가 있습니다.
{'title': 'Work From Home Policy', 'contents': 'The purpose of this full-time work-from-home policy is...', 'created_on': '2023-11-02'}
{'name': 'Sense day.', 'url': 'http://www.day.com/', 'summary': 'Guess improve just born.', 'content': 'Natural second style service she read. Child plan response head.\nRoad space military weight responsibility true figure develop. M

  result = self.es.search(index="my_documents", body=body, size=size, from_=from_)
  result = self.es.search(index="my_documents", body=body, size=size, from_=from_)


In [20]:
class Search:
    def __init__(self):
        self.es = Elasticsearch('http://localhost:9200')
        client_info = self.es.info()
        print('Connected to Elasticsearch!')
        pprint(client_info.body)

    def create_index(self):
        self.es.indices.delete(index='my_documents', ignore_unavailable=True)
        self.es.indices.create(index='my_documents')

    def insert_documents(self, documents):
        operations = []
        for document in documents:
            operations.append({'index': {'_index': 'my_documents'}})
            operations.append(document)
        return self.es.bulk(operations=operations)


The method accepts a list of documents. Instead of adding each document separately, it assembles a single list called operations, and then passes the list to the bulk() method. For each document, two entries are added to the operations list:

A description of what operation to perform, set to index, with the name of the index given as an argument.
The actual data of the document
When processing a bulk request, the Elasticsearch service walks the operations list from the start and performs the operations that were requested.

Learn more about the bulk() method in the documentation


## Regenerating the Index
While you work on this tutorial you will need to regenerate the index a few times. To streamline this operation, add a reindex() method to search.py:




In [21]:
class Search:
    def __init__(self):
        self.es = Elasticsearch('http://localhost:9200')
        client_info = self.es.info()
        print('Connected to Elasticsearch!')
        pprint(client_info.body)

    def create_index(self):
        self.es.indices.delete(index='my_documents', ignore_unavailable=True)
        self.es.indices.create(index='my_documents')

    def insert_documents(self, documents):
        operations = []
        for document in documents:
            operations.append({'index': {'_index': 'my_documents'}})
            operations.append(document)
        return self.es.bulk(operations=operations)

    def reindex(self):
        self.create_index()
        with open('data.json', 'rt') as f:
            documents = json.loads(f.read())
        return self.insert_documents(documents)


## Search Basics
Now that you have built an Elasticsearch index and loaded some documents into it, you are ready to implement full-text search.



## Elasticsearch Queries
The Elasticsearch services uses a Query DSL (Domain Specific Language) based on the JSON format to define queries.

The Elasticsearch client for Python has a search() method that is used to submit a search query. Let's add a search() helper method in search.py that uses this method:


```python
class Search:
    # ...

    def search(self, **query_args):
        return self.es.search(index='my_documents', **query_args)

```

This method invokes the search() method of the Elasticsearch client with the index name. The query_args argument captures all the keyword arguments provided to the method, and then passes-them through to the es.search() method. These arguments are going to be how the caller specifies what to search for.



## Match Queries
The Elasticsearch Query DSL offers many different ways to query an index. Looking through the sub-sections in the documentation you will familiarize with the different types of queries that are possible. The very common task of searching text is covered in the Full-Text queries section.

For the first search implementation, let's use the Match query. Below you can see an example that uses this query:
- https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html



## QueryDSL 

쿼리절의 두 가지 유형: 
- Leaf 쿼리 절:
  - 특정 필드에서 특정 값을 찾는 쿼리 (예: match, term, range)

- Compound 쿼리 절:
  - 다른 leaf 또는 compound 쿼리를 감싸는 쿼리
  - 여러 쿼리를 논리적으로 결합하거나 (예: bool, dis_max) 동작을 변경 (예: constant_score)

쿼리 컨텍스트와 필터 컨텍스트:
- 쿼리 절은 사용되는 컨텍스트에 따라 다르게 동작합니다.
- 쿼리 컨택스트: 
  - "이 문서가 이 쿼리 절과 얼마나 잘 일치하는가?"에 대한 답을 구하는 것.
  - 특징:
    - 관련성 점수(relevance score)를 계산합니다.
    - 기본적으로 쿼리 결과를 점수에 따라 정렬합니다.
    - 전체 텍스트 검색에 주로 사용됩니다.
    - 예시: match, multi_match, query_string 쿼리 등
- 필터 컨택스트: 
  -  "이 문서가 이 쿼리 절과 일치하는가?"에 대한 예/아니오 답변을 구합니다.
  - 특징:
    - 관련성 점수를 계산하지 않습니다.
    - 결과를 캐싱할 수 있어 성능상 이점이 있습니다.
    - 정확한 값이나 범위에 대한 필터링에 주로 사용됩니다.
  - 예시: term, terms, range, exists 쿼리 등

- 사용 방법:
  - bool 쿼리 내에서 다음과 같이 사용됩니다:
  - must, should, must_not 절: 쿼리 컨텍스트
  - filter 절: 필터 컨텍스트
  - must vs filter:
    - must: 조건을 만족하면서 동시에 관련성 점수를 계산합니다.
    - filter: 조건을 만족하는지만 확인하고, 관련성 점수 계산에는 영향을 주지 않습니다.

```json
{
  "bool": {
    "must": [
      { "match": { "title": "search" } }  // 쿼리 컨텍스트
    ],
    "filter": [
      { "term": { "status": "published" } },  // 필터 컨텍스트
      { "range": { "publish_date": { "gte": "2015-01-01" } } }  // 필터 컨텍스트
    ]
  }
}
```


- 선택 기준:
  - 전체 텍스트 검색이나 유사도 기반 검색: 쿼리 컨텍스트
  - 정확한 값 매칭, 범위 검색, 존재 여부 확인: 필터 컨텍스트

- 성능 고려사항:
  - 가능한 경우 필터 컨텍스트를 사용하는 것이 성능상 유리합니다.
  - 필터 결과는 캐시되어 재사용될 수 있지만, 쿼리 결과는 매번 새로 계산됩니다.






비용이 많이 드는 쿼리:
- 일부 쿼리 유형은 구현 방식 때문에 일반적으로 실행 속도가 느리며, 클러스터의 안정성에 영향을 줄 수 있습니다. 이러한 쿼리는 다음과 같이 분류됩니다:
  - 선형 스캔이 필요한 쿼리:
    - Script Query 
  - 초기 비용이 높은 쿼리:
    - fuzzy query 
    - regexp query 
    - prefix query 
    - wildcard query 
    - range query on text and keyword fields 
  - 조인 쿼리
  - 문서당 비용이 높을 수 있는 쿼리: 
    - script_score queries 
    - percolate query 


비용이 많이 드는 쿼리 제어:
- search.allow_expensive_queries 설정을 false로 지정하여 이러한 쿼리의 실행을 방지할 수 있습니다 (기본값은 true).



Script Query: 
- 사용자가 정의한 스크립트를 기반으로 문서를 검색하거나 점수를 매기는 방법입니다. 이를 통해 복잡한 조건이나 계산을 쿼리에 포함시킬 수 있습니다.
- 문서 내에 있는 필드를 이용한 복잡한 조건을 스크립트로 만들면 검색할 수 있는거임. 
- 기본적으로 Painless 스크립트 언어를 지원하며, 설정에 따라 다른 언어도 사용 가능합니다.
- 성능 고려: 스크립트 쿼리는 일반 쿼리보다 실행 속도가 느릴 수 있으므로 주의가 필요합니다.

Script Query 예시: 
- 이 쿼리는 가격이 100 미만이고 재고가 있는 문서를 검색합니다.


```json 
{
  "query": {
    "script": {
      "script": {
        "source": "doc['price'].value < 100 && doc['in_stock'].value == true"
      }
    }
  }
}
```

매개변수를 사용한 스크립트 쿼리:

```json
{
  "query": {
    "script": {
      "script": {
        "source": "doc['price'].value < params.max_price",
        "params": {
          "max_price": 50
        }
      }
    }
  }
}
```

스크립트 점수 쿼리:
- 이 쿼리는 모든 문서를 대상으로 하되, 인기도와 평점을 조합하여 점수를 계산합니다.

```json
{
  "query": {
    "script_score": {
      "query": { "match_all": {} },
      "script": {
        "source": "doc['popularity'].value / 10 + doc['rating'].value"
      }
    }
  }
}
```

## Retrieving Individual Results
You may have noticed that the index.html template renders the title of each search result as a link. The link points to the third and last endpoint that came implemented in the starter Flask application, called get_document. The implementation that is provided returns a "Document not found" hardcoded text, so this is what you will see if you click on any of the results while playing with the application.

To correctly render individual documents, let's add a retrieve_document() helper method in search.py, using the get() method of the Elasticsearch client:




In [22]:
class Search:
    def __init__(self):
        self.es = Elasticsearch('http://localhost:9200')
        client_info = self.es.info()
        print('Connected to Elasticsearch!')
        pprint(client_info.body)

    def create_index(self):
        self.es.indices.delete(index='my_documents', ignore_unavailable=True)
        self.es.indices.create(index='my_documents')

    def insert_documents(self, documents):
        operations = []
        for document in documents:
            operations.append({'index': {'_index': 'my_documents'}})
            operations.append(document)
        return self.es.bulk(operations=operations)

    def reindex(self):
        self.create_index()
        with open('data.json', 'rt') as f:
            documents = json.loads(f.read())
        return self.insert_documents(documents)
    
    def retrieve_document(self, id):
        return self.es.get(index='my_documents', id=id)


# document = es.retrieve_document(id)
# title = document['_source']['name']
#  paragraphs = document['_source']['content'].split('\n')

## Searching Multiple Fields

After you played with the application for a while you may have noticed that a lot of queries return no results. As you recall, the search is currently implemented on the name field of each document, which is where the document titles are stored. Documents also have summary and content fields, which have longer texts that are apt to be searched as well, but right now these are ignored.

In this section you are going to learn about another common full-text search query, the Multi-match, which requests a search to be carried out across multiple fields of an index.

Here is the example multi-match query from the documentation:





## Multi-match query
 
multi-match query는 여러 필드에 대해 동시에 검색을 수행할 수 있는 유연한 쿼리 타입입니다. 이 쿼리는 같은 검색어를 여러 필드에 적용하고자 할 때 특히 유용합니다.

주요 특징:
- 여러 필드 검색: 하나의 쿼리로 여러 필드를 동시에 검색할 수 있습니다.
- 다양한 타입: 여러 가지 매치 전략을 제공합니다 (best_fields, most_fields, cross_fields 등).
- 필드 가중치: 특정 필드에 더 높은 중요도를 부여할 수 있습니다.
- 유연성: 와일드카드를 사용하여 유사한 이름의 여러 필드를 한 번에 지정할 수 있습니다.

기본 구문:

```json
{
  "query": {
    "multi_match": {
      "query": "검색어",
      "fields": ["필드1", "필드2", "필드3"]
    }
  }
}
```

주요 매치 타입:
- best_fields (기본값):
  - 가장 높은 점수를 받은 필드의 점수를 사용합니다.
  - 하나의 필드에서 많은 검색어가 나타나는 문서를 선호합니다.

- most_fields:
  - 모든 매칭 필드의 점수를 합산합니다.
  - 각 필드에 대해 독립적으로 쿼리를 실행한 후, 결과 점수를 합산합니다.
  - 각 필드는 개별적으로 분석되고 점수가 매겨집니다.
  - most_fields: 각 필드별로 독립적인 IDF를 사용
  - 여러 필드에 걸쳐 검색어가 나타나는 문서를 선호합니다

- cross_fields:
  - 모든 필드를 하나의 큰 필드처럼 취급합니다.
  - cross_fields: 모든 필드에 걸쳐 통합된 IDF를 사용 
  - 검색어가 여러 필드에 걸쳐 나타나는 경우에 유용합니다.

## Pagination

It is often impractical for an application to deal with a very large number of results. For this reason, APIs and web services use pagination controls to allow applications to request the results in small chunks or pages.

You may have noticed that Elasticsearch by default does not return more than 10 results. The optional size parameter can be given in a search request to change this maximum. The following example asks for up to 5 search results to be returned:


```python
results = es.search(
    query={
        'multi_match': {
            'query': query,
            'fields': ['name', 'summary', 'content'],
        }
    }, size=5
)
```

To access additional pages of results, the from_ parameter is used, which indicates from where in the complete list of results to start (since from is a reserved keyword in Python, from_ is used).

The next example retrieves a second page of 5 results:


```python
results = es.search(
    query={
        'multi_match': {
            'query': query,
            'fields': ['name', 'summary', 'content'],
        }
    }, size=5, from_=5
)

```



## Filters

Many applications need to give users the power to customize queries in ways that complement what search queries alone can do. In this chapter you are going to learn about filtering, a technique that makes it possible to specify that a search query is executed only on the subset of the documents contained in an index that satisfy a given condition.



## Introduction to Boolean Queries
Before you can implement filters you have to understand how compound queries are implemented in Elasticsearch.

A compound query allows an application to combine two or more individual queries, so that they execute together, and if appropriate, return a combined set of results. The standard way to create compound queries in Elasticsearch is to use a Boolean query.

A boolean query acts as a wrapper for two or more individual queries or clauses. There are four different ways to combine queries:

- bool.must: the clause must match. If multiple clauses are given, all must match (similar to an AND logical operation).
- bool.should: when used without must, at least one clause should match (similar to an OR logical operation). When combined with must each matching clause boosts the relevance score of the document.
- bool.filter: only documents that match the clause(s) are considered search result candidates.
- bool.must_not: only documents that do not match the clause(s) are considered search result candidates.

As you can probably guess from the above, boolean queries involve a fair amount of complexity and can be used in a variety of ways. In this chapter you are going to learn how to combine the multi-match full-text search clause implemented in the previous chapters with a filter that restricts results to one category of documents. Recall that the dataset used with this tutorial includes a category field that can be set to sharepoint, teams or github.



## Adding a Filter to a Query

The multi-match query that is currently implemented in the tutorial application uses the following structure:

```json
{
    'multi_match': {
        'query': "query text here",
        'fields': ['name', 'summary', 'content'],
    }
}
```

To add a filter that restricts this search to a specific category, the query must be expanded as follows:


```json
{
    'bool': {
        'must': [{
            'multi_match': {
                'query': "query text here",
                'fields': ['name', 'summary', 'content'],
            }
        }],
        'filter': [{
            'term': {
                'category.keyword': {
                    'value': "category to filter"
                }
            }
        }]
    }
}

```

Let's look at the new components in this query in detail.

First of all, the multi_match query has been moved inside a bool.must clause. The bool.must clause is usually the place where the base query is defined. Note that must accepts a list of queries to search for, so this allows multiple base-level queries to be combined when desired.

The filtering is implemented in a bool.filter section, using a new query type, the term query. Using a match or multi_match query for a filter is not a good idea, because these are full-text search queries. For the purpose of filtering, the query must return an absolute true or false answer for each document and not a relevance score like the match queries do.

The term query performs an exact search for the a value in a given field. This type of query is useful to search for identifiers, labels, tags, or as in this case, categories.

This query does not work well with fields that are indexed for full-text search. String fields are assigned a default type of text, and have their contents analyzed and separated into individual words before they are indexed. Elasticsearch assigns string fields a secondary type of keyword, which indexes the field contents as a whole, making them more appropriate for filtering with the term query. By using a field name of category.keyword in the filter portion of the query, the keyword typed variant of the field is used instead of the default text one.



## Term Query 

Term Query는 정확한 값 매칭을 위해 사용되는 쿼리 타입임. 

기본 개념:
- Term Query는 지정된 필드에서 정확히 일치하는 용어를 찾습니다.
- 대소문자를 구분하며, 분석되지 않은 정확한 값을 검색합니다.

주요 특징:
- 정확한 매칭: 부분 일치나 유사 매칭이 아닌 정확한 일치만을 찾습니다.
- 분석 없음: 쿼리 문자열이 분석되지 않고 그대로 사용됩니다.
- 빠른 성능: 정확한 매칭으로 인해 매우 빠른 검색이 가능합니다.

기본 구문: 
```json
{
  "query": {
    "term": {
      "field_name": "exact_value"
    }
  }
}
```

주의사항:
- text 필드에 사용 시 주의: text 필드는 기본적으로 분석되므로, term query와 함께 사용할 때 예상치 못한 결과가 나올 수 있습니다.
- keyword 필드 사용 권장: 정확한 매칭을 위해서는 keyword 타입 필드를 사용하는 것이 좋습니다.
- 대소문자 구분: "Active"와 "active"는 다른 값으로 취급됩니다.

text 필드와 함께 사용할 때의 팁:
- 필드명에 '.keyword'를 추가하여 keyword 버전의 필드를 사용합니다.

```json
{
  "query": {
    "term": {
      "status.keyword": "Active"
    }
  }
}
```

다중 값 검색 (terms query):
- 여러 값 중 하나와 일치하는 문서를 찾고 싶을 때 사용합니다.

```json 
{
  "query": {
    "terms": {
      "status": ["active", "pending"]
    }
  }
}
``` 




## Text Type Query 

text 타입의 필드 검색을 위해 Elasticsearch는 여러 가지 쿼리 타입을 제공합니다. 이들은 전문 검색(full-text search)에 적합하며, 각각 다른 특성과 사용 사례를 가지고 있습니다. 주요 쿼리 타입들은 다음과 같습니다:

Match Query:
- 가장 기본적이고 널리 사용되는 전문 검색 쿼리입니다.
- 제공된 텍스트를 분석하고 개별 용어로 분리한 후 검색합니다.

```json
{
  "query": {
    "match": {
      "description": "quick brown fox"
    }
  }
}
```

Multi-Match Query:
- 여러 필드에서 동일한 검색어를 찾을 때 사용합니다.

```json
text 타입의 필드 검색을 위해 Elasticsearch는 여러 가지 쿼리 타입을 제공합니다. 이들은 전문 검색(full-text search)에 적합하며, 각각 다른 특성과 사용 사례를 가지고 있습니다. 주요 쿼리 타입들은 다음과 같습니다:

Match Query:

가장 기본적이고 널리 사용되는 전문 검색 쿼리입니다.
제공된 텍스트를 분석하고 개별 용어로 분리한 후 검색합니다.

jsonCopy{
  "query": {
    "match": {
      "description": "quick brown fox"
    }
  }
}

Multi-Match Query:
- 여러 필드에서 동일한 검색어를 찾을 때 사용합니다.

jsonCopy{
  "query": {
    "multi_match": {
      "query": "quick brown fox",
      "fields": ["title", "description"]
    }
  }
}

Match Phrase Query:
- 정확한 구문을 검색할 때 사용합니다.
- 단어의 순서와 근접성을 고려합니다.

```json
{
  "query": {
    "match_phrase": {
      "description": "quick brown fox"
    }
  }
}
``` 

Query String Query:
- 복잡한 검색 구문을 지원합니다 (AND, OR, NOT 등).
- 사용자가 직접 검색 쿼리를 입력할 때 유용합니다.

```json
{
  "query": {
    "query_string": {
      "default_field": "description",
      "query": "quick AND fox OR (brown AND dog)"
    }
  }
}
``` 

Prefix Query:
- 특정 접두사로 시작하는 단어를 검색합니다.

```json
{
  "query": {
    "prefix": {
      "title": "qu"
    }
  }
}
```

Wildcard Query:
- 와일드카드 문자 (*,?)를 사용한 패턴 매칭을 지원합니다.

```json
text 타입의 필드 검색을 위해 Elasticsearch는 여러 가지 쿼리 타입을 제공합니다. 이들은 전문 검색(full-text search)에 적합하며, 각각 다른 특성과 사용 사례를 가지고 있습니다. 주요 쿼리 타입들은 다음과 같습니다:

Match Query:

가장 기본적이고 널리 사용되는 전문 검색 쿼리입니다.
제공된 텍스트를 분석하고 개별 용어로 분리한 후 검색합니다.

jsonCopy{
  "query": {
    "match": {
      "description": "quick brown fox"
    }
  }
}

Multi-Match Query:

여러 필드에서 동일한 검색어를 찾을 때 사용합니다.

jsonCopy{
  "query": {
    "multi_match": {
      "query": "quick brown fox",
      "fields": ["title", "description"]
    }
  }
}

Match Phrase Query:

정확한 구문을 검색할 때 사용합니다.
단어의 순서와 근접성을 고려합니다.

jsonCopy{
  "query": {
    "match_phrase": {
      "description": "quick brown fox"
    }
  }
}

Query String Query:

복잡한 검색 구문을 지원합니다 (AND, OR, NOT 등).
사용자가 직접 검색 쿼리를 입력할 때 유용합니다.

jsonCopy{
  "query": {
    "query_string": {
      "default_field": "description",
      "query": "quick AND fox OR (brown AND dog)"
    }
  }
}

Simple Query String Query:

Query String Query의 간소화된 버전입니다.
구문 오류에 더 관대합니다.

jsonCopy{
  "query": {
    "simple_query_string": {
      "fields": ["title", "description"],
      "query": "quick brown +fox -dog"
    }
  }
}

Prefix Query:

특정 접두사로 시작하는 단어를 검색합니다.

jsonCopy{
  "query": {
    "prefix": {
      "title": "qu"
    }
  }
}

Wildcard Query:

와일드카드 문자 (*,?)를 사용한 패턴 매칭을 지원합니다.

jsonCopy{
  "query": {
    "wildcard": {
      "description": "qu*k"
    }
  }
}



Fuzzy Query:
- 철자 오류를 허용하는 유사 검색을 수행합니다.

```json
{
  "query": {
    "fuzzy": {
      "description": {
        "value": "quik",
        "fuzziness": "AUTO"
      }
    }
  }
}
```






## Specifying a Filter
Before the filtered query can be implemented, it is necessary to add a way for end users to enter a desired filter. The solution implemented in this tutorial will look for a category:<category-name> pattern in the text of the search query. Let's add a function called extract_filters() to app.py to look for filter expressions:


```python
def extract_filters(query):
    filters = []

    filter_regex = r'category:([^\s]+)\s*'
    m = re.search(filter_regex, query)
    if m:
        filters.append({
            'term': {
                'category.keyword': {
                    'value': m.group(1)
                }
            }
        })
        query = re.sub(filter_regex, '', query).strip()

    return {'filter': filters}, query
```

The function accepts the query entered by the user and returns a tuple with the filters that were found in the query, and the modified query after the filters were removed. To look for the filter pattern it uses a regular expression. The function is designed to be expanded with additional filters.

When a filter is found, the filters list is extended with a corresponding filter expression, which in this case is based on the term query, as discussed above.

To better understand how this function works, start a Python session (make sure the virtual environment is activated first) and run the following code:

```python
from app import extract_filters
extract_filters('this is the search text category:sharepoint')

```

The returned tuple from the function should be:

{'filter': [{'term': 'category.keyword': {'value': 'sharepoint'}}]}, 'this is the search text'



## Implementing the Filtered Search

What remains to do is to change the handle_search() function to send an updated query that combines the full-text search expression with a filter, if one is given by the user. Below is the new version of this function:

```python
@app.post('/')
def handle_search():
    query = request.form.get('query', '')
    filters, parsed_query = extract_filters(query)
    from_ = request.form.get('from_', type=int, default=0)

    results = es.search(
        query={
            'bool': {
                'must': {
                    'multi_match': {
                        'query': parsed_query,
                        'fields': ['name', 'summary', 'content'],
                    }
                },
                **filters
            }
        },
        size=5,
        from_=from_
    )
    return render_template('index.html', results=results['hits']['hits'],
                           query=query, from_=from_,
                           total=results['hits']['total']['value'])

```

The query has now been changed to send a bool expression, and the search expression was moved inside a must section under it. The extract_filters() function returns the filter portion of the query in the form it needs to be sent to Elasticsearch, so it is inserted in the query dictionary also under the top-level bool key.

Try a search query such as work from home category:sharepoint to see how only documents from the given category are returned.



## Range Filters
Elasticsearch supports a variety of filters besides the term filter. Another one that is commonly used is the range filter, which works with numbers and dates. Let's add a year filter that can be used to restrict results based on the year they were last updated, which is given in the updated_at field.

Below is an updated version of the extract_filters() function that looks for both category:<category> and year:<yyyy> as filters:

```python
def extract_filters(query):
    filters = []

    filter_regex = r'category:([^\s]+)\s*'
    m = re.search(filter_regex, query)
    if m:
        filters.append({
            'term': {
                'category.keyword': {
                    'value': m.group(1)
                }
            },
        })
        query = re.sub(filter_regex, '', query).strip()

    filter_regex = r'year:([^\s]+)\s*'
    m = re.search(filter_regex, query)
    if m:
        filters.append({
            'range': {
                'updated_at': {
                    'gte': f'{m.group(1)}||/y',
                    'lte': f'{m.group(1)}||/y',
                }
            },
        })
        query = re.sub(filter_regex, '', query).strip()

    return {'filter': filters}, query

```

This version adds a second regular expression to find year:yyyy in the query string. It creates a range filter for the updated_at field, and sets the low and high bounds of the range to the year that is given after the colon, which is captured in the regular expression match as m.group(1).

There is a small complication, because the updated_at field contains full dates, and in this filter only needs to look at the year. Luckily, when the range filter is used with date field the bounds of the range can be enhanced with date math. The ||/y suffix that is added to the gte (lower bound) and lte (upper bound) parameters of the range indicates that the given value is a year that must be completed to form a full date that can be compared against the field.

With this change, you can include a query such as year:2020 work from home to see results from the requested year only. The query can include the two filters as well, for example year:2020 category:teams work from home.



## The match-all query
Before moving on to a new topic, try entering only a filter in the search query text field, for example category:github. Unfortunately this does not return any results, but the expected behavior in this case would be to receive all the results that match the requested category.

What happens is that the extract_filters() function returns a tuple with the filter(s) in the first element and an empty query string in the second element. The multi_match query receives the empty string, and returns an empty list of results, because nothing matches an empty string.

To address this special case, the multi_match query can be replaced with match_all when the search text is empty. The version of the handle_search() function below adds logic to do this. Update the function in app.py.


```python
@app.post('/')
def handle_search():
    query = request.form.get('query', '')
    filters, parsed_query = extract_filters(query)
    from_ = request.form.get('from_', type=int, default=0)

    if parsed_query:
        search_query = {
            'must': {
                'multi_match': {
                    'query': parsed_query,
                    'fields': ['name', 'summary', 'content'],
                }
            }
        }
    else:
        search_query = {
            'must': {
                'match_all': {}
            }
        }

    results = es.search(
        query={
            'bool': {
                **search_query,
                **filters
            }
        },
        size=5,
        from_=from_
    )
    return render_template('index.html', results=results['hits']['hits'],
                           query=query, from_=from_,
                           total=results['hits']['total']['value'])

```

With this version, you can ask for all the documents that match a category. Note how all the results that are returned come back with the same score of 1.0, because there are no search terms to compute scores.



## Faceted Search

Faceted Search(패싯 검색)는 Elasticsearch에서 제공하는 강력한 검색 기능 중 하나입니다. 이 기능을 통해 사용자는 검색 결과를 다양한 카테고리나 속성(패싯)으로 필터링하고 정리할 수 있습니다. 

기본 개념:
- Faceted Search는 검색 결과를 여러 차원으로 분류하여 제시합니다.
- 사용자가 검색 결과를 쉽게 탐색하고 필터링할 수 있게 해줍니다.


주요 특징:
- 동적 필터링: 사용자가 실시간으로 검색 결과를 필터링할 수 있습니다.
- 다차원 분류: 여러 속성을 기준으로 결과를 분류합니다.
- 결과 요약: 각 패싯에 대한 문서 수를 제공합니다.

일반적인 패싯 유형:
- 카테고리 패싯: 제품 카테고리, 브랜드 등
- 숫자 범위 패싯: 가격 범위, 날짜 범위 등
- 태그 패싯: 키워드, 태그 등

사용 예시:
```json
{
  "query": {
    "match": {
      "description": "laptop"
    }
  },
  "aggs": {
    "brands": {
      "terms": {
        "field": "brand.keyword"
      }
    },
    "price_ranges": {
      "range": {
        "field": "price",
        "ranges": [
          { "to": 500 },
          { "from": 500, "to": 1000 },
          { "from": 1000 }
        ]
      }
    }
  }
}
```

실제 적용 사례:
- 전자상거래: 제품을 브랜드, 가격 범위, 색상 등으로 필터링
- 도서 검색: 저자, 출판년도, 장르 등으로 필터링
- 호텔 예약: 위치, 가격, 별점 등으로 필터링








## Faceted Search에서의 동적 필터링

동적 필터링의 핵심 개념:
- 실시간 업데이트: 사용자가 필터를 선택할 때마다 검색 결과와 다른 패싯들이 즉시 업데이트됩니다.
- 상호 의존적 패싯: 하나의 패싯 선택이 다른 패싯의 가용한 옵션들에 영향을 줍니다.
- 필터 조합: 여러 패싯에서 선택한 필터들이 조합되어 적용됩니다.


구현 방법:
- 1. 필터 적용:
  - 사용자가 패싯을 선택할 때마다 해당 필터를 쿼리에 추가합니다.

```json
GET /products/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "description": "laptop" } }
      ],
      "filter": [
        { "term": { "brand": "BrandA" } },
        { "range": { "price": { "gte": 500, "lt": 1000 } } }
      ]
    }
  },
  "aggs": {
    // aggregations here
  }
}
```

- 2. 동적 집계 (Dynamic Aggregations): 선택된 필터를 제외한 나머지 패싯에 대해 정확한 카운트를 유지하기 위해 필터된 집계를 사용합니다.

```json
GET /products/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "description": "laptop" } }
      ],
      "filter": [
        { "term": { "brand": "BrandA" } },
        { "range": { "price": { "gte": 500, "lt": 1000 } } }
      ]
    }
  },
  "aggs": {
    "all_categories": {
      "global": {},
      "aggs": {
        "categories": {
          "terms": { "field": "category" }
        }
      }
    },
    "filtered_brands": {
      "filter": {
        "range": { "price": { "gte": 500, "lt": 1000 } }
      },
      "aggs": {
        "brands": {
          "terms": { "field": "brand" }
        }
      }
    },
    "price_ranges": {
      "filter": {
        "term": { "brand": "BrandA" }
      },
      "aggs": {
        "price_ranges": {
          "range": {
            "field": "price",
            "ranges": [
              { "to": 500 },
              { "from": 500, "to": 1000 },
              { "from": 1000 }
            ]
          }
        }
      }
    }
  }
}
```

- 3. 필터 제거 (Filter Removal): 사용자가 필터를 제거할 수 있도록 하고, 제거 시 해당 필터를 쿼리에서 삭제합니다.

- 4. 브라우저 상태 관리: 선택된 필터들을 URL 파라미터나 브라우저 상태로 관리하여 페이지 새로고침 후에도 필터 상태를 유지할 수 있게 합니다.

- 5. 캐싱: 자주 사용되는 필터 조합에 대한 결과를 캐싱하여 응답 시간을 개선할 수 있습니다.



정리하자면 동적 필터링 기능을 제공하기 위해선 두 가지 작업이 필요함: 
- 쿼리의 filter 부분에 사용자가 선택한 필터 조건 추가: 이는 검색 결과를 사용자의 선택에 맞게 필터링합니다.
- aggregations (aggs) 부분에 각 패싯에 대한 필터 추가: 이는 다른 패싯의 카운트를 정확하게 유지하기 위함입니다.




## Term Aggregations

In Elasticsearch faceted search is implemented using the aggregations feature. One of the supported aggregations divides the search results in buckets, based on some criteria. The list of buckets, each including the number of documents it contains, is going to be used to render the facets sidebar.

The simplest type of bucket aggregation is the one in which buckets are defined for each keyword. This type, which is called terms aggregation is perfect to create the buckets for the category field. Here is the search request from the application, expanded to ask for category aggregations:


```python
results = es.search(
    query={
        'bool': {
            **search_query,
            **filters
        }
    },
    aggs={
        'category-agg': {
            'terms': {
                'field': 'category.keyword',
            }
        },
    },
    size=5,
    from_=from_
)

```


## Aggregtaions 

Elasticsearch의 aggregations(집계) 기능은 데이터를 그룹화하고 통계를 계산하는 강력한 도구입니다. 

기본 개념:
- 데이터를 그룹화하고 메트릭을 계산합니다.
- 검색 쿼리와 함께 또는 독립적으로 사용할 수 있습니다.
- 복잡한 데이터 분석과 요약을 가능하게 합니다.

주요 타입:
- Bucket Aggregations:
  - 데이터를 그룹(버킷)으로 나눕니다.
  - 예: terms, date_histogram, range

- Metric Aggregations: 
  - 숫자 필드에 대한 계산을 수행합니다.
  - 예: avg, sum, min, max, cardinality

- Pipeline Aggregations:
  - 다른 집계의 결과를 입력으로 사용합니다.
  - 예: avg_bucket, sum_bucket, cumulative_su

주요 Bucket Aggregations:
- Terms: 필드의 고유 값을 기준으로 그룹화
- Date Histogram: 날짜/시간 필드를 기준으로 그룹화
- Range: 숫자나 날짜 범위로 그룹화
- Filters: 사전 정의된 필터로 그룹화

주요 Metric Aggregations:
- Avg: 평균 계산
- Sum: 합계 계산
- Min/Max: 최소/최대값 찾기
- Cardinality: 고유 값의 개수 계산 (근사값)
- Percentiles: 백분위수 계산


사용 예시: 
- 이 예시는 가장 인기 있는 5개 색상과 각 색상의 평균 가격을 계산합니다.
 
```json
GET /my_index/_search
{
  "size": 0,
  "aggs": {
    "popular_colors": {
      "terms": {
        "field": "color",
        "size": 5
      },
      "aggs": {
        "avg_price": {
          "avg": {
            "field": "price"
          }
        }
      }
    }
  }
}

```

## Buckets 

buckets(버킷)은 aggregations(집계) 기능의 핵심 개념 중 하나입니다. 버킷은 특정 기준에 따라 문서들을 그룹화하는 컨테이너라고 생각할 수 있습니다.