-
Notifications
You must be signed in to change notification settings - Fork 722
OpenSearch Support #880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
OpenSearch Support #880
Changes from all commits
Commits
Show all changes
49 commits
Select commit
Hold shift + click to select a range
0a83644
[skip ci] elasticsearch support init: structure and skeleton code
947119f
[skip ci] rename elasticsearch->opensearch
4534d7a
[skip ci] merge Assaf and Murali forks
4e8f4e3
[skip ci] fixed filter_path pandasticsearch issue
7a010cd
[skip ci] disable scan for now
4bfbfd6
Merge branch 'main' of https://github.com/awslabs/aws-data-wrangler i…
79e0a9a
[skip ci] path documentation
f07e698
[skip ci] add delete_index
7d7318b
[skip ci] add delete_index
6b90c93
[skip ci] add index_json
73db6f5
[skip ci] add index_csv local path
15d8aca
[skip ci] add is_scroll to search (scan)
e01b1a0
[skip ci] add search_by_sql
1e1fe37
[skip ci] opensearch test infra
d574341
[skip ci] index create/delete ignore exceptions
7bb6779
[skip ci] index_documents documents type
75a2701
[skip ci] removed pandasticsearch dependency
cea9abb
[skip ci] port typo
f6c7dd4
[skip ci] enforced_pandas_params
517a3a6
Merge branch 'main' of https://github.com/awslabs/aws-data-wrangler i…
030e21c
[skip ci] isort & black
950231d
Added OpenSearch tutorial
mureddy29 9829755
typing fixes
0120e31
[skip ci] isort
b4700f6
[skip ci] black opensearch
51b8110
[skip ci] opensearch validation
cdf7dc7
Merge branch 'main' of https://github.com/awslabs/aws-data-wrangler i…
39457fc
[skip ci] opensearch: poetry add requests-aws4auth and elasticsearch
7be5062
[skip ci] opensearch: add support for host with schema http/https
cb8656c
Update 031 - OpenSearch.ipynb
mureddy29 22b5e9b
[skip ci] opensearch: index_documents 429 error
c5092a2
[skip ci] opensearch: add jsonpath_ng library
8bd8985
Merge branch 'elasticsearch-support' of https://github.com/AssafMentz…
97a35bd
[skip ci] opensearch: renamed fgac user/password
a73d875
[skip ci] opensearch: add connection timeout
ed7a57c
opensearch: get_credentials_from_session
aaf8943
Merge branch 'main' of https://github.com/awslabs/aws-data-wrangler i…
545e163
[skip ci] opensearch: indexing progressbar
6042ae4
[skip ci] opensearch.index_documents.max_retries default 5
c53cd6f
opensearch: replace elasticsearch-py with opensearch-py low-level client
5c5d717
[skip ci] opensearch filter_path default value
152c407
[skip ci] opensearch tutorial
419a5ce
Merge branch 'main' into elasticsearch-support
jaidisido 808cf09
Merge branch 'main' into elasticsearch-support
jaidisido 53dff4b
Minor - Pylint
jaidisido c6e6d80
[skip ci] opensearch: pylint f-string and file open encoding
2307100
[skip ci] opensearch: pylint f-string
29f892c
opensearch: add to CONTRIBUTING.md
827c3bf
opensearch: update aws-cdk packages to have the same minimum version
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,17 @@ | ||
| """Utilities Module for Amazon OpenSearch.""" | ||
|
|
||
| from awswrangler.opensearch._read import search, search_by_sql | ||
| from awswrangler.opensearch._utils import connect | ||
| from awswrangler.opensearch._write import create_index, delete_index, index_csv, index_df, index_documents, index_json | ||
|
|
||
| __all__ = [ | ||
| "connect", | ||
| "create_index", | ||
| "delete_index", | ||
| "index_csv", | ||
| "index_documents", | ||
| "index_df", | ||
| "index_json", | ||
| "search", | ||
| "search_by_sql", | ||
| ] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,169 @@ | ||
| """Amazon OpenSearch Read Module (PRIVATE).""" | ||
|
|
||
| from typing import Any, Collection, Dict, List, Mapping, Optional, Union | ||
|
|
||
| import pandas as pd | ||
| from opensearchpy import OpenSearch | ||
| from opensearchpy.helpers import scan | ||
|
|
||
| from awswrangler.opensearch._utils import _get_distribution | ||
|
|
||
|
|
||
| def _resolve_fields(row: Mapping[str, Any]) -> Mapping[str, Any]: | ||
| fields = {} | ||
| for field in row: | ||
| if isinstance(row[field], dict): | ||
| nested_fields = _resolve_fields(row[field]) | ||
| for n_field, val in nested_fields.items(): | ||
| fields[f"{field}.{n_field}"] = val | ||
| else: | ||
| fields[field] = row[field] | ||
| return fields | ||
|
|
||
|
|
||
| def _hit_to_row(hit: Mapping[str, Any]) -> Mapping[str, Any]: | ||
| row: Dict[str, Any] = {} | ||
| for k in hit.keys(): | ||
| if k == "_source": | ||
| solved_fields = _resolve_fields(hit["_source"]) | ||
| row.update(solved_fields) | ||
| elif k.startswith("_"): | ||
| row[k] = hit[k] | ||
| return row | ||
|
|
||
|
|
||
| def _search_response_to_documents(response: Mapping[str, Any]) -> List[Mapping[str, Any]]: | ||
| return [_hit_to_row(hit) for hit in response["hits"]["hits"]] | ||
|
|
||
|
|
||
| def _search_response_to_df(response: Union[Mapping[str, Any], Any]) -> pd.DataFrame: | ||
| return pd.DataFrame(_search_response_to_documents(response)) | ||
|
|
||
|
|
||
| def search( | ||
| client: OpenSearch, | ||
| index: Optional[str] = "_all", | ||
| search_body: Optional[Dict[str, Any]] = None, | ||
| doc_type: Optional[str] = None, | ||
| is_scroll: Optional[bool] = False, | ||
| filter_path: Optional[Union[str, Collection[str]]] = None, | ||
| **kwargs: Any, | ||
| ) -> pd.DataFrame: | ||
| """Return results matching query DSL as pandas dataframe. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| client : OpenSearch | ||
| instance of opensearchpy.OpenSearch to use. | ||
| index : str, optional | ||
| A comma-separated list of index names to search. | ||
| use `_all` or empty string to perform the operation on all indices. | ||
| search_body : Dict[str, Any], optional | ||
| The search definition using the [Query DSL](https://opensearch.org/docs/opensearch/query-dsl/full-text/). | ||
| doc_type : str, optional | ||
| Name of the document type (for Elasticsearch versions 5.x and earlier). | ||
| is_scroll : bool, optional | ||
| Allows to retrieve a large numbers of results from a single search request using | ||
| [scroll](https://opensearch.org/docs/opensearch/rest-api/scroll/) | ||
| for example, for machine learning jobs. | ||
| Because scroll search contexts consume a lot of memory, we suggest you don’t use the scroll operation | ||
| for frequent user queries. | ||
| filter_path : Union[str, Collection[str]], optional | ||
| Use the filter_path parameter to reduce the size of the OpenSearch Service response \ | ||
| (default: ['hits.hits._id','hits.hits._source']) | ||
| **kwargs : | ||
| KEYWORD arguments forwarded to [opensearchpy.OpenSearch.search]\ | ||
| (https://opensearch-py.readthedocs.io/en/latest/api.html#opensearchpy.OpenSearch.search) | ||
| and also to [opensearchpy.helpers.scan](https://opensearch-py.readthedocs.io/en/master/helpers.html#scan) | ||
| if `is_scroll=True` | ||
|
|
||
| Returns | ||
| ------- | ||
| Union[pandas.DataFrame, Iterator[pandas.DataFrame]] | ||
| Results as Pandas DataFrame | ||
|
|
||
| Examples | ||
| -------- | ||
| Searching an index using query DSL | ||
|
|
||
| >>> import awswrangler as wr | ||
| >>> client = wr.opensearch.connect(host='DOMAIN-ENDPOINT') | ||
| >>> df = wr.opensearch.search( | ||
| ... client=client, | ||
| ... index='movies', | ||
| ... search_body={ | ||
| ... "query": { | ||
| ... "match": { | ||
| ... "title": "wind" | ||
| ... } | ||
| ... } | ||
| ... } | ||
| ... ) | ||
|
|
||
|
|
||
| """ | ||
| if doc_type: | ||
| kwargs["doc_type"] = doc_type | ||
|
|
||
| if filter_path is None: | ||
| filter_path = ["hits.hits._id", "hits.hits._source"] | ||
|
|
||
| if is_scroll: | ||
| if isinstance(filter_path, str): | ||
| filter_path = [filter_path] | ||
| filter_path = ["_scroll_id", "_shards"] + list(filter_path) # required for scroll | ||
| documents_generator = scan(client, index=index, query=search_body, filter_path=filter_path, **kwargs) | ||
| documents = [_hit_to_row(doc) for doc in documents_generator] | ||
| df = pd.DataFrame(documents) | ||
| else: | ||
| response = client.search(index=index, body=search_body, filter_path=filter_path, **kwargs) | ||
| df = _search_response_to_df(response) | ||
| return df | ||
|
|
||
|
|
||
| def search_by_sql(client: OpenSearch, sql_query: str, **kwargs: Any) -> pd.DataFrame: | ||
| """Return results matching [SQL query](https://opensearch.org/docs/search-plugins/sql/index/) as pandas dataframe. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| client : OpenSearch | ||
| instance of opensearchpy.OpenSearch to use. | ||
| sql_query : str | ||
| SQL query | ||
| **kwargs : | ||
| KEYWORD arguments forwarded to request url (e.g.: filter_path, etc.) | ||
|
|
||
| Returns | ||
| ------- | ||
| Union[pandas.DataFrame, Iterator[pandas.DataFrame]] | ||
| Results as Pandas DataFrame | ||
|
|
||
| Examples | ||
| -------- | ||
| Searching an index using SQL query | ||
|
|
||
| >>> import awswrangler as wr | ||
| >>> client = wr.opensearch.connect(host='DOMAIN-ENDPOINT') | ||
| >>> df = wr.opensearch.search_by_sql( | ||
| >>> client=client, | ||
| >>> sql_query='SELECT * FROM my-index LIMIT 50' | ||
| >>> ) | ||
|
|
||
|
|
||
| """ | ||
| if _get_distribution(client) == "opensearch": | ||
| url = "/_plugins/_sql" | ||
| else: | ||
| url = "/_opendistro/_sql" | ||
|
|
||
| kwargs["format"] = "json" | ||
| body = {"query": sql_query} | ||
| for size_att in ["size", "fetch_size"]: | ||
| if size_att in kwargs: | ||
| body["fetch_size"] = kwargs[size_att] | ||
| del kwargs[size_att] # unrecognized parameter | ||
| response = client.transport.perform_request( | ||
| "POST", url, headers={"Content-Type": "application/json"}, body=body, params=kwargs | ||
| ) | ||
| df = _search_response_to_df(response) | ||
| return df |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,108 @@ | ||
| """Amazon OpenSearch Utils Module (PRIVATE).""" | ||
|
|
||
| import logging | ||
| import re | ||
| from typing import Any, Optional | ||
|
|
||
| import boto3 | ||
| from opensearchpy import OpenSearch, RequestsHttpConnection | ||
| from requests_aws4auth import AWS4Auth | ||
|
|
||
| from awswrangler import _utils, exceptions | ||
|
|
||
| _logger: logging.Logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| def _get_distribution(client: OpenSearch) -> Any: | ||
| return client.info().get("version", {}).get("distribution", "elasticsearch") | ||
|
|
||
|
|
||
| def _get_version(client: OpenSearch) -> Any: | ||
| return client.info().get("version", {}).get("number") | ||
|
|
||
|
|
||
| def _get_version_major(client: OpenSearch) -> Any: | ||
| version = _get_version(client) | ||
| if version: | ||
| return int(version.split(".")[0]) | ||
| return None | ||
|
|
||
|
|
||
| def _strip_endpoint(endpoint: str) -> str: | ||
| uri_schema = re.compile(r"https?://") | ||
| return uri_schema.sub("", endpoint).strip().strip("/") | ||
|
|
||
|
|
||
| def connect( | ||
| host: str, | ||
| port: Optional[int] = 443, | ||
| boto3_session: Optional[boto3.Session] = boto3.Session(), | ||
| region: Optional[str] = None, | ||
| username: Optional[str] = None, | ||
| password: Optional[str] = None, | ||
| ) -> OpenSearch: | ||
| """Create a secure connection to the specified Amazon OpenSearch domain. | ||
|
|
||
| Note | ||
| ---- | ||
| We use [opensearch-py](https://github.com/opensearch-project/opensearch-py), an OpenSearch low-level python client. | ||
|
|
||
| The username and password are mandatory if the OS Cluster uses [Fine Grained Access Control]\ | ||
| (https://docs.aws.amazon.com/opensearch-service/latest/developerguide/fgac.html). | ||
| If fine grained access control is disabled, session access key and secret keys are used. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| host : str | ||
| Amazon OpenSearch domain, for example: my-test-domain.us-east-1.es.amazonaws.com. | ||
| port : int | ||
| OpenSearch Service only accepts connections over port 80 (HTTP) or 443 (HTTPS) | ||
| boto3_session : boto3.Session(), optional | ||
| Boto3 Session. The default boto3 Session will be used if boto3_session receive None. | ||
| region : | ||
| AWS region of the Amazon OS domain. If not provided will be extracted from boto3_session. | ||
| username : | ||
| Fine-grained access control username. Mandatory if OS Cluster uses Fine Grained Access Control. | ||
| password : | ||
| Fine-grained access control password. Mandatory if OS Cluster uses Fine Grained Access Control. | ||
|
|
||
| Returns | ||
| ------- | ||
| opensearchpy.OpenSearch | ||
| OpenSearch low-level client. | ||
| https://github.com/opensearch-project/opensearch-py/blob/main/opensearchpy/client/__init__.py | ||
| """ | ||
| valid_ports = {80, 443} | ||
|
|
||
| if port not in valid_ports: | ||
| raise ValueError(f"results: port must be one of {valid_ports}") | ||
|
|
||
| if username and password: | ||
| http_auth = (username, password) | ||
| else: | ||
| if region is None: | ||
| region = _utils.get_region_from_session(boto3_session=boto3_session) | ||
| creds = _utils.get_credentials_from_session(boto3_session=boto3_session) | ||
| if creds.access_key is None or creds.secret_key is None: | ||
| raise exceptions.InvalidArgument( | ||
| "One of IAM Role or AWS ACCESS_KEY_ID and SECRET_ACCESS_KEY must be " | ||
| "given. Unable to find ACCESS_KEY_ID and SECRET_ACCESS_KEY in boto3 " | ||
| "session." | ||
| ) | ||
| http_auth = AWS4Auth(creds.access_key, creds.secret_key, region, "es", session_token=creds.token) | ||
| try: | ||
| es = OpenSearch( | ||
| host=_strip_endpoint(host), | ||
| port=port, | ||
| http_auth=http_auth, | ||
| use_ssl=True, | ||
| verify_certs=True, | ||
| connection_class=RequestsHttpConnection, | ||
| timeout=30, | ||
| max_retries=10, | ||
| retry_on_timeout=True, | ||
| ) | ||
| except Exception as e: | ||
| _logger.error("Error connecting to Opensearch cluster. Please verify authentication details") | ||
| raise e | ||
| return es | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed for anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes in opensearch/_write.py create_index method to support the deprecation of ES mapping types