Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

boto code for Athena internal partitioned tables #3812

Closed
2 tasks
jchowdhary opened this issue Aug 8, 2023 · 2 comments
Closed
2 tasks

boto code for Athena internal partitioned tables #3812

jchowdhary opened this issue Aug 8, 2023 · 2 comments
Assignees
Labels
feature-request This issue requests a feature. response-requested Waiting on additional information or feedback. service-api This issue is caused by the service API, not the SDK implementation.

Comments

@jchowdhary
Copy link

Describe the feature

I have an Athena table "mss_athena_intake" which contains 20 partitioned columns and 30 non partitioned columns and around 2 TB of data. From athena console when i make a select query call on partitioned columns for this table and if i give the limit for 200 records , the query when executed takes around 40-50 seconds. The same query when it is called from boto script , it takes around 1 min.

Now my recommendation is , Athena internally creates an internal partition table where only the partitioned columns exist. The table name for this one will be "mss_athena_intake$partitions" . The select query for only partitioned columns executed against this internal table hardly took 5-10 sec as compared to 40-50sec.

But from boto, if i make a select query on this Athena internal partitioned table, i get the error
{
"errorMessage": "An error occurred (InvalidRequestException) when calling the StartQueryExecution operation: Queries of this type are not supported",
"errorType": "InvalidRequestException",
"requestId": "121e08fd-256d-4600-88d6-ed33c47490a7",
"stackTrace": [
" File "/var/task/lambda_function.py", line 18, in lambda_handler\n response = client.start_query_execution(\n",
" File "/var/lang/lib/python3.11/site-packages/botocore/client.py", line 534, in _api_call\n return self._make_api_call(operation_name, kwargs)\n",
" File "/var/lang/lib/python3.11/site-packages/botocore/client.py", line 976, in _make_api_call\n raise error_class(parsed_response, operation_name)\n"
]
}

My suggestion is can't we allow these internal queries from boto as well when i am able to execute the same from Athena query console.. This will really benefit some complex processing and it will drastically reduce the processing time of the queries which are called on partitioned columns only.
I am adding below the boto code which i used in test lambda

import time
import boto3
import urllib

athena_tbl_name = urllib.parse.unquote_plus('mss_athena_intake%24partitions')

query = f"SELECT customer_id, customer_pid,device_id FROM mss_athena_database.`{athena_tbl_name}` limit 20;"

DATABASE = 'mss_athena_database'
output='s3://test123/'

def lambda_handler(event, context):
    client = boto3.client('athena')

    # Execution
    response = client.start_query_execution(
        QueryString=query,
        QueryExecutionContext={
            'Database': DATABASE
        },
        ResultConfiguration={
            'OutputLocation': output,
        }
    )
    print(f'The final response is {response}')
    return response
    return

Use Case

Boto will drastically reduce the query processing time for Athena table, if it makes direct query on Athena internally created partitioned table rather than actual Athena table.
Currently I get the below error when i try to execute the boto code
{
"errorMessage": "An error occurred (InvalidRequestException) when calling the StartQueryExecution operation: Queries of this type are not supported",
"errorType": "InvalidRequestException",
"requestId": "121e08fd-256d-4600-88d6-ed33c47490a7",
"stackTrace": [
" File "/var/task/lambda_function.py", line 18, in lambda_handler\n response = client.start_query_execution(\n",
" File "/var/lang/lib/python3.11/site-packages/botocore/client.py", line 534, in _api_call\n return self._make_api_call(operation_name, kwargs)\n",
" File "/var/lang/lib/python3.11/site-packages/botocore/client.py", line 976, in _make_api_call\n raise error_class(parsed_response, operation_name)\n"
]
}

Proposed Solution

My suggestion is can't we allow these internal queries from boto as well when i am able to execute the same from Athena query console.. This will really benefit some complex processing and it will drastically reduce the processing time of the queries which are called on partitioned columns only.

Other Information

I am adding below the boto code which i used in test lambda

import time
import boto3
import urllib

athena_tbl_name = urllib.parse.unquote_plus('mss_athena_intake%24partitions')

query = f"SELECT customer_id, customer_pid,device_id FROM mss_athena_database.`{athena_tbl_name}` limit 20;"

DATABASE = 'mss_athena_database'
output='s3://test123/'

def lambda_handler(event, context):
    client = boto3.client('athena')

    # Execution
    response = client.start_query_execution(
        QueryString=query,
        QueryExecutionContext={
            'Database': DATABASE
        },
        ResultConfiguration={
            'OutputLocation': output,
        }
    )
    print(f'The final response is {response}')
    return response
    return

Acknowledgements

  • I may be able to implement this feature request
  • This feature might incur a breaking change

SDK version used

boto 3, python 3.11(runtime)

Environment details (OS name and version, etc.)

windows 11

@jchowdhary jchowdhary added feature-request This issue requests a feature. needs-triage This issue or PR still needs to be triaged. labels Aug 8, 2023
@jchowdhary jchowdhary changed the title (short issue description) boto code for Athena internal partitioned tables Aug 8, 2023
@tim-finnigan tim-finnigan self-assigned this Aug 9, 2023
@tim-finnigan
Copy link
Contributor

Hi @jchowdhary thanks for reaching out. The Boto3 Athena commands correspond to Athena APIs defined upstream. Therefore requests like yours generally need to get forwarded to the service team. (If you have a support plan we recommend reaching out via AWS Support, otherwise we could forward your feature request to the Athena team for consideration.)

First I think we need a bit more clarity on what you are trying to do. You are using the start_query_execution command in a Lambda function and receiving the InvalidRequestException documented here. As noted in the documentation, this error:

Indicates that something is wrong with the input to the request. For example, a required parameter may be missing or out of range.

To investigate the underlying issue further we would want to review the debug logs (with any sensitive info redacted) which you can get by adding boto3.set_stream_logger('') to your script.


In regards to the query time performance I can share some resources that might help:

If you query a partitioned table and specify the partition in the WHERE clause, Athena scans the data only from that partition. For more information, see Table location and partitions.

@tim-finnigan tim-finnigan added response-requested Waiting on additional information or feedback. service-api This issue is caused by the service API, not the SDK implementation. and removed needs-triage This issue or PR still needs to be triaged. labels Aug 9, 2023
@jchowdhary
Copy link
Author

jchowdhary commented Aug 10, 2023

Hey Tim..Thanks for the suggestion.. I will check with AWS Support Team.
Just for information, if we have a Athena table with partitioned columns and non partitioned columns and we make a SQL select query on that table involving the columns which are partitioned, then we have a good performance. But instead of making a call to Athena table, if we make an call to Athena internally created partitioned table where the internal table name will be "original tablename" suffixed with "$partitions". Like "mss-athena-intake" internal table name will be "mss-athena-intake$partitions". Response time for this query on internal Athena table will be 2x more faster than the original manuallly created Athena table.. Hope this information might help somebody..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request This issue requests a feature. response-requested Waiting on additional information or feedback. service-api This issue is caused by the service API, not the SDK implementation.
Projects
None yet
Development

No branches or pull requests

2 participants