# Aws Athena

### Introduction

In this lesson, we'll walkthrough some code for querying athena.  Please use your skills with reading code when working through this.  So that means placing breakpoints to see what the values of different variables are, and trying to call certain methods yourself.

### Our starting point

You can see that we have two folders -- our `extract_and_load` folder and our `query_bucket` folder.  And we also have a `src/console.py` file.  

The `console.py` file is where we can perform our athena queries.  This is through the `query_results` function.  As you can see, we simply query our input bucket (where our original csv file was stored) as if it were a table.

```python
query_results("SELECT * FROM jigsawtexasquery limit 3")
```

You can try this yourself, but you will need to make a couple of tweaks.  

* `output_bucket`
    * The first is to remember that with athena, when we make a query, our query results get sent to an output bucket.
    
* `db_name`
    * The second is the `db_name`.  This is the name of the [AWS lake formation](https://aws.amazon.com/lake-formation/) that AWS uses to query the s3 bucket.
    
If you change those variables to point to your `output_bucket` and `db_name`, you should then be able to query your bucket with the input data. 

Give that a shot.  Change the variable names, and then run `python3 -i console.py` to query the bucket. 

### How did it work

Ok, so now that we queried our bucket, let's take a deeper look at our `query_results` function.  We can see that the function calls two other functions, `query_athena` and `get_query_results`.

Understanding why we have two functions depends on understanding how Athena works.  Remember that with Athena, when we query our S3 bucket, the results of that query is stored in a different bucket.  
For that reason, we have one function `query_athena` that performs the query, and specifies the `output_bucket`.  And we have another  function `get_query_results` that then retrieves the results from that output bucket.

Ok, now let's dig deeper into each of these functions.  You can find them both in `query_bucket/athena_boto.py`.

* `query_athena` 

This function takes the athena `query`, the `db_name`, and the `output_bucket_folder`, where we'll place the results of the query.   Notice that it returns a `response`.  This response **does not** contain the `results` of the query itself.  Rahter it  contains some metadata about the response.  For example, this is an example of what is returned.

```python
{'QueryExecutionId': '476cf070-6ffd-454a-9ca9-686f23c20b46', 'ResponseMetadata': {'RequestId': '9b0c1adf-f044-492a-9db5-b295cffc1ace', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Tue, 23 May 2023 17:24:08 GMT', 'content-type': 'application/x-amz-json-1.1', 'content-length': '59', 'connection': 'keep-alive', 'x-amzn-requestid': '9b0c1adf-f044-492a-9db5-b295cffc1ace'}, 'RetryAttempts': 0}}
```

We can see that it was successful, and that also there is a `QueryExecutionId`.  

That `QueryExecutionId` is the file name of where the results are stored.  For example, we can see that by going to the s3 results bucket.

<img src="./s3-results.png" width="100%">

* `get_query_results`
Ok, so the main point of `get_query_results` is to just read down our data from the bucket.  However, you will see a `while True` with a `try` `except` block.  We'll see that in that block we repeatedly catch an exception where the query is not yet finished.  So that's the purpose of our `try` `except` block -- repeatedly see if our query has completed.  

Then when it has, we call the `read_from_bucket` function at the end.

* `read_from_bucket`

The `read_from_bucket` method uses our `s3.get_object` method, providing the bucket name, and then the path to the file as the key.  Because the data that comes back is bytes -- which apparently pandas cannot translate to a dataframe -- we convert this to a string with the line `csv = BytesIO(data)`, and then pass this data to our `pd.read_csv` method to return a dataframe.

### Putting it all together

Ok, so we just went through querying athena.  But remember that this is part of a broader data pipeline, which you can see in the `extract_load` folder.  If you look at the `extract_load/upload_console.py` file, you can see how we got that data in our bucket in the first place.

```python
# upload_console.py
restaurant_name = 'HONDURAS MAYA CAFE & BAR LLC'
df, file_name = request_and_download_locally(restaurant_name)
uploaded_text = upload_and_read(file_name, query_bucket_name)
```

We did so by first making a request to the api, downloading the results in a csv file, and then uploading those results into our s3 bucket.  From there, we created a datalake that had access to this s3 bucket.  And we used Athena to query the bucket, storing the query results in a separate bucket.

### Resources

[AWS Athena](https://www.sqlshack.com/an-introduction-to-aws-athena/)

[Athena pros and cons](https://towardsaws.com/aws-athena-why-is-it-different-than-mysql-93d55fd4a757)

[AWS permissions](https://docs.aws.amazon.com/glue/latest/dg/create-an-iam-role.html)

[S3 permissions](https://docs.aws.amazon.com/glue/latest/dg/create-an-iam-role.html)

[boto bucket policy](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-example-bucket-policies.html)