# Aws Athena

### Introduction

In this lesson, we'll see how we can work with AWS athena.  Aws athena is a service that allows us to query our data directly from an s3 bucket.  Let's get started.

### Benefits of Athena

Athena will allow us to query our S3 data without setting up or running a database like our RDS postgres image.

This has some benefits.  Remember with a postgres image in RDS, we have to be careful about keeping this database running, especially if we are not using it very often.  With athena, our data is stored in a file in S3.  Nothing is running until we call the query.

Another benefit of athena is that we do not have to set up a traditional schema.  We can query unstructured data with athena.

Still, athena has it's downsides.  It doesn't support typical database features like indexing, which is used to speed up our queries.  And with athena, we pay per query, so if running a lot of queries can have the cost add up.  With a service like RDS we pay per the size of the cluster, so the cost can be easier to predict. 

For these reasons, Athena is typically used for a couple adhoc queries, before moving the data to a database.  Or for some initial searching through unstructured data like log files.  You can read more about the pros and cons of Athena in the resources at the end of this lesson.  For now let's start using it.

### Using athena with S3 

Now athena will allow us to query our Json data directly from S3.  However, to accomplish this, our data needs to be in a specific form -- that looks like the following.

```python
{"song": "royals", "artist": "lorde"}
{"song": "taxman", "artist": "the beatles"}
{"song": "paint it black", "artist": "rolling stones"}
```

For our Json to work, each dictionary should be on a separate line in our s3 buckets, there should be no comma between our dictionaries, and we should not have any square brackets at the beginning or end of our list of dictionaries.

### Storing our data

If you look at the `src/console.py` and `index.py` files, you can see how we accomplish this.  Looking at the `console.py` file:

* We create a new bucket to store our data to query (you'll have to set a unique name).
* We retrieve a list of dictionaries with a call to find_receipts. 
* We then use the `build_in_mem_file`.  The function creates an in memory file (with String.io), and then add each receipt to the file, separating each one with a new line character.  Then call `get_value()` to retrieve the text from that file.
* Then if you uncomment the `put_object` method, you'll see that we add this set of dictionaries to a file in our bucket.

We can confirm that this works, by then reading from the bucket.

```python3
bucket_name = ''
object_name = ''
obj = s3.get_object(Bucket=bucket, Key=object_name)
text = obj['Body'].read()
```

Ok, so now that we have have a bucket and an object to read from, the next step is to create a bucket to write to.  It turns out that athena will be writing the results of our query to an object in a bucket, so let's create that too.  

You can see in the `console.py` file, that we have a couple of lines for creating just this bucket.

### Setting up Athena

Ok, so the next step is to move onto athena.

* To do so, first click on the athena resource, and then click on settings where we can specify the results bucket.

### Resources

[AWS Athena](https://www.sqlshack.com/an-introduction-to-aws-athena/)

[Athena pros and cons](https://towardsaws.com/aws-athena-why-is-it-different-than-mysql-93d55fd4a757)