# Aws Athena

### Introduction

In this lesson, we'll see how we can work with AWS athena.  Aws athena is a service that allows us to query our data directly from an s3 bucket.  Let's get started.

### Benefits of Athena

Athena will allow us to query our S3 data without setting up or running a database like our RDS postgres image.

This has some benefits.  Remember with a postgres image in RDS, we have to be careful about keeping this database running, especially if we are not using it very often.  With athena, our data is stored in a file in S3.  Nothing is running until we call the query.

Another benefit of athena is that we do not have to set up a traditional schema.  We can query unstructured data with athena.

Still, athena has it's downsides.  It doesn't support typical database features like indexing, which is used to speed up our queries.  And with athena, we pay per query, so if running a lot of queries can have the cost add up.  With a service like RDS we pay per the size of the cluster, so the cost can be easier to predict. 

For these reasons, Athena is typically used for a couple adhoc queries, before moving the data to a database.  Or for some initial searching through unstructured data like log files.  You can read more about the pros and cons of Athena in the resources at the end of this lesson.  For now let's start using it.

### Using athena with S3 

Now athena will allow us to query our Json or CSV data directly from S3.  

To query JSON, our data needs to be in a specific form -- that looks like the following.

```python
{"song": "royals", "artist": "lorde"}
{"song": "taxman", "artist": "the beatles"}
{"song": "paint it black", "artist": "rolling stones"}
```

For our Json to work, each dictionary should be on a separate line in our s3 buckets, there should be no comma between our dictionaries, and we should not have any square brackets at the beginning or end of our list of dictionaries.

For CSV data, we can just upload a standard csv file to s3.

### Storing our data

If you look at the `src/console.py` and `index.py` files, you can see how we accomplish this.  Looking at the `console.py` file:

* We create a new bucket to store our data to query (you'll have to set a unique name).
* We retrieve a list of dictionaries with a call to find_receipts. 
* We then use `pd.DataFrame` to convert this list of dictionaries to a dataframe.
* Then if you uncomment the `s3.upload_file` method, you'll see that we add this csv file to our bucket.

We can confirm that this works, by then reading from the bucket.

```python3
bucket_name = ''
object_name = ''
obj = s3.get_object(Bucket=bucket, Key=object_name)
text = obj['Body'].read()
```

Ok, so now that we have have a bucket and an object to read from, the next step is to create a bucket to write to.  It turns out that athena will be writing the results of our query to an object in a bucket, so let's create that too.  

You can see in the `console.py` file, that we have a couple of lines for creating just this bucket.

```python


bucket_name = 'jigsawtexasresults' # replace bucket name
results_bucket = s3.create_bucket(Bucket = bucket_name)
```

### Setting up Athena

Ok, so we can confirm that our buckets have been created, and our file has been uploaded.  

From here we type athena in the search console, and click on athena.

<img src="./athena.png" width="100%">

From there, click on Query your data, and click Launch query editor.

<img src="./query-data.png">

When we get into Athena, we'll see in the light blue banner, that it asks us to set up an athena query results location.  

<img src="./query-result-location.png" width="100%">

Let's do that now.  This is just the path to the results bucket that we created.

<img src="./results-bucket.png" width="80%">

So now that we have specified the results, the next step is to specify where we are getting our data from.  So to the left click on `create`, and then AWS Glue Crawler.

<img src="./aws-glue-crawler.png" width="50%">

By selecting AWS glue, we are instructing AWS to crawl the data in the specified s3 object, and then create a corresponding table from the attributes of our csv file.  

So let's do that.

### Creating our Glue Crawler

Go to the new page, and follow the instructions of entering the crawler name, and then adding the data source.

> You can see that for this step, we can specify the *bucket* where we uploaded our data to.  Notice that we placed a `/` at the end of the bucket name, indicating to crawl the files inside of the bucket.

<img src="./s3-bucket.png" width="70%">

After adding our S3 data source we should see something like the following.

<img src="./selected-source.png" width="100%">

Next we will need a new iam role to read from our s3 bucket.  Click on `Create new IAM` role.  And you can view the default IAM configuration by clicking on `View`.

<img src="./iam-role.png" width="70%">

<img src="./create_db.png" width="80%">

Click on add a database, and fill in a database name.  From there, if you click on the refresh button to the right, you will be able to see your database, and select it.

<img src="./select-new-db.png" width="80%">

> This database is created in something called the AWS Lake Formation.

We can keep the database schedule as on demand.  However, AWS allows us to recrawl our buckets on a schedule in case the structure of our data changes.

Finally, we can select create crawler.

<img src="./create-crawler.png" width="70%">

If it worked, you should see a green banner saying that the crawler was created, and from there you can click on `Run crawler` to the right.

<img src="./run-crawler.png" width="100%">

### Back to Athena

Ok, so remember that Glue just turned crawled our S3 bucket so that we could query this bucket as a table.  Now we can go back to Athena to perform some queries.

> For the database, select the database that we created in athena.  And then we can query our bucket as if it were a table.  

So below, our query is:

```sql
select location_name, liquor_receipts from jigsawtexasquery where liquor_receipts = 0 limit 3;
```

<img src="./query-athena.png" width="100%">

Finally, like everything, it is also possible to access use Athena from boto3.  You can take a look at that in the `src/athena_boto.py` file, and we can talk through it in the next lesson.

### Resources

[AWS Athena](https://www.sqlshack.com/an-introduction-to-aws-athena/)

[Athena pros and cons](https://towardsaws.com/aws-athena-why-is-it-different-than-mysql-93d55fd4a757)

[AWS permissions](https://docs.aws.amazon.com/glue/latest/dg/create-an-iam-role.html)

[S3 permissions](https://docs.aws.amazon.com/glue/latest/dg/create-an-iam-role.html)

[boto bucket policy](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-example-bucket-policies.html)