# AWS Wrangler Lab

### Introduction

In this lesson, we'll practice working with the AWS wrangler library to both read and write data to our s3 resource. 

You can view the documentation for this library [here](https://aws-sdk-pandas.readthedocs.io/en/stable/stubs/awswrangler.s3.read_parquet.html).

### Getting started

In this lesson, we'll be working with the ecommerce data located in the `data` folder.

You can see that we provided you with a function to read the data, `get_data`, which you can see in the `index.py` file.  You can call that function from the `console.py` file.

### Writing to S3

Now create an s3 bucket from the aws console.  

* `write_to_s3(df, path)`

Next write a function called `write_to_s3`, that takes in arguments of a dataframe and the s3 path to the bucket.  Given those to arguments, the function should write to the bucket, storing in parquet format, and writing as a dataset.

Notice at the top of the function, we placed a line of `boto3.setup_default_session(region_name="us-east-1")`.  Make sure that the region name matches the location of the bucket.  You can find this, by viewing the s3 buckets in the console.

<img src="./aws-region.png" width="80%">

> **Remember** that writing as a **dataset** means that the files will be written to the folder generally, and then be read from that folder without specifying the particular file.  

* `read_from_s3`



Write a function called `read_from_s3` that provided a path will to the bucket, will return a dataframe of our data.  

> This function should also start with that line of: 

`boto3.setup_default_session(region_name="us-east-1")`.  Make sure that the correct region is specified.

4. Partition

Now let's partition the dataset.  Update the `write_to_s3` method so that `partition_cols` argument is added.  And partition by the date.

Then try calling the `write_to_s3` data again.  This time if you visit the bucket, you should see that the data is partitioned by date.

<img src="./partitioned-data.png" width="50%">

### Creating a database

Now, let's use AWS glue to bring some order to our datalake.  Begin by using wrangler to list the available databases.

In [None]:
databases = wr.catalog.databases()

Then create a new database called `ecommerce`.

In [None]:
wr.catalog.create_database('ecommerce')

Finally, use the `store_parquet_metadata` function to scan the proper bucket and allow us to read the data as a table from the ecommerce database.

In [None]:
path = 's3://jigsaw-wrangler'

res = wr.s3.store_parquet_metadata(
    path=path,
    database="ecommerce",
    table="sales",
    dataset=True,
    mode="overwrite"
)

From there, write a function called `read_from_db` that takes in an argument of query, and will then return the results of the query.

> You may have to have the `boto3.setup_default_session(region_name="us-east-1")` at the top of the function.

### Resources

[Documentation](https://aws-sdk-pandas.readthedocs.io/en/stable/stubs/awswrangler.s3.read_parquet.html)