# Intro to AWS Wrangler

### Introduction

In this lesson, we'll introduce AWS wrangler as a means to upload data to and read data from S3.

### Downloading Wrangler

We can get started with AWS wrangler by installing the library.  
First set up the bash environment.

```bash
python -m venv venv
```

Then activate the environment.

```bash
source venv/bin/activate
```

Then you can install the libraries in the requirements.txt file.

`pip3 install -r requirements.txt`

### Setting up the credentials

We'll need to make sure that we our AWS credentials are properly connected to our local environment.  To do this, we can begin by trying to access one of our s3 repositories.

> **Get setup**: If you are unable to access the s3 object url below, we have provided you with the dataset in the `data` folder.  Upload it to an S3 bucket, and then change the url below.

Now see if you can use awswrangler.

In [30]:
import awswrangler as wr

# you may have to add your own url
url_path = "s3://jigsaw-labs-student/chicago-crimes.csv"

s3_df = wr.s3.read_csv(url_path, low_memory=False, index_col = 0)

In [31]:
s3_df[:2]

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
3,10508693,HZ250496,05/03/2016 11:40:00 PM,013XX S SAWYER AVE,486,BATTERY,DOMESTIC BATTERY SIMPLE,APARTMENT,True,True,...,24.0,29.0,08B,1154907.0,1893681.0,2016.0,05/10/2016 03:56:50 PM,41.864073,-87.706819,"(41.864073157, -87.706818608)"
89,10508695,HZ250409,05/03/2016 09:40:00 PM,061XX S DREXEL AVE,486,BATTERY,DOMESTIC BATTERY SIMPLE,RESIDENCE,False,True,...,20.0,42.0,08B,1183066.0,1864330.0,2016.0,05/10/2016 03:56:50 PM,41.782922,-87.604363,"(41.782921527, -87.60436317)"


If we get a credential error then we will need to set up our credentials.

#### Detour: Setting up your credentials

If we do not have our credentials set up, then we should go to `IAM` > `Users` in the AWS console, and create a new user.  And give the user administrative access, by attaching those permissions.

Then, after the user is created, we'll need to create the related API keys.  Do so by going to `Users` on the side panel on the left, and then click on the name of the user you just created, then security credentials.

<img src="./sec-creds.png" width="80%">

Then scroll down under access keys, and click on `create access keys`.

<img src="./create-access.png" width="60%">

Then type in `aws configure` in the shell.  And enter your access keys when prompted.

### Reading and Writing from Wrangler

Ok, so we just saw our first method in wrangler -- reading a csv file.  Let's see it again.

In [39]:
import awswrangler as wr

# you may have to add your own url
url_path = "s3://jigsaw-labs-student/chicago-crimes.csv"

crimes_df = wr.s3.read_csv(url_path, low_memory=False, index_col = 0)

> We use the `wr.s3` module, and then call the `read_csv` function, just like pandas. 

And from here we can write this to parquet with something like the following:

In [34]:
bucket_name = "jigsaw-labs-student" 

crimes_parquet_url = f"s3://{bucket_name}/crimes.snappy.parquet"
wr.s3.to_parquet(df=crimes_df, path=write_url)

{'paths': ['s3://jigsaw-labs-student/crimes.snappy.parquet'],
 'partitions_values': {}}

Ok, so we just wrote the file to s3, and did so writing a `parquet` object.  We can read this object like so.

In [37]:
crimes_df = wr.s3.read_parquet(crimes_parquet_url)
crimes_df[:2]

Unnamed: 0,ID,Case_Number,Date,Block,IUCR,Primary_Type,Description,Location_Description,Arrest,Domestic,...,Ward,Community_Area,FBI_Code,X_Coordinate,Y_Coordinate,Year,Updated_On,Latitude,Longitude,Location
0,10508693,HZ250496,05/03/2016 11:40:00 PM,013XX S SAWYER AVE,486,BATTERY,DOMESTIC BATTERY SIMPLE,APARTMENT,True,True,...,24.0,29.0,08B,1154907.0,1893681.0,2016.0,05/10/2016 03:56:50 PM,41.864073,-87.706819,"(41.864073157, -87.706818608)"
1,10508695,HZ250409,05/03/2016 09:40:00 PM,061XX S DREXEL AVE,486,BATTERY,DOMESTIC BATTERY SIMPLE,RESIDENCE,False,True,...,20.0,42.0,08B,1183066.0,1864330.0,2016.0,05/10/2016 03:56:50 PM,41.782922,-87.604363,"(41.782921527, -87.60436317)"


Using a parquet format has multiple benefits.

1. Columnar based storage -- this allows for efficient reading of columns of data, common in analytics queries

2. Compression - because our data is of the same type, the data is easier to be compressed.  In fact, you'll see the file name is `.snappy.parquet`.  The .snappy is for the snappy compression algorithm.  As we know, compression reduces our storage costs.

3. Storing datatypes - Parquet stores metadata along with the data.  So when we write our data, we can also write the datatype.

### Working with directories

Oftentimes when reading and writing files, we will split a dataset up among multiple files.  

For example, let's take our chicago crimes data, and split it into two components.

In [40]:
first_crimes = crimes_df[:100]

last_crimes = crimes_df[100:200]

From there, instead of writing to a specific file, let's write our dataset to a specified *folder*.

> Create a new folder inside of your bucket.

Then we can write to that bucket with the `to_parquet` function.

In [41]:
bucket_name = "jigsaw-labs-student" # change bucket name
folder_name = "chicago"

wr.s3.to_parquet(df=first_crimes, 
                path=f"s3://{bucket_name}/{folder_name}/",
                dataset=True)

{'paths': ['s3://jigsaw-labs-student/chicago/3f35b6b5b3ae444f894de6993b89467b.snappy.parquet'],
 'partitions_values': {}}

Notice that this time, we passed through the argument `dataset = True`.  This tells awswrangler to treat the contents of the *entire folder* as a dataset.

From here, we can add additional files to the folder.  

In [42]:
wr.s3.to_parquet(df=last_crimes, 
                path=f"s3://{bucket_name}/{folder_name}/",
                dataset=True)

{'paths': ['s3://jigsaw-labs-student/chicago/90f3f848ac0f49688a22d65a48885aba.snappy.parquet'],
 'partitions_values': {}}

> The `to_parquet` function takes an optional argument of `mode`, where mode can be `overwrite` or `append`.  An overwrite, will `overwrite` the existing dataset.  Whereas append, will append to the dataset.

> The default value is "append", so above we just left it blank.

```python
wr.s3.to_parquet(df=last_crimes, 
                path=f"s3://{bucket_name}/{folder_name}/",
                dataset=True)
```

Then, if we want to read from the entire folder, and treat that folder as a dataset, we can do so by just specifying `dataset = True`.

In [51]:
crimes_df = wr.s3.read_parquet(path=f"s3://{bucket_name}/{folder_name}/",
                dataset=True)

Notice, that doing so combined all of our rows.

In [46]:
crimes_df.shape

(200, 22)

Ok, that's good for that section.  Let's remove the files in our directory.

In [52]:
wr.s3.delete_objects(f"s3://{bucket_name}/{folder_name}/")

### Partitioning

Now when writing to a folder, it's often a good idea to partition our data.  This way, if we want to query a dataset stored in s3, tools like athena or spark will not need to search through all of the files in the dataset, but just those in the matching partition.

For example, let's partition our dataset by year.  We can see that there are three different years in our dataset.

In [53]:
crimes_df.Year.unique()

array([2016., 2015., 2012.])

Now let's store our dataset, partitioning by year.

In [55]:
wr.s3.to_parquet(df=crimes_df, 
                path=f"s3://{bucket_name}/{folder_name}/",
                partition_cols = ['Year'],
                dataset=True)

  for keys, subgroup in df.groupby(by=partition_cols, observed=True):


{'paths': ['s3://jigsaw-labs-student/chicago/Year=2012.0/6ca60edb4e1b491c8af4302ce6b69423.snappy.parquet',
  's3://jigsaw-labs-student/chicago/Year=2015.0/6ca60edb4e1b491c8af4302ce6b69423.snappy.parquet',
  's3://jigsaw-labs-student/chicago/Year=2016.0/6ca60edb4e1b491c8af4302ce6b69423.snappy.parquet'],
 'partitions_values': {'s3://jigsaw-labs-student/chicago/Year=2012.0/': ['2012.0'],
  's3://jigsaw-labs-student/chicago/Year=2015.0/': ['2015.0'],
  's3://jigsaw-labs-student/chicago/Year=2016.0/': ['2016.0']}}

This time, if we look at the bucket, we'll that our data was partitioned into a separate folder per year. 

### Connecting to glue

Ok, so now we can use glue to scan the repository and then query our repo with athena.

We can get the current list of databases with the following.

In [57]:
databases = wr.catalog.databases()
# databases

And now let's create a new database called `chicago_datasets`.

In [58]:
wr.catalog.create_database("chicago_datasets")

If we look at databases again, we'll see it listed there.

In [None]:
databases = wr.catalog.databases()
# databases

And finally, we can get glue to scan our dataset.

In [59]:
bucket_name = "jigsaw-labs-student" # change bucket name
folder_name = "chicago"

path = f"s3://{bucket_name}/{folder_name}/"

res = wr.s3.store_parquet_metadata(
    path=path,
    database="chicago_datasets",
    table="crimes",
    dataset=True,
    mode="overwrite"
)

Let's see what it came up with.

In [62]:
wr.catalog.table(database="chicago_datasets", table="crimes").T[:2]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,12,13,14,15,16,17,18,19,20,21
Column Name,id,case_number,date,block,iucr,primary_type,description,location_description,arrest,domestic,...,ward,community_area,fbi_code,x_coordinate,y_coordinate,updated_on,latitude,longitude,location,year
Type,bigint,string,string,string,string,string,string,string,boolean,boolean,...,double,double,string,double,double,string,double,double,string,string


So it has found various columns and the related datatypes.  Finally, let's use athena to query our created table.

In [69]:
query = "SELECT * FROM crimes where Year > '2015' limit 10"
crimes_2015_df = wr.athena.read_sql_query(query, 
                                        database="chicago_datasets")

In [70]:
crimes_2015_df[:2]

Unnamed: 0,id,case_number,date,block,iucr,primary_type,description,location_description,arrest,domestic,...,ward,community_area,fbi_code,x_coordinate,y_coordinate,updated_on,latitude,longitude,location,year
0,10508693,HZ250496,05/03/2016 11:40:00 PM,013XX S SAWYER AVE,486,BATTERY,DOMESTIC BATTERY SIMPLE,APARTMENT,True,True,...,24.0,29.0,08B,1154907.0,1893681.0,05/10/2016 03:56:50 PM,41.864073,-87.706819,"(41.864073157, -87.706818608)",2016.0
1,10508695,HZ250409,05/03/2016 09:40:00 PM,061XX S DREXEL AVE,486,BATTERY,DOMESTIC BATTERY SIMPLE,RESIDENCE,False,True,...,20.0,42.0,08B,1183066.0,1864330.0,05/10/2016 03:56:50 PM,41.782922,-87.604363,"(41.782921527, -87.60436317)",2016.0


### Summary

In this lesson, we saw how we can use awswrangler to read and write to files like so.

```python
# read from object url
crimes_df = wr.s3.read_parquet(crimes_parquet_url)

# write to object specifying df, and url
wr.s3.to_parquet(df=s3_df, path=write_url)
```

And that we can read and write to folders with the `dataset = True` argument.

```python
wr.s3.to_parquet(df=crimes_df, 
                path=folder_path,
                dataset=True)

crimes_df = wr.s3.read_parquet(folder_path)
```

### Resources

[AWS Wrangler Tutorial](https://github.com/aws/aws-sdk-pandas/blob/main/tutorials/004%20-%20Parquet%20Datasets.ipynb)

[Crawling in Glue](https://github.com/aws/aws-sdk-pandas/blob/main/tutorials/010%20-%20Parquet%20Crawler.ipynb)

[Glue Partitioning](https://docs.aws.amazon.com/athena/latest/ug/partition-projection.html)