# Cloud Tools - AWS

In [1]:
import os
import pandas as pd
from pymagic.cloud_tools import AWS

# S3

## Admin

### Create a Bucket

### List Files on S3

Let's take a look at what files we have on our S3 bucket.

In [2]:
df = AWS.s3_list_files(
    bucket=os.environ['example_aws_bucket'],
    pub=os.environ['aws_secret_access_id_cc'], 
    sec=os.environ['aws_secret_access_key_cc']
)

Screen Shot 2020-06-17 at 6.19.55 PM.png
rs_test_data.csv000
test.png
test_df.csv


In [3]:
df.head()

Unnamed: 0,files,times,metadata
0,Screen Shot 2020-06-17 at 6.19.55 PM.png,2020-07-01 03:16:27+00:00,{}
1,rs_test_data.csv000,2020-09-26 23:47:20+00:00,{}
2,test.png,2020-09-26 23:47:14+00:00,"{'metadata_key1': 'metadata_key1', 'metadata_k..."
3,test_df.csv,2020-09-26 23:47:14+00:00,{}


### Delete S3 Object

Let's delete an object in our S3 bucket.

In [4]:
AWS.delete_s3_object(
    pub=os.environ['aws_secret_access_id_cc'], 
    sec=os.environ['aws_secret_access_key_cc'],
    bucket=os.environ['example_aws_bucket'], 
    obj_list=[{'Key':'test_df.csv'}] #pass a list of dictionaries
)

{'ResponseMetadata': {'RequestId': '0DFAF82595798FA9',
  'HostId': 'GuC54haguka5x1bVdY2oDFG6vFn3lwFi57pbA0KTdBjF76cI2lYLxNgE1OqhBLPHYSwehUPYyIk=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': 'GuC54haguka5x1bVdY2oDFG6vFn3lwFi57pbA0KTdBjF76cI2lYLxNgE1OqhBLPHYSwehUPYyIk=',
   'x-amz-request-id': '0DFAF82595798FA9',
   'date': 'Sat, 26 Sep 2020 23:48:57 GMT',
   'connection': 'close',
   'content-type': 'application/xml',
   'transfer-encoding': 'chunked',
   'server': 'AmazonS3'},
  'RetryAttempts': 1},
 'Deleted': [{'Key': 'test_df.csv'}]}

## Loading Data

Let's look at some examples of loading data to S3.

### DataFrames

Let's push a Pandas DataFrame to S3 as a csv file

In [5]:
df = pd.DataFrame({
    "widgets":[12,31,43,32,33,12,3],
    "sales":[249,199,89,59,129,99,159]
})

In [6]:
df

Unnamed: 0,widgets,sales
0,12,249
1,31,199
2,43,89
3,32,59
4,33,129
5,12,99
6,3,159


In [7]:
AWS.df_to_s3(
    df=df, 
    object_name="test_df.csv", 
    bucket=os.environ['example_aws_bucket'], 
    pub=os.environ['aws_secret_access_id_cc'], 
    sec=os.environ['aws_secret_access_key_cc'],
    sep=",", 
    header=True
)

{'ResponseMetadata': {'RequestId': '80506F835F6841CF',
  'HostId': '7PjcIs7TLcF80tGmX2nBembdaXCCq7HZfGAp2KBQie1iUkluAVEipaiQ2IzQPt5R/Bi60FfGoMw=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': '7PjcIs7TLcF80tGmX2nBembdaXCCq7HZfGAp2KBQie1iUkluAVEipaiQ2IzQPt5R/Bi60FfGoMw=',
   'x-amz-request-id': '80506F835F6841CF',
   'date': 'Sat, 26 Sep 2020 23:48:58 GMT',
   'etag': '"554a58475fccccc287150fb98445cf78"',
   'content-length': '0',
   'server': 'AmazonS3'},
  'RetryAttempts': 1},
 'ETag': '"554a58475fccccc287150fb98445cf78"'}

### Local Files

What about a local csv file?

In [8]:
sav_dir = '/Users/Collier/Downloads/'
file_name = "test.png"

AWS.file_to_s3(
    folder_file=sav_dir+file_name, 
    object_name=file_name, 
    bucket=os.environ['example_aws_bucket'], 
    pub=os.environ['aws_secret_access_id_cc'], 
    sec=os.environ['aws_secret_access_key_cc'],
    metadata_d={ #optional metadata dictionary
        "metadata_key1":"metadata_key1",
        "metadata_key2":"metadata_key2"
    }, 
    public_file=False #True=Public, False=Private
)

loaded data to test.png from: /Users/Collier/Downloads/test.png


### Redshift

The AWS ecosystem allows some pretty flexible interoperability between AWS services. 

Let's dump the contents of a Redshift query into S3 as a csv file.

This uses the Redshift 'UNLOAD' command under the hood.

In [9]:
#import the Redshift connection class
from pymagic.db_conn_tools import Redshift

#psycopg2
cursor_rs, conn_rs = Redshift.conn_rs_pg(
    host=os.environ['example_rs_host'],
    db=os.environ['example_rs_db'],
    user=os.environ['example_rs_user'],
    pwd=os.environ['example_rs_pwd'],
    port=os.environ['example_rs_port']
)

#you can use any select query
sql = "select * from flights"

#unload
AWS.rs_to_s3(
    cursor=cursor_rs, 
    sql=sql, 
    #make sure to include the 's3://' prefix
    bucket_path="s3://"+os.environ['example_aws_bucket']+"/rs_test_data.csv", 
    delimiter=",", 
    pub=os.environ['aws_secret_access_id_cc'], 
    sec=os.environ['aws_secret_access_key_cc']
)

loaded data to s3://cctickerupdall/rs_test_data.csv from query: select * from flights


## Retrieving Data

What about getting data out of S3?

### DataFrames

Let's download the csv we loaded earlier back to a Pandas DataFrame.

In [10]:
df = AWS.s3_to_df(
    object_name="test_df.csv", 
    bucket=os.environ['example_aws_bucket'],
    pub=os.environ['aws_secret_access_id_cc'], 
    sec=os.environ['aws_secret_access_key_cc'],
    sep=",", 
    header=0
)

In [11]:
df

Unnamed: 0,widgets,sales
0,12,249
1,31,199
2,43,89
3,32,59
4,33,129
5,12,99
6,3,159


### Local Files

Of course we can download files from S3 to a local folder.

In [12]:
sav_dir = '/Users/Collier/Downloads/'
file_name = "test.png"

AWS.s3_to_file(
    object_name=file_name, 
    folder_file=sav_dir+file_name, 
    bucket=os.environ['example_aws_bucket'],
    pub=os.environ['aws_secret_access_id_cc'], 
    sec=os.environ['aws_secret_access_key_cc'],
)

loaded data to /Users/Collier/Downloads/test.png from: test.png


### Redshift

Now let's load some data from S3 and to Redshift.  

This is the reverse of what we did earlier and uses the Redshift 'COPY' command under the hood.

For more Redshift ETL examples, see this notebook: []()

In [13]:
#one note is that you must create the target table ahead of time before you can load data into it.

#import the Redshift ETL class
from pymagic.db_etl_tools import Redshift

tbl_name = "rs_test_data_from_s3"

#let's grab our dataframe again so we can create the table 
df = AWS.s3_to_df(
    object_name="test_df.csv", 
    bucket=os.environ['example_aws_bucket'],
    pub=os.environ['aws_secret_access_id_cc'], 
    sec=os.environ['aws_secret_access_key_cc'],
    sep=",", 
    header=0
)

#if we already created the table, lets drop it so we can re-create
try:
    Redshift.run_query_rs(sql=f"drop table {tbl_name}",conn=conn_rs)
except:
    conn_rs.commit()
    pass

sql = Redshift.make_df_tbl_rs(
    tbl_name=tbl_name,
    df=df
)

Redshift.run_query_rs(sql=sql,conn=conn_rs)

[31mRuntime: 0.0017358666666666665
[31mRuntime: 0.0024833999999999998


In [14]:
#this is the SQL that we ran to create the table
sql

'CREATE TABLE rs_test_data_from_s3 ( widgets INTEGER, sales INTEGER )'

In [15]:
#load the Redshift table from the csv on S3

AWS.s3_to_rs(
    cursor=cursor_rs, #note we passed the Redshift cursor object
    tbl_name=tbl_name, 
    bucket_path="s3://"+os.environ['example_aws_bucket']+"/test_df.csv",
    pub=os.environ['aws_secret_access_id_cc'], 
    sec=os.environ['aws_secret_access_key_cc'],
    delimiter=","
)

loaded data from s3://cctickerupdall/test_df.csv to rs_test_data_from_s3


# DynamoDB

DynamoDB is AWS' flagship NoSQL database. Let's go over some examples with DynamoDB.

## Creating/Dropping Tables

First let's create a DynamoDB table by outlining some of the table details.

Below we define our key_schema, attribute definitions,and provisioned throughput parameters.
These are all required when creating a DynamoDB table.

In [16]:
tbl_name = "dynamo_test"
region = "us-east-2"

key_schema = \
[
    {
        'AttributeName': 'year',
        'KeyType': 'HASH'  # Partition key
    },
    {
        'AttributeName': 'title',
        'KeyType': 'RANGE'  # Sort key
    }
]

attribute_definitions = \
[
    {
        'AttributeName': 'year',
        'AttributeType': 'N'
    },
    {
        'AttributeName': 'title',
        'AttributeType': 'S'
    },
]

provisioned_throughput = \
{
    'ReadCapacityUnits': 10,
    'WriteCapacityUnits': 10
}

In [17]:
#todo: delete the table before re-creating it...

try:
    AWS.delete_dynamo_tbl(
        pub=os.environ['aws_secret_access_id_cc'], 
        sec=os.environ['aws_secret_access_key_cc'],
        region_name=region, 
        tbl_name=tbl_name
    )
except Exception as e:
    print(str(e))
    pass

An error occurred (ResourceNotFoundException) when calling the DeleteTable operation: Requested resource not found: Table: dynamo_test not found


In [18]:
AWS.make_dynamo_tbl(
    pub=os.environ['aws_secret_access_id_cc'], 
    sec=os.environ['aws_secret_access_key_cc'],
    region_name=region, 
    tbl_name=tbl_name,
    key_schema=key_schema, 
    attribute_definitions=attribute_definitions,
    provisioned_throughput=provisioned_throughput
)

dynamodb.Table(name='dynamo_test')

## Loading Data

### Dictionaries

A DynamoDB table holds records called 'items' which are essentially Python dictionaries or JavaScript object.  The fields in an item are called 'attributes'.

If we want to insert a dictionary into a DynamoDB table we can do so like shown below, where we have a dictionary with nested fields and elements.

In [20]:
AWS.load_dynamo(
    pub=os.environ['aws_secret_access_id_cc'], 
    sec=os.environ['aws_secret_access_key_cc'],
    region_name=region, 
    tbl_name=tbl_name, 
    #dictionary must have the table's partition key as the root element
    d={'year': {"N":"1997"},
       "title": {"S":"Titantic"}
      }
)

{'ResponseMetadata': {'RequestId': 'CH2FQ58L6AL86L6H25IDA4OE6RVV4KQNSO5AEMVJF66Q9ASUAAJG',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'server': 'Server',
   'date': 'Sat, 26 Sep 2020 23:49:31 GMT',
   'content-type': 'application/x-amz-json-1.0',
   'content-length': '2',
   'connection': 'keep-alive',
   'x-amzn-requestid': 'CH2FQ58L6AL86L6H25IDA4OE6RVV4KQNSO5AEMVJF66Q9ASUAAJG',
   'x-amz-crc32': '2745614147'},
  'RetryAttempts': 0}}

DynamoDB supports quite a few different data types for attributes in tables.

You can find a list of them here: []().

Let's take a look at a few here by loading them to our table.

In [21]:
AWS.load_dynamo(
    pub=os.environ['aws_secret_access_id_cc'], 
    sec=os.environ['aws_secret_access_key_cc'],
    region_name=region, 
    tbl_name=tbl_name, 
    d={'year': {"N":"1997"}, #if you use the same partition key, you will update the existing attribute
       "title": {"S":"Titantic"},
       'test_string': {'S': 'Australia'}, 
       'test_number': {'N': "99.99"},
       'test_bool':{'BOOL':True},
       'test_list':{'L':[{'N':"1"},{'S':'Hello'}]},
       'test_numbered_set':{'NS':["0","1","2","3","4"]},
       'test_string_set':{'SS':["Hello","World","Foo","Bar"]}
      }
)

{'ResponseMetadata': {'RequestId': '3ETRQAO02LKCN16VDL7NB40O27VV4KQNSO5AEMVJF66Q9ASUAAJG',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'server': 'Server',
   'date': 'Sat, 26 Sep 2020 23:49:38 GMT',
   'content-type': 'application/x-amz-json-1.0',
   'content-length': '2',
   'connection': 'keep-alive',
   'x-amzn-requestid': '3ETRQAO02LKCN16VDL7NB40O27VV4KQNSO5AEMVJF66Q9ASUAAJG',
   'x-amz-crc32': '2745614147'},
  'RetryAttempts': 0}}

### DataFrames

What if we want to load a Pandas DataFrame? Well it's a similar process as the above, but we'll need to convert our dataframe into a list dictionaries to load because.

In [22]:
#create the dataframe
d = pd.DataFrame({
    "year":[1998,1999,2000,2001],
    "title":["Armageddon","The Matrix","Remember the Titans","Donnie Darko"],
    "gross":[554_000_000,
            463_517_383,
            115_600_000,
             7_510_877],
    "starring":[
        ['Bruce Willis',"Billy Bob Thornton","Ben Affleck","Liv Tyler"],
        ['Keanu Reeves','Laurence Fishburne','Carrie-Anne Moss','Hugo Weaving'],
        ['Denzel Washington','Will Patton','Wood Harris','Ryan Hurst'],
        ['Jake Gyllenhaal','Holmes Osborne','Maggie Gyllenhaal']
    ]
})

#now we define the Dynamo data types for the attributes we are loading.
# These need to be appended to the column names.
column_definitions = {
    "year":"year (N)",
    "title": "title (S)",
    "gross": "gross (N)",
    "starring": "starring (SS)" #list of strings
}

d = d.rename(columns=column_definitions)

In [23]:
d

Unnamed: 0,year (N),title (S),gross (N),starring (SS)
0,1998,Armageddon,554000000,"[Bruce Willis, Billy Bob Thornton, Ben Affleck..."
1,1999,The Matrix,463517383,"[Keanu Reeves, Laurence Fishburne, Carrie-Anne..."
2,2000,Remember the Titans,115600000,"[Denzel Washington, Will Patton, Wood Harris, ..."
3,2001,Donnie Darko,7510877,"[Jake Gyllenhaal, Holmes Osborne, Maggie Gylle..."


In [24]:
AWS.load_dynamo(
    pub=os.environ['aws_secret_access_id_cc'], 
    sec=os.environ['aws_secret_access_key_cc'],
    region_name=region, 
    tbl_name=tbl_name, 
    d=d
)

int64
object
int64
object
Loading df record: 0
Loading df record: 1
Loading df record: 2
Loading df record: 3


{'ResponseMetadata': {'RequestId': 'BOULGEJTS23OABOFHKHAHIA6DJVV4KQNSO5AEMVJF66Q9ASUAAJG',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'server': 'Server',
   'date': 'Sat, 26 Sep 2020 23:49:43 GMT',
   'content-type': 'application/x-amz-json-1.0',
   'content-length': '2',
   'connection': 'keep-alive',
   'x-amzn-requestid': 'BOULGEJTS23OABOFHKHAHIA6DJVV4KQNSO5AEMVJF66Q9ASUAAJG',
   'x-amz-crc32': '2745614147'},
  'RetryAttempts': 0}}

## Retrieving Data

Let's look at some examples where we retrieve data from DynamoDB.

### Getting Fields

Here we have a helpful function to just grab the list of fields in our table.

In [25]:
fields = \
    AWS.get_dynamo_fields(
        tbl_name=tbl_name, 
        pub=os.environ['aws_secret_access_id_cc'], 
        sec=os.environ['aws_secret_access_key_cc'],
        region_name=region
    )

In [26]:
fields

['title',
 'test_string',
 'test_number',
 'test_string_set',
 'test_list',
 'test_numbered_set',
 'starring',
 'test_bool',
 'gross',
 'year']

### Dictionaries

The first example is pretty straightforward, let's get a list of dictionaries from our DynamoDB table.

We leave fields=False to just select all the attributes, we can pass a list of attribute names to just select those if we want.

In [27]:
ret = \
    AWS.query_dynamo(
        tbl_name=tbl_name, 
        pub=os.environ['aws_secret_access_id_cc'], 
        sec=os.environ['aws_secret_access_key_cc'],
        region_name=region,
        data_format="list", 
        fields=False
    )

In [28]:
ret

[{'title': ['The Matrix',
   'Titantic',
   'Remember the Titans',
   'Armageddon',
   'Donnie Darko']},
 {'test_string': [None, 'Australia', None, None, None]},
 {'test_number': [None, Decimal('99.99'), None, None, None]},
 {'test_string_set': [None,
   {'Bar', 'Foo', 'Hello', 'World'},
   None,
   None,
   None]},
 {'test_list': [None, [Decimal('1'), 'Hello'], None, None, None]},
 {'test_numbered_set': [None,
   {Decimal('0'), Decimal('1'), Decimal('2'), Decimal('3'), Decimal('4')},
   None,
   None,
   None]},
 {'starring': [{'Carrie-Anne Moss',
    'Hugo Weaving',
    'Keanu Reeves',
    'Laurence Fishburne'},
   None,
   {'Denzel Washington', 'Ryan Hurst', 'Will Patton', 'Wood Harris'},
   {'Ben Affleck', 'Billy Bob Thornton', 'Bruce Willis', 'Liv Tyler'},
   {'Holmes Osborne', 'Jake Gyllenhaal', 'Maggie Gyllenhaal'}]},
 {'test_bool': [None, True, None, None, None]},
 {'gross': [Decimal('463517383'),
   None,
   Decimal('115600000'),
   Decimal('554000000'),
   Decimal('7510877')]

Let's select just year and title now.

In [29]:
ret = \
    AWS.query_dynamo(
        tbl_name=tbl_name, 
        pub=os.environ['aws_secret_access_id_cc'], 
        sec=os.environ['aws_secret_access_key_cc'],
        region_name=region,
        data_format="list", 
        fields=["year","title"]
    )

In [30]:
ret

[{'year': [Decimal('1999'),
   Decimal('1997'),
   Decimal('2000'),
   Decimal('1998'),
   Decimal('2001')]},
 {'title': ['The Matrix',
   'Titantic',
   'Remember the Titans',
   'Armageddon',
   'Donnie Darko']}]

### DataFrames

Let's try getting Dynamo data into Pandas.

In [31]:
ret = \
    AWS.query_dynamo(
        tbl_name=tbl_name, 
        pub=os.environ['aws_secret_access_id_cc'], 
        sec=os.environ['aws_secret_access_key_cc'],
        region_name=region,
        data_format="df", 
        fields=False
    )

In [32]:
ret

Unnamed: 0,title,test_string,test_number,test_string_set,test_list,test_numbered_set,starring,test_bool,gross,year
0,The Matrix,,,,,,"{Keanu Reeves, Hugo Weaving, Laurence Fishburn...",,463517383.0,1999
1,Titantic,Australia,99.99,"{Foo, Hello, Bar, World}","[1, Hello]","{0, 1, 2, 3, 4}",,True,,1997
2,Remember the Titans,,,,,,"{Ryan Hurst, Will Patton, Denzel Washington, W...",,115600000.0,2000
3,Armageddon,,,,,,"{Bruce Willis, Ben Affleck, Billy Bob Thornton...",,554000000.0,1998
4,Donnie Darko,,,,,,"{Holmes Osborne, Maggie Gyllenhaal, Jake Gylle...",,7510877.0,2001


And let's do the same as before with just the same two fields.

In [33]:
ret = \
    AWS.query_dynamo(
        tbl_name=tbl_name, 
        pub=os.environ['aws_secret_access_id_cc'], 
        sec=os.environ['aws_secret_access_key_cc'],
        region_name=region,
        data_format="df", 
        fields=["year","title"]
    )

In [34]:
ret

Unnamed: 0,year,title
0,1999,The Matrix
1,1997,Titantic
2,2000,Remember the Titans
3,1998,Armageddon
4,2001,Donnie Darko


### Redshift

In [35]:
#import the Redshift connection class
from pymagic.db_conn_tools import Redshift

#psycopg2
cursor_rs, conn_rs = Redshift.conn_rs_pg(
    host=os.environ['example_rs_host'],
    db=os.environ['example_rs_db'],
    user=os.environ['example_rs_user'],
    pwd=os.environ['example_rs_pwd'],
    port=os.environ['example_rs_port'],
)

#import the Redshift ETL class
from pymagic.db_etl_tools import Redshift

# tbl_name = "rs_test_data_from_s3"
tbl_name = "dynamo_test"

**Important:** Redshift only supports Number and String columns for now, so let's change our CREATE TABLE query to just create those types of columns.

In [36]:
#if we already created the table, lets drop it so we can re-create
try:
    Redshift.run_query_rs(sql=f"drop table {tbl_name}",conn=conn_rs)
except:
    conn_rs.commit()
    pass

#create the new table
sql = \
f'''
CREATE TABLE dynamo_test ( year INTEGER, gross FLOAT, title VARCHAR(20))
'''
Redshift.run_query_rs(sql=sql,conn=conn_rs)

[31mRuntime: 0.0025813666666666666
[31mRuntime: 0.0025795666666666665


In [38]:
AWS.dynamo_to_redshift(
    cursor=cursor_rs, 
    pub=os.environ['aws_secret_access_id_cc'], 
    sec=os.environ['aws_secret_access_key_cc'],
    tbl_name_rs=tbl_name, 
    tbl_name_dynamo=tbl_name, 
    readratio=10, 
    fields=["year","gross","title"]
)