<a href="https://colab.research.google.com/github/rzl-ds/gu511/blob/master/010_s3.ipynb" target="_parent">
    <img src="https://colab.research.google.com/assets/colab-badge.svg"/>
</a>

# `aws s3` (Simple Storage Service)

when it comes to managing your data, you can get pretty far with just an `ec2` server and a virtual hard drive.

for example, right now I have a `postgres` server running on an `ec2` instance with a 64G hard drive, and that server has been polling the [`wmata` train position api](https://developer.wmata.com/docs/services/5763fa6ff91823096cac1057/operations/5763fb35f91823096cac1058) every 10 seconds for about 10 months. it currently holds about 200 million records. not exactly big data, but not exactly small data either.

as another example, the long-running project scraping power outage information (mentioned last lecture) in 15 minute intervals. the resulting `json` files are saved to the `/data` directory directly on that machine, and (if nothing changes) will keep on downloading and storing those files forever

there are disadvantages to on-disk storage, though, like:

?

1. access
    1. you have to log in to the linux server to access the files
    2. this means I have to grant selective access to people
2. cost
    1. disk space (specifically: EBS (elastic block store)) on an `ec2` service is expensive, running about $0.10 per GB-month
3. manual administration
    1. if I want to grow the hard drive size, I have to do it myself
    2. I have to know *when* that's going to happen, or I could fill my harddrive without paying attention
    3. I could intentionally set up backup policies and redundancy mechanisms on my `ec2` server

`s3` is a service offered by `aws` to be a central file storage location, accessible from everywhere via standard REST api requests (`GET, PUT, COPY, POST, LIST`), which addresses some of those disadvantages:

1. access
    1. you can control access to any "bucket" (basically, a top-level directory in the file system) from the web console
    2. you can control permissions down to the individual file level
    3. you can be as permissive or as restrictive as you desire
2. cost
    1. standard storage starts at about $0.023 per GB-month (so about 23% the EBS `ec2` hard drive cost) for the first 50 TB, and gets *cheaper* from there
3. manual administration
    1. will grow on the fly without administration
    2. redundancy and backup is a built-in service option
    3. I can have easy-access logging and version information
    4. I can just host a static webpage with a click of a buton
    
it's not perfect, but it's pretty cool

and what about cons?

1. size limit
    1. there is a 5T size limit for individual files
    2. there is a 5G size limit for individual `PUT` requests
    3. there is a separate process recommended for loading large files: a [multipart upload](http://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html)
2. small changes to files are inefficient
    1. require full download and re-upload
3. sub-optimal as mounted drives on a server
4. charged *per read*
    1. not just charged for taking up space

## buckets

open your `aws` console and go to the `s3` service

https://s3.console.aws.amazon.com/s3/home?region=us-east-1

everywhere you look: "Buckets". a *bucket* is effectively a top-level directory for a family of related files.

you *could* create a single "data" bucket and keep everything in there, but you probably don't want to. why?

1. a broad interpretation of the "separation of concerns" principle applies
    1. you don't want to mix different files that are doing different things for different purposes
2. permissions will be dicey
    1. eventually you may want to be very restrictive or very permissive
    2. you *can* controll permissions on a per-file basis in `s3`, so it is *possible*
    3. *but* other project stakeholders may not approve of this at all, or it may just be more onerous to set permissions on a per-file basis than on a bucket-wide level
3. descriptive names actually make for better code
    1. long path names that have nothing to do with a task or project are wasted words
    2. descriptive names can help you understand exactly what you're looking for from the main page

**<div align="center">mini exercise: create an `s3` bucket</div>**

1. pick a name -- it has to be *globally* unique (that is, across *all* of `s3`, not just *your* buckets)
    1. one convent: use the "url" style of naming things, e.g. something like `*****.com`. I often use `***********.lamberty.io` bucket names because I own that domain
    2. it doesn't have to be that way -- just has to be unique
    3. no uppercase letters and no underscore characters (`_`)
2. go with the NOVA region
3. tag 'em!
4. submit

**<div align="center">tour the `s3` web console</div>**

1. main page
    1. pretty self-explanatory: a list of buckets and the ability to create new ones
    1. click on any bucket *line* and a right context menu comes up
        1. basic descriptions of the three types of configuration values (properties, permissions, management), and the ability to quickly access the `arn` (amazon resource name)
1. bucket page (off of any bucket name link)
    1. overview tab
        1. allows you to upload a file, create a directory, and set object permissions
        1. if you have files, you can click on those lines and bring up a right context menu
    1. properties tab
        1. versioning: you have the ability to record version histories of every file in your bucket (off by default)
        1. server, object logging: log (to `s3`, $ for `aws`) access and file creation / manipulation log records, or specific `rest`ful requests
        1. static website hosting: host a webpage straightout out of your `s3` bucket (not bad huh?)
        1. encryption: set up server-side encryption (that is, scrambled on `aws`'s servers) for your fiels
        1. tags: good for collecting resources for shared purpose or project
        1. transfer acceleration: pay for faster access
        1. events: create a trigger when a file gets dropped to a location
        1. requester pays: like calling collect, but for the internet age
    1. permissions tab
        1. block public access 
        1. access control list: control bucket-level permissions for `aws` users and the public
        1. bucket policy: generate an `iam` policy which you can later attach to `iam` users or roles
        1. CORS configuration: advanced; allows you to allow other web services to access these files as if they were local to that web service (forbidden by default)
    1. management tab
        1. lifecycle: making sure you pay less for old stuff you don't use
        1. replication: making sure you have multiple copies of your most important things
        1. a suite of other analytics and monitoring tools; not useful at this time since we have no information in these buckets

### putting things in the bucket: web interface

putting things into a bucket via the web interface is pretty intuitive, so let's not waste a lot of time

1. click on the bucket name
2. create a directory if you want
3. click the upload button

while uploading, you'll see a couple of different dialogs. let's upload a file together, and discuss some of the dialogs while we're doing it

**<div align="center">mini exercise: add a file to the bucket via the web console</div>**

1. create a text file on your *local* computer (not your `ec2` server)
    1. do it any way you want -- something super simple.
2. in your bucket, create a folder called "helloworld"
3. click the "Upload" button
    1. select your file
    2. leave permissions as-is (but you could make it public here!)
    3. leave properties as-is
        1. storage class discussion (notes below)
        2. encryption discussion (notes below)
        3. metadata discussion (notes below)
    4. review and uplaod

exit for notes

#### storage classes

there are different ways of storing files in `s3`, and the primary dimensions of difference are 

1. cost per GB of storage space used
2. cost per request for or upload of a file
3. availability (how often will the system fail to find your file at a given moment)
4. redundancy (how many backup copies of your file exist across all of the aws internal storage system)

given these dimensions of difference, the options availabe are

+ standard
    + default behavior
+ intelligent tiering
    + halfway between standard (above) and standard-ia (below)
    + wait for your users to tell you how rarely or frequently you need files
+ standard-ia
    + `ia` for "Infrequent Access"
    + cheaper storage but more expensive per access request
    + good for long-term storage of files
    + you can rotate files from `standard` to `standard-ia`
+ one-zone ia
    + this option is a little bit cheaper and made for things you are "more comfortable" losing
    + still not extremely likely to lose files this way
+ `glacier`
    + this is a "deeper" level of storage than `standard-ia`
    + much cheaper per `GB`
    + requests for files are neither instant nor cheap
    + this is basically just tape archives of your files
+ `glacier` deep archive
    + go even deeper -- in`s3`ption
    + if you can afford to wait hours to get something every few months, this is the storage level for you

#### encryption

the files you are loading to and downloading from `s3` can be encrypted or unencrypted (*aka* in plain text) during different stages of the process

+ on upload or download: the text sent over the internet in the upload or download request could be plain text or decrypted
+ in storage: the actual item stored in `s3` could be encrypted or unencrypted

there are basically two times a malicious agent could "steal" your data:

1. in transit (during upload or download over the internet)
2. at rest (while it's located on some aws machine hosting the `s3` service)

the different types of encryption policies and schemes basically boil down to which of those options you try to prevent.

if you don't care at all, you're not performing any encryption.

now suppose you only care to prevent the latter (at rest, people "hacking" `s3` as a service), and you don't care if you're sending that information over the internet un-encrypted. schemes that allow un-encrypted information to be transferred from your local computer to the `s3` service and *then* encrypt the information for `s3` storage, and which decrypt that information *prior* to sending it back to you (the client) are referred to as *server-side encryption*, because all of the information about how files are encrypted / decrypted resides on the server side of the transaction.

now suppose you want to make sure your data never leaves your physical network (e.g. your computer, or your government-agency-wide network). schemes which focus on encrypting files before upload and after download are referred to as *client-side encryption* because the knowledge about how files are encrypted / decrypted resides on the client (your local computer) side of the transaction.

if you're a paranoid sadist, you could do both -- have the server "double" encrypt the encrypted message.

when you're selecting the encryption option on the file upload dialog, you are *only* talking about server-side encryption. This will not affect how data is transferred to and from `s3`, but rather under what conditions it will be saved within `s3` itself. there are three options

+ none
    + the file will be saved in `s3` exactly as it was presented to `s3`
    + this is the chosen option for both no encryption and client-side encryption 
+ amazon `s3` master-key
    + this server-side encryption option will manage the keys within the `s3` service itself
    + each file will get a unique key
+ amazon `kms` master-key
    + this also does server-side encryption
    + encryption keys are controlled via a separate service, the `kms` service
        + you access this service as part of the `iam` service, `iam` > "encryption keys"
    + for the `cli` and the `sdk`s, there are some "signature" issues (the signature of the files returned in `GET` requests is not the `md5` sum of the file), but they are trivial to fix
    + the file returned to you is in plain text

#### metadata

there are several options in terms of meta-data. there are effectively three types of metadata fields on offer here:

1. standard `http` header message metadata
    1. these control how web-aware applications will handle your `s3` files
    2. of these, the one most worth knowing about from the start is the `content-type`, a field which offers a hint at how applications are supposed to interact with the file based on its type
2. `aws s3`-specific metadata
3. user-specific metadata
    1. for whatever information I want to tag these files with
    2. can be useful for bulk processing of your files, for example, based on header data queries

now that you've created files, let's investigate the items available in the web console

1. click on the file *line* and review the side panel
    1. click on the URL in the side panel
    2. go back in your browser (to the file overview tab)
2. the file overview tab 
    1. the URL of the file is the same as the previous link
    2. you can click "open", though, to actually open this file
    3. you could download from here
3. click on the "test" link in the breadcrumbs menu at the top
4. right click one of the line items for a full context menu of options

**<div align="center">PAUSE FOR ZOOM BREAK</div>**

### putting things in the bucket: `cli`

that's all well and good, but eventually we're going to want to be interacting with the `s3` service via applications and from the command line. fortunately, both the `aws cli` and the `python boto3` libraries make that straight-forward.

log in to your `ec2` server and check out the things that the `aws cli` can do with the `s3` service:

```bash
aws s3 help
```

**<div align="center">mini exercise: add a file to the bucket via the `aws cli`</div>**

1. create a text file on your `ec2` sever
2. upload the file to the `s3` service
    1. use the `aws s3 cp` command
    2. post it to the bucket and directory you created above
3. verify (in the web console) that your file was posted successfully

to do the above:

```bash
echo "hello world" >> /tmp/helloworld.txt
aws s3 cp /tmp/helloworld.txt s3://test1.rzl.gu511.com/test/helloworld.cli.txt
```

### putting things in the bucket: `python` and `boto3`

under the hood, the `cli` is just using the `boto3` library -- so we should be able to easily reproduce this behavior within a `python` script / shell.

with `s3` we can *upload* or *download* file objects, and with `python` we can do it from/to files or `python` `str`s

for starters, let's go ahead and enter a `python` shell:

```bash
python
```

and then

```python
import boto3
session = boto3.session.Session(region_name='us-east-1')
s3 = session.resource('s3')
```

In [None]:
# import boto3

# s3 = boto3.client('s3')
# help(s3.upload_file)

**<div align="center">mini exercise: upload a local file to the bucket using the `boto3` `python` library</div>**

1. create an `s3` `resource` object
2. get the `bucket` object corresponding to our desired bucket
3. use the `bucket.upload_file` function
4. verify it worked on the `s3` console

the above can be done in only a few lines:

```python
import boto3

session = boto3.session.Session(region_name='us-east-1')

s3 = session.resource('s3')
bucket = s3.Bucket('test1.rzl.gu511.com')
bucket.upload_file(Filename='/tmp/helloworld.txt',
                   Key='test/helloworld.boto3.txt')
```

**<div align="center">mini exercise: download an `s3` file to the local disk</div>**

1. create an `s3` `resource` object
2. get the `bucket` object corresponding to our desired bucket
3. use the `bucket.download_file` method

the above can be done in only a few lines:

```python
import boto3

session = boto3.session.Session(region_name='us-east-1')

s3 = session.resource('s3')
bucket = s3.Bucket('test1.rzl.gu511.com')
bucket.download_file(Key='test/helloworld.boto3.txt',
                     Filename='/tmp/helloworld.cli.downloadedtxt')
```

**<div align="center">mini exercise: create an `s3` file and update contents from a `str` (no local file)</div>**

1. create an `s3` `resource` object
2. get the `bucket` object corresponding to our desired bucket
3. use the `Object` method to create a new file object on that bucket, and set the body of that file to be a hard-coded string message
4. verify it worked on the `s3` console

the above can be done in only a few lines:

```python
import boto3

session = boto3.session.Session(region_name='us-east-1')

s3 = session.resource('s3')
bucket = s3.Bucket('test1.rzl.gu511.com')
s3file = bucket.Object(key='test/goodbyeworld.boto3.txt')
s3file.put(Body='goodbye, world!',
           ContentType='text/plain')
```

**<div align="center">mini exercise: loading the contents of an `s3` file to a `str`</div>**

1. create an `s3` `resource` object
2. get the `bucket` object corresponding to our desired bucket
3. use the `bucket.Object` method to access a file object
4. use the file object's `get` method to get a `json` representation of the file
5. use the `read` method of the `get` response's `Body` value to read the contents to a string

the above can be done in only a few lines:

```python
import boto3

session = boto3.session.Session(region_name='us-east-1')

s3 = session.resource('s3')
bucket = s3.Bucket('test1.rzl.gu511.com')
s3file = bucket.Object(key='test/goodbyeworld.boto3.txt')

response = s3file.get()
print(response)

response['Body'].read()
```

**<div align="center">PAUSE FOR ZOOM BREAK</div>**

<strong><em><div align="center">sssuper</div></em></strong>
<div align="center"><img src="https://vice-images.vice.com/images/content-images-crops/2016/07/27/that-s-thing-everyone-drew-in-school-what-is-it-body-image-1469592131-size_1000.jpg?output-quality=75" width="500px"></div>

# END OF LECTURE

next lecture: [`aws` serverless function architecture `lambda`](011_lambda.ipynb)

### putting things in the bucket: multipart uploads

I worked on the following, but have yet to find a solid motivating test case (not to mention: it appears that in the `python boto3` library the `transfer` method is preferred. it is possible `transfer` chunks files as a matter of implementation, but I wouldn't be surprised if files over 50M fail uploads via `boto3`.

```bash
# create a big file of zeroes
dd if=/dev/zero of=output.dat bs=1M count=500

# some variables
BUCKET_NAME=test1.rzl.gu511.com
KEY=test/ouptut.dat

# create a multipart upload
aws s3api create-multipart-upload --bucket $BUCKET_NAME --key $KEY --metadata md5=$(openssl md5 -binary ${FILE_NAME} | base64)

# Get that json uploadid value
JUI= [something we just pulled from a json returned by the prev command]

#Split file
split -n 5 -d output.dat output_

# For each part:
FNOW= [output_***]
FMD5NOW=$(openssl md5 -binary $FNOW | base64)
aws s3api upload-part --bucket $BUCKET --key $KEY --part-number 5 --body $FNOW --upload-id $JUI --content-md5 $FMD5NOW

# Capture the result of the uploaded files to a json file:
aws s3api list-parts --bucket $BUCKET --key $KEY --upload-id $JUI >> mpu.json

# Edit that mpu.json file to contain only the “Parts” key
# and for each parts object only the “ETag” and “PartNumber” 
# element (see https://aws.amazon.com/premiumsupport/knowledge-center/s3-multipart-upload-cli/)
nano mpu.json
# type type type

# Finalize
aws s3api complete-multipart-upload --multipart-upload file://mpu.json --bucket $BUCKET --key $KEY --upload-id $JUI
```

```python
import boto3

s3 = boto3.client('s3')

BUCKET = 'test1.rzl.gu511.com'
KEY = 'test/output.dat'


createresp = s3.create_multipart_upload(
    Bucket=BUCKET,
    Key=key
)

# rather than do this, you should just use the built-in
# upload_file function. Maybe this won't work for files
# over 5GB, but then you can get into the issues of
# reading a file in chunks, which is a bit much right now
# not that it wasn't a bit much up above, here...

abortresp = s3.abort_multipart_upload(
    Bucket=BUCKET,
    Key=KEY,
    UploadId=createresp['UploadId'],
)

# list parts to double-verify
listresp = s3.list_parts(
    Bucket=BUCKET,
    Key=KEY,
    UploadId=createresp['UploadId']
)
```