[![AWS Data Wrangler](_static/logo.png "AWS Data Wrangler")](https://github.com/awslabs/aws-data-wrangler)

# Amazon S3

## Table of Contents
* [1. CSV files](#1.-CSV-files)
	* [1.1 Writing CSV files](#1.1-Writing-CSV-files)
	* [1.2 Reading single CSV file](#1.2-Reading-single-CSV-file)
	* [1.3 Reading multiple CSV files](#1.3-Reading-multiple-CSV-files)
		* [1.3.1 Reading CSV by list](#1.3.1-Reading-CSV-by-list)
		* [1.3.2 Reading CSV by prefix](#1.3.2-Reading-CSV-by-prefix)
* [2. JSON files](#2.-JSON-files)
	* [2.1 Writing JSON files](#2.1-Writing-JSON-files)
	* [2.2 Reading single JSON file](#2.2-Reading-single-JSON-file)
	* [2.3 Reading multiple JSON files](#2.3-Reading-multiple-JSON-files)
		* [2.3.1 Reading JSON by list](#2.3.1-Reading-JSON-by-list)
		* [2.3.2 Reading JSON by prefix](#2.3.2-Reading-JSON-by-prefix)
* [3. Parquet files](#3.-Parquet-files)
	* [3.1 Writing Parquet files](#3.1-Writing-Parquet-files)
	* [3.2 Reading single Parquet file](#3.2-Reading-single-Parquet-file)
	* [3.3 Reading multiple Parquet files](#3.3-Reading-multiple-Parquet-files)
		* [3.3.1 Reading Parquet by list](#3.3.1-Reading-Parquet-by-list)
		* [3.3.2 Reading Parquet by prefix](#3.3.2-Reading-Parquet-by-prefix)
* [4. Fixed-width formatted files (only read)](#4.-Fixed-width-formatted-files-%28only-read%29)
	* [4.1 Reading single FWF file](#4.1-Reading-single-FWF-file)
	* [4.2 Reading multiple FWF files](#4.2-Reading-multiple-FWF-files)
		* [4.2.1 Reading FWF by list](#4.2.1-Reading-FWF-by-list)
		* [4.2.2 Reading FWF by prefix](#4.2.2-Reading-FWF-by-prefix)
* [5. Reading with lastModified filter](#5.-Reading-with-lastModified-filter)
	* [5.1 Define the Date time with UTC Timezone](#5.1-Define-the-Date-time-with-UTC-Timezone)
	* [5.2 Define the Date time and specify the Timezone](#5.2-Define-the-Date-time-and-specify-the-Timezone)
	* [5.3 Read json with no LastModified filter](#5.3-Read-json-with-no-LastModified-filter)
	* [5.4 Read json using the LastModified filter](#5.4-Read-json-using-the-LastModified-filter)
* [6. Delete objects](#6.-Delete-objects)


In [1]:
import awswrangler as wr
import pandas as pd
import boto3
import pytz
from datetime import datetime

df1 = pd.DataFrame({
    "id": [1, 2],
    "name": ["foo", "boo"]
})

df2 = pd.DataFrame({
    "id": [3],
    "name": ["bar"]
})

## Enter your bucket name:

In [2]:
import getpass
bucket = getpass.getpass()

 ············


# 1. CSV files

## 1.1 Writing CSV files

In [3]:
path1 = f"s3://{bucket}/csv/file1.csv"
path2 = f"s3://{bucket}/csv/file2.csv"

wr.s3.to_csv(df1, path1, index=False)
wr.s3.to_csv(df2, path2, index=False);

## 1.2 Reading single CSV file

In [4]:
wr.s3.read_csv([path1])

Unnamed: 0,id,name
0,1,foo
1,2,boo


## 1.3 Reading multiple CSV files

### 1.3.1 Reading CSV by list

In [5]:
wr.s3.read_csv([path1, path2])

Unnamed: 0,id,name
0,1,foo
1,2,boo
2,3,bar


### 1.3.2 Reading CSV by prefix

In [6]:
wr.s3.read_csv(f"s3://{bucket}/csv/")

Unnamed: 0,id,name
0,1,foo
1,2,boo
2,3,bar


# 2. JSON files

## 2.1 Writing JSON files

In [7]:
path1 = f"s3://{bucket}/json/file1.json"
path2 = f"s3://{bucket}/json/file2.json"

wr.s3.to_json(df1, path1)
wr.s3.to_json(df2, path2)

## 2.2 Reading single JSON file

In [8]:
wr.s3.read_json([path1])

Unnamed: 0,id,name
0,1,foo
1,2,boo


## 2.3 Reading multiple JSON files

### 2.3.1 Reading JSON by list

In [9]:
wr.s3.read_json([path1, path2])

Unnamed: 0,id,name
0,1,foo
1,2,boo
0,3,bar


### 2.3.2 Reading JSON by prefix

In [10]:
wr.s3.read_json(f"s3://{bucket}/json/")

Unnamed: 0,id,name
0,1,foo
1,2,boo
0,3,bar


# 3. Parquet files

For more complex features releated to Parquet Dataset check the tutorial number 4.

## 3.1 Writing Parquet files

In [11]:
path1 = f"s3://{bucket}/parquet/file1.parquet"
path2 = f"s3://{bucket}/parquet/file2.parquet"

wr.s3.to_parquet(df1, path1)
wr.s3.to_parquet(df2, path2);

## 3.2 Reading single Parquet file

In [12]:
wr.s3.read_parquet([path1])

Unnamed: 0,id,name
0,1,foo
1,2,boo


## 3.3 Reading multiple Parquet files

### 3.3.1 Reading Parquet by list

In [13]:
wr.s3.read_parquet([path1, path2])

Unnamed: 0,id,name
0,1,foo
1,2,boo
2,3,bar


### 3.3.2 Reading Parquet by prefix

In [14]:
wr.s3.read_parquet(f"s3://{bucket}/parquet/")

Unnamed: 0,id,name
0,1,foo
1,2,boo
2,3,bar


# 4. Fixed-width formatted files (only read)

As of today, Pandas doesn't implement a `to_fwf` functionality, so let's manually write two files:

In [15]:
content = "1  Herfelingen 27-12-18\n"\
          "2    Lambusart 14-06-18\n"\
          "3 Spormaggiore 15-04-18"
boto3.client("s3").put_object(Body=content, Bucket=bucket, Key="fwf/file1.txt")

content = "4    Buizingen 05-09-19\n"\
          "5   San Rafael 04-09-19"
boto3.client("s3").put_object(Body=content, Bucket=bucket, Key="fwf/file2.txt")

path1 = f"s3://{bucket}/fwf/file1.txt"
path2 = f"s3://{bucket}/fwf/file2.txt"

## 4.1 Reading single FWF file

In [16]:
wr.s3.read_fwf([path1], names=["id", "name", "date"])

Unnamed: 0,id,name,date
0,1,Herfelingen,27-12-18
1,2,Lambusart,14-06-18
2,3,Spormaggiore,15-04-18


## 4.2 Reading multiple FWF files

### 4.2.1 Reading FWF by list

In [17]:
wr.s3.read_fwf([path1, path2], names=["id", "name", "date"])

Unnamed: 0,id,name,date
0,1,Herfelingen,27-12-18
1,2,Lambusart,14-06-18
2,3,Spormaggiore,15-04-18
3,4,Buizingen,05-09-19
4,5,San Rafael,04-09-19


### 4.2.2 Reading FWF by prefix

In [18]:
wr.s3.read_fwf(f"s3://{bucket}/fwf/", names=["id", "name", "date"])

Unnamed: 0,id,name,date
0,1,Herfelingen,27-12-18
1,2,Lambusart,14-06-18
2,3,Spormaggiore,15-04-18
3,4,Buizingen,05-09-19
4,5,San Rafael,04-09-19


# 5. Reading with lastModified filter 

Specify the filter by LastModified Date.

The filter needs to be specified as datime with time zone

Internally the path needs to be listed, after that the filter is applied.

The filter compare the s3 content with the variables lastModified_begin and lastModified_end

https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html

### 5.1 Define the Date time with UTC Timezone

In [19]:
begin = datetime.strptime("20-07-31 20:30", "%y-%m-%d %H:%M")
end = datetime.strptime("21-07-31 20:30", "%y-%m-%d %H:%M")

begin_utc = pytz.utc.localize(begin)
end_utc = pytz.utc.localize(end)

### 5.2 Define the Date time and specify the Timezone 

In [20]:
begin = datetime.strptime("20-07-31 20:30", "%y-%m-%d %H:%M")
end = datetime.strptime("21-07-31 20:30", "%y-%m-%d %H:%M")

timezone = pytz.timezone("America/Los_Angeles")

begin_Los_Angeles = timezone.localize(begin)
end_Los_Angeles = timezone.localize(end)

### 5.3 Read json using the LastModified filters 

In [21]:
wr.s3.read_fwf(f"s3://{bucket}/fwf/", names=["id", "name", "date"], last_modified_begin=begin_utc, last_modified_end=end_utc)
wr.s3.read_json(f"s3://{bucket}/json/", last_modified_begin=begin_utc, last_modified_end=end_utc)
wr.s3.read_csv(f"s3://{bucket}/csv/", last_modified_begin=begin_utc, last_modified_end=end_utc)
wr.s3.read_parquet(f"s3://{bucket}/parquet/", last_modified_begin=begin_utc, last_modified_end=end_utc);

# 6. Delete objects

In [22]:
wr.s3.delete_objects(f"s3://{bucket}/")