# Intro to Pachyderm

Pachyderm is an incredibly powerful platform, and can be used for many kinds of data-centered applications. In this notebook, we will introduce you to the basic concepts of data versioning and data pipelines and how they work in Pachyderm. 

## Installation

For this tutorial, we will use the `pachctl` command line interface. This means that any of these commands can be run from your terminal as well. 

If you are running this notebook on [Pachyderm Hub](https://hub.pachyderm.com/), the installation is done for you and everything should work automatically. If you are running in a self-hosted Pachyderm cluster, then you will have to install the Pachyderm client and connect to it before `pachctl` will run. For more information, see the [Getting Started](https://docs.pachyderm.com/latest/getting_started/local_installation/) docs. 

Let's make sure that we're connected to the Pachyderm cluster by checking the version. 

(`pachctl` is the version of the client running locally, `pachd` is the version of the Pachyderm server running in the cluster) 

In [None]:
# And configure aws via aws configure
!pip install awscli

In [22]:
!pachctl version

COMPONENT           VERSION             
pachctl             2.2.2               
pachd               2.2.2               


We can always see the help to understand how a particular `pachctl` command works by adding the `--help` flag. 

In [23]:
!pachctl --help

Access the Pachyderm API.

Environment variables:
  PACH_CONFIG=<path>, the path where pachctl will attempt to load your config.
  JAEGER_ENDPOINT=<host>:<port>, the Jaeger server to connect to, if PACH_TRACE
    is set
  PACH_TRACE={true,false}, if true, and JAEGER_ENDPOINT is set, attach a Jaeger
    trace to any outgoing RPCs.
  PACH_TRACE_DURATION=<duration>, the amount of time for which PPS should trace
    a pipeline after 'pachctl create-pipeline' (PACH_TRACE must also be set).

Usage:
  pachctl [command]

Administration Commands:
  auth         Auth commands manage access to data in a Pachyderm cluster
  enterprise   Enterprise commands enable Pachyderm Enterprise features
  idp          Commands to manage identity provider integrations

Commands by Action:
  copy         Copy a Pachyderm resource.
  create       Create a new instance of a Pachyderm resource.
  delete       Delete an existing Pachyderm resource.
  diff         Show the differences between two Pachyderm resource

In [3]:
!pachctl create repo --help

Create a new repo.

Usage:
  pachctl create repo <repo> [flags]

Aliases:
  repo, repos

Flags:
  -d, --description string   A description of the repo.
  -h, --help                 help for repo

Global Flags:
      --no-color   Turn off colors.
  -v, --verbose    Output verbose logs


## Pachyderm Data Repositories

Pachyderm organizes data into data repositories. This is somewhat similar to git as we'll see, but scales much better for all file types, such as images, machine learning models, csv files, and many others.

Let's first start by creating a data repository. 

### Create a data repo

In [4]:
!pachctl create repo data

In [24]:
!pachctl list repo

NAME    CREATED      SIZE (MASTER) DESCRIPTION                       
count   5 hours ago  ≤ 22B         Output repo for pipeline count.   
data    5 hours ago  ≤ 728B                                          
reduce  22 hours ago ≤ 6.545KiB    Output repo for pipeline reduce.  
map     22 hours ago ≤ 8.583KiB    Output repo for pipeline map.     
scraper 22 hours ago ≤ 333.5KiB    Output repo for pipeline scraper. 
urls    22 hours ago ≤ 119B                                          


A data repository, similar to a git repository, will be what we use to organize and reference data. 

We can also view and explore our data repository in the Pachyderm [Console](https://docs.pachyderm.com/2.0.x-beta/getting_started/beginner_tutorial/#exploring-your-dag-in-pachyderm-console), which should look something like the following.


When we list our repos, we can see that we have an empty data repository, so let's add some data.

### Add data

First, we'll create a small csv file locally with some of the iris data. 

In [6]:
%%writefile /tmp/iris.csv
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor
6.4,3.2,4.5,1.5,Iris-versicolor
6.9,3.1,4.9,1.5,Iris-versicolor
5.5,2.3,4.0,1.3,Iris-versicolor
6.3,3.3,6.0,2.5,Iris-virginica
5.8,2.7,5.1,1.9,Iris-virginica
7.1,3.0,5.9,2.1,Iris-virginica
6.3,2.9,5.6,1.8,Iris-virginica

Writing /tmp/iris.csv


Data repositories in Pachyderm automatically track versions of the data placed in them. Similar to Git, we organize our data via branches, so we will push our data to the master branch of our data repository.

In [7]:
!pachctl put file data@master -f /tmp/iris.csv



We can look at the data that's been uploaded to our data repository, by listing the files on the master branch.

In [8]:
!pachctl list file data@master

NAME      TYPE SIZE 
/iris.csv file 364B 


### Delete data

Similarly, if we want to delete our file, we can do that as well. 

In [9]:
!pachctl delete file data@master:/iris.csv

In [10]:
!pachctl list file data@master

NAME TYPE SIZE 


Now, if we add it back again...

In [11]:
!pachctl put file data@master -f /tmp/iris.csv



In [12]:
!pachctl list file data@master

NAME      TYPE SIZE 
/iris.csv file 364B 


No surprise, our file is there again. But when we list all of the commits that have been made to our repository, we can see the history of data on the master branch.

### Data commits

In [31]:
!pachctl list commit data

REPO BRANCH COMMIT                           FINISHED           SIZE ORIGIN DESCRIPTION
data master fa7e4423bd1c4a438ed6af61ed4f70d6 About a minute ago 364B USER    
data master 5da64681657f41de8e395f5ed8f24f0d 2 minutes ago      0B   USER    
data master 6bc28f6fdd3149519043a88a8ebeabbc 2 minutes ago      364B USER    


Pachyderm keeps a record of all the changes that happen to the data repository. This way if we ever want to revert to a previous version of our data repository (dataset in this case), we can do it.

For example, if we wanted to go back in time to the first file we added, we can move the "head" of our master branch to the first commit. To do this, we run the following 

**Note:** the commit hashes will be different. Copy and past the hash(es) above to run it yourself.

In [19]:
!pachctl create branch data@master --head 5da64681657f41de8e395f5ed8f24f0d

In [20]:
!pachctl list branch data

BRANCH HEAD                             TRIGGER 
master 5da64681657f41de8e395f5ed8f24f0d -       


In [21]:
!pachctl list file data@master

NAME TYPE SIZE 


As we can see when we list the history of our branch, we now only see the first commit (the head of our master branch). 

Let's go back to our most recent commit. 

In [26]:
!pachctl create branch data@master --head fa7e4423bd1c4a438ed6af61ed4f70d6

We can also use [Ancestry Syntax](https://docs.pachyderm.com/latest/concepts/data-concepts/history/#ancestry-syntax) to traverse and explore commits. `^` for the parent of the commit or we can reference the commits in numerical order using `.n`, where `n` is the commit number.

In [42]:
!pachctl list commit data@master^

REPO BRANCH COMMIT                           FINISHED      SIZE ORIGIN DESCRIPTION
data master 5da64681657f41de8e395f5ed8f24f0d 2 minutes ago 0B   USER    
data master 6bc28f6fdd3149519043a88a8ebeabbc 2 minutes ago 364B USER    


In [43]:
!pachctl list commit data@master.1

REPO BRANCH COMMIT                           FINISHED      SIZE ORIGIN DESCRIPTION
data master 6bc28f6fdd3149519043a88a8ebeabbc 2 minutes ago 364B USER    


In [44]:
!pachctl list commit data@master.-1

REPO BRANCH COMMIT                           FINISHED      SIZE ORIGIN DESCRIPTION
data master 5da64681657f41de8e395f5ed8f24f0d 2 minutes ago 0B   USER    
data master 6bc28f6fdd3149519043a88a8ebeabbc 2 minutes ago 364B USER    


In [45]:
!pachctl list branch data

BRANCH HEAD                             TRIGGER 
master fa7e4423bd1c4a438ed6af61ed4f70d6 -       


### Awesome Pachyderm Feature - Efficient Storage! 

If we list our repo info again, we can see that the *entire size* of the repo is just as big as original file, even though we added it a second time! Pachyderm is really smart in how it handles data. It can understand when the content of a file is a duplicate of something it's seen before to minimize the amount of storage needed. 

This means it's much, much cheaper to store and version data in Pachyderm than any other platform. 

Note: Deduplication happens per chunk (e.g. 8MBs per chunk), not per file. For more information on why this is better, see [this blog](https://medium.com/@jdoliner/debunking-the-fud-about-data-version-control-implementations-55cbe72014fb) on content-based chunking. 

(This feature was introduced in Pachyderm 2.0)

In [46]:
!pachctl list repo

NAME    CREATED       SIZE (MASTER) DESCRIPTION                       
data    3 minutes ago ≤ 364B                                          
reduce  17 hours ago  ≤ 6.545KiB    Output repo for pipeline reduce.  
map     17 hours ago  ≤ 8.583KiB    Output repo for pipeline map.     
scraper 17 hours ago  ≤ 333.5KiB    Output repo for pipeline scraper. 
urls    17 hours ago  ≤ 119B                                          


## Pachyderm Pipelines

Managing and versioning data by itself is only half the story. Once you have data, you typically want to do something with it, whether it's transform it, run tests on it, or even train a model. 

**A Pachyderm Pipeline is how you apply code to your data.**

Pipelines work seemlessly with data inside your data repositories, but even better, these pipelines can be triggered by your data! 

This means that we can deploy a pipeline to transform the data from our `data` repo, and anytime we modify our data, the pipeline will automatically re-run. 

Initially, this can be a hard concept to grasp, so let's walk through an example.

### Count Pipeline

Let's say we just want to count the number of lines in our csv file. We can create a Pachyderm Pipeline that looks like the `yaml` below that uses a shell command to count the number of lines (we'll see why we use shell later on).

In [32]:
%%writefile /tmp/count.yaml
pipeline:
    name: 'count'
description: 'Count the number of lines in a csv file'
input:
    pfs:
        repo: 'data'
        branch: 'master'
        glob: '/'
transform:
    image: alpine:3.14.0
    cmd: ['/bin/sh']
    stdin: ['wc -l /pfs/data/iris.csv > /pfs/out/line_count.txt']

Overwriting /tmp/count.yaml


### Pipelines in detail

Let's break this pipeline down section by section and explain it: 

Every pipeline must have a unique name. In our case, we will call this one `count`. It's also good practice to give our pipeline a description to help others know what it does. 

When the pipeline runs, it will also **create a data repository** `count` for any files created when the pipeline runs. 
```yaml
pipeline:
  name: count
description: Count the number of lines in a csv file
```

The `input` section defines what Pachyderm Data Repositories (or other type of input) will be connected to the pipeline. In our case, the `master` branch of our `data` repo will be used. 

When the pipeline runs, it will map the files from the `master` branch of our `data` repo, into the file system at `/pfs/data/` (`/pfs/` stands for Pachyderm File System). 

We'll talk more about glob patterns in another tutorial, but in this example, `/` means that every file on the head commit of the master branch is accessible to the the pipeline. 

```yaml
input:
  pfs:
    repo: data
    branch: master
    glob: /
```

The `transform` portion of the pipeline defines what code should be run when the pipeline executes. Pachyderm Pipelines use Docker containers to allow code written in any language to be executed as a pipeline. In this case, we are using a Docker container `alpine:3.14.0` as our Docker image. When this pipeline runs, it execute the `cmd` along with the `stdin` inside our container. 

Our `stdin` command, will count the number of lines in `/pfs/data/iris.csv` and write the output to `/pfs/out/line_count.txt`. `/pfs/out` is a special location in Pachyderm pipelines. Anything written to this directory will be *commited* to the `count` data repository (automatically created) as the output of the pipeline.

```yaml
transform:
  image: alpine:3.14.0
  cmd: ['/bin/sh']
  stdin: ['wc -l /pfs/data/iris.csv > /pfs/out/line_count.txt']
```

### More regarding pipeline spec: https://docs.pachyderm.com/latest/reference/pipeline-spec

### Creating pipelines

We can submit our pipeline to Pachyderm by using the `create pipeline` command.

We can also view our pipelines in the Pachyderm Console as well. Notice it automatically creates the output data repository with the same name. 


In [33]:
!pachctl create pipeline -f /tmp/count.yaml

### Monitor pipelines

If we list our pipelines, we can see the status of them. 

In [49]:
!pachctl list pipeline

NAME    VERSION INPUT        CREATED       STATE / LAST JOB  DESCRIPTION                                                                                 
count   1       data:/       2 seconds ago [32mrunning[0m / -       Count the number of lines in a csv file                                                     
reduce  1       map:/        17 hours ago  [32mrunning[0m / [32msuccess[0m A pipeline that aggregates the total counts for each word.                                  
map     1       scraper:/*/* 17 hours ago  [32mrunning[0m / [32msuccess[0m A pipeline that tokenizes scraped pages and appends counts of words to corresponding files. 
scraper 1       urls:/*      17 hours ago  [32mrunning[0m / [32msuccess[0m A pipeline that pulls content from a specified Internet source.                             


It looks like our pipeline is `running` and the last job succeeded. Let's take a look at the job.

A job is an execution of our pipeline. We can see our job status by running: 

In [50]:
!pachctl list job

ID                               SUBJOBS PROGRESS CREATED        MODIFIED
e35d00004c5b4288b6580c5c0519cc80 1       [32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m 10 seconds ago 10 seconds ago 
ebee0fc1176c4e01a8093559cb893a5c 3       [32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m 25 minutes ago 25 minutes ago 
f3b7d09fb53f49acb727ce0010027b9f 3       [32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m 17 hours ago   17 hours ago   
0a1ef75590de4b56ae92470d7e2281ab 1       [32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m 17 hours ago   17 hours ago   
47d4cc3c190648a2a31ce435c8e2f3d7 1       [32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m 17 hours ago   17 hours ago   
2d03390e3d14482c925c595a82f94ba7 1       [32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m 17 hours ago   17 hours ago   


We can also see that we have a new data repository called `count` that holds the output of our pipeline. 

### View pipeline output commits

In [51]:
!pachctl list repo

NAME    CREATED        SIZE (MASTER) DESCRIPTION                       
count   16 seconds ago ≤ 22B         Output repo for pipeline count.   
data    4 minutes ago  ≤ 364B                                          
reduce  17 hours ago   ≤ 6.545KiB    Output repo for pipeline reduce.  
map     17 hours ago   ≤ 8.583KiB    Output repo for pipeline map.     
scraper 17 hours ago   ≤ 333.5KiB    Output repo for pipeline scraper. 
urls    17 hours ago   ≤ 119B                                          


In [52]:
!pachctl list file count@master

NAME            TYPE SIZE 
/line_count.txt file 22B  


Let's download the file created by our `count` pipeline and see what's in it. 

In [53]:
!pachctl get file count@master:/line_count.txt -o /tmp/line_count.txt



We can see that our output file correctly counted the number of lines in our csv file. 

In [54]:
# Output file
!cat /tmp/line_count.txt

12 /pfs/data/iris.csv


In [55]:
# Original file
!wc -l /tmp/iris.csv

12 /tmp/iris.csv


### Data-Driven Pipelines
If we recall, all of our pipelines in Pachyderm are data-driven. They are always ready to run whenever the data contained in an input repository changes. So let's do that. Let's update our iris data (this time with 24 lines). 

In [56]:
%%writefile /tmp/iris_v2.csv
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor
6.4,3.2,4.5,1.5,Iris-versicolor
6.9,3.1,4.9,1.5,Iris-versicolor
5.5,2.3,4.0,1.3,Iris-versicolor
6.3,3.3,6.0,2.5,Iris-virginica
5.8,2.7,5.1,1.9,Iris-virginica
7.1,3.0,5.9,2.1,Iris-virginica
6.3,2.9,5.6,1.8,Iris-virginica
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor
6.4,3.2,4.5,1.5,Iris-versicolor
6.9,3.1,4.9,1.5,Iris-versicolor
5.5,2.3,4.0,1.3,Iris-versicolor
6.3,3.3,6.0,2.5,Iris-virginica
5.8,2.7,5.1,1.9,Iris-virginica
7.1,3.0,5.9,2.1,Iris-virginica
6.3,2.9,5.6,1.8,Iris-virginica

Writing /tmp/iris_v2.csv


We'll overwrite our original file with the command: 

In [57]:
!pachctl put file data@master:iris.csv -f /tmp/iris_v2.csv



In [58]:
!pachctl list file data@master

NAME      TYPE SIZE 
/iris.csv file 728B 


In [59]:
!pachctl list commit data@master

REPO BRANCH COMMIT                           FINISHED      SIZE ORIGIN DESCRIPTION
data master e168bea3fdbf49d2849354c2dc833dd9 8 seconds ago 728B USER    
data master fa7e4423bd1c4a438ed6af61ed4f70d6 5 minutes ago 364B USER    
data master 5da64681657f41de8e395f5ed8f24f0d 5 minutes ago 0B   USER    
data master 6bc28f6fdd3149519043a88a8ebeabbc 5 minutes ago 364B USER    


We have a new commit to our `data` repository, so let's see what's happened to our pipeline. 

In [60]:
!pachctl list job

ID                               SUBJOBS PROGRESS CREATED            MODIFIED
e168bea3fdbf49d2849354c2dc833dd9 1       [32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m 11 seconds ago     11 seconds ago     
e35d00004c5b4288b6580c5c0519cc80 1       [32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m About a minute ago About a minute ago 
ebee0fc1176c4e01a8093559cb893a5c 3       [32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m 27 minutes ago     27 minutes ago     
f3b7d09fb53f49acb727ce0010027b9f 3       [32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m 17 hours ago       17 hours ago       
0a1ef75590de4b56ae92470d7e2281ab 1       [32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m 17 hours ago       17 hours ago       
47d4cc3c190648a2a31ce435c8e2f3d7 1       [32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇

We have a new job that has just run. But remember, we only uploaded a file to our input repo. Pachyderm intelligently tells pipelines to run when their input data changes. If we look at the output of our `count` repository, we now see 2 commits. 

In [61]:
!pachctl list commit count@master

REPO  BRANCH COMMIT                           FINISHED           SIZE ORIGIN DESCRIPTION
count master e168bea3fdbf49d2849354c2dc833dd9 15 seconds ago     22B  AUTO    
count master e35d00004c5b4288b6580c5c0519cc80 About a minute ago 22B  AUTO    


In [62]:
!pachctl get file count@master:/line_count.txt -o /tmp/line_count_v2.txt



In [63]:
!cat /tmp/line_count_v2.txt

24 /pfs/data/iris.csv


### Awesome Pachyderm Feature - Data Lineage!

The data-driven nature of Pachyderm Pipelines allow you to reliably maintain data and process lineage at scale. Combining versioning data with code in Docker containers for pipelines, Pachyderm can be used to automate, debug, and maintain any data + code workflow. 

For example, if we want to know the lineage of our most recent `line_count.txt`, we can run: 

In [64]:
!pachctl list commit count@master

REPO  BRANCH COMMIT                           FINISHED           SIZE ORIGIN DESCRIPTION
count master e168bea3fdbf49d2849354c2dc833dd9 31 seconds ago     22B  AUTO    
count master e35d00004c5b4288b6580c5c0519cc80 About a minute ago 22B  AUTO    


This gives us the unique commit for that run of the `count` pipeline. We can use this commit to see the unique combination of inputs and pipelines that resulted in this file. 

In [68]:
!pachctl list commit e168bea3fdbf49d2849354c2dc833dd9

REPO       BRANCH COMMIT                           FINISHED       SIZE     ORIGIN DESCRIPTION
count.spec master e168bea3fdbf49d2849354c2dc833dd9 47 seconds ago 0B       ALIAS   
data       master e168bea3fdbf49d2849354c2dc833dd9 46 seconds ago 728B     USER    
count.meta master e168bea3fdbf49d2849354c2dc833dd9 45 seconds ago 1.438KiB AUTO    
count      master e168bea3fdbf49d2849354c2dc833dd9 45 seconds ago 22B      AUTO    


We will gloss over some details here, but the important thing is, we can see the commit to the `data` repo was initiated by a `USER`. We can see exactly what commit triggered the pipeline. 

In [9]:
!pachctl list file data@e168bea3fdbf49d2849354c2dc833dd9

NAME      TYPE SIZE 
/iris.csv file 728B 


In [29]:
!pachctl deploy --dry-run

Error: unknown command "deploy" for "pachctl"
Run 'pachctl --help' for usage.
unknown command "deploy" for "pachctl"


If we inspect the job associated with this commit, then we can get all the information about what pipeline was run on the data from this commit.

In [10]:
!pachctl inspect job count@e168bea3fdbf49d2849354c2dc833dd9

ID: e168bea3fdbf49d2849354c2dc833dd9
Pipeline: count
Started: 54 minutes ago 
Duration: 1 second 
State: [32msuccess[0m
Reason: 
Processed: 1
Failed: 0
Skipped: 0
Recovered: 0
Total: 1
Data Downloaded: 728B
Data Uploaded: 22B
Download Time: Less than a second
Process Time: Less than a second
Upload Time: Less than a second
Datum Timeout: (duration: nil Duration)
Job Timeout: (duration: nil Duration)
Worker Status:
WORKER              JOB                 DATUM               STARTED             
Restarts: 0
ParallelismSpec: <nil>



Input:
{
  "pfs": {
    "name": "data",
    "repo": "data",
    "repo_type": "user",
    "branch": "master",
    "commit": "e168bea3fdbf49d2849354c2dc833dd9",
    "glob": "/"
  }
}

Transform:
{
  "image": "alpine:3.14.0",
  "cmd": [
    "/bin/sh"
  ],
  "stdin": [
    "wc -l /pfs/data/iris.csv > /pfs/out/line_count.txt"
  ]
} 
Output Commit: e168bea3fdbf49d2849354c2dc833dd9 


## updating the pipeline

In [34]:
!pachctl update pipeline -f /tmp/count.yaml

https://docs.pachyderm.com/latest/how-tos/pipeline-operations/updating-pipelines/

## s3 interface

In [16]:
!aws --endpoint-url http://localhost:30600 s3 ls s3://master.data/

2022-06-14 13:30:39        728 iris.csv


In [38]:
!aws --endpoint-url http://localhost:30600 s3 ls s3://6bc28f6fdd3149519043a88a8ebeabbc.master.data/

2022-06-14 13:25:11        364 iris.csv


In [20]:
!aws --endpoint-url http://localhost:30600 s3 cp s3://6bc28f6fdd3149519043a88a8ebeabbc.master.data/iris.csv /tmp/iris.csv

download: s3://6bc28f6fdd3149519043a88a8ebeabbc.master.data/iris.csv to ../../../../tmp/iris.csv


In [21]:
!ls -lah /tmp/iris.csv

-rw-rw-r-- 1 ubuntu ubuntu 364 Jun 14 13:25 /tmp/iris.csv
