# Running Pandas on Bacalhau


[![stars - badge-generator](https://img.shields.io/github/stars/bacalhau-project/bacalhau?style=social)](https://github.com/bacalhau-project/bacalhau)

### Introduction

Pandas is a Python package that provides fast, flexible, and expressive data structures designed to make working with data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis/manipulation tool available in any language. It is already well on its way towards this goal.

## TD;LR
Running pandas script in Bacalhau

## Prerequisite

To get started, you need to install the Bacalhau client, see more information [here](https://docs.bacalhau.org/getting-started/installation)

In [6]:
!command -v bacalhau >/dev/null 2>&1 || (export BACALHAU_INSTALL_DIR=.; curl -sL https://get.bacalhau.org/install.sh | bash)
path=!echo $PATH
%env PATH=./:{path[0]}

Your system is linux_amd64

BACALHAU CLI is detected:
Client Version: v0.3.29
Server Version: v0.3.29
Reinstalling BACALHAU CLI - ./bacalhau...
Getting the latest BACALHAU CLI...
Installing v0.3.29 BACALHAU CLI...
Downloading https://github.com/bacalhau-project/bacalhau/releases/download/v0.3.29/bacalhau_v0.3.29_linux_amd64.tar.gz ...
Downloading sig file https://github.com/bacalhau-project/bacalhau/releases/download/v0.3.29/bacalhau_v0.3.29_linux_amd64.tar.gz.signature.sha256 ...
Verified OK
Extracting tarball ...
NOT verifying Bin
bacalhau installed into . successfully.
Client Version: v0.3.29
Server Version: v0.3.29
env: PATH=./:/home/gitpod/.pyenv/versions/3.11.1/bin:/home/gitpod/.pyenv/libexec:/home/gitpod/.pyenv/plugins/python-build/bin:/home/gitpod/.pyenv/shims:/ide/bin/remote-cli:/home/gitpod/.nix-profile/bin:/home/gitpod/.local/bin:/home/gitpod/.sdkman/candidates/maven/current/bin:/home/gitpod/.sdkman/candidates/java/current/bin:/home/gitpod/.sdkman/candidates/gradle/current/b


## Running Pandas Locally

To run Pandas script on Bacalhau for analysis, first we will place the Pandas script in a container and then run it at scale on Bacalhau. To get started, you need to install the Pandas library from pip.

In [7]:
%%bash
pip install pandas

Collecting pandas
  Using cached pandas-2.0.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.2 MB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2023.3-py2.py3-none-any.whl (502 kB)
Collecting tzdata>=2022.1 (from pandas)
  Using cached tzdata-2023.3-py2.py3-none-any.whl (341 kB)
Collecting numpy>=1.21.0 (from pandas)
  Using cached numpy-1.24.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
Installing collected packages: pytz, tzdata, numpy, pandas
Successfully installed numpy-1.24.3 pandas-2.0.1 pytz-2023.3 tzdata-2023.3


### Importing data from CSV to DataFrame

Pandas is built around the idea of a DataFrame, a container for representing data. Below you will create a DataFrame by importing a CSV file. A CSV file is a text file with one record of data per line. The values within the record are separated using the “comma” character. Pandas provides a useful method, named `read_csv()` to read the contents of the CSV file into a DataFrame. For example, we can create a file named `transactions.csv` containing details of Transactions. The CSV file is stored in the same directory that contains Python script.


In [9]:
%%writefile read_csv.py
import pandas as pd

print(pd.read_csv("transactions.csv"))

Overwriting read_csv.py


In [10]:
%%bash
# Downloading the dataset
wget https://cloudflare-ipfs.com/ipfs/QmfKJT13h5k1b23ja3ZCVg5nFL9oKz2bVXc8oXgtwiwhjz/transactions.csv

--2023-05-03 11:46:15--  https://cloudflare-ipfs.com/ipfs/QmfKJT13h5k1b23ja3ZCVg5nFL9oKz2bVXc8oXgtwiwhjz/transactions.csv
Resolving cloudflare-ipfs.com (cloudflare-ipfs.com)... 104.17.64.14, 104.17.96.13, 2606:4700::6811:400e, ...
Connecting to cloudflare-ipfs.com (cloudflare-ipfs.com)|104.17.64.14|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1567 (1.5K) [text/csv]
Saving to: ‘transactions.csv.2’

     0K .                                                     100% 21.3M=0s

2023-05-03 11:46:16 (21.3 MB/s) - ‘transactions.csv.2’ saved [1567/1567]



In [11]:
%%bash
cat transactions.csv

hash,nonce,block_hash,block_number,transaction_index,from_address,to_address,value,gas,gas_price,input,block_timestamp,max_fee_per_gas,max_priority_fee_per_gas,transaction_type
0x04cbcb236043d8fb7839e07bbc7f5eed692fb2ca55d897f1101eac3e3ad4fab8,12,0x246edb4b351d93c27926f4649bcf6c24366e2a7c7c718dc9158eea20c03bc6ae,483920,0,0x1b63142628311395ceafeea5667e7c9026c862ca,0xf4eced2f682ce333f96f2d8966c613ded8fc95dd,0,150853,50000000000,0xa9059cbb000000000000000000000000ac4df82fe37ea2187bc8c011a23d743b4f39019a00000000000000000000000000000000000000000000000000000000000186a0,1446561880,,,0
0xcea6f89720cc1d2f46cc7a935463ae0b99dd5fad9c91bb7357de5421511cee49,84,0x246edb4b351d93c27926f4649bcf6c24366e2a7c7c718dc9158eea20c03bc6ae,483920,1,0x9b22a80d5c7b3374a05b446081f97d0a34079e7f,0xf4eced2f682ce333f96f2d8966c613ded8fc95dd,0,150853,50000000000,0xa9059cbb00000000000000000000000066f183060253cfbe45beff1e6e7ebbe318c81e560000000000000000000000000000000000000000000000000000000000030d40,1446561880,,,0
0x463d53f

### Running the script

Now let's run the script to read in the CSV file. The output will be a DataFrame object.

In [12]:
%%bash
python3 read_csv.py

                                                hash  ...  transaction_type
0  0x04cbcb236043d8fb7839e07bbc7f5eed692fb2ca55d8...  ...                 0
1  0xcea6f89720cc1d2f46cc7a935463ae0b99dd5fad9c91...  ...                 0
2  0x463d53f0ad57677a3b430a007c1c31d15d62c37fab5e...  ...                 0
3  0x05287a561f218418892ab053adfb3d919860988b1945...  ...                 0

[4 rows x 15 columns]


## Ingesting data

To run pandas on Bacalhau you must store your assets in a location that Bacalhau has access to. We usually default to storing data on IPFS and code in a container, but you can also easily upload your script to IPFS too.

If you are interested in finding out more about how to ingest your data into IPFS, please see the [data ingestion guide](https://docs.bacalhau.org/examples/data-ingestion/).

We've already uploaded the script and data to IPFS to the following CID: `QmfKJT13h5k1b23ja3ZCVg5nFL9oKz2bVXc8oXgtwiwhjz`. You can look at this by browsing to one of the HTTP IPFS proxies like [ipfs.io](https://cloudflare-ipfs.com/ipfs/QmfKJT13h5k1b23ja3ZCVg5nFL9oKz2bVXc8oXgtwiwhjz/) or [w3s.link](https://bafybeih4hyydvojazlyv5zseelgn5u67iq2wbrbk2q4xoiw2d3cacdmzlu.ipfs.w3s.link/).

## Running a Bacalhau Job

After mounting the Pandas script and data from IPFS, we can now use the container for running on Bacalhau. To submit a job, run the following Bacalhau command:

Now we're ready to run a Bacalhau job, whilst mounting the Pandas script and data from IPFS. We'll use the `bacalhau docker run` command to do this. The `-v` flag allows us to mount a file or directory from IPFS into the container. The `-v` flag takes two arguments, the first is the IPFS CID and the second is the path to the directory in the container. The `-v` flag can be used multiple times to mount multiple directories.

In [13]:
%%bash --out job_id
bacalhau docker run \
    --wait \
    --id-only \
    -i ipfs://QmfKJT13h5k1b23ja3ZCVg5nFL9oKz2bVXc8oXgtwiwhjz:/files \
    -w /files \
    amancevice/pandas \
    -- python read_csv.py

### Structure of the command

- `bacalhau docker run`: call to bacalhau 

- `amancevice/pandas `: Using the official pytorch Docker image

- ``-i ipfs://QmfKJT13h5k1b23ja3Z .....`: Mounting the uploaded dataset to path

- `-w /files` Our working directory is /outputs. This is the folder where we will to save the model as it will automatically gets uploaded to IPFS as outputs

` python read_csv.py`: python script to read pandas script

When a job is submitted, Bacalhau prints out the related `job_id`. We store that in an environment variable so that we can reuse it later on.

In [14]:
%env JOB_ID={job_id}

env: JOB_ID=61e542a7-bea1-4382-b3c9-40050d143ad6


## Checking the State of your Jobs

- **Job status**: You can check the status of the job using `bacalhau list`. 

In [15]:
%%bash
bacalhau list --id-filter ${JOB_ID}

[92;100m CREATED  [0m[92;100m ID       [0m[92;100m JOB                     [0m[92;100m STATE     [0m[92;100m VERIFIED [0m[92;100m PUBLISHED               [0m
[97;40m 11:46:26 [0m[97;40m 61e542a7 [0m[97;40m Docker amancevice/pa... [0m[97;40m Completed [0m[97;40m          [0m[97;40m ipfs://QmY2MEETWyX77... [0m


When it says `Completed`, that means the job is done, and we can get the results.

- **Job information**: You can find out more information about your job by using `bacalhau describe`.

In [16]:
%%bash
bacalhau describe ${JOB_ID}

Job:
  APIVersion: V1beta1
  Metadata:
    ClientID: 07bde6e8241b19d58c1c5ff3e8ec17e1e80ac6424cd029bd1317a60f1705b583
    CreatedAt: "2023-05-03T11:46:26.767484787Z"
    ID: 61e542a7-bea1-4382-b3c9-40050d143ad6
    Requester:
      RequesterNodeID: QmdZQ7ZbhnvWY1J12XYKGHApJ6aufKyLNSvf8jZBrBaAVL
      RequesterPublicKey: CAASpgIwggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQDVRKPgCfY2fgfrkHkFjeWcqno+MDpmp8DgVaY672BqJl/dZFNU9lBg2P8Znh8OTtHPPBUBk566vU3KchjW7m3uK4OudXrYEfSfEPnCGmL6GuLiZjLf+eXGEez7qPaoYqo06gD8ROdD8VVse27E96LlrpD1xKshHhqQTxKoq1y6Rx4DpbkSt966BumovWJ70w+Nt9ZkPPydRCxVnyWS1khECFQxp5Ep3NbbKtxHNX5HeULzXN5q0EQO39UN6iBhiI34eZkH7PoAm3Vk5xns//FjTAvQw6wZUu8LwvZTaihs+upx2zZysq6CEBKoeNZqed9+Tf+qHow0P5pxmiu+or+DAgMBAAE=
  Spec:
    Deal:
      Concurrency: 1
    Docker:
      Entrypoint:
      - python
      - read_csv.py
      Image: amancevice/pandas
      WorkingDirectory: /files
    Engine: Docker
    Language:
      JobContext: {}
    Network:
      Type: None
    Publisher: Estuary
    

When it says `Published` or `Completed`, that means the job is done, and we can get the results.

- **Job information**: You can find out more information about your job by using `bacalhau describe`.

In [17]:
%%bash
rm -rf results && mkdir -p results
bacalhau get ${JOB_ID}  --output-dir results

Fetching results of job '61e542a7-bea1-4382-b3c9-40050d143ad6'...

Computing default go-libp2p Resource Manager limits based on:
    - 'Swarm.ResourceMgr.MaxMemory': "34 GB"
    - 'Swarm.ResourceMgr.MaxFileDescriptors': 524288

Applying any user-supplied overrides on top.
Run 'ipfs swarm limit all' to see the resulting limits.

Results for job '61e542a7-bea1-4382-b3c9-40050d143ad6' have been written to...
results


## Viewing your Job Output

Each job creates 3 subfolders: the **combined_results**,**per_shard files**, and the **raw** directory. To view the file, run the following command:

In [18]:
%%bash
cat results/stdout # displays the contents of the file

                                                hash  ...  transaction_type
0  0x04cbcb236043d8fb7839e07bbc7f5eed692fb2ca55d8...  ...                 0
1  0xcea6f89720cc1d2f46cc7a935463ae0b99dd5fad9c91...  ...                 0
2  0x463d53f0ad57677a3b430a007c1c31d15d62c37fab5e...  ...                 0
3  0x05287a561f218418892ab053adfb3d919860988b1945...  ...                 0

[4 rows x 15 columns]
