# Running Pandas on Bacalhau
## Introduction
Pandas is a Python package that provides fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis/manipulation tool available in any language. It is already well on its way towards this goal.

## Installing and Getting Started with Pandas

In [None]:
%%bash
pip install pandas

### Running your pandas script Locally

**Importing data from CSV to DataFrame**

We can also create a DataFrame by importing a CSV file. A CSV file is a text file with one record of data per line. The values within the record are separated using the “comma” character. Pandas provides a useful method, named `read_csv()` to read the contents of the CSV file into a DataFrame. For example, we can create a file named `transactions.csv` containing details of Transactions. The CSV file is stored in the same directory that contains Python scripts. 

This file can be imported using:


In [None]:
%%bash
python3 read_csv.py

### Running the script on bacalhau
To run pandas on bacalhau you must upload your datasets along with the script to IPFS this can be done by using the IPFS CLI to upload the files or using a pinning service like [pinata](https://www.pinata.cloud/) or [nft.storage](https://nft.storage/).

Adding the scripts and datasets to IPFS:

In [None]:
%%bash
ipfs add -r .

The output should look like:
``` bash
added QmPqx4BaWzAmZm4AuBqGtG6dkX7bGSVgjfgpkv2g7mi3uz pandas/read_csv.py
added QmYErPqtdpNTxpKot9pXR5QbhGSyaGdMFxfUwGHm4rzXzH pandas/transactions.csv
added QmfKJT13h5k1b23ja3ZCVg5nFL9oKz2bVXc8oXgtwiwhjz pandas
 1.59 KiB / 1.59 KiB [===================================================================================]
```

For running pandas in bacalhau you need have a container which has python and pandas installed.

Structure of the command:

- `bacalhau docker run` is similar to docker run
- -v mount the CID to the container this is the `CID:<PATH-TO-WHERE-THE-CID-IS-TO-BE-MOUNTED>`
`QmfKJT13h5k1b23ja3ZCVg5nFL9oKz2bVXc8oXgtwiwhjz:/files`
- -w is used to set the working directory
- `-- python read_csv.py` to run the script

Command:


In [None]:
%%bash
bacalhau docker run \
-v QmfKJT13h5k1b23ja3ZCVg5nFL9oKz2bVXc8oXgtwiwhjz:/files \
-w /files \
amancevice/pandas \
-- python read_csv.py

Running the commands will output a UUID (like `940c7fd7-c15a-4d00-8170-0d138cdca7eb`). This is the ID of the job that was created. 

You can check the status of the job with the following command:

In [None]:
%%bash
bacalhau list --id-filter 940c7fd7-c15a-4d00-8170-0d138cdca7eb

This should result in an output like the following:
``` bash
CREATED   ID        JOB                      STATE      VERIFIED  PUBLISHED               
 04:56:11  940c7fd7  Docker amancevice/pa...  Published            /ipfs/bafybeihaqoxj7... 

```

Where it says `Published`, that means the job is done, and we can get the results.

If there is an error you can view the error using the following command:


In [None]:
%%bash
bacalhau describe 940c7fd7-c15a-4d00-8170-0d138cdca7eb

Since there is no error, instead we can see the state of our job is complete. This is the output:
``` bash
Shards:
    - ShardIndex: 0
      Nodes:
        - Node: QmXaXu9N5GNetatsvwnTfQqNtSeKAD6uCmarbh3LMRYAcF
          State: Cancelled
          Status: ""
          Verified: false
          ResultID: ""
        - Node: QmdZQ7ZbhnvWY1J12XYKGHApJ6aufKyLNSvf8jZBrBaAVL
          State: Published
          Status: 'Got results proposal of length: 0'
          Verified: true
          ResultID: bafybeihaqoxj7ty55af23hfyu423ic5sazgwocpa4lzfgcxknqcioi3co4
Start Time: 2022-09-13T04:56:11.493291381Z
```



To download the results of your job, run the following command:

In [None]:
%%bash
bacalhau get 940c7fd7-c15a-4d00-8170-0d138cdca7eb

An example output looks like:
``` bash
2022/09/13 05:04:15 failed to sufficiently increase receive buffer size (was: 208 kiB, wanted: 2048 kiB, got: 416 kiB). See https://github.com/lucas-clemente/quic-go/wiki/UDP-Receive-Buffer-Size for details.
05:04:15.835 | INF ipfs/downloader.go:115 > Found 1 result shards, downloading to temporary folder.
05:04:18.602 | INF ipfs/downloader.go:195 > Combining shard from output volume 'outputs' to final location: '/home/vedant/test/pandas'
```

After the download has finished you can see the following contents by running `ls`:
``` bash
shards stderr stdout volumes
```

In [None]:
%%bash
ls

The structure of the files and directories will look like this:
``` bash
├── shards
│   └── job-940c7fd7-c15a-4d00-8170-0d138cdca7eb-shard-0-host-QmdZQ7ZbhnvWY1J12XYKGHApJ6aufKyLNSvf8jZBrBaAVL
│       ├── exitCode
│       ├── stderr
│       └── stdout
├── stderr
├── stdout
└── volumes
    └── outputs
```
- stdout contains things printed to the console like outputs, etc.
- stderr contains any errors. In this case, since there are no errors, it's will be empty
- volumes folder contain the volumes you named when you started the job with the `-o` flag. In addition, you will always have a `outputs` volume, which is provided by default.

Because your script is printed to stdout, the output will appear in the stdout file. You can read this by typing the following command:


In [None]:
%%bash
cat stdout

The output should look something like this:
``` bash
                                                hash  ...  transaction_type
0  0x04cbcb236043d8fb7839e07bbc7f5eed692fb2ca55d8...  ...                 0
1  0xcea6f89720cc1d2f46cc7a935463ae0b99dd5fad9c91...  ...                 0
2  0x463d53f0ad57677a3b430a007c1c31d15d62c37fab5e...  ...                 0
3  0x05287a561f218418892ab053adfb3d919860988b1945...  ...                 0

[4 rows x 15 columns]
```