# Training Pytorch Model

Introduction
PyTorch is a framework developed by Facebook AI Research for deep learning, featuring both beginner-friendly debugging tools and a high-level of customization for advanced users, with researchers and practitioners using it across companies like Facebook and Tesla. Applications include computer vision, natural language processing, cryptography, and more
In this example we will train a mnist neural network using the RNN architechture


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/bacalhau-project/examples/blob/main/model-training/Training-Tensorflow-Model/index.ipynb)
[![Open In Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/bacalhau-project/examples/HEAD?labpath=model-training/Training-Tensorflow-Model/index.ipynb)

## Training the model locally

Prerequisites
- python
- torch
- torchvision

Cloning the pytorch samples

In [1]:
%%bash
git clone https://github.com/pytorch/examples

Cloning into 'examples'...
remote: Enumerating objects: 3718, done.[K
remote: Counting objects: 100% (40/40), done.[K
remote: Compressing objects: 100% (33/33), done.[K
remote: Total 3718 (delta 11), reused 32 (delta 7), pack-reused 3678
Receiving objects: 100% (3718/3718), 40.95 MiB | 21.46 MiB/s, done.
Resolving deltas: 100% (1831/1831), done.


Training a mnist_rnn model

we add the --save-model flag to save the model

In [8]:
%%bash
python ./examples/mnist_rnn/main.py --save-model


Test set: Average loss: 0.7476, Accuracy: 7615/10000 (76%)


Test set: Average loss: 0.4355, Accuracy: 8636/10000 (86%)


Test set: Average loss: 0.3266, Accuracy: 9035/10000 (90%)


Test set: Average loss: 0.2797, Accuracy: 9146/10000 (91%)


Test set: Average loss: 0.2519, Accuracy: 9238/10000 (92%)


Test set: Average loss: 0.2355, Accuracy: 9277/10000 (93%)


Test set: Average loss: 0.2244, Accuracy: 9333/10000 (93%)


Test set: Average loss: 0.2163, Accuracy: 9362/10000 (94%)


Test set: Average loss: 0.2151, Accuracy: 9366/10000 (94%)


Test set: Average loss: 0.2124, Accuracy: 9367/10000 (94%)


Test set: Average loss: 0.2098, Accuracy: 9373/10000 (94%)


Test set: Average loss: 0.2096, Accuracy: 9381/10000 (94%)


Test set: Average loss: 0.2075, Accuracy: 9385/10000 (94%)


Test set: Average loss: 0.2066, Accuracy: 9386/10000 (94%)



## Running on bacalhau

### Uploading the dataset to IPFS

Since Container running on bacalhau has no network we need to manually upload the dateset to IPFS

we can download the dataset using pytorch datasets in this case we need to download the MNIST dataset we create a folder data where we will download the dataset

In [9]:
%%bash
mkdir ./data

In [10]:
from torchvision import datasets
from torchvision.transforms import ToTensor

training_data = datasets.MNIST(
    root="./data",
    train=True,
    download=True,
    transform=ToTensor()
)

test_data = datasets.MNIST(
    root="./data",
    train=False,
    download=True,
    transform=ToTensor()
)

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ./data/MNIST/raw/train-images-idx3-ubyte.gz


  0%|          | 0/9912422 [00:00<?, ?it/s]

Extracting ./data/MNIST/raw/train-images-idx3-ubyte.gz to ./data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ./data/MNIST/raw/train-labels-idx1-ubyte.gz


  0%|          | 0/28881 [00:00<?, ?it/s]

Extracting ./data/MNIST/raw/train-labels-idx1-ubyte.gz to ./data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ./data/MNIST/raw/t10k-images-idx3-ubyte.gz


  0%|          | 0/1648877 [00:00<?, ?it/s]

Extracting ./data/MNIST/raw/t10k-images-idx3-ubyte.gz to ./data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz


  0%|          | 0/4542 [00:00<?, ?it/s]

Extracting ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw



### Uploading the dataset to IPFS

Using the IPFS cli
```
ipfs add -r data
```



Since the data Uploaded To IPFS using IPFS CLI isn’t pinned or will be garbage collected

The Data needs to be Pinned, Pinning is the mechanism that allows you to tell IPFS to always keep a given object somewhere, the default being your local node, though this can be different if you use a third-party remote pinning service.

There a different pinning services available you can you any one of them


## [Pinata](https://app.pinata.cloud/)

Click on the upload folder button

![](https://i.imgur.com/crnkrwy.png)

After the Upload has finished copy the CID

### [NFT.Storage](https://nft.storage/) (Recommneded Option)

[Upload files and directories with NFTUp](https://nft.storage/docs/how-to/nftup/) 

To upload your dataset using NFTup just drag and drop your directory it will upload it to IPFS

![](https://i.imgur.com/03NEonV.png)


Copy the CID in this case it is QmdeQjz1HQQdT9wT2NHX86Le9X6X6ySGxp8dfRUKPtgziw
(If you used pinata) or bafybeif5m2md7bo2iua3kfate72kh54jgwr2spgvdtn33zdeqffh3d6qce
(if you used nft.storage)

You can view you uploaded dataset by clicking on the Gateway URL

[https://gateway.pinata.cloud/ipfs/QmdeQjz1HQQdT9wT2NHX86Le9X6X6ySGxp8dfRUKPtgziw/?filename=data](https://gateway.pinata.cloud/ipfs/QmdeQjz1HQQdT9wT2NHX86Le9X6X6ySGxp8dfRUKPtgziw/?filename=data)

In [11]:
!curl -sL https://get.bacalhau.org/install.sh | bash

Your system is linux_amd64
No BACALHAU detected. Installing fresh BACALHAU CLI...
Getting the latest BACALHAU CLI...
Installing v0.3.13 BACALHAU CLI...
Downloading https://github.com/filecoin-project/bacalhau/releases/download/v0.3.13/bacalhau_v0.3.13_linux_amd64.tar.gz ...
Downloading sig file https://github.com/filecoin-project/bacalhau/releases/download/v0.3.13/bacalhau_v0.3.13_linux_amd64.tar.gz.signature.sha256 ...
Verified OK
Extracting tarball ...
NOT verifying Bin
bacalhau installed into /usr/local/bin successfully.
Client Version: v0.3.13
Server Version: v0.3.13


In [13]:
%%bash --out job_id
bacalhau docker run \
--gpu 1 \
--wait \
--id-only \
pytorch/pytorch \
-w /outputs \
 -v QmdeQjz1HQQdT9wT2NHX86Le9X6X6ySGxp8dfRUKPtgziw:/data \
-u https://raw.githubusercontent.com/pytorch/examples/main/mnist_rnn/main.py \
-- python ../inputs/main.py --save-model

Sturucture of the command

Request 1 GPU to train the model --gpu 1

Using the official pytorch docker Image pytorch/pytorch

Mounting the uploaded dataset to path /data -v QmdeQjz1HQQdT9wT2NHX86Le9X6X6ySGxp8dfRUKPtgziw:/data

Mounting our training script we will use the [Training script](https://github.com/pytorch/examples/blob/main/mnist_rnn/main.py) from the pytorch examples and use the raw link of the script
-u https://raw.githubusercontent.com/pytorch/examples/main/mnist_rnn/main.py

Its the folder where we will to save the model as it will automatically gets uploaded to IPFS as outputs so we choose /outputs as our working directory
-w /outputs

Running the script
python ../inputs/main.py --save-model

since the URL script gets mounted to the /inputs folder in the container
we will execute that script but since our working directory is /outputs we provide the relave path to python to execute the script

In [14]:
%env JOB_ID={job_id}

env: JOB_ID=1658bb6b-21d1-4d1a-a278-b0984c967e14


In [16]:
%%bash
bacalhau list --id-filter ${JOB_ID}

[92;100m CREATED  [0m[92;100m ID       [0m[92;100m JOB                     [0m[92;100m STATE     [0m[92;100m VERIFIED [0m[92;100m PUBLISHED               [0m
[97;40m 14:43:37 [0m[97;40m 1658bb6b [0m[97;40m Docker pytorch/pytor... [0m[97;40m Completed [0m[97;40m          [0m[97;40m /ipfs/QmTZKuZJX3Zj9v... [0m


Where it says "Completed", that means the job is done, and we can get the results.

To find out more information about your job, run the following command:


In [None]:
%%bash
bacalhau describe ${JOB_ID}

In [15]:
%%bash
rm -rf results && mkdir -p results
bacalhau get $JOB_ID --output-dir results

Fetching results of job '1658bb6b-21d1-4d1a-a278-b0984c967e14'...
Results for job '1658bb6b-21d1-4d1a-a278-b0984c967e14' have been written to...
results


2022/11/21 14:46:56 failed to sufficiently increase receive buffer size (was: 208 kiB, wanted: 2048 kiB, got: 416 kiB). See https://github.com/lucas-clemente/quic-go/wiki/UDP-Receive-Buffer-Size for details.


In [17]:
%%bash
ls results/

combined_results
per_shard
raw


In [18]:
%%bash
cat results/combined_results/stdout


Test set: Average loss: 0.7476, Accuracy: 7615/10000 (76%)


Test set: Average loss: 0.4355, Accuracy: 8636/10000 (86%)


Test set: Average loss: 0.3266, Accuracy: 9035/10000 (90%)


Test set: Average loss: 0.2797, Accuracy: 9146/10000 (91%)


Test set: Average loss: 0.2519, Accuracy: 9238/10000 (92%)


Test set: Average loss: 0.2355, Accuracy: 9277/10000 (93%)


Test set: Average loss: 0.2244, Accuracy: 9333/10000 (93%)


Test set: Average loss: 0.2163, Accuracy: 9362/10000 (94%)


Test set: Average loss: 0.2151, Accuracy: 9366/10000 (94%)


Test set: Average loss: 0.2124, Accuracy: 9367/10000 (94%)


Test set: Average loss: 0.2098, Accuracy: 9373/10000 (94%)


Test set: Average loss: 0.2096, Accuracy: 9381/10000 (94%)


Test set: Average loss: 0.2075, Accuracy: 9385/10000 (94%)


Test set: Average loss: 0.2066, Accuracy: 9386/10000 (94%)



The model has successfully trained and downloaded

In [19]:
%%bash
ls results/combined_results/outputs/

mnist_rnn.pt
