Skip to content

Commit

Permalink
Fix typos (#254)
Browse files Browse the repository at this point in the history
* Firs version

* Fix codespell

* Fix StyleGAN3

* Replace logo
  • Loading branch information
enoldev committed Jun 22, 2023
1 parent 4742b5d commit 7d416ee
Show file tree
Hide file tree
Showing 54 changed files with 509 additions and 502 deletions.
4 changes: 2 additions & 2 deletions docs/data-ingestion/from-url.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ head -n 15 ./results/combined_results/stdout

## Get the CID From the Completed Job

To get the CID from a completed jon output, run the following command:
To get the output CID from a completed job, run the following command:

```bash
%%bash --out cid
Expand Down Expand Up @@ -100,4 +100,4 @@ The job has been submitted and Bacalhau has printed out the related job id. We s

## Need Support?

For questions, feedback, please reach out in our [forum](https://github.com/filecoin-project/bacalhau/discussions)
For questions and feedback, please reach out in our [forum](https://github.com/filecoin-project/bacalhau/discussions)
16 changes: 8 additions & 8 deletions docs/data-ingestion/pin.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,11 @@ description: "How to pin data to public storage"
---
# Pinning Data

If you have data that you want to make available to your Bacalhau jobs (or other people), you can pin it using a pinning service like Web3.Storage, Estuary, etc. Pinning services store data on behalf of users. The pinning provider is essentially guaranteeing that your data will be available if someone knows the CID. Most pinning services offer you a free tier, so you can try them out without spending any money.
If you have data that you want to make available to your Bacalhau jobs (or other people), you can pin it using a pinning service like Web3.Storage, Estuary, etc. Pinning services store data on behalf of users. The pinning provider is essentially guaranteeing that your data will be available if someone knows the CID. Most pinning services offer you a free tier, so you can try them out without spending any money.

## Web3.Storage

This example will demonstrate how to pin data using Web3.Storage. Web3.Storage is a pinning service that is built on top of IPFS and Filecoin. It is free to use for small amounts of data, and has a generous free tier. You can find more information about Web3.Storage [here](https://web3.storage/).
This example will demonstrate how to pin data using Web3.Storage. Web3.Storage is a pinning service that is built on top of IPFS and Filecoin. It is free to use for small amounts of data and has a generous free tier. You can find more information about Web3.Storage [here](https://web3.storage/).

- First you need to create an [account](https://web3.storage/login/) (if you don't have one already).
- Next, sign in and browse to the [Create API Key](https://web3.storage/tokens/?create=true) page. Follow the instructions to create an API key. Once created, you will need to copy the API key to your clipboard.
Expand All @@ -34,7 +34,7 @@ curl -X POST https://api.web3.storage/upload -H "Authorization: Bearer ${TOKEN}"

3. **Pin multiple local files via Node.JS**: Web3.Storage has a [node.js library](https://web3.storage/docs/reference/js-client-library/) to interact with their API. The following example requires node.js to be installed. The following code uses a docker container. The javascript code is located on [their website](https://web3.storage/docs/intro/#create-the-upload-script) or on [github](https://github.com/bacalhau-project/examples/blob/main/data-ingestion/nodejs/put-files.js).

First create some files to upload.
First, create some files to upload.

```python
%%writefile nodejs/test1.txt
Expand All @@ -50,7 +50,7 @@ docker run --rm --env TOKEN=$TOKEN -v $PWD/nodejs:/nodejs node:18-alpine ash -c

The response will return the CID of the file, which can now be used as an input to Bacalhau.

4. **Pin a file from a URL via Curl**: You can use curl to download a file then re-upload to web3.storage. For example:
4. **Pin a file from a URL via Curl**: You can use curl to download a file and then re-upload it to web3.storage. For example:

```bash
export TOKEN=YOUR_API_KEY
Expand All @@ -67,13 +67,13 @@ docker run --rm --env TOKEN=$TOKEN -v $PWD/nodejs:/nodejs node:18-alpine ash -c

## Estuary

This example show you how to pin data using [estuary](https://estuary.tech/api-admin).
This example shows you how to pin data using [estuary](https://estuary.tech/api-admin).

- Before you can upload files via estuary,create an [account](https://estuary.tech) (if you don't have one already).
- Before you can upload files via estuary, you must create an [account](https://estuary.tech) (if you don't have one already).

- Browse to [the API Key management page](https://estuary.tech/api-admin) and create a key.

### Ways to pin using Esturay
### Ways to pin using Esturay

1. **Pin a local file via the Esturay UI**: You can [browse to the Estuary UI](https://estuary.tech/upload) to upload a file via your web browser.

Expand All @@ -91,7 +91,7 @@ curl -X POST https://upload.estuary.tech/content/add -H "Authorization: Bearer $

The response will return the CID of the file.

## View pinned files
## View pinned files

If the upload was successful, you can view the file via your [estuary account page](https://estuary.tech/home). Alternatively, you can obtain this information from the CLI:

Expand Down
20 changes: 10 additions & 10 deletions docs/examples/data-engineering/DuckDB/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ sidebar_position: 3
[![stars - badge-generator](https://img.shields.io/github/stars/bacalhau-project/bacalhau?style=social)](https://github.com/bacalhau-project/bacalhau)


DuckDB is a relational table-oriented database management system and supports SQL queries for producing analytical results. It also comes with various features that are useful for data analytics.
DuckDB is a relational table-oriented database management system that supports SQL queries for producing analytical results. It also comes with various features that are useful for data analytics.

DuckDB is suited for the following use cases:

Expand All @@ -17,7 +17,7 @@ DuckDB is suited for the following use cases:
- Concurrent large changes, to multiple large tables, e.g. appending rows, adding/removing/updating columns
- Large result set transfer to client

In this example tutorial, we will show how to use DuckDB with Bacalhau. The advantage of using DuckDB with Bacalhau is that you don’t need to install, there is no need to download the datasets since the datasets are
In this example tutorial, we will show how to use DuckDB with Bacalhau. The advantage of using DuckDB with Bacalhau is that you don’t need to install, and there is no need to download the datasets since the datasets are
already there on IPFS or on the web.

## TD;lR
Expand Down Expand Up @@ -101,7 +101,7 @@ docker push davidgasquez/datadex:v0.2.0

## Running a Bacalhau Job

After the repo image has been pushed to docker hub, we can now use the container for running on Bacalhau. To submit a job, run the following Bacalhau command:
After the repo image has been pushed to Docker Hub, we can now use the container for running on Bacalhau. To submit a job, run the following Bacalhau command:


```bash
Expand All @@ -117,7 +117,7 @@ davidgasquez/datadex:v0.2.0 -- /bin/bash -c 'duckdb -s "select 1"'

Let's look closely at the command above:

* `bacalhau docker run`: call to bacalhau
* `bacalhau docker run`: call to bacalhau

* `davidgasquez/datadex:v0.2.0 `: the name and the tag of the docker image we are using

Expand All @@ -135,7 +135,7 @@ When a job is submitted, Bacalhau prints out the related `job_id`. We store that

## Checking the State of your Jobs

- **Job status**: You can check the status of the job using `bacalhau list`.
- **Job status**: You can check the status of the job using `bacalhau list`.


```bash
Expand Down Expand Up @@ -182,7 +182,7 @@ cat results/stdout # displays the contents of the file

## Running Arbitrary SQL commands

Below is the `bacalhau docker run` command to to run arbitrary SQL commands over yellow taxi trips dataset
Below is the `bacalhau docker run` command to to run arbitrary SQL commands over the yellow taxi trips dataset


```bash
Expand All @@ -201,8 +201,8 @@ bacalhau docker run \

Let's look closely at the command above:

* `bacalhau docker run`: call to bacalhau
* `bacalhau docker run`: call to bacalhau

* `-i ipfs://bafybeiejgmdpwlfgo3dzfxfv3cn55qgnxmghyv7vcarqe3onmtzczohwaq \`: CIDs to use on the job. Mounts them at '/inputs' in the execution.

* `davidgasquez/duckdb:latest`: the name and the tag of the docker image we are using
Expand All @@ -214,7 +214,7 @@ Let's look closely at the command above:

When a job is submitted, Bacalhau prints out the related `job_id`. We store that in an environment variable so that we can reuse it later on.

- **Job status**: You can check the status of the job using `bacalhau list`.
- **Job status**: You can check the status of the job using `bacalhau list`.


```bash
Expand Down Expand Up @@ -260,4 +260,4 @@ cat results/stdout

## Need Support?

For questions, feedback, please reach out in our [forum](https://github.com/filecoin-project/bacalhau/discussions)
For questions, and feedback, please reach out in our [forum](https://github.com/filecoin-project/bacalhau/discussions)
40 changes: 20 additions & 20 deletions docs/examples/data-engineering/blockchain-etl/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@ sidebar_position: 1

[![stars - badge-generator](https://img.shields.io/github/stars/bacalhau-project/bacalhau?style=social)](https://github.com/bacalhau-project/bacalhau)

Mature blockchains are difficult to analyze because of their size. Ethereum-ETL is a tool that makes it easy to extract information from an Ethereum node, but it's not easy to get working in a batch manner. It takes approximately 1 week for an Ethereum node to download the entire chain (event more in my experience) and importing and exporting data from the Ethereum node is slow.
Mature blockchains are difficult to analyze because of their size. Ethereum-ETL is a tool that makes it easy to extract information from an Ethereum node, but it's not easy to get working in a batch manner. It takes approximately 1 week for an Ethereum node to download the entire chain (even more in my experience) and importing and exporting data from the Ethereum node is slow.

For this example, we ran an Ethereum node for a week and allowed it to synchronise. We then ran ethereum-etl to extract the information and pinned it on Filecoin. This means that we can both now access the data without having to run another ethereum node.
For this example, we ran an Ethereum node for a week and allowed it to synchronize. We then ran ethereum-etl to extract the information and pinned it on Filecoin. This means that we can both now access the data without having to run another Ethereum node.

But there's still a lot of data and these types of analyses typically need repeating or refining. So it makes absolute sense to use a decentralised network like Bacalhau to process the data in a scalable way.
But there's still a lot of data and these types of analyses typically need repeating or refining. So it makes absolute sense to use a decentralized network like Bacalhau to process the data in a scalable way.

## TD;LR
Running Ethereum-etl tool on Bacalhau to extract Ethereum node.
Expand Down Expand Up @@ -64,7 +64,7 @@ df.info()

The following code inspects the daily trading volume of Ethereum for a single chunk (100,000 blocks) of data.

This is all good, but we can do better. We can use the Bacalhau client to download the data from IPFS and then run the analysis on the data in the cloud. This means that we can analyse the entire Ethereum blockchain without having to download it locally.
This is all good, but we can do better. We can use the Bacalhau client to download the data from IPFS and then run the analysis on the data in the cloud. This means that we can analyze the entire Ethereum blockchain without having to download it locally.


```python
Expand All @@ -75,7 +75,7 @@ df[['block_datetime', 'value']].groupby(pd.Grouper(key='block_datetime', freq='1

## Analysing Ethereum Data With Bacalhau

To run jobs on the Bacalhau network you need to package your code. In this example I will package the code as a Docker image.
To run jobs on the Bacalhau network you need to package your code. In this example, I will package the code as a Docker image.

But before we do that, we need to develop the code that will perform the analysis. The code below is a simple script to parse the incoming data and produce a CSV file with the daily trading volume of Ethereum.

Expand Down Expand Up @@ -172,11 +172,11 @@ The job has been submitted and Bacalhau has printed out the related job id. We s

The `bacalhau docker run` command allows to pass input data volume with a `-i ipfs://CID:path` argument just like Docker, except the left-hand side of the argument is a [content identifier (CID)](https://github.com/multiformats/cid). This results in Bacalhau mounting a *data volume* inside the container. By default, Bacalhau mounts the input volume at the path `/inputs` inside the container.

Bacalhau also mounts a data volume to store output data. The `bacalhau docker run` command creates an output data volume mounted at `/outputs`. This is a convenient location to store the results of your job.
Bacalhau also mounts a data volume to store output data. The `bacalhau docker run` command creates an output data volume mounted at `/outputs`. This is a convenient location to store the results of your job.

## Checking the State of your Jobs

- **Job status**: You can check the status of the job using `bacalhau list`.
- **Job status**: You can check the status of the job using `bacalhau list`.


```bash
Expand All @@ -203,7 +203,7 @@ rm -rf ./results && mkdir -p ./results # Temporary directory to store the result
bacalhau get --output-dir ./results ${JOB_ID} # Download the results
```

After the download has finished you should see the following contents in results directory.
After the download has finished you should see the following contents in the results directory.

## Viewing your Job Output

Expand All @@ -217,7 +217,7 @@ ls -lah results/outputs

### Display the image

To view the images, we will use **glob** to return all file paths that match a specific pattern.
To view the images, we will use **glob** to return all file paths that match a specific pattern.


```python
Expand All @@ -232,7 +232,7 @@ df.plot()

### Massive Scale Ethereum Analysis

Ok so that works. Let's scale this up! We can run the same analysis on the entire Ethereum blockchain (up to the point where I have uploaded the Ethereum data). To do this, we need to run the analysis on each of the chunks of data that we have stored on IPFS. We can do this by running the same job on each of the chunks.
Ok, so that works. Let's scale this up! We can run the same analysis on the entire Ethereum blockchain (up to the point where I have uploaded the Ethereum data). To do this, we need to run the analysis on each of the chunks of data that we have stored on IPFS. We can do this by running the same job on each of the chunks.

See the appendix for the `hashes.txt` file.

Expand All @@ -249,15 +249,15 @@ for h in $(cat hashes.txt); do \
done
```

Now take a look at the job id's. You can use these to check the status of the jobs and download the results. You might want to double check that the jobs ran ok by doing a `bacalhau list`.
Now take a look at the job id's. You can use these to check the status of the jobs and download the results. You might want to double-check that the jobs ran ok by doing a `bacalhau list`.


```bash
%%bash
cat job_ids.txt
```

Wait until all of these jobs have completed:
Wait until all of these jobs have been completed:


```bash
Expand All @@ -279,7 +279,7 @@ wait

### Display the image

To view the images, we will use **glob** to return all file paths that match a specific pattern.
To view the images, we will use **glob** to return all file paths that match a specific pattern.


```python
Expand All @@ -303,15 +303,15 @@ df = df_unsorted.groupby(level=0).sum()
df.plot(figsize=(16,9))
```

That's it! There is several years of Ethereum transaction volume data.
That's it! There are several years of Ethereum transaction volume data.


```bash
%%bash
rm -rf results_* output_* outputs results temp # Remove temporary results
```

## Appendix 1: List Ethereum Data CIDs
## Appendix 1: List Ethereum Data CIDs

The following list is a list of IPFS CID's for the Ethereum data that we used in this tutorial. You can use these CID's to download the rest of the chain if you so desire. The CIDs are ordered by block number and they increase 50,000 blocks at a time. Here's a list of ordered CIDs:

Expand Down Expand Up @@ -376,7 +376,7 @@ bafybeien55egngdpfvrsxr2jmkewdyha72ju7qaaeiydz2f5rny7drgzta

## Appendix 2: Setting up an Ethereum Node

In the course of writing this example I had to setup an Ethereum node. It was a slow and painful process so I thought I would share the steps I took to make it easier for others.
In the course of writing this example, I had to set up an Ethereum node. It was a slow and painful process so I thought I would share the steps I took to make it easier for others.

### Geth setup and sync

Expand Down Expand Up @@ -467,19 +467,19 @@ Watch the logs:
journalctl -u prysm -f
```

Prysm will need to finish synchronising before geth will start synchronising.
Prysm will need to finish synchronizing before geth will start synchronizing.

In Prysm you will see lots of log messages saying: `Synced new block`, and in Geth you will see: `Syncing beacon headers downloaded=11,920,384 left=4,054,753 eta=2m25.903s`. This tells you how long it will take to sync the beacons. Once that's done, get will start synchronising the blocks.
In Prysm you will see lots of log messages saying: `Synced new block`, and in Geth you will see: `Syncing beacon headers downloaded=11,920,384 left=4,054,753 eta=2m25.903s`. This tells you how long it will take to sync the beacons. Once that's done, get will start synchronizing the blocks.

Bring up the ethereum javascript console with:
Bring up the Ethereum javascript console with:

```
sudo geth --datadir /mnt/disks/ethereum/ attach
```

Once the block sync has started, `eth.syncing` will return values. Before it starts, this value will be `false`.

Note that by default, geth will perform a fast sync, without downloading the full blocks. The `syncmode=flull` flag forces geth to do a full sync. If we didn't do this, then we wouldn't be able to backup the data properly.
Note that by default, geth will perform a fast sync, without downloading the full blocks. The `syncmode=flull` flag forces geth to do a full sync. If we didn't do this, then we wouldn't be able to back up the data properly.

### Extracting the Data

Expand Down
Loading

0 comments on commit 7d416ee

Please sign in to comment.