Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there any detailed description about config? #183

Open
hyphennn opened this issue Dec 14, 2023 · 6 comments
Open

Is there any detailed description about config? #183

hyphennn opened this issue Dec 14, 2023 · 6 comments

Comments

@hyphennn
Copy link

Hi,
I'm studying buildbarn and trying to build a remote exection cluster. After reading docs in this repo, i still got some question:

  1. Is there any detailed description about config?
  2. We're using bazel-remote as our remote cache right now, so is there any way for us to keep use bazel-remote if we wanna use bb-remote-execution?
    thx for your help
@EdSchouten
Copy link
Member

EdSchouten commented Dec 14, 2023

Is there any detailed description about config?

Yes. Quoting the README:

The schema of the storage configuration file gives a good overview of which storage backends are available and how they can be configured.

Just study the .proto files under pkg/proto/configuration to figure out what options are available. The Protobuf language guide may also be of use.

We're using bazel-remote as our remote cache right now, so is there any way for us to keep use bazel-remote if we wanna use bb-remote-execution?

Sure. If bazel-remote works for you, be sure to just use that. Do note that you will not be able to make use of Buildbarn specific data stores, such as the File System Access Cache (FSAC).

@hyphennn
Copy link
Author

hyphennn commented Dec 14, 2023

Thx for your kindly responce, that really helps a lot.

Sure. If bazel-remote works for you, be sure to just use that. Do note that you will not be able to make use of Buildbarn specific data stores, such as the File System Access Cache (FSAC).

So, what is the advantages of using FSAC instead of continuing using bazel-remote? Replacing bazel-remote to another cache is also in our choice list, for our bazel-remote service are still in building. So we can tolerate changing our remote cache. But i did not find and information about FSAC for us to evaluate our ROI.

And another question. After reading the pb and bb_deployment, I noticed that there is a demultiplexing node, it seems to be a load balancer? What is the key means of this map?
image

I draw a picture with my own understanding, but the question is how does demultiplexing node work? I read source code and find it just simply match longest prefix, that really confused me:

  • Are all these node must run in one machine? Or it is distributed? If it's distributed, how can I config it?

  • If demultiplexing node just match prefix, where is the prefix added? The requests send by bazel client seems just like: http://host:port/ac/sha256digest. So the request will just be send randomly? And that means, all the bb_storage's disk are almost same?

  • Can the demultiplexing node just be config like 'redis slot'? That means, if a request like 'http://host:port/ac/sha256digest' reaches out , the demultiplexing node calculate it's hash value by some hash function, than redirect it to a certain bb_storage node. With this, we can use these disks most efficiently?

image

I'm not sure if my understanding is right, really hope your reply. thx alot.

@EdSchouten
Copy link
Member

Thx for your kindly responce, that really helps a lot.

Sure. If bazel-remote works for you, be sure to just use that. Do note that you will not be able to make use of Buildbarn specific data stores, such as the File System Access Cache (FSAC).

So, what is the advantages of using FSAC instead of continuing using bazel-remote? Replacing bazel-remote to another cache is also in our choice list, for our bazel-remote service are still in building. So we can tolerate changing our remote cache. But i did not find and information about FSAC for us to evaluate our ROI.

The File System Access Cache is needed if you want to spin up workers using the virtual file system, and want to have parallel prefetching of input files. That can be quite advantageous for actions that are file system intensive, especially if the latency between workers and storage gets higher.

And another question. After reading the pb and bb_deployment, I noticed that there is a demultiplexing node, it seems to be a load balancer?

It's not a load balancer. It's a way to route requests based on instance_name. Search for that field in REv2 for the details. It's for example used by bb_clientd to support access to multiple clusters.

What is the key means of this map?

That's stated almost literally in one of the sentences above the screenshot:

Map of storage backends, where the key corresponds to the instance name prefix to match.

  • Are all these node must run in one machine? Or it is distributed? If it's distributed, how can I config it?

Generally distributed. You can make it distributed by just using the grpc backend and forwarding those requests to another system.

  • If demultiplexing node just match prefix, where is the prefix added? The requests send by bazel client seems just like: http://host:port/ac/sha256digest. So the request will just be send randomly? And that means, all the bb_storage's disk are almost same?

The URL scheme you use above is used by Bazel if you use the HTTP based protocol for caching. This protocol is generally not used by Buildbarn. We use the gRPC based protocol (which I linked above). That one allows specifying the instance name as part of each request.

@hyphennn
Copy link
Author

Thx for your generous responce. With your help I finnally find out how to build a bb_storage node.

Now I'm trying to use the 'ShardingBlobAccessConfiguration' to build a bb_storage cluster. I plan to build a shard node and two local storage node. So I wote a config for shard node as below:

{
  contentAddressableStorage: {
    backend: {
        'sharding': {
            hash_initialization: 1,
            shards: [
                {
                    backend: {
                        'grpc': {
                            address: '10.101.62.103:8980'
                        },
                    },
                    weight: 1,
                },
                {
                    backend: {
                       'grpc': {
                            address: '10.101.187.11:8980'
                       },
                    },
                    weight: 1,
                },
            ]
        },
    },
    getAuthorizer: { allow: {} },
    putAuthorizer: { allow: {} },
    findMissingAuthorizer: { allow: {} },
  },
  actionCache: {
    backend: {
      completenessChecking: {
        backend: {
            'sharding': {
                hash_initialization: 1,
                shards: [
                    {
                        backend: {
                            'grpc': {
                                address: '10.101.62.103:8980'
                            },
                        },
                        weight: 1,
                    },
                    {
                        backend: {
                           'grpc': {
                                address: '10.101.187.11:8980'
                           },
                        },
                        weight: 1,
                    },
                ]
            },
        },
        maximumTotalTreeSizeBytes: 16 * 1024 * 1024,
      },
    },
    getAuthorizer: { allow: {} },
    putAuthorizer: { allow: {} },
  },
  global: { diagnosticsHttpServer: {
    httpServers: [{
      listenAddresses: [':9980'],
      authenticationPolicy: { allow: {} },
    }],
    enablePrometheus: true,
    enablePprof: true,
  } },
  grpcServers: [{
    listenAddresses: [':8980'],
    authenticationPolicy: { allow: {} },
  }],
  schedulers: {
    bar: { endpoint: { address: 'bar-scheduler:8981' } },
  },
  executeAuthorizer: { allow: {} },
  maximumMessageSizeBytes: 16 * 1024 * 1024,
}

As you see, i use to grpc backend as my storage node. These two node's config are totally the same as config in README.

Then I tested this cluster, but found that only few cache hit when I just build bb_storage. Detailly, if I only use one single storage node, there was 854/1118 action hit, but when I use shard cluster, only 248/1118 action hit.
image
image

I read some source code, and guess it may because the actionCache cannot be distributed. Then I change the actionCache's backend as local, it works.

So, my question is:

  • Is my guess 'the actionCache cannot be distributed' true?
  • Is this a bug or a feature?

@EdSchouten
Copy link
Member

Did you remove completenessChecking from the individual shards? You need to remove it there, otherwise each shard only wants to return AC entries if all references CAS objects are present in that shard (which won't be the case).

@hyphennn
Copy link
Author

Gorgeous, it works. sry for my stupid hahaha.

Do you have any plan to enrich bb_storage's doc or accept some pull request about this? I do agree bb_storage is a great repo. But honestly speaking, it's really suffered to read protobuf's comment and source code to find out how config works, especially for a non-English speaker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants