Skip to content

rgw/s3vector: implement the VectorBucket API#66327

Closed
yuvalif wants to merge 3 commits intoceph:wip-s3vectorfrom
yuvalif:wip-vector-bucket-apis
Closed

rgw/s3vector: implement the VectorBucket API#66327
yuvalif wants to merge 3 commits intoceph:wip-s3vectorfrom
yuvalif:wip-vector-bucket-apis

Conversation

@yuvalif
Copy link
Copy Markdown
Contributor

@yuvalif yuvalif commented Nov 19, 2025

  • This is a draft of adding Create/Get/List/Remove Vector Bucket APIs.
  • The idea is to borrow as much as possible from the existing mechanism for regular S3 buckets, so that in the future we will be able to support: metadata sync, policies, quota, stats etc. for vector buckets without the need to reimplement all of these mechanism.
  • Currently, the main difference between VectorBuckets and Buckets are with the prefixes of the RADOS object names. However, to make the code change less intrusive, I mainly used copy&paste, and inheritence to implement that
  • Once we have a working solution, refactoring MUST BE DONE so that the implementation is much smaller
    • bucket_sobj refactoring
    • RGWBucketCtl refactoring
    • sal::VectorBucket refactoring
    • RGWRados refactoring
  • Bucket index code is removed from VectorBuckets
  • Currently the tests that verify combination of buckets and vector buckets are failing. if a bucket is created with the same name as a vector bucket, the bucket creation fails (and vice versa)

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

You must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.

do some fixes to message validation based on the tesst

Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>
@yuvalif yuvalif force-pushed the wip-vector-bucket-apis branch 2 times, most recently from 726f536 to 7e88e92 Compare November 27, 2025 10:50
@yuvalif yuvalif marked this pull request as ready for review November 27, 2025 15:39
@yuvalif yuvalif requested a review from a team as a code owner November 27, 2025 15:39
@yuvalif yuvalif requested a review from dang November 27, 2025 15:51
@cbodley
Copy link
Copy Markdown
Contributor

cbodley commented Dec 2, 2025

rgw/s3vector: disable mdlog writes for vector buckets
since the rest of the md sync code does not support them

TODO:

  1. create RGWMetadataHandlers for vector bucket entrypoints (ex RGWBucketMetadataHandler) and instances (ex RGWBucketInstanceMetadataHandler)
  2. attach them to the RGWMetadataManager in RGWCtl::init()

that's all that metadata sync should require, other than these mdlog entries

@yuvalif
Copy link
Copy Markdown
Contributor Author

yuvalif commented Dec 3, 2025

rgw/s3vector: disable mdlog writes for vector buckets
since the rest of the md sync code does not support them

TODO:

  1. create RGWMetadataHandlers for vector bucket entrypoints (ex RGWBucketMetadataHandler) and instances (ex RGWBucketInstanceMetadataHandler)
  2. attach them to the RGWMetadataManager in RGWCtl::init()

that's all that metadata sync should require, other than these mdlog entries

done here: 340d931
ran some basic testing - listing on 2ndary vector buckets created oin primary

Copy link
Copy Markdown
Contributor

@cbodley cbodley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the weekly meeting, we briefly discussed the contents of RGWBucketInfo::layout for vector buckets and i suggested marking them as "indexless"

Comment on lines +3140 to +3146
class RGWVectorBucketInstanceMetadataHandler : public RGWBucketInstanceMetadataHandler {
protected:
int put_prepare(const DoutPrefixProvider* dpp, optional_yield y,
const std::string& entry, RGWBucketCompleteInfo& bci,
const std::optional<RGWBucketCompleteInfo>& old_bci,
const RGWObjVersionTracker& objv_tracker,
bool from_remote_zone) override { return 0;}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for multisite, RGWBucketInstanceMetadataHandler::put_prepare() is what calls init_default_bucket_layout() to initialize the local bucket layout

since you're overriding put_prepare(), you can just force indexless here:

// vector buckets are indexless
bci.info.layout.current_index.layout.type = rgw::BucketIndexType::Indexless;
return 0;

Comment on lines +130 to +132
rgw::sal::VectorBucket::CreateParams createparams;
createparams.owner = s->user->get_id();
createparams.zonegroup_id = zonegroup.id;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then on creation, you can request indexless with:

// vector buckets are indexless
createparams.index_type = rgw::BucketIndexType::Indexless;

@yuvalif yuvalif requested a review from cbodley December 24, 2025 12:24
@yuvalif
Copy link
Copy Markdown
Contributor Author

yuvalif commented Dec 25, 2025

jenkins test docs

yuvalif added a commit to yuvalif/ceph that referenced this pull request Jan 1, 2026
should be merged after ceph#66327

Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>
yuvalif added a commit to yuvalif/ceph that referenced this pull request Jan 1, 2026
should be merged after ceph#66327

Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>
yuvalif added a commit to yuvalif/ceph that referenced this pull request Jan 13, 2026
not implemented in this commit:
* VectorBucket APIs (done in  ceph#66327)
* caching information as VectorBucket attributes and fetching them gtom
  there (after the VectoBucket PR is merged): schema, distance type
* VectorBucket policy APIs
* metadata support

Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>
const CreateParams& params,
optional_yield y) = 0;

/** Get the cached attributes associated with this vector bucket */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we avoid using optional_yield since this is a new interface? Since we're trying to do async between rust and C++, I'd really rather avoid having the back-end have to deal with the possibility of blocking, and instead have that handled at the caller. (If the functions we call use take optional_yield they should be callable with the straight yield_context?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the vector bucket creation does not involve calling into lancedb, this is veriy similar to the flow of S3 bucket creation.
since optional_yield is used there, and since I'm using the same (or similar) calls, I think that we are "stuck" with it for now...

/** Get the cached placement rule of this vector bucket */
virtual rgw_placement_rule& get_placement_rule() = 0;
/** Get the cached creation time of this vector bucket */
virtual ceph::real_time& get_creation_time() = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning a non-const reference here doesn't seem right. SInce it's a 64-bit integer, could we just return it by value?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i copied that from sal::Bucket, but i completly agree that it should be by value

/** Get the cached ID of this vector bucket */
virtual const std::string& get_bucket_id() const = 0;
/** Get the cached placement rule of this vector bucket */
virtual rgw_placement_rule& get_placement_rule() = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This being a mutable reference doesn't seem right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

/** Get the key for this vector bucket */
virtual rgw_bucket& get_key() = 0;
/** Get the info for this vector bucket */
virtual RGWBucketInfo& get_info() = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-const?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

ldpp_dout(dpp, 20) << "s3vector --- RGWRados::create_vector_bucke called" << dendl;
int ret = 0;

#define MAX_CREATE_RETRIES 20 /* need to bound retries */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't do this. Use a constexpr.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copied from...


int RGWRados::get_raw_obj_ref(const DoutPrefixProvider *dpp, rgw_raw_obj obj, rgw_rados_ref* ref)
{
ldpp_dout(dpp, 1) << "INFO: s3vector -- called RGWRados::get_raw_obj_ref()" << dendl;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prints s3vector whether s3vector is involved or not?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will remove all of the debug log used to help with the call flow

RGWBucketEntryPoint ep;
r = ctl.vector_bucket->read_bucket_entrypoint_info(bucket_info.bucket,
&ep,
null_yield,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we passing null_yield here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copied from the s3 bucket code. will fix here, but i'm not sure why null_yield is used there? @cbodley ?

Comment on lines +2027 to +2028
/** Store the cached bucket info into the backing store */
//virtual int put_info(const DoutPrefixProvider* dpp, bool exclusive, ceph::real_time mtime, optional_yield y) = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you'll probably need this to implement things like PutVectorBucketPolicy to update existing bucket instance metadata objects

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add that. though the bucket policy part will be done in subsequent work

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbodley added 2 APIs that will be used in setting vector bucket policies in: e69ecd4

@yuvalif yuvalif requested a review from cbodley January 22, 2026 18:29
Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>
@yuvalif
Copy link
Copy Markdown
Contributor Author

yuvalif commented Jan 28, 2026

cherry-picked into #66066

@yuvalif yuvalif closed this Jan 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants