Skip to content

ref-counted Metadata type #10069

@emilk

Description

@emilk

Background

Both RecordBatch/Schema and Field can have metadata. In both cases they are encoded as HashMap<String, String>

One downside with this is that cloning the metadata is slow (requires a deep clone and a lot of allocations). This is in contrast with basically everything else in arrow-rs, which uses an Arc for fast cloning.

Proposal outline

#[derive(Clone, Default,)]
pub struct Metadata(
    // Use `Option` to avoid allocation in case of empty metadata
    Option<Arc<BTreeMap<String, String>>>
)

impl Metdata {
    pub fn get(&self, key: &str) -> Option<&String> {}

    /// Does deep clone if (and only if) this `Metadata` is shared
    pub fn insert(&mut self, key: impl Into<String>, value: impl Into<String>) {
        Arc::make_mut(self.0.get_or_insert_default()).insert(key.into(), value.into());
    }}

impl Index<…> for Metadataimpl From<HashMap<String, String>> for Metadata {}
impl From<BTreeMap<String, String>> for Metadata {}
impl Into<HashMap<String, String>> for Metadata {}
impl Into<BTreeMap<String, String>> for Metadata {}

impl IntoIterator, FromIterator,

PRO/CON vs status quo (HashMap<String, String>)

  • PRO: Fast cloning of the whole Metadata
  • PRO: Deterministic iteration order (thanks to BTreeMap) - good for IPC/FFI encoding, test stability, hashing, …
  • NEUTRAL: Can still add/remove Metadata fields without extra cost
  • CON: New type; more complexity

Alternatives

Instead of storing String, we could store Arc<str>. That would make it efficient to share the same keys across many metadata tables.

The downside is added complexity.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelog

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions