Skip to content

fix: substring with negative start index#4017

Merged
andygrove merged 10 commits intoapache:mainfrom
kazuyukitanimura:fix-3919
Apr 29, 2026
Merged

fix: substring with negative start index#4017
andygrove merged 10 commits intoapache:mainfrom
kazuyukitanimura:fix-3919

Conversation

@kazuyukitanimura
Copy link
Copy Markdown
Contributor

@kazuyukitanimura kazuyukitanimura commented Apr 21, 2026

Which issue does this PR close?

Closes #3919
Closes #3337

Rationale for this change

What changes are included in this PR?

How are these changes tested?

@kazuyukitanimura kazuyukitanimura marked this pull request as ready for review April 24, 2026 21:35
let result = DictionaryArray::try_new(dict.keys().clone(), values)?;
Ok(Arc::new(result) as ArrayRef)
}
_ => Ok(Arc::clone(array)),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be an error rather than just returning the input data?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, updated

Ok(Arc::new(builder.finish()) as ArrayRef)
}
DataType::Dictionary(_, _) => {
let dict = as_dictionary_array::<Int32Type>(array);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would panic for a dictionary with Int64Type. Can we add a check for the type?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parquet dictionary uses Integer, so we are not doing Int64Type including other locations. E.g.https://github.com/apache/datafusion-comet/blob/main/native/spark-expr/src/static_invoke/char_varchar_utils/read_side_padding.rs#L68

Comment on lines +202 to +216
fn spark_substr_negative(s: &str, pos: i64, len: u64) -> String {
let num_chars = s.chars().count() as i64;
let start = num_chars + pos;
let end = start.saturating_add(len as i64).min(num_chars);
let start = start.max(0);

if start >= end {
return String::new();
}

s.chars()
.skip(start as usize)
.take((end - start) as usize)
.collect()
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude recommended an optimized version to avoid an intermediate string allocation per row. I have not verified.

fn spark_substr_negative(s: &str, pos: i64, len: u64) -> &str {                                                                                                                                                        
      let num_chars = s.chars().count() as i64;                              
      let end = (num_chars + pos).saturating_add(len as i64).min(num_chars);                                                                                                                                             
      let start = (num_chars + pos).max(0);
      if start >= end {                                                                                                                                                                                                  
          return "";                                                                                                                                                                                                     
      }
                                                                                                                                                                                                                         
      // Translate char indices [start, end) to byte offsets in a single forward pass.                                                                                                                                   
      let mut it = s.char_indices();
      let byte_start = it.by_ref().nth(start as usize).map(|(b, _)| b).unwrap_or(s.len());                                                                                                                               
      let span = (end - start - 1) as usize;                                                                                                                                                                             
      let byte_end = it.nth(span).map(|(b, _)| b).unwrap_or(s.len());
                                                                                                                                                                                                                         
      &s[byte_start..byte_end]                                               
  }         

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

SELECT substring('こんにちは世界', -2)

query
SELECT substring('🎉🎊🎈🎁', 2, 2)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Copy Markdown
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kazuyukitanimura I think we also would need to donate this later to DF

Copy link
Copy Markdown
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @kazuyukitanimura.

nit: there is some overlap between the Scala and SQL tests, and the SQL tests may have been sufficient alone, but we can review laster

@andygrove andygrove merged commit dc5929a into apache:main Apr 29, 2026
133 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

substring incompatible with spark for negative start index Native engine panics on all-scalar inputs for Substring and StringSpace

3 participants