Skip to content

Conversation

@krishan1390
Copy link
Contributor

@krishan1390 krishan1390 commented Dec 1, 2025

  1. Add ColumnarDataSource which enables clients to create multiple ColumnReaderFactory from same input file.
  2. Add Object getValue(int docId) which is used if clients don't know the data type to read from a column but want access based on docId

@codecov-commenter
Copy link

codecov-commenter commented Dec 1, 2025

Codecov Report

❌ Patch coverage is 12.50000% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.23%. Comparing base (0ae7f87) to head (dd62289).
⚠️ Report is 1 commits behind head on master.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...egment/readers/PinotSegmentColumnarDataSource.java 0.00% 8 Missing ⚠️
...ocal/segment/readers/DefaultValueColumnReader.java 25.00% 3 Missing ⚠️
.../segment/readers/PinotSegmentColumnReaderImpl.java 0.00% 2 Missing ⚠️
...gment/processing/timehandler/EpochTimeHandler.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##             master   #17293   +/-   ##
=========================================
  Coverage     63.22%   63.23%           
- Complexity     1474     1475    +1     
=========================================
  Files          3147     3148    +1     
  Lines        187575   187586   +11     
  Branches      28712    28710    -2     
=========================================
+ Hits         118601   118618   +17     
  Misses        59771    59771           
+ Partials       9203     9197    -6     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.19% <12.50%> (-0.01%) ⬇️
java-21 63.18% <12.50%> (+0.01%) ⬆️
temurin 63.23% <12.50%> (+<0.01%) ⬆️
unittests 63.23% <12.50%> (+<0.01%) ⬆️
unittests1 55.64% <0.00%> (+0.01%) ⬆️
unittests2 33.88% <12.50%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@krishan1390
Copy link
Contributor Author

The test case failures look unrelated. And are passing locally. So should be intermittent failures

Copy link
Contributor

@9aman 9aman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor queries. They are mostly questions and not as such suggestions for improvement.

* Get the value at the given document ID as a Java Object.
* Can be used for both single-value and multi-value columns.
* This should be used if
* 1. The data type is not known at compile time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compile time ??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea. updated the comment to clarify more

Assert.assertTrue(reader.hasNext());
Assert.assertFalse(reader.isNextNull());
Assert.assertTrue(reader.isNextNull());
Assert.assertEquals(reader.nextInt(), expectedValue);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change kind of makes these tests depict the wrong access behaviour

nextInt should only be called when isNextNull is false.
We can still call hasNext and next as it does not guarantee non-null returns.

Confused me a bit while reading.

Is guess the entire objective to return true in isNextNull is to ensure that the reader does not go ahead with some default value reads.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern here is that hadNext() and next() access pattern for a long default column will give 0 while isNextNull() access pattern would expect the user to skip the value and handle nulls.

Or do we return null object as the default value ?

_defaultValue = defaultNullValue;

even for primitive types ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nextInt should only be called when isNextNull is false.

Yes in https://github.com/apache/pinot/pull/17304/files#diff-b2f354bd2a7666251988f742a8e710c754933547eaa67fbff28992ed18997705 I am making the change to throw an exception for nextInt() in DefaultValueColumnReader

I am removing this change of returning true from this PR as its handled better in that PR

@Override
public ColumnReader getColumnReader(String columnName)
throws IOException {
public ColumnReader getColumnReader(String columnName) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we moving away from throwing an exception. Is the logic similar to that of next() where the caller is expected to return null.

Is so, are there benefits of the same. ?

In case we are sticking to this, can we just update the interface to say that a null will be returned in case the ColumnReader is not present.

/**

  • Get a column reader for the specified column.
  • Implementations may cache and reuse readers for efficiency.
  • @param columnName Name of the column to read
  • @return ColumnReader instance for the specified column (may be cached)
    */
    @nullable ColumnReader getColumnReader(String columnName);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use case for the method is to get a column reader from a input file.

The client needs to know if the input file doesn't contain the column. If we throw an IOException in such a case, we can't know if there is an exception in creating the reader or if the column doesn't exist.

Thus returning null can help the client differentiate that.

Currently PinotSegmentColumnReaderFactory creates a DefaultValueColumnReader if it doesn't exist. Ideally it shouldn't do that. The client should handle case where the input doesn't have the column (in some cases it might make sense to fail or not read that column etc)

I am not making the changes to remove DefaultValueColumnReader from PinotSegmentColumnReaderFactory because it might break something if some external code depends on that.

However other factories that we will create will return null if a column doesn't exist. So made the change here too to return null. This shouldn't break anything major if some external code is using this factory

* </ul>
* <p>
*/
public interface ColumnarDataSource extends Closeable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noob question
@krishan1390

Why can't we directly use: ColumnReaderFactory

How and where does this additional level of abstraction help ?

I see an already written columnar reader that is being used during segment reload that directly relies on the factory ?

What value add do we expect from this interface in future ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the java doc in the class and createColumnReaderFactory() help in answering this ?

ColumnarDataSource is similar to RecordReaderFileConfig that it holds required attributes to initialise the ColumnReaderFactory / ColumnReader

For the same input attributes (file name, etc), we want to create independent instances of ColumnReaderFactory / ColumnReader

Copy link
Contributor

@9aman 9aman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@swaminathanmanish swaminathanmanish merged commit a9f36e4 into apache:master Dec 18, 2025
51 of 54 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants