Skip to content

Cache local JAR file in ProtoBufCodeGenMessageDecoder to eliminate redundant remote fetches#18233

Open
rseetham wants to merge 1 commit intoapache:masterfrom
rseetham:protobuf-jar-cache
Open

Cache local JAR file in ProtoBufCodeGenMessageDecoder to eliminate redundant remote fetches#18233
rseetham wants to merge 1 commit intoapache:masterfrom
rseetham:protobuf-jar-cache

Conversation

@rseetham
Copy link
Copy Markdown
Contributor

@rseetham rseetham commented Apr 16, 2026

Why

ProtoBufCodeGenMessageDecoder.init() is called once per consuming segment creation — once per topic partition every time a segment rolls over, and once per partition on server restart. Each call unconditionally fetched the protobuf schema JAR from remote storage (S3, HDFS, etc.) via ProtoBufUtils.getFileCopiedToLocal(), which copies the JAR into a new timestamped temp directory every time. The JAR only changes when the table's decoder config is updated, so in normal operation every fetch after the first is unnecessary network I/O.
Additionally, if jar is fetched by object store and that connection is broken, ingestion stops right now. With this fix, ingestion will continue based on the cached copy.

What
Introduce a JVM-level ConcurrentHashMap<String, CachedJar> keyed by topicName. CachedJar stores the remote JAR path and the local File it was copied to. On every init():

  • Cache hit (same jarPath): return the cached local file immediately — no network call. Codegen and Janino compilation still run fresh per init(), which is correct because fieldsToRead can differ between decoder instances for the same topic.
  • jarPath changed (config update): fetch the new JAR, replace the cache entry.
  • Fetch failure with a stale entry: log an error and return the previously cached local file so segment creation succeeds rather than failing on a transient network issue. Rows decoded with a stale schema during this window are made explicit in the error log.

The URLClassLoader created to load the proto class is closed after the compiled Method is extracted, releasing the file handle immediately rather than accumulating them across segment rollovers.

How it behaves in each lifecycle event
Normal segment rollover: init() hits the cache, skips the remote fetch, runs codegen + Janino in memory (~ms), and returns. Each segment manager thread runs its own init() in parallel — no serialization across topics.

New table creation: First init() for that topicName — cache miss, JAR is fetched and cached. Subsequent segments for the same table hit the fast path.

Decoder config update (new JAR deployed): Next init() sees cached._jarPath != newJarPath, fetches the new JAR, replaces the cache entry.

Server restart: All cache entries are gone (JVM-level cache). Each partition's first init() after restart fetches the JAR once; subsequent rollovers hit the cache.

Tests

  • Existing behavioral tests are unchanged and continue to pass.
  • Added testCacheHit: two decoders initialized for the same topic and JAR both decode correctly, exercising the cache-hit path.
  • Added testStaleFallbackOnFetchFailure: a decoder initialized with an unreachable JAR path falls back to the previously cached local file and
    decodes correctly.

🤖 Generated with Claude Code

@xiangfu0 xiangfu0 added ingestion Related to data ingestion pipeline plugins Related to the plugin system labels Apr 16, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes protobuf decoder initialization in pinot-protobuf by caching the locally-copied schema JAR to avoid repeated remote fetches during consuming segment rollovers, while adding tests for cache hit and fetch-failure fallback behavior.

Changes:

  • Add a JVM-level JAR cache and a resolveJar() path in ProtoBufCodeGenMessageDecoder.init().
  • Close the per-init URLClassLoader after codegen/Janino compilation.
  • Add unit tests covering cache-hit behavior and stale fallback on fetch failure.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
pinot-plugins/pinot-input-format/pinot-protobuf/src/main/java/org/apache/pinot/plugin/inputformat/protobuf/ProtoBufCodeGenMessageDecoder.java Introduces local JAR caching + fetch-failure fallback and classloader lifecycle updates.
pinot-plugins/pinot-input-format/pinot-protobuf/src/test/java/org/apache/pinot/plugin/inputformat/protobuf/ProtoBufCodeGenMessageDecoderTest.java Adds tests for cache hit and stale fallback; clears cache between tests.
pinot-plugins/pinot-input-format/pinot-protobuf/src/test/java/org/apache/pinot/plugin/inputformat/protobuf/ProtoBufUtilsTest.java Updates test classloading approach for descriptor lookup.

Comment on lines 57 to 60
URL jarFile = getClass().getClassLoader().getResource("complex_types.jar");
ClassLoader clsLoader = ProtoBufCodeGenMessageDecoder.loadClass(jarFile.getPath());
ClassLoader clsLoader = new URLClassLoader(new URL[]{jarFile});
Descriptors.Descriptor desc = ProtoBufCodeGenMessageDecoder.getDescriptorForProtoClass(clsLoader,
"org.apache.pinot.plugin.inputformat.protobuf.ComplexTypes$TestMessage");
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

URLClassLoader implements Closeable, but this test never closes it. Use try-with-resources to avoid leaking an open JAR handle across the test suite (especially since this PR is explicitly addressing classloader/JAR handle churn).

Copilot uses AI. Check for mistakes.
Comment on lines +98 to +105
File localFile = resolveJar(topicName, jarPath);
URLClassLoader loader = new URLClassLoader(new URL[]{localFile.toURI().toURL()});
Descriptors.Descriptor descriptor = getDescriptorForProtoClass(loader, protoClassName);
String codeGenCode = new MessageCodeGen().codegen(descriptor, fieldsToRead);
Class<?> recordExtractor = compileClass(protoMessageClsLoader,
Class<?> recordExtractor = compileClass(loader,
MessageCodeGen.EXTRACTOR_PACKAGE_NAME + "." + MessageCodeGen.EXTRACTOR_CLASS_NAME, codeGenCode);
_decodeMethod = recordExtractor.getMethod(MessageCodeGen.EXTRACTOR_METHOD_NAME, byte[].class, GenericRow.class);
loader.close();
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

URLClassLoader loader is only closed on the happy path. If getDescriptorForProtoClass(), codegen(), compileClass(), or getMethod() throws, the classloader (and underlying JAR handle) will leak. Use try-with-resources (or a finally) so the loader is always closed.

Copilot uses AI. Check for mistakes.
Comment on lines +119 to +123
private static File resolveJar(String topicName, String jarPath)
throws Exception {
CachedJar cached = JAR_CACHE.get(topicName);
if (cached != null && cached._jarPath.equals(jarPath) && cached._localFile.exists()) {
return cached._localFile;
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache is keyed only by topicName. Pinot can have multiple tables consuming the same topic (note other components key by (table, topic)), and in that case a fetch failure could cause this decoder to fall back to a cached JAR from a different table/config, risking silent schema mismatch and bad ingestion. Consider including at least protoClassName (and/or jarPath) in the cache key, or restricting stale fallback so it cannot reuse an entry from a different configuration scope.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a given kafka topic, the data can only be in 1 format. So the jar is the same to read from it. It the schema of different tables is different only those fields will be extracted per table. Here we are only caching the jar so that we don't keep fetching all the time.

}
try {
File localFile = ProtoBufUtils.getFileCopiedToLocal(jarPath);
localFile.getParentFile().deleteOnExit();
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

localFile.getParentFile().deleteOnExit() won’t delete the temp directory at JVM exit because the directory is non-empty (the JAR file remains). If you want cleanup-on-exit, register localFile.deleteOnExit() first and then register the directory, or otherwise perform recursive cleanup.

Suggested change
localFile.getParentFile().deleteOnExit();
localFile.deleteOnExit();
File parentFile = localFile.getParentFile();
if (parentFile != null) {
parentFile.deleteOnExit();
}

Copilot uses AI. Check for mistakes.
if (cached != null && cached._localFile.exists()) {
LOGGER.error("Failed to fetch JAR for topic '{}' from '{}', reusing stale local copy from '{}'. "
+ "Rows decoded with the stale schema may be incorrect if the schema has changed.",
topicName, jarPath, cached._jarPath, e);
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error log says it is "reusing stale local copy from '…'" but the value being logged is cached._jarPath (remote path), not the local file path. This makes troubleshooting difficult; log cached._localFile (and optionally also log the stale remote path separately).

Suggested change
topicName, jarPath, cached._jarPath, e);
topicName, jarPath, cached._localFile, e);

Copilot uses AI. Check for mistakes.
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 16, 2026

Codecov Report

❌ Patch coverage is 89.28571% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.36%. Comparing base (87e09fd) to head (e9f3f4f).

Files with missing lines Patch % Lines
...format/protobuf/ProtoBufCodeGenMessageDecoder.java 89.28% 1 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##             master   #18233   +/-   ##
=========================================
  Coverage     63.36%   63.36%           
  Complexity     1627     1627           
=========================================
  Files          3243     3243           
  Lines        197038   197054   +16     
  Branches      30466    30468    +2     
=========================================
+ Hits         124845   124856   +11     
- Misses        62195    62204    +9     
+ Partials       9998     9994    -4     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.32% <89.28%> (+0.02%) ⬆️
java-21 63.31% <89.28%> (-0.02%) ⬇️
temurin 63.36% <89.28%> (+<0.01%) ⬆️
unittests 63.35% <89.28%> (+<0.01%) ⬆️
unittests1 55.32% <ø> (+<0.01%) ⬆️
unittests2 34.94% <89.28%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@xiangfu0 xiangfu0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found one high-signal ingestion correctness risk; see inline comment.

private static File resolveJar(String topicName, String jarPath)
throws Exception {
CachedJar cached = JAR_CACHE.get(topicName);
if (cached != null && cached._jarPath.equals(jarPath) && cached._localFile.exists()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the rollout semantics from 're-fetch the protobuf JAR on every init' to 'reuse it indefinitely while the URI string stays the same'. Pinot deployments often replace the decoder JAR in place at the same S3/HDFS path during schema rollouts; after this cache hits once, long-lived servers will keep decoding with the old generated classes until restart, which can silently ingest rows with the wrong schema. We need either a freshness/version check here or an explicit versioned-URI contract before making the cached file authoritative.

Copy link
Copy Markdown
Contributor Author

@rseetham rseetham Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dabbled with having a background job refresh the jar fetch every hour. The issue is the plugin module does not have access to the server executor service so I'll have to create and manage it here. So I don't think this is a good solution.

Another solution is having a cache with a ttl of an hour/ some configured value (server property) That would also force a periodic fetch. The issue here is if you set the segment completion time to the same time as the cache expiration, all segments completed at the same time so they would wait for the jar fetch anyway. Users would have to set this more carefully. But this solves the problem. I'll add this and address the other smaller comments that were brought up.

Still the fundamental issue with both of these is the only way to force a fetch is a restart in case the jar was changed in place. At the moment a table force commit will force a fetch. During incidents saying this will take 1 hr will be an issue.

Is there another solution you'd suggest?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ingestion Related to data ingestion pipeline plugins Related to the plugin system

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants