Fix content type detection with leading whitespace #32632

jasontedor · 2018-08-05T17:34:42Z

Today content type detection on an input stream works by peeking up to twenty bytes into the stream. If the stream is headed by more whitespace than twenty bytes, we might fail to detect the content type. We should be ignoring this whitespace before attempting to detect the content type. This commit does that by ignoring all leading whitespace in an input stream before attempting to guess the content type.

Relates #32357

Today content type detection on an input stream works by peeking up to twenty bytes into the stream. If the stream is headed by more whitespace than twenty bytes, we might fail to detect the content type. We should be ignoring this whitespace before attempting to detect the content type. This commit does that by ignoring all leading whitespace in an input stream before attempting to guess the content type.

elasticmachine · 2018-08-05T17:35:05Z

Pinging @elastic/es-core-infra

andrershov · 2018-08-05T19:18:31Z

@jasontedor what is the reason to constantly double readLimit in si.mark() call? Why not just to call
si.mark(Integer.MAX_VALUE-8) and skip all whitespaces during the first pass? I think all sane implementations allocate reading buffer lazily, increasing its capacity when data does not fit.

bleskes · 2018-08-05T19:48:24Z

libs/x-content/src/main/java/org/elasticsearch/common/xcontent/XContentFactory.java

-                    break;
+        int iteration = 1;
+        while (true) {
+            si.mark(iteration * GUESS_HEADER_LENGTH);


This is where I was a bit uncomfortable with requiring the underlying stream to have an unbound support for mark. If I read things correctly we end up with a stream that just wraps a byte reference and we therefore don't care. If you're comfortable relying on that behavior, then I'm good but then I also think we don't need to allocate a growing size of firstBytes but rather read byte by byte until we find the first non-white space.

I am comfortable with relying on that behavior. I pushed 86227ba.

…pe-detection-with-leading-whitespace * elastic/master: (34 commits) Cross-cluster search: preserve cluster alias in shard failures (elastic#32608) Handle AlreadyClosedException when bumping primary term [TEST] Allow to run in FIPS JVM (elastic#32607) [Test] Add ckb to the list of unsupported languages (elastic#32611) SCRIPTING: Move Aggregation Scripts to their own context (elastic#32068) Painless: Use LocalMethod Map For Lookup at Runtime (elastic#32599) [TEST] Enhance failure message when bulk updates have failures [ML] Add ML result classes to protocol library (elastic#32587) Suppress LicensingDocumentationIT.testPutLicense in release builds (elastic#32613) [Rollup] Update wire version check after backport Suppress Wildfly test in FIPS JVMs (elastic#32543) [Rollup] Improve ID scheme for rollup documents (elastic#32558) ingest: doc: move Dot Expander Processor doc to correct position (elastic#31743) [ML] Add some ML config classes to protocol library (elastic#32502) [TEST]Split transport verification mode none tests (elastic#32488) Core: Move helper date formatters over to java time (elastic#32504) [Rollup] Remove builders from DateHistogramGroupConfig (elastic#32555) [TEST} unmutes SearchAsyncActionTests and adds debugging info [ML] Add Detector config classes to protocol library (elastic#32495) [Rollup] Remove builders from MetricConfig (elastic#32536) ...

jasontedor · 2018-08-06T12:46:53Z

@andrershov It was to deal with possibility of insane implementations but it appears that we should always have a sane implementation here.

bleskes

LGTM

Today content type detection on an input stream works by peeking up to twenty bytes into the stream. If the stream is headed by more whitespace than twenty bytes, we might fail to detect the content type. We should be ignoring this whitespace before attempting to detect the content type. This commit does that by ignoring all leading whitespace in an input stream before attempting to guess the content type.

jasontedor requested a review from bleskes August 5, 2018 17:34

jasontedor added >bug review :Core/Infra/Core Core issues without another label v7.0.0 v6.4.0 v5.6.11 v6.5.0 labels Aug 5, 2018

jasontedor mentioned this pull request Aug 5, 2018

20+ whitespace chars in bulk insert makes index unsearchable #32357

Closed

bleskes reviewed Aug 5, 2018

View reviewed changes

jasontedor added 2 commits August 6, 2018 08:11

Iteration

86227ba

jasontedor requested a review from bleskes August 6, 2018 12:47

bleskes approved these changes Aug 6, 2018

View reviewed changes

jasontedor merged commit 3fb0923 into elastic:master Aug 6, 2018

jasontedor deleted the improve-content-type-detection-with-leading-whitespace branch August 6, 2018 22:33

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix content type detection with leading whitespace #32632

Fix content type detection with leading whitespace #32632

jasontedor commented Aug 5, 2018

elasticmachine commented Aug 5, 2018

andrershov commented Aug 5, 2018

bleskes Aug 5, 2018 •

edited

jasontedor Aug 6, 2018

jasontedor commented Aug 6, 2018

bleskes left a comment

Fix content type detection with leading whitespace #32632

Fix content type detection with leading whitespace #32632

Conversation

jasontedor commented Aug 5, 2018

elasticmachine commented Aug 5, 2018

andrershov commented Aug 5, 2018

bleskes Aug 5, 2018 • edited

Choose a reason for hiding this comment

jasontedor Aug 6, 2018

Choose a reason for hiding this comment

jasontedor commented Aug 6, 2018

bleskes left a comment

Choose a reason for hiding this comment

bleskes Aug 5, 2018 •

edited