This repository has been archived by the owner. It is now read-only.

Allow leading zeros for NumberedFileInputSplit #420

Merged
merged 3 commits into from Nov 14, 2017

Conversation

Projects
None yet
3 participants
@turambar
Copy link
Contributor

turambar commented Sep 5, 2017

What changes were proposed in this pull request?

With these changes, NumberedFileInputSplit permits two kinds of formatting -- "prefix4.suffix" or "prefix0004.suffix." Latter is more suitable for, e.g., Hadoop mapfiles spit out by a Spark ETL pipeline.

How was this patch tested?

Added a NumberedFileInputSplitTests class with some unit tests to verify the name formatting works. I've also run it in live code, and it performed fine.

@turambar

This comment has been minimized.

Copy link
Contributor

turambar commented Sep 5, 2017

Addresses this issue: #421

@AlexDBlack
Copy link
Member

AlexDBlack left a comment

Assuming all tests pass, LGTM.

@huitseeker
Copy link
Contributor

huitseeker left a comment

Please check your multiple-digit cases and document one way or another the fragment of printf you support..

@@ -1,19 +1,3 @@
/*-

This comment has been minimized.

@huitseeker

huitseeker Sep 6, 2017

Contributor

This should not be removed

@@ -39,15 +25,18 @@
private final int minIdx;
private final int maxIdx;

private static final Pattern p = Pattern.compile("\\%(0\\d)?d");

This comment has been minimized.

@huitseeker

huitseeker Sep 6, 2017

Contributor

To be compliant with printf-formatted strings, which is the format your users will expect to deal with, you have to at least document the following interactions in unit tests (and possibly do something better in the cases below) — they are all valid printf format strings.

scala> val pat = "\\%(0\\d)?d".r
pat: scala.util.matching.Regex = \%(0\d)?d

scala> def f(x: String): String = x match {
     |   case pat.unanchored(y) => s"Matched, with pattern $y"
     |   case _ => "UNMATCHED"
     | }
f: (x: String)String

scala> f("/path/to/files/prefix%5d.suffix")
res24: String = UNMATCHED

scala> f("/path/to/files/prefix%+5d.suffix")
res26: String = UNMATCHED

scala> f("/path/to/files/prefix%-5d.suffix")
res27: String = UNMATCHED

scala> f("/path/to/files/prefix%011d.suffix")
res28: String = UNMATCHED

scala> f("/path/to/files/prefix%101d.suffix")
res29: String = UNMATCHED

scala> f("/path/to/files/prefix%0505d.suffix")
res30: String = UNMATCHED
}

@Test(expected = IllegalArgumentException.class)
public void testNumberedFileInputSplitWithLeadingSpaces() {

This comment has been minimized.

@huitseeker

huitseeker Sep 6, 2017

Contributor

O_o

@huitseeker
Copy link
Contributor

huitseeker left a comment

@turambar FTFY

@huitseeker huitseeker force-pushed the dck_numFileInputSplit branch from 56a34b4 to 2a6eafb Nov 10, 2017

@huitseeker huitseeker merged commit 736c041 into master Nov 14, 2017

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.