METRON-682: Unify and Improve the Flat File Loader #432
Conversation
… flatfile load script
I know it seems like a lot of code changed, but a lot of this was reorganizing the flat file loader class, reformatting old code to conform to current standard spacing and splitting it into separate reusable components, rather than new code. |
Testing PlanPreliminaries
Import from URL
Import from local file (non-zipped)
Import from local file (gzipped)
Import from local file (zipped)
Import from HDFS via MR
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good PR. I actually pulled this into a branch to test out top-domain (Alexa) enrichments. The flexible file loading is helpful, and the enhancements to the integration tests are nice. No real issues on inspection. The architectural changes seem reasonable to me and fit with the idioms employed in other parts of the system.
import java.util.Map; | ||
|
||
public interface InputFormatHandler { | ||
void set(Job job, Path input, Map<String, Object> config) throws IOException; | ||
void set(Job job, List<Path> input, Map<String, Object> config) throws IOException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like
|
||
import java.util.Optional; | ||
|
||
public abstract class OptionHandler implements Function<String, Option> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this be useful in metron-common? Doesn't have to be this PR, but maybe "LoadOptions" becomes a parameterized type OPT_TYPE.
Job job = Job.getInstance(hadoopConfig); | ||
List<String> inputs = (List<String>) config.get(LoadOptions.INPUT).get(); | ||
job.setJobName("MapReduceImporter: " + inputs.stream().collect(Collectors.joining(",")) + " => " + table + ":" + cf); | ||
System.out.println("Configuring " + job.getJobName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why system out instead of a logger?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch
I am trying to test this out - my vm is not doing well at the moment however. Hopefully I can get it straight |
[vagrant@node1 tmp]$ /usr/metron/0.3.0/bin/flatfile_loader.sh -i http://s3.amazonaws.com/alexa-static/top-1m.csv.zip -t enrichment -c t -e ./extractor.json -p 5 -b 128 with extractor.json of:
|
@ottobackwards your |
Also, I'll point out that you can make your life easier and kill pretty much everything on your vagrant and do this. The only reliance is on HBase and MR. I would suggest killing:
|
I'll point out as well that it'd be nice to have a decent exception there, kinda like what you'd get from jsonlint.com:
That might be worth a JIRA, honestly. |
Works like a champ. Nice Job +1 |
Currently the flat file loader is deficient in a couple ways:
This JIRA will:
Testing plan in comments