Skip to content

Latest commit

 

History

History
37 lines (29 loc) · 1.68 KB

NOTES.md

File metadata and controls

37 lines (29 loc) · 1.68 KB

Random notes while looking through code

What happens when we start a Kiji MR job?

Look at KijiProduce in KijiMR. This is the tool that we use to run a command-line Kiji producer job. Before running the Hadoop job, the command-line tool sets up a bunch of configuration parameters in an instance of KijiProduceJobBuilder. The KijiProduce instance calls its configure(jobBuilder) method, which also calls superclass methods of that same name. These methods seem to set up the KijiProduceJobBuilder with whatever the default Hadoop Configuration settings are, and then populate some other, Producer-specific stuff. See, for example, JobTool.java, line 101:

jobBuilder.withConf(getConf());

(JobTool extends BaseTool, which extends Configured.)

KijiProduceJobBuilder extends KijiTableInputJobBuilder, which extends MapReduceJobBuilder. All three of these classes together contain a lot of useful information that we clearly need to run Kiji-flavored Hadoop jobs:

  • JAR directories
  • A base Hadoop Configuration object
  • A MapReduceJobOutput
  • A map from keys to KeyValueStore instances
  • KijiURI
  • Starting and ending EntityIds
  • The KijiProducer class
  • KijiProducer, KijiMapper, and KijiReducer instances
  • A KijiDataRequest

Eventually from all of these builders, we get a configured map-reduce job, ready to run (see line 116 of MapReduceJobBuilder, which contains the build() method). We should be able to call the getConfiguration method on the Job we get from the MapReduceJobBuilder in our Spark code when we call newAPIHadoopRDD.

We can also just look at KijiTap and KijiScheme in KijiExpress. They do basically everything we need to do.