--jobconf mapred.task.timeout does not appear to change timeout #257

ekulakow · 2011-11-08T18:34:17Z

Hi guys,

I've been running a decently large job that I'm having trouble completing because one of the reducers times out on the file upload part after it completes. I tried adding --jobconf mapred.task.timeout=1800000 to the command line, but it appears that the job terminated after 10 mins of uploading time anyways. I'm not sure why the upload should be having problems as it doesn't seem that the file is outrageously large, but it is significantly larger than the outputs of most of the other reducers (the key space is not well balanced - I'm going to try fixing this but it shouldn't matter for this bug). Anyways, it seems like mrjob is timing out stuff in the upload state, and that setting mapred.task.timeout to a larger number isn't listened to.

I'm on development-v0.2.x. Am using

bootstrap_actions: &boots

s3://elasticmapreduce/bootstrap-actions/configurations/latest/memory-intensive

and

base_tmp_dir: /mnt/tmp/

As far as other flags go.

The bug was still happening without the memory-intensive bootstrap, though I didn't try changing mapred.task.timeout without it.

Some log file stuff:

2011-11-08 02:03:40,421 INFO org.apache.hadoop.mapred.Merger (main): Merging 9 sorted segments
2011-11-08 02:03:41,284 INFO org.apache.hadoop.mapred.Merger (main): Down to the last merge-pass, with 9 segments left of total size: 1648153512 bytes
2011-11-08 02:03:42,535 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Creating new file 's3://path/to/output/part-00050' in S3
2011-11-08 02:03:42,537 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Outputstream for key 'path/to/output/part-00050' writing to tempfile '/mnt/var/lib/hadoop/s3/output-5430952274196003800.tmp'
2011-11-08 02:03:42,578 INFO org.apache.hadoop.streaming.PipeMapRed (main): R/W/S=1/0/0 in:0=1/92 [rec/s] out:0=0/92 [rec/s]
2011-11-08 02:03:42,579 INFO org.apache.hadoop.streaming.PipeMapRed (main): R/W/S=10/0/0 in:0=10/92 [rec/s] out:0=0/92 [rec/s]
2011-11-08 02:03:42,579 INFO org.apache.hadoop.streaming.PipeMapRed (main): R/W/S=100/0/0 in:1=100/92 [rec/s] out:0=0/92 [rec/s]

...

2011-11-08 02:11:14,944 INFO org.apache.hadoop.streaming.PipeMapRed (main): R/W/S=9700000/210394/0 in:17798=9700000/545 [rec/s] out:386=210394/545 [rec/s]
2011-11-08 02:11:19,745 INFO org.apache.hadoop.streaming.PipeMapRed (main): R/W/S=9800000/210394/0 in:17850=9800000/549 [rec/s] out:383=210394/549 [rec/s]
2011-11-08 02:11:24,922 INFO org.apache.hadoop.streaming.PipeMapRed (main): R/W/S=9900000/210394/0 in:17870=9900000/554 [rec/s] out:379=210394/554 [rec/s]
2011-11-08 02:11:32,005 INFO org.apache.hadoop.streaming.PipeMapRed (main): R/W/S=10000000/210394/0 in:17793=10000000/562 [rec/s] out:374=210394/562 [rec/s]

End Of File

Anything else I can look at?

coyotemarin · 2011-11-08T18:38:29Z

I'm not sure what's wrong, but I think we've seen a similar issue at Yelp. I'll post here if/when we find out anything more.

ekulakow · 2011-11-08T18:46:24Z

Are you also sorting logs by IP to have 1 IP per line? :-p

I'll let you know how redistributing the key space works.

coyotemarin · 2011-11-08T19:19:41Z

Haha, no, but that sounds like something that would produce very long lines for a handful of heavily trafficked IPs. :)

ekulakow · 2011-11-08T22:43:11Z

I think I'm just trying to soak up way too much data in my [i for i in values] generator expansion call in one specific key. Will switch to a (slower) safer way of doing that.

If you're curious, here's some errors:

Redistributing keys has given a new error: I also found an exit status for the reduce task:

2011-11-08 21:20:32,987 INFO org.apache.hadoop.mapred.TaskInProgress (IPC Server handler 19 on 9001): Error from attempt_201111081851_0001_r_000060_2: java.io.IOException: Task process exit with nonzero status of 137.
at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)

Waiting 5.0s for S3 eventual consistency
Job failed with status SHUTTING_DOWN: Shut down as step failed
Logs are in s3://mrjob-1ac31f1ce091d5aa/tmp/logs/j-3UK9OSOU8SYVG/
Scanning logs for probable cause of failure
Probable cause of failure (from s3://mrjob-1ac31f1ce091d5aa/tmp/logs/j-3UK9OSOU8SYVG/task-attempts/attempt_201111081851_0001_m_000435_1/syslog):
java.io.IOException: log:null
(while reading from s3://path/to/input/2011-07/2011-07-72.gz)
Terminating job flow: j-3UK9OSOU8SYVG

The 1st try on this mapper finishes successfully.

One issue that I "don't" know how to address is - I'm not getting my status messages or counters out, don't know where to find these guys.

Also the log files under "steps" show the following strange behavior:

2011-11-08 20:15:29,332 INFO org.apache.hadoop.streaming.StreamJob (main): map 100% reduce 98%
2011-11-08 20:15:38,369 INFO org.apache.hadoop.streaming.StreamJob (main): map 100% reduce 99%
2011-11-08 20:17:18,838 INFO org.apache.hadoop.streaming.StreamJob (main): map 100% reduce 100%
2011-11-08 20:42:37,569 INFO org.apache.hadoop.streaming.StreamJob (main): map 100% reduce 98%
2011-11-08 20:42:47,614 INFO org.apache.hadoop.streaming.StreamJob (main): map 100% reduce 99%
2011-11-08 20:44:17,894 INFO org.apache.hadoop.streaming.StreamJob (main): map 100% reduce 100%

And the offending node appears to have run out of memory:

2011-11-08 20:15:07,820 WARN org.apache.hadoop.dfs.DataNode (DataNode: [/mnt/var/lib/hadoop/dfs]): java.io.IOException: Cannot run program "bash": java.io.IOException: error=12, Cannot allocate memory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
at org.apache.hadoop.util.Shell.run(Shell.java:134)
at org.apache.hadoop.fs.DF.getCapacity(DF.java:63)
at org.apache.hadoop.dfs.FSDataset$FSVolume.getCapacity(FSDataset.java:334)
at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.getCapacity(FSDataset.java:495)
at org.apache.hadoop.dfs.FSDataset.getCapacity(FSDataset.java:692)
at org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:691)
at org.apache.hadoop.dfs.DataNode.run(DataNode.java:2970)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
... 9 more

2011-11-08 20:15:10,260 WARN org.apache.hadoop.dfs.DataNode (DataNode: [/mnt/var/lib/hadoop/dfs]): java.io.IOException: Cannot run program "du": java.io.IOException: error=12, Cannot allocate memory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
at org.apache.hadoop.util.Shell.run(Shell.java:134)
at org.apache.hadoop.fs.DU.access$200(DU.java:29)
at org.apache.hadoop.fs.DU$DURefreshThread.run(DU.java:84)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
... 5 more

coyotemarin · 2011-11-08T22:45:51Z

Try the development branch. v0.2.8 can only get logs from S3, and we can't control how fast the logs get transferred there, but the development branch can actually SSH in and grab them.

You'll have to use a key-pair file if you aren't already; see http://packages.python.org/mrjob/amazon.html#ssh-tunneling.

ekulakow · 2011-11-08T23:19:37Z

Cool didn't know that about the "development" branch.

irskep · 2012-02-02T00:44:41Z

You might try using the configure-hadoop bootstrap action to have the option set at bootstrap time rather than job run time.

irskep · 2012-03-19T19:07:59Z

I'm pretty sure you have to do as I suggested for this option to work. Dave, please close if you agree, or someone provide evidence to the contrary.

coyotemarin · 2012-03-19T19:14:10Z

No, we've definitely made this work in the past with just --jobconf. We should probably hand-test it before closing this ticket.

irskep · 2012-03-23T02:52:56Z

It definitely works with configure-hadoop and not with --jobconf. I feel like if we were using --jobconf incorrectly we would be getting actual errors.

coyotemarin · 2012-03-26T17:14:05Z

Huh. Maybe it only worked with Hadoop 0.18. That's interesting.

Is this something we want to add to the documentation, or should we just close this issue?

irskep · 2012-03-26T17:43:18Z

Well first of all we need to fix this section: http://packages.python.org/mrjob/writing-and-running.html#increasing-the-task-timeout

And then we can expand this section: http://packages.python.org/mrjob/writing-and-running.html#setting-parameters-in-mapred-site-xml

coyotemarin · 2012-03-26T17:44:26Z

Ah, right. Cool, changing this to a "Docs" issue.

ghost assigned irskep Mar 26, 2012

coyotemarin mentioned this issue Mar 26, 2012

warn about ignored jobconf args #411

Open

irskep closed this as completed Mar 29, 2012

tomelm unassigned irskep Sep 28, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

--jobconf mapred.task.timeout does not appear to change timeout #257

--jobconf mapred.task.timeout does not appear to change timeout #257

ekulakow commented Nov 8, 2011

coyotemarin commented Nov 8, 2011

ekulakow commented Nov 8, 2011

coyotemarin commented Nov 8, 2011

ekulakow commented Nov 8, 2011

coyotemarin commented Nov 8, 2011

ekulakow commented Nov 8, 2011

irskep commented Feb 2, 2012

irskep commented Mar 19, 2012

coyotemarin commented Mar 19, 2012

irskep commented Mar 23, 2012

coyotemarin commented Mar 26, 2012

irskep commented Mar 26, 2012

coyotemarin commented Mar 26, 2012

--jobconf mapred.task.timeout does not appear to change timeout #257

--jobconf mapred.task.timeout does not appear to change timeout #257

Comments

ekulakow commented Nov 8, 2011

coyotemarin commented Nov 8, 2011

ekulakow commented Nov 8, 2011

coyotemarin commented Nov 8, 2011

ekulakow commented Nov 8, 2011

coyotemarin commented Nov 8, 2011

ekulakow commented Nov 8, 2011

irskep commented Feb 2, 2012

irskep commented Mar 19, 2012

coyotemarin commented Mar 19, 2012

irskep commented Mar 23, 2012

coyotemarin commented Mar 26, 2012

irskep commented Mar 26, 2012

coyotemarin commented Mar 26, 2012