Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--jobconf mapred.task.timeout does not appear to change timeout #257

Closed
ekulakow opened this issue Nov 8, 2011 · 13 comments
Closed

--jobconf mapred.task.timeout does not appear to change timeout #257

ekulakow opened this issue Nov 8, 2011 · 13 comments
Labels
Milestone

Comments

@ekulakow
Copy link

ekulakow commented Nov 8, 2011

Hi guys,

I've been running a decently large job that I'm having trouble completing because one of the reducers times out on the file upload part after it completes. I tried adding --jobconf mapred.task.timeout=1800000 to the command line, but it appears that the job terminated after 10 mins of uploading time anyways. I'm not sure why the upload should be having problems as it doesn't seem that the file is outrageously large, but it is significantly larger than the outputs of most of the other reducers (the key space is not well balanced - I'm going to try fixing this but it shouldn't matter for this bug). Anyways, it seems like mrjob is timing out stuff in the upload state, and that setting mapred.task.timeout to a larger number isn't listened to.

I'm on development-v0.2.x. Am using

bootstrap_actions: &boots

  • s3://elasticmapreduce/bootstrap-actions/configurations/latest/memory-intensive

and

base_tmp_dir: /mnt/tmp/

As far as other flags go.

The bug was still happening without the memory-intensive bootstrap, though I didn't try changing mapred.task.timeout without it.

Some log file stuff:

2011-11-08 02:03:40,421 INFO org.apache.hadoop.mapred.Merger (main): Merging 9 sorted segments
2011-11-08 02:03:41,284 INFO org.apache.hadoop.mapred.Merger (main): Down to the last merge-pass, with 9 segments left of total size: 1648153512 bytes
2011-11-08 02:03:42,535 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Creating new file 's3://path/to/output/part-00050' in S3
2011-11-08 02:03:42,537 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Outputstream for key 'path/to/output/part-00050' writing to tempfile '/mnt/var/lib/hadoop/s3/output-5430952274196003800.tmp'
2011-11-08 02:03:42,578 INFO org.apache.hadoop.streaming.PipeMapRed (main): R/W/S=1/0/0 in:0=1/92 [rec/s] out:0=0/92 [rec/s]
2011-11-08 02:03:42,579 INFO org.apache.hadoop.streaming.PipeMapRed (main): R/W/S=10/0/0 in:0=10/92 [rec/s] out:0=0/92 [rec/s]
2011-11-08 02:03:42,579 INFO org.apache.hadoop.streaming.PipeMapRed (main): R/W/S=100/0/0 in:1=100/92 [rec/s] out:0=0/92 [rec/s]

...

2011-11-08 02:11:14,944 INFO org.apache.hadoop.streaming.PipeMapRed (main): R/W/S=9700000/210394/0 in:17798=9700000/545 [rec/s] out:386=210394/545 [rec/s]
2011-11-08 02:11:19,745 INFO org.apache.hadoop.streaming.PipeMapRed (main): R/W/S=9800000/210394/0 in:17850=9800000/549 [rec/s] out:383=210394/549 [rec/s]
2011-11-08 02:11:24,922 INFO org.apache.hadoop.streaming.PipeMapRed (main): R/W/S=9900000/210394/0 in:17870=9900000/554 [rec/s] out:379=210394/554 [rec/s]
2011-11-08 02:11:32,005 INFO org.apache.hadoop.streaming.PipeMapRed (main): R/W/S=10000000/210394/0 in:17793=10000000/562 [rec/s] out:374=210394/562 [rec/s]

End Of File

Anything else I can look at?

@coyotemarin
Copy link
Collaborator

I'm not sure what's wrong, but I think we've seen a similar issue at Yelp. I'll post here if/when we find out anything more.

@ekulakow
Copy link
Author

ekulakow commented Nov 8, 2011

Are you also sorting logs by IP to have 1 IP per line? :-p

I'll let you know how redistributing the key space works.

@coyotemarin
Copy link
Collaborator

Haha, no, but that sounds like something that would produce very long lines for a handful of heavily trafficked IPs. :)

@ekulakow
Copy link
Author

ekulakow commented Nov 8, 2011

I think I'm just trying to soak up way too much data in my [i for i in values] generator expansion call in one specific key. Will switch to a (slower) safer way of doing that.

If you're curious, here's some errors:

Redistributing keys has given a new error: I also found an exit status for the reduce task:

2011-11-08 21:20:32,987 INFO org.apache.hadoop.mapred.TaskInProgress (IPC Server handler 19 on 9001): Error from attempt_201111081851_0001_r_000060_2: java.io.IOException: Task process exit with nonzero status of 137.
at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)

Waiting 5.0s for S3 eventual consistency
Job failed with status SHUTTING_DOWN: Shut down as step failed
Logs are in s3://mrjob-1ac31f1ce091d5aa/tmp/logs/j-3UK9OSOU8SYVG/
Scanning logs for probable cause of failure
Probable cause of failure (from s3://mrjob-1ac31f1ce091d5aa/tmp/logs/j-3UK9OSOU8SYVG/task-attempts/attempt_201111081851_0001_m_000435_1/syslog):
java.io.IOException: log:null
(while reading from s3://path/to/input/2011-07/2011-07-72.gz)
Terminating job flow: j-3UK9OSOU8SYVG

The 1st try on this mapper finishes successfully.

One issue that I "don't" know how to address is - I'm not getting my status messages or counters out, don't know where to find these guys.

Also the log files under "steps" show the following strange behavior:

2011-11-08 20:15:29,332 INFO org.apache.hadoop.streaming.StreamJob (main): map 100% reduce 98%
2011-11-08 20:15:38,369 INFO org.apache.hadoop.streaming.StreamJob (main): map 100% reduce 99%
2011-11-08 20:17:18,838 INFO org.apache.hadoop.streaming.StreamJob (main): map 100% reduce 100%
2011-11-08 20:42:37,569 INFO org.apache.hadoop.streaming.StreamJob (main): map 100% reduce 98%
2011-11-08 20:42:47,614 INFO org.apache.hadoop.streaming.StreamJob (main): map 100% reduce 99%
2011-11-08 20:44:17,894 INFO org.apache.hadoop.streaming.StreamJob (main): map 100% reduce 100%

And the offending node appears to have run out of memory:

2011-11-08 20:15:07,820 WARN org.apache.hadoop.dfs.DataNode (DataNode: [/mnt/var/lib/hadoop/dfs]): java.io.IOException: Cannot run program "bash": java.io.IOException: error=12, Cannot allocate memory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
at org.apache.hadoop.util.Shell.run(Shell.java:134)
at org.apache.hadoop.fs.DF.getCapacity(DF.java:63)
at org.apache.hadoop.dfs.FSDataset$FSVolume.getCapacity(FSDataset.java:334)
at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.getCapacity(FSDataset.java:495)
at org.apache.hadoop.dfs.FSDataset.getCapacity(FSDataset.java:692)
at org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:691)
at org.apache.hadoop.dfs.DataNode.run(DataNode.java:2970)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
... 9 more

2011-11-08 20:15:10,260 WARN org.apache.hadoop.dfs.DataNode (DataNode: [/mnt/var/lib/hadoop/dfs]): java.io.IOException: Cannot run program "du": java.io.IOException: error=12, Cannot allocate memory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
at org.apache.hadoop.util.Shell.run(Shell.java:134)
at org.apache.hadoop.fs.DU.access$200(DU.java:29)
at org.apache.hadoop.fs.DU$DURefreshThread.run(DU.java:84)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
... 5 more

@coyotemarin
Copy link
Collaborator

Try the development branch. v0.2.8 can only get logs from S3, and we can't control how fast the logs get transferred there, but the development branch can actually SSH in and grab them.

You'll have to use a key-pair file if you aren't already; see http://packages.python.org/mrjob/amazon.html#ssh-tunneling.

@ekulakow
Copy link
Author

ekulakow commented Nov 8, 2011

Cool didn't know that about the "development" branch.

@irskep
Copy link
Contributor

irskep commented Feb 2, 2012

You might try using the configure-hadoop bootstrap action to have the option set at bootstrap time rather than job run time.

@irskep
Copy link
Contributor

irskep commented Mar 19, 2012

I'm pretty sure you have to do as I suggested for this option to work. Dave, please close if you agree, or someone provide evidence to the contrary.

@coyotemarin
Copy link
Collaborator

No, we've definitely made this work in the past with just --jobconf. We should probably hand-test it before closing this ticket.

@irskep
Copy link
Contributor

irskep commented Mar 23, 2012

It definitely works with configure-hadoop and not with --jobconf. I feel like if we were using --jobconf incorrectly we would be getting actual errors.

@coyotemarin
Copy link
Collaborator

Huh. Maybe it only worked with Hadoop 0.18. That's interesting.

Is this something we want to add to the documentation, or should we just close this issue?

@irskep
Copy link
Contributor

irskep commented Mar 26, 2012

@coyotemarin
Copy link
Collaborator

Ah, right. Cool, changing this to a "Docs" issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants