New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capture output printed via System.out.println #1202

Closed
marklit opened this Issue Dec 29, 2015 · 6 comments

Comments

Projects
None yet
2 participants
@marklit

marklit commented Dec 29, 2015

I am trying to run the pi estimation example shipped with Hadoop on EMR. The jar file executes but it's output written via System.out.println (lines 358 & 359) don't appear anywhere in the output given by mrjob nor are they to be found anywhere in the S3 bucket output.

Here are the steps I took to run the job:

Got a copy of the examples jar:

$ curl -O http://www.eu.apache.org/dist/hadoop/common/hadoop-2.6.3/hadoop-2.6.3.tar.gz
$ tar zxf hadoop-2.6.3.tar.gz
$ cp hadoop-2.6.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.3.jar ./

Created an mrjob.conf file:

runners:
  emr:
    aws_region: us-east-1
    ec2_master_instance_type: c3.xlarge
    ec2_master_instance_bid_price: '0.05'
    ec2_instance_type: c3.xlarge
    ec2_core_instance_bid_price: '0.05'
    num_ec2_instances: 2
    ami_version: 3.6.0

Created a job:

from mrjob.job import MRJob
from mrjob.step import JarStep


class CalcPiJob(MRJob):

    def steps(self):
        return [JarStep(
            jar='hadoop-mapreduce-examples.jar',
            args=['pi', 10, 100000])]


if __name__ == '__main__':
    CalcPiJob.run()

Created an empty input file:

$ touch none

Ran the job:

$ python calc_pi_job.py -r emr --conf-path mrjob.conf --output-dir s3://<my throwaway bucket>/test1/ none
...
Job completed.
Running time was 74.0s (not counting time spent waiting for the EC2 instances)
ec2_key_pair_file not specified, going to S3
...
Streaming final output from s3://<my throwaway bucket>/test1/
...

The bucket does exist but the new test1 wasn't created.

I looked through the S3 bucket contents for the mrjob logs but I couldn't find '3.14' anywhere:

$ s3cmd get --recursive s3://mrjob-blah/logs/j-blah
$ cd j-blah
$ find . -type f -name '*.gz' -exec gunzip "{}" \;
$ grep -r '3.14' * | wc -l
0

The output usually looks something like:

Estimated value of Pi is 3.14158440000000000000

Any idea how I can capture this output?

@davidmarin davidmarin added the Bug label Dec 30, 2015

@davidmarin

This comment has been minimized.

Show comment
Hide comment
@davidmarin

davidmarin Dec 30, 2015

Collaborator

Thank you for the very detailed bug report. I'll try to duplicate your steps, and see what I come up with.

Collaborator

davidmarin commented Dec 30, 2015

Thank you for the very detailed bug report. I'll try to duplicate your steps, and see what I come up with.

@davidmarin

This comment has been minimized.

Show comment
Hide comment
@davidmarin

davidmarin Dec 31, 2015

Collaborator

So, I think the problem is that the example jar is meant to print to the stdout of your Hadoop command, not to the output directory (equating stdout with the output directory is a Hadoop Streaming thing).

EMR does log the stdout of individual steps (i.e., runs of Hadoop), so you should be able to find Pi on S3 somewhere (<log dir>/<cluster id>/steps/s-<step id>/stdout, I believe).

Are there non-example Hadoop jars that output results in this way? If so, I could look into adding a mrjob feature for them.

Also, FWIW, the 3.6.0 AMIs use Hadoop 2.4.0, so you'd probably want to use the example from that version, if it's available (don't know if it's any different).

Collaborator

davidmarin commented Dec 31, 2015

So, I think the problem is that the example jar is meant to print to the stdout of your Hadoop command, not to the output directory (equating stdout with the output directory is a Hadoop Streaming thing).

EMR does log the stdout of individual steps (i.e., runs of Hadoop), so you should be able to find Pi on S3 somewhere (<log dir>/<cluster id>/steps/s-<step id>/stdout, I believe).

Are there non-example Hadoop jars that output results in this way? If so, I could look into adding a mrjob feature for them.

Also, FWIW, the 3.6.0 AMIs use Hadoop 2.4.0, so you'd probably want to use the example from that version, if it's available (don't know if it's any different).

@davidmarin davidmarin closed this Dec 31, 2015

@davidmarin

This comment has been minimized.

Show comment
Hide comment
@davidmarin

davidmarin Dec 31, 2015

Collaborator

Yep, ran your example, and here is the step log on the master node (ssh'ed in with the AWS CLI):

$ aws emr ssh --cluster-id j-3TA5Q2BERU1G8 --key-pair-file ~/.ssh/EMR.pem
...
[hadoop@ip-172-31-27-231 ~]$ cat /mnt/var/log/hadoop/steps/s-3QSFO02HYEA8/stdout 
Number of Maps  = 10
Samples per Map = 100000
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Starting Job
Job Finished in 186.566 seconds
Estimated value of Pi is 3.14155200000000000000

Oddly, only the stderr gets uploaded to S3: (s3://mrjob-35cdec11663cb1cb/tmp/logs/j-3TA5Q2BERU1G8/steps/s-3QSFO02HYEA8/stderr.gz in my case). So yeah, printing to stdout is nice for Hadoop examples, but not a great way to make data available on EMR.

Collaborator

davidmarin commented Dec 31, 2015

Yep, ran your example, and here is the step log on the master node (ssh'ed in with the AWS CLI):

$ aws emr ssh --cluster-id j-3TA5Q2BERU1G8 --key-pair-file ~/.ssh/EMR.pem
...
[hadoop@ip-172-31-27-231 ~]$ cat /mnt/var/log/hadoop/steps/s-3QSFO02HYEA8/stdout 
Number of Maps  = 10
Samples per Map = 100000
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Starting Job
Job Finished in 186.566 seconds
Estimated value of Pi is 3.14155200000000000000

Oddly, only the stderr gets uploaded to S3: (s3://mrjob-35cdec11663cb1cb/tmp/logs/j-3TA5Q2BERU1G8/steps/s-3QSFO02HYEA8/stderr.gz in my case). So yeah, printing to stdout is nice for Hadoop examples, but not a great way to make data available on EMR.

@marklit

This comment has been minimized.

Show comment
Hide comment
@marklit

marklit Jan 1, 2016

Thanks @davidmarin for the tip on pulling out the value via SSHing into the master node. Do you have the conf values handy for passing in your public ssh key and having it added to the Hadoop user's authorized_keys file and a command to stop auto-termination the cluster upon completion so I have time to retrieve the value?

marklit commented Jan 1, 2016

Thanks @davidmarin for the tip on pulling out the value via SSHing into the master node. Do you have the conf values handy for passing in your public ssh key and having it added to the Hadoop user's authorized_keys file and a command to stop auto-termination the cluster upon completion so I have time to retrieve the value?

@davidmarin

This comment has been minimized.

Show comment
Hide comment
@davidmarin

davidmarin Jan 1, 2016

Collaborator

See https://pythonhosted.org/mrjob/guides/emr-quickstart.html#ssh-tunneling for setting up SSH.

To stop the cluster from terminating, you can either use Pooling, or start the cluster with mrjob create-job-flow and then run your job on it with the --emr-job-flow-id option.

Either way, I suggest starting the cluster with --max-hours-idle 1 so it doesn't run forever.

Collaborator

davidmarin commented Jan 1, 2016

See https://pythonhosted.org/mrjob/guides/emr-quickstart.html#ssh-tunneling for setting up SSH.

To stop the cluster from terminating, you can either use Pooling, or start the cluster with mrjob create-job-flow and then run your job on it with the --emr-job-flow-id option.

Either way, I suggest starting the cluster with --max-hours-idle 1 so it doesn't run forever.

@marklit

This comment has been minimized.

Show comment
Hide comment
@marklit

marklit Jan 1, 2016

Fantastic, cheers.

marklit commented Jan 1, 2016

Fantastic, cheers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment