Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capture output printed via System.out.println #1202

Closed
marklit opened this issue Dec 29, 2015 · 6 comments
Closed

Capture output printed via System.out.println #1202

marklit opened this issue Dec 29, 2015 · 6 comments
Labels
Bug

Comments

@marklit
Copy link

@marklit marklit commented Dec 29, 2015

I am trying to run the pi estimation example shipped with Hadoop on EMR. The jar file executes but it's output written via System.out.println (lines 358 & 359) don't appear anywhere in the output given by mrjob nor are they to be found anywhere in the S3 bucket output.

Here are the steps I took to run the job:

Got a copy of the examples jar:

$ curl -O http://www.eu.apache.org/dist/hadoop/common/hadoop-2.6.3/hadoop-2.6.3.tar.gz
$ tar zxf hadoop-2.6.3.tar.gz
$ cp hadoop-2.6.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.3.jar ./

Created an mrjob.conf file:

runners:
  emr:
    aws_region: us-east-1
    ec2_master_instance_type: c3.xlarge
    ec2_master_instance_bid_price: '0.05'
    ec2_instance_type: c3.xlarge
    ec2_core_instance_bid_price: '0.05'
    num_ec2_instances: 2
    ami_version: 3.6.0

Created a job:

from mrjob.job import MRJob
from mrjob.step import JarStep


class CalcPiJob(MRJob):

    def steps(self):
        return [JarStep(
            jar='hadoop-mapreduce-examples.jar',
            args=['pi', 10, 100000])]


if __name__ == '__main__':
    CalcPiJob.run()

Created an empty input file:

$ touch none

Ran the job:

$ python calc_pi_job.py -r emr --conf-path mrjob.conf --output-dir s3://<my throwaway bucket>/test1/ none
...
Job completed.
Running time was 74.0s (not counting time spent waiting for the EC2 instances)
ec2_key_pair_file not specified, going to S3
...
Streaming final output from s3://<my throwaway bucket>/test1/
...

The bucket does exist but the new test1 wasn't created.

I looked through the S3 bucket contents for the mrjob logs but I couldn't find '3.14' anywhere:

$ s3cmd get --recursive s3://mrjob-blah/logs/j-blah
$ cd j-blah
$ find . -type f -name '*.gz' -exec gunzip "{}" \;
$ grep -r '3.14' * | wc -l
0

The output usually looks something like:

Estimated value of Pi is 3.14158440000000000000

Any idea how I can capture this output?

@coyotemarin coyotemarin added the Bug label Dec 30, 2015
@coyotemarin
Copy link
Collaborator

@coyotemarin coyotemarin commented Dec 30, 2015

Thank you for the very detailed bug report. I'll try to duplicate your steps, and see what I come up with.

@coyotemarin
Copy link
Collaborator

@coyotemarin coyotemarin commented Dec 31, 2015

So, I think the problem is that the example jar is meant to print to the stdout of your Hadoop command, not to the output directory (equating stdout with the output directory is a Hadoop Streaming thing).

EMR does log the stdout of individual steps (i.e., runs of Hadoop), so you should be able to find Pi on S3 somewhere (<log dir>/<cluster id>/steps/s-<step id>/stdout, I believe).

Are there non-example Hadoop jars that output results in this way? If so, I could look into adding a mrjob feature for them.

Also, FWIW, the 3.6.0 AMIs use Hadoop 2.4.0, so you'd probably want to use the example from that version, if it's available (don't know if it's any different).

@coyotemarin
Copy link
Collaborator

@coyotemarin coyotemarin commented Dec 31, 2015

Yep, ran your example, and here is the step log on the master node (ssh'ed in with the AWS CLI):

$ aws emr ssh --cluster-id j-3TA5Q2BERU1G8 --key-pair-file ~/.ssh/EMR.pem
...
[hadoop@ip-172-31-27-231 ~]$ cat /mnt/var/log/hadoop/steps/s-3QSFO02HYEA8/stdout 
Number of Maps  = 10
Samples per Map = 100000
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Starting Job
Job Finished in 186.566 seconds
Estimated value of Pi is 3.14155200000000000000

Oddly, only the stderr gets uploaded to S3: (s3://mrjob-35cdec11663cb1cb/tmp/logs/j-3TA5Q2BERU1G8/steps/s-3QSFO02HYEA8/stderr.gz in my case). So yeah, printing to stdout is nice for Hadoop examples, but not a great way to make data available on EMR.

@marklit
Copy link
Author

@marklit marklit commented Jan 1, 2016

Thanks @DavidMarin for the tip on pulling out the value via SSHing into the master node. Do you have the conf values handy for passing in your public ssh key and having it added to the Hadoop user's authorized_keys file and a command to stop auto-termination the cluster upon completion so I have time to retrieve the value?

@coyotemarin
Copy link
Collaborator

@coyotemarin coyotemarin commented Jan 1, 2016

See https://pythonhosted.org/mrjob/guides/emr-quickstart.html#ssh-tunneling for setting up SSH.

To stop the cluster from terminating, you can either use Pooling, or start the cluster with mrjob create-job-flow and then run your job on it with the --emr-job-flow-id option.

Either way, I suggest starting the cluster with --max-hours-idle 1 so it doesn't run forever.

@marklit
Copy link
Author

@marklit marklit commented Jan 1, 2016

Fantastic, cheers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.