# MrJob Logging & Counters
The following notebook should assist in writing out information to logs on AWS and finding logs.

#### Import the libraries and env for running on EMR

In [3]:
import os
import sys

# Get enviroment variables set from utils/setup.sh
home_dir = os.environ['HOME']
root_dir = os.environ['BD_GitRoot']

# Add utils to the python system path
sys.path.append(root_dir + '/utils')

# Read AWS credentials from 'EC2_VAULT'/Creds.pkl 
from read_mrjob_creds import *
(key_id, secret_key, s3_bucket, username) = read_credentials()
print s3_bucket,key_id,username

s3://dse-jgilliii/ AKIAI2W7F3RNEJ3Z35AQ jgilliii


The following is a sample MrJob script that implements the DSE230 Homework #3 filtering on the weather data.

In [4]:
%%writefile weather_filter.py
from mrjob.job import MRJob
import mrjob
import sys
import random

class InitCenter(MRJob):
    def mapper(self, _, line):
        cols = line.split(',')
        if cols[0] == 'station':
            pass
        else:
            if cols[1] == 'TMAX':
                # Write out to the log the TMAX line
                sys.stderr.write(line+'\n')
                
                # Filer out any rows that have more than 50 data points missing.
                # Also, make sure to only take rows that have 365 data points
                if (not (sum([1 for d in cols[3:] if d == ""]) > 50 and len(cols[3:]) == 365)):
                    self.increment_counter('mapper', 'tmax', 1)
                    yield(random.randint(1, 1000000),line)
                else:
                    self.increment_counter('mapper', 'missing_tmax', 1)
            else:
                self.increment_counter('mapper', 'no_tmax', 1)
    
    # Just output the lines
    def reducer(self, _, line):
        for l in line:
            yield None,l

if __name__ == '__main__':
    InitCenter.run()

Overwriting weather_filer.py


In [6]:
!python weather_filter.py --runner=emr --emr-job-flow-id=j-396W2TEG2UPAJ $root_dir/data/weather/F1000.csv > tmax1000_filter.out

using configs in /Users/johngill/.mrjob.conf
creating tmp directory /var/folders/n3/xkz_j8js6vb475vj7f6c6h180000gn/T/weather_filer.johngill.20150530.203638.295956
Copying non-input files into s3://dse-jgilliii/tmp/weather_filer.johngill.20150530.203638.295956/files/
Adding our job to existing job flow j-396W2TEG2UPAJ
Job launched 30.6s ago, status WAITING: Waiting after step completed
Job launched 61.3s ago, status RUNNING: Running step (weather_filer.johngill.20150530.203638.295956: Step 1 of 1)
Job completed.
Running time was 49.0s (not counting time spent waiting for the EC2 instances)
Fetching counters from S3...
Waiting 5.0s for S3 eventual consistency
Counters may not have been uploaded to S3 yet. Try again in 5 minutes with: mrjob fetch-logs --counters j-396W2TEG2UPAJ
Counters from step 1:
  (no counters found)
Streaming final output from s3://dse-jgilliii/tmp/weather_filer.johngill.20150530.203638.295956/output/
removing tmp directory /var/folders/n3/xkz_j8js6vb475vj7f6c6h18000

## Locating the log directory
To be able to find the logs for your particular job you need to know what job-flow-id you used. In the above command I used 'j-396W2TEG2UPAJ'.

From a web browser go to the AWS S3 section: [AWS Console](console.aws.amazon.com/s3/home)

All of the logs for the EMR servers are under the *mas-dse-emr/log* S3 bucket. The logs that are specific to your run will be under the directory that corresponds to the `job-flow-id` - *mas-dse-emr/log/j-396W2TEG2UPAJ*.

Under this directory there are a few folders, but the key two folders are: **Steps** and **task-attempts**.
 * **task-attempts** logging information for the individual mapper and reducer jobs.
 * **steps** logging information for the entire job that has been submited.

## Locating your step 'ID'
Under the **steps** directory you should see a list of directories named similar to:
- s-1CRJQNUKL7R9X
- s-1NDESE9KY8QEX
- s-2HXLUWTMX2WIF

Most often your particular step will be the last one run. It does take noticable time for logs to appear on s3 after running. It's not long, but it is not instantaneously.

To be exactly sure that you get the step-ID that corresponds to your job you can use the following steps:
1. From the [AWS Console](console.aws.amazon.com)
2. Click on 'EMR'
3. Search for the ID that corresponds to the `job-flow-id` parameter provided during execution.
4. Click that name
5. Scroll down to the **Steps** Section and click the arrow to expand
6. Look at the 'Name' field. This should match to the name from the output above. Look on status lines "Running step (**weather_filer.johngill.20150530.203638.295956**: Step 1 of 1)"
7. The ID column is the StepID

### Locating Counters and task_attempt name
Under the **steps** directory identified in the previous step there are 4 (possibly more) files:
- controller: Output from the parent process of the mappers and reducers
- stderr: stderr for the parent process of mappers and reduers
- stdout: stdout for the parent process of mappers and reducers
- syslog: The full log for the job. ** Important File**

The syslog file has the overall status for status of a job, as well as the the task-attempt name. Search for a line matching something like:"Submitted application **application_1433004582303_0001**". The bold portion of the previous string is the name of the directory that contains all the individual mapper and reducer output.

This same syslog file also contains the counter output for your job:
```
filter
    tmax=93
    tmax_missing=30
    tmax_no=876
```

## Debugging stderr for mapper/reducers
Now that you have found the **stepID** and the **task_attempt** names we can finally get to the logs for the mappers and recuders. This will also be where the stack dumps are.

Under the application directory there are a number of subdirectories (container) that hold individual stderr, stdout, syslog files for each mapper and reducer task started.

Typically I found that starting with container directory 0001 is not helpful as that is a Hadoop control task. Starting with the 0002 directory is often much more useful. Double clicking the stderr.gz file will open in another window and display the stderr output for your task.

It may be that a single task has failed and it may require searching through multiple stderr output files to locate a particular error.

# Using the get_emr_logs utility
It appears that Professor Yoav has provided an interactive script that will get all the logs that correspond to your job that you just run.