Home

Introduction

Apache Pig is a platform for analyzing large data sets using a high-level language for expressing data analysis programs, coupled with the infrastructure for evaluating these programs. Key benefits of using Pig are :

Ease of programming : As Pig abstracts Map/Reduce paradigm, it becomes very easy for user to write any complex task through Pig Latin.
Optimization opportunities : Allows user to focus on semantics rather than efficiency.
Extensibility : Users can create their own functions to do special-purpose processing.

Limitations

Pig is highly used for performing any ad-hoc analysis of large data sets. As Pig is mainly used through command line interface, there are some of issues while using it in shared Hadoop cluster environment which are :

Managing users : As users need to submit Pig jobs through the CLI machine, one needs to get a user account on the CLI machine. For on-boarding a new user, it needs to go through the process of creating an account, setting required permissions etc.
Automation of Pig execution : As the Pig job needs to be submitted from the CLI machine, there is no easy or programatic way to automate submissions of Pig scripts.
Limited by the number of processes : As each Pig execution results in a separate Java process, the number of Pig requests which can run will be limited by number of processes run on the CLI machine.
User is exposed to cluster configuration : The user needs to be aware of the Hadoop configuration like Namenode, Jobtracker etc in order to run Pig script which may not be a well fit for shared cluster environment where there are a large number of users from different teams/organizations.
No governance : As user gets access to CLI machine, they can submit as many requests as they want and it is difficult to apply rate limiting or QoS for such jobs.

There are tools available like Ambrose, Lipstick etc which provides visualization capability of data workflow. But none of the solution provides interface to submit Pig script or allows to view output of PIG script programatically.

Oink

The aim of this project is to provide a REST based interface for PIG execution. Oink is Pig on a servlet which provides the following functionalities:

Register/unregister/view a Pig script
Register/unregister/view a jar file (for customer defined UDF functions)
Execute a Pig job
View the status/stats of a Pig job
Cancel a Pig job

Design

The Oink webapp runs under container (like tomcat) and allows to execute multiple Pig requests in parallel through a single JVM. Each Pig request runs on a separate thread inside web container. For achieving this, we have modified the Pig source code to manipulate the classloader and each thread (executing a Pig script) will have its own classloader.

While registering Pig script/jar, we store it in Hadoop itself which is available for future reference. Whenever someone submits a request for execution, a request id will be returned immediately (async call) which can be used for tracking purpose. THE user can have parameterized Pig scripts where all required (dynamic) parameters can be provided while submitting request for execution. There is also provision to receive HTTP notifications in case PIG request status changed or completed.

Currently, the thread which is executing the Pig job is a blocking thread as currently there is no support for doing async execution of Pig jobs. Request specific information will be stored in DFS itself. There are generally 3 directories per request id : input (which stores actual input json file), stats (stores request specific information like status of request, output bytes, # of Hadoop jobs, URL per Hadoop job etc. This information is updated dynamically as and when request progresses), output (stores actual output of request).

REST API

For details on REST API, please refer to this page.

Currently, there is only one validation in place while registering a Pig script which checks if there are no DUMP statements and only STORE statements as DUMP is not supported currently. Also, one needs to provide '$output' as location for storing output of PIG script. The location of '$output' will be determined by the Pig self-serve itself and it will be with respect to the request id.

Usecase

In eBay, all application servers' logs are getting stored centrally through CAL (Centralized Application Logging). There are around >2.5K pools and >40K application servers. All application specific logs are stored in Hadoop which has few PBs as capacity. Number of users are interested in this log to process and retrieve meaningful information out of it. For example, generate analytical reports, debug specific scenario, find out different statistics which are not available directly from metrics etc. Each user writes their own PIG script and submits for processing. Given there are >2.5K pools (and can be equal number of PIG users), it becomes very cumbersome to provide support and maintain cluster. It is also not possible to provide any governance if user starts submitting request directly from CLI.

Oink is the right solution to handle such scenarios. With Oink, it became very easy to on-board any new customer as there is no process involved. User can submit any ad-hoc request and gets required output without knowing the Hadoop configuration. Also as it is single point of entry, it is also possible to do rate limiting in case of large number of requests. With Oink, CAL team is able to on-board >100 PIG scripts and is currently running >6000 PIG jobs per day regularly. It is also getting 2-3 new customers per week and there is no involvement of CAL or Hadoop team to on-board them.

Benefits

Provides a multi-tenant environment for different kinds of Pig execution
Can serve as single entry point for PIG job execution where rate limiting, QoS can be implemented and applied very easily
Very easy to automate script execution
No manual intervention of getting user access etc
Abstraction of Hadoop cluster configuration from users

Provide feedback

Saved searches

Use saved searches to filter your results more quickly