# Hetrogeneous Multiconstraint Application Partitioner (HMAP) -Exploiting parallelism in a hetrogeneous HPC environment

Servesh Muralidharan<sup>†</sup>, Aravind Vasudevan<sup>†</sup>, Avinash Malik<sup>§</sup> and David Gregg<sup>†</sup>

†School of Computer Science and Statistics, Trinity College Dublin, Dublin 2, Ireland

Email: muralis@scss.tcd.ie, vasudeva@scss.tcd.ie, David.Gregg.cs.tcd.ie

§IBM Research - Ireland

Email: avinmali@ie.ibm.com

Abstract-In this article we provide a framework for exploiting parallelism onto heterogeneous HPC architectures. Given an Hetrogeneous HPC cluster with varying compute units, communication constraints and topology, our Hetrogeneous Multiconstraint Application Partitioner (HMAP) framework can be utilized for partitioning applications exhibiting task and data parallelism resulting in increased throughput. The challenge lies in the fact that heterogeneous compute clusters consist of processing elements exhibiting different compute speeds, vector lengths, and communication bandwidths, which all need to be considered when partitioning the application and associated data. We tackle this problem using a staged graph partitioning framework. HMAP framework finishes within seconds even for architectures with 100's of processing elements, which makes our algorithm suitable for exploring parallelism potential off line or on line.

Keywords-Graph partitioning, vectorization, data parallelism, heterogeneous architectures, clusters.

#### I. INTRODUCTION

HPC clusters increasingly consist of large numbers of heterogeneous processing elements such as CPUs, graphics processing units (GPUs), field programmable gate arrays (FPGAs), low-power processors intended for digital signal processing (DSP), etc. By combining heterogeneous processing units it may be possible to divide the work so that different types of computation in the application are run on different types of units. This can result in significant speedups, lower hardware costs and/or reduced power consumption by the HPC system. For example, if a computation contains the right patterns of data parallelism it may run dozens or even hundreds of times faster on a GPU than on a CPU that has similar cost and power consumption. On the other hand, computations with less data parallelism and more complex control flow may run faster on CPUs. Matching the type of computation to the processor can yield significant benefits.

Although the potential of heterogeneous computing is great, exploiting that potential is more difficult. Given a parallel application, it is difficult to map the available parallelism onto the hardware. For example, how does one decide which parallel filters should run on which type of execution unit? Given a system with dozens or hundreds of

CPUs, GPUs and other units, how does one divide the work between them? There are several conflicting factors. For example, one wants to allocate filters to the type of execution unit that will execute it most efficiently. On the other hand, one wants to achieve a good load balance by dividing the work evenly across the units. We want to allocate the filters to reduce communication costs while at the same time taking account of all the other factors.

In this paper we consider the problem of mapping graphs of parallel tasks to heterogeneous HPC computing systems. This problem has been studied extensively for homogeneous architectures where all processing elements are the same. Although the homogeneous case is NP-hard [1], several heuristic solutions have been found that work well in practice. However, extending these solutions to the heterogeneous case is difficult for two reasons.

In the heterogeneous case some processing elements are more powerful than others, so achieving a good load balance usually involves distributing the work unevenly.

A second reason why it can be difficult to extend algorithms for homogeneous architectures to the heterogeneous hardware relates to the strengths and weaknesses of different types of processors. When considering heterogeneous architectures, it is tempting to think of some processing elements simply being more powerful than others. A GPU is not simply a more powerful CPU. In fact, some types of computation run better on CPUs and some on GPUs. For a mapping algorithm to work well, it needs to take account of the strengths and weaknesses of different types of processing elements. In this paper we present an approach to mapping parallel tasks to heterogeneous architectures that addresses both of these concerns.

Our main contributions are as follows:

- We present a novel approach to characterizing the type of processing elements based on their level of vector parallelism which allows us to distinguish the suitability of different types of units to different filters.
- We provide a novel algorithm for mapping task and data parallelism to heterogeneous architectures based on hierarchical graph partitioning.
- In addition to mapping filters to processing elements,

our framework also allocates the data stores being used by the different filters.

The rest of this paper is organized as follows. We first provide a motivating example in Section II formalizes the problem statement and defines the objective function. Next, in Section III, we provide a detailed description of our framework. Section IV gives the quantitative comparisons of our approach against other approaches. Section V describes the related work and positions our approach in comparison to these works. Finally, we conclude in Section VI.

## II. PRELIMINARIES

We now present a formal description of the problem along with the notations used.

#### A. Execution model

In this paper we consider the streaming [5] model of computation. Streaming is a popular model for programs such as image and signal processing, financial applications, networking, telecommunications, etc.

In the streaming model statements (also called filters/actors/tasks or kernels) execute iteratively, processing the incoming tokens of data. Consider the Jacobi example and its filter graph in Figure 1. The Jacobi algorithm is used in fluid dynamics and heat transfer problems. We consider every statement (marked 1 to 4) in this example to be a filter that can be run in a software pipelined [6] manner on a given architecture. An *example* execution trace of the Jacobi example is shown in Table I for some arbitrary value of computation and communication latency of statements.

| P0 | 10 | $2_{0}$  | 11    |       | $2_1$ | 12    |
|----|----|----------|-------|-------|-------|-------|
| P1 |    | $_{3_0}$ | $4_0$ | $4_0$ | $3_1$ | $4_1$ |

 $\label{eq:Table I} \textbf{Table I} \\ \textbf{Example execution trace of the Jacobi Kernel} \\$ 

In a software pipelined model, the different iterations of the filters are run in parallel, e.g.,  $\mathbf{1}_0$  is the  $\mathbf{1}^{st}$  iteration of statement 1 in the Jacobi example, while  $\mathbf{1}_1$  is the second iteration and so on and so forth. The latency of the application (termed makespan) is the period, which iterates continuously (shown within the double lined columns in Table I). In such a model, the resource allocation (rather than dependencies) determines the application latency, especially without back-edges in the filter graph (as is the case with our model). In Table I, the resource allocation on processing element P1 determines the application latency, because that is the maximum of the two allocation latencies.

## B. Notations

We refer to our application graph, as a *Synchronous DataFlow* (SDF) graph defined formally as a weighted directed graph:  $G_t(V_t, E_t)$ , where  $V_t$  is the set of all filters in the application graph and  $E_t$  represents the communication buffers between these filters. The system resources are represented by a weighted undirected graph  $G_r(V_r, E_r)$  where

 $V_r$  represents a set of processing elements (PEs) which can have different processing capabilities and  $E_r$  represent the communication links between these PEs with differing latencies and bandwidths.

#### C. Problem Definition

Given a graph  $G_t(V_t, E_t)$ , each vertex in the filter graph,  $t_i \in V_t$  has a set of associated requirements represented by  $T^i_j$  where  $j=0...n_t$  with  $n_t$  being the number of requirements. These requirements represent the computational requirements of the filter. Namely,  $T^i_0$  represents the scalar requirements, while  $T^i_1$  represents the vector requirements.

The communication edges are decorated with  $e^c \in E_t$ , which denotes the data the filter requires for processing.

Each resource node  $r_i \in V_r$  has a number of computational capabilities, represented by  $R^i_j$  where  $j=0...n_r$ . In our formulation  $n_t=n_r$ . For each node, capability:  $R^i_0$  represents the frequency of the PE or how many scalar instructions the PE can perform in one second (the *Million Instructions Per Second* (MIPS) count). Capability:  $R^i_1$  denotes the maximum number of parallel vector operations it can perform (the vector length). Each edge  $e \in E_r$  has a weight which represents the bandwidth between two PEs  $r_i$  and  $r_i$  which is denoted by  $E^c$ .

The problem at hand is to effectively map the filter graph  $G_t$  onto given resource graph  $G_r$ . This problem is known to be NP-Hard [1].

Given some filter  $t_i \in V_t$  mapped to some resource  $r_j \in V_r$ , the latency for that node is computed by Equation (1). In this formulation for some filter  $t_i$  being mapped onto some resource  $r_j$ , we first calculate the number of vectorized instructions that can be executed in parallel (by dividing the required vector length by the vector capacity of  $r_j$  represented by  $R_1^j$ ). We then multiply this number by the number of iterative (non-vectorized) instructions required to get the total number of instructions to be performed by that filter-graph node. Once we have this number we calculate latency of execution of this filter-graph node  $t_i$  on this resource  $r_j$  by dividing it with the MIPS value of the resource denoted by  $R_0^j$ . Calculation of the communication latency requires dividing the number of bits by the bandwidth of the shortest path.

Given the filter-graph and the resource-graph, let  $\mathcal{M}$  be some mapping of the application on the resource-graph. For a particular allocation onto some resource node  $r_s \in V_r$ , we define its computation and communication latency as in Equation (2).

$$\begin{array}{l} ((T_1^i/R_1^j \times T_0^i)/R_0^j) + (e^c/{E^c}')|c = (t_i,t_k), t_k \neq t_i, \\ \forall t_k \in V_t, c' = (r_j,r_l), r_l \neq r_j, \forall r_l \in V_r \end{array}$$

Finally, the complete application latency can then be defined as:

$$L^{\mathcal{M}} = \max(L_s^{\mathcal{M}}), \forall r_s \in V_r \tag{3}$$

```
for (int i=0; i<999; ++i) {
              for (int j=0; j<999; ++j) {
               1: A[i][j] = (i*j+2.0+2.0/1000)
               2: B[i][j] = (i * j + 3.0 + 3.0 / 1000)
                  (int k=0; k<1000; ++k) {
              for (int i=1; i<998; ++i)
               for (int j=1; j<998; ++j)
                 3: B[i][j] = 0.2*(A[i][j]+A[i][j-1]
                                 +A[i][j+1]+A[i-1][j])
              //Data-parallel
              for (int i=1; i<999; ++i)
               for (int j=1; j<999; ++j)
                 4: A[i][j] = B[i][j]
                   (a) Example 2-dimensional Jacobi application
                   L_s^{\mathcal{M}} = Lcomp_s^{\mathcal{M}} + Lcomm_s^{\mathcal{M}} Figure 1. Jacobi example and its filter-graph
            Lcomp_s^{\mathcal{M}} = \sum_{\forall t_i \in V_t} ((T_1^i/R_1^s \times T_0^i)/R_0^s)
                  Lcomm_s^{\mathcal{M}} = \sum_{\forall t_i \in V_t} e^c / E^{c'}
s.t., d = (t_i, t_k), t_k \neq t_i, \forall t_k \in V_t \land \ c' = (r_s, r_l), r_l \neq r_s, \forall r_l \in V_r \land d \text{ is routed on } c'
```

//Task and data-parallel

The objective of our framework is to find a mapping  $\mathcal{M}$ that minimizes the total application latency as described in Equation (3).

#### III. OUR FRAMEWORK

A common approach to solving the homogeneous case is to partition the graph across the processing elements [2], [3], [4]. However, we have found that such heuristic partitioning approaches do not work well for heterogeneous architectures. Instead we propose a novel approach where the architecture is hierarchically partitioned into sub-clusters. Each sub-cluster at the same level of the architecture hierarchy has approximately the same computing capability and has relatively local communication. This allows us to use existing approaches that work well for the homogeneous case to partition *heterogeneous* architectures across the subclusters at each level of the hierarchy. We show that this is an effective approach if the sub-clusters are well balanced at each level of the hierarchy.

In this section we describe our heuristic algorithm. There are two important concepts that need description. First, we describe how the topology clusters are formed from the resource graphs accounting for communication and heterogeneity of the topology. Secondly, given a filter graph with data parallel filters, task parallel filters, and communication



(b) The filter graph for the Jacobi example <sup>1</sup>

extracted as shown in Figure 1(b) how the mapping is performed.

#### A. Clustering the resource graph

The resource graph represents the cluster of compute nodes on which the filter graph will be executed. A sample resource graph is shown in Figure 2 at level 0. The resource graph that is shown is heterogeneous in both computation and communication. The properties of the resource graph are described below:

- **Compute nodes:** The compute nodes (PEs) are assumed to belong under two categories of processing units, mainly CPUs and GPUs. Following the current trend, the CPUs have a larger MIPS count. The MIPS capacity of a PE is denoted by the first decoration  $R_0^i, \forall i \in V_r$ . The GPU nodes have a lower MIPS count, but have a large resident vector length, denoted by the second decoration  $R_1^i, \forall i \in V_r$ . For example, the fifth PE (E0) at level 0 is a CPU, since it has a small vector count and a large MIPS count, the first PE (A0) on the other hand is a GPU, since the capabilities are reversed.
- Communication links: The resource graph shown in Figure 2 at level 0, follows a 2D mesh topology. In this topology the PEs are connected in a grid with individual communication links between them. The bandwidth of these links is non uniform. The different bandwidths on the communication links is represented by the decoration  $E^C$ . Our framework can handle any kind of topology, the 2D mesh shown in Figure 2, is just an example topology.
- 1) Clustering the topology: The main idea behind our partitioning approach is to first hierarchically cluster the nodes in the heterogeneous topology, provided by the designer, thereby forming fewer contracted resource nodes. The application is then partitioned in stages (levels) onto the resulting hierarchy. The intuition behind this approach is two fold:

<sup>&</sup>lt;sup>1</sup>Ellipses represent data stores. Rectangles represent filter nodes. Rounded rectangle represents data parallel nodes. The dots represent other data parallel nodes not shown in the figure. Dashed arrows represent communication between data stores and execution statements. Solid arrows represent dependence edges.

- Heterogeneous K-way partitioning: The process of partitioning an application onto a given architecture is equivalent to a heterogeneous K-way partitioning problem. Hierarchically clustering a heterogeneous topology such that the resulting hierarchy consists of contracted PEs with equivalent compute capabilities can reduce the heterogeneous K-way partitioning problem to a homogeneous one.
- Considering communication links: The communication
  can be considered into the equation, while building
  the hierarchical cluster using the min-cut technique.
  Thus, a min-cut load-balancing of the PEs in a topology
  intuitively means: we are clustering together PEs, which
  have large bandwidth together into a single contracted
  node, while making an attempt to load balance the two
  capabilities: MIPS and vector lengths.

The hierarchical cluster built for the synthetic topology at level 0 of Figure 2, is shown in the levels 1-3. The stages used to build the hierarchy are as follows:

- Effective computation during contraction: Given a topology graph with  $|V_r|$  resources, we cluster the PEs in levels, whereby the height of the cluster is  $log_2|V_r|$ . For example, consider the PEs at level 0 in Figure 2. Contraction from level 0 to level 1 results in 4 PEs at level 1, where the two capabilities,  $R_0^k$ ,  $R_1^k$  for each contracted node  $k=\{i,j\}, \exists i\in V_r \land \exists j\in V_r$  is computed as:  $R_0^k=R_0^i+R_0^j,$  and  $R_1^k=R_1^k+R_1^j.$ Without loss of generality we assume this for any  $k = \{i, j, ..., n\}, \forall n \in V_r$ . This process is continued until we reach the top-level with just 1 contracted node. We end up with a load-balanced hierarchy, with each level showing a larger amount of homogeneity, and a smaller number of contracted PEs. The reason for a level based contraction, instead of contracting all nodes into a single 2 node partition, is that when partitioning the application on the resulting top-down cluster, we have fine grained details within each of the contracted node.
- Effective communication during contraction: It is essential that we handle communication correctly while contracting nodes. There are a number of paths that communication between two nodes might follow. The effective bandwidth between any two contracted nodes depends upon the route in the topology graph between these contracted nodes. We use the all-pair Floyd-Warshall [13] algorithm to calculate the shortest path, in terms of latency of data-transfer, for every communication link in the topology. Once the nodes are contracted we use the bottleneck amongst these possible links as our effective bandwidth.

For the topology, at level 0, in Figure 2, we first calculate the best possible paths between every pair of nodes  $(i, j) \in (V_r \times V_r)$ . Now let us consider the

example of the contracted nodes B1 and C1. Node B1 is a contraction of the set of children nodes  $\{3, 8, 6\}$ and C1 is the contraction of the node set  $\{2,1\}$ . Being an undirected graph with loss of generality we can consider B1 to be the source and C1 to be the destination. Hence, the worst path, in terms of latency of data-transfer, following the all-pair Floyd-Warshall algorithm, between nodes 3 and 2 within the contracted nodes is given by the maximum latency path amongst the memoized best edges: max((3,2), (8,2), (6,2)). Similar computation is carried out for all link pairs in the contracted node with 1 as the destination node to find the minimum latency path. Finally, the addition of these two latencies and its reciprocal gives us the effective bandwidth between the two contracted nodes. This effective bandwidth is the overall bottleneck between the two contracted nodes.



Figure 2. Clustering of a resource graph

## B. Task graph partitioning

The filter partitioning on the resulting hierarchical cluster takes a top-down approach. We start with the filter graph extracted from the application (Figure 1). We then recursively partition the filter graph (using K-way partitioning) on the hierarchical cluster of the resource-graph. For the example cluster in Figure 2, we start by partitioning the filter graph in Figure 1, first into two (K=2) partitions considering two equally weighted compute nodes at level 2. Once a partition is obtained for this level, we move onto the next level (level 1), whereby all the filter graph nodes allocated onto node A2 are further partitioned onto the nodes A1 and C1 (again

K=2), which are coupled into the contracted node A2. This process is continued recursively for all contracted nodes until the final level (level 0). During each partition we make sure that the filter node requirements are closely matched to the resource node capabilities.

Doing it in this top down manner has two important consequences.

- Increase in the time complexity: As stated previously, the partitioning problem is equivalent to the K-way partitioning problem. The multi-level K-way partitioning results in the worst cast time complexity of  $O(|E_t| \times log_2|V_r|)$  [14]. Our algorithm gives a worst case complexity of  $log_2|V_r| \times O(|E_t| \times log_2|V_r|)$  when using the multi-level K-way partitioning. But, in the average case we ask for a balanced 2-way partition at all levels, which results in a complexity of  $log_2|V_r| \times O(E_t)$  in the average case.
- Refined mapping: By dividing the mapping on to several levels we can achieve much better load balancing by considering only fewer nodes to map than if we were to do it directly on to the resource graph.

#### IV. EXPERIMENTS AND RESULTS

## A. Our implementation

We use the Metis [17] graph partitioning library to implement our partitioning algorithm. We are not tied to Metis and any other graph partitioner such as Zoltan [18] or Scotch [19] can be used for implementing our algorithm. Herein, we describe how we used Metis to implement the graph partitioning.

The resource graph is represented in the Metis graph format. We represent the PEs' capabilities as constraints of the nodes and the links' bandwidths as communication weight on the edges. We then construct our clustered structure (Figure 2) by asking for a 2-way partition at each level of the  $log_2|V_r|$  height. Metis partitions the graph by load balancing the constraints and performing a minimum edge cut. In partitioning the filter graph, we need to balance the constraints on to the available partitions. Metis offers the ability to load balance multiple constraints on to different partitions based on the metric 'tp-weight'. We calculate the ratios between the capabilities of different partitions and represent them as this metric in order to load balance on to the available partitions.

## B. The experimental set-up

The experimental set-up consists of the resource graph generation and the filter graph generation. Herein, we describe the two set-ups.

- 1) The resource graph set-up: The experimental set up consists of the following.
  - 1) An interconnection network with  $|V_r|$  nodes.  $|V_r|$  vary from 64 to 4096 PEs. A node can be just a multi-core CPU or a multi-core CPU with an attached GPU.

- 2) A set of  $N_G$  GPUs where  $N_G$  is at most  $|V_r|$ . The GPUs are connected in the network at locations, chosen randomly in the normal distribution of 25% to 75% of  $|V_r|$ .
- 3) A set  $G = \{G_1, G_2, G_3, ... G_{|G|}\}$  where  $G_i$  is a power of 2. Every GPU in this experiment has a vector length of  $G_i$  where  $G_i$  is sampled randomly from the set G. The elements of set G are chosen from a normal distribution ranging from: 1024 to 16284.
- 4) A set  $C = \{C_1, C_2, C_3, ... C_{|C|}\}$  where  $C_i$  is a power of 2. Every CPU in this experiment has  $C_i$  cores where  $C_i$  is sampled randomly from the set C.
- 5) A set  $M = \{M_1, M_2, M_3, ... M_{|M|}\}$  where  $M_i$  is a power of 2. Every  $C_i \in C$  and GPU in this experiment has a MIPS count of  $M_i$  where  $M_i$  is sampled randomly from the set M. The elements of set M are chosen from a normal distribution ranging from: XXX to XXX.

For given values of  $|V_r|$ ,  $N_G$ , G, C, and M and a given application, let the k-th <u>trial</u> be defined as one execution of the following sequence of steps.

- For each GPU  $G_i$ , sample G and M randomly to determine its vector length  $V_i$  and MIPS count  $M_i$ .
- For each CPU  $P_i$ , sample C randomly to determine the number of cores  $C_i$  in the processor  $P_i$ .
- For each core  $C_i$  in the processor  $P_i$  sample  $V_i$  and  $M_i$  randomly from set G and M.
- Use our framework to extract data and filter parallelism that is best utilizable by the heterogeneity created by parameters in items 1, 2, and 3 above. Determine the execution time  $L^{\mathcal{M}}$ .

An experiment,  $E(|V_r|, N_G, G, CM)$ , consists of conducting enough of the above trials so that width of the 95% confidence interval on the average value of  $Latency^{\zeta_M}$  is less than 10% of the average value. This results in a variable number of trials with different experimental setups. Note that two trials differ from each other only in the seed for the random number generator. This reduces the dependence of our results on a lucky sequence of numbers from the random number generator.

- 2) Random filter graph generation: We built a random graph generator to test our partitioning methodology rigorously. The random graph generator needs as input the following parameters:
  - Number of nodes (n) Total number of nodes to be present in the filter graph
  - Indegree (i) Average indegree of every vertex
  - Outdegree (o) Average outdegree of every vertex
  - Communication to Computation Ratio CCR (c) It is the ratio of the average communication cost of an outedge the average computation cost of the vertex itself. If a DAG's c is low, then it can be called a computation intensive application and if it is greater than 1 it can

- be called a communication intensive application
- Structure of the graph  $(\alpha)$  We generate the height of the graph based on  $\alpha$  as  $\frac{\sqrt{n}}{\alpha}$ . This implies the width of the graph becomes  $\sqrt{n}*\alpha$ . Higher values of  $\alpha$  give wider graphs which means the graph has more inherent task parallelism, while lower values give taller graphs which means the graph is inherently serial
- Beta (β) We use this parameter to decide if an actor in the filter graph is CPU intensive or GPU intensive. Smaller values of β makes actors CPU intensive (by making first constraint larger than the second), while larger values make it GPU intensive (by making the second constraint larger than the first)
- Skewedness factor  $(\gamma)$  This parameter dictates how computation is spread across the graph. Smaller values of  $\gamma$  give uniformly distributed values for the constraints of the actors while larger values produces skewed graphs

For our experiments we generated random graphs by choosing values for the input parameters from the following sets:

- $\chi_n = \{256, 1024, 16384\}$
- $\chi_i = \{1\}$
- $\chi_o = \{2, 4, 8\}$
- $\chi_c = \{0.0001, 0.001, 0.009\}$
- $\chi_{\alpha} = \{0.1, 1.0, 10.0\}$
- $\chi_{\beta} = \{5, 25, 50, 75, 95\}$
- $\chi_{\gamma} = \{25, 50, 75\}$

We generated one graph per combination for a total of [servesh fill this number in]. Since the random graph generator has a variety of inputs and these inputs are filled in from a large set of possible values, a diverse set of DAGs are generated with various characteristics. Experiments based on diverse set of DAGs prevent biasing towards a particular partitioning algorithm.

#### C. Experimental Results

The experimental set-up consists of a dual socket system consisting of Intel Xeon E5620 CPU running at 2.4Ghz with 24GB DDR3 RAM. The system is running Linux kernel ver 3.0.40-1. Our framework was compiled using gcc version 4.7.1 with '-O3' optimization flag.

#### V. RELATED WORK

A significant amount of research literature exists for extracting parallelism from programs [11], [20], [21], [22], [23]. The polyhedral optimization model [11] concentrates on extracting parallelism from loops, which is a form of data-parallelism. The polyhedral optimization community has concentrated on optimizing for CPUs and GPUs separately, but to our knowledge has not explored the combination of the two. Carpenter et. al [23] have again explored ideas for partitioning onto heterogeneous architectures, but at a much smaller scale and again ignoring data-allocation costs

and depending upon the polyhedral model for vectorization. The StreamIt [24] community has also explored parallelization techniques, but they have only targeted homogeneous RAW [25] architecture. There are the classical algorithms such as critical path scheduling [26] and list scheduling [27], which have been used for scheduling task parallel process onto homogeneous architectures. The list scheduling techniques targeting heterogeneous architectures such as [28] do not exploit SIMD parallelism onto vector processors. Declustering [22], is another technique, which misses the opportunity to mix SIMD and task-parallel optimizations together.

Cluster based partitioning techniques [29], [30], [12] only consider independent tasks without communication. The proposed heuristics for partitioning data-parallel applications onto clusters [7], [31] do not consider vectorization potential available on the compute clusters and only concentrate on partitioning task parallel processes.

Random search techniques when combined with a certain objective use random choices to guide them through a search space to arrive at a solution. The objective is mostly derived from information obtained from previous random searches or prior information about the system. Genetic Algorithms [] are the commonly used from this technique and generate reasonably good results but this requires significantly more time than heusristic based alogirthms []. Moreover determining optimal parameters to fine tune these random techniques to get good results is a challenge. Parameters that give good results for a certain scenario may not be the same for a different one. Several other techniques also exist that belong to the same category in addition to GA such as simulated Annealing [] and local search methods [] suffer from the same disadvantages.

# VI. CONCLUSION

In this paper we have described a novel staged graph based partitioning technique to partition and schedule applications onto heterogeneous execution architectures. HMAP framework considers both filter and data parallelism, which allows the application writers to design and tune their applications to extract parallelism. We also consider communication along application and underlying architecture edges, which when combined with the filter and data parallelism gives a better estimate of the application latency than the currently described research literature targeting similar problems.

We have tested our framework on a statistical sample of randomly generated filter graphs with varying compute, vector, communication requirements.

# VII. ACKNOWLEDGMENT

This work is partly funded by the IRCSET Enterprise Partnership Scheme in collaboration with IBM Research, Ireland.



Figure 3. Experimental Results on random application graphs

#### REFERENCES

- V. Sarkar, "Partitioning and scheduling parallel algorithms for execution on multiprocessors," Ph.D. dissertation, Stanford university, 1989.
- [2] A. Aletà, J. M. Codina, J. Sánchez, and A. González, "Graph-partitioning based instruction scheduling for clustered processors," in *Proceedings of the 34th annual* ACM/IEEE international symposium on Microarchitecture, ser. MICRO 34. Washington, DC, USA: IEEE Computer Society, 2001, pp. 150–159. [Online]. Available: http://dl.acm.org/citation.cfm?id=563998.564019
- [3] K. Purna and D. Bhatia, "Temporal partitioning and scheduling data flow graphs for reconfigurable computers," *Computers, IEEE Transactions on*, vol. 48, no. 6, pp. 579 –590, jun 1999.
- [4] E. Nystrom and A. E. Eichenberger, "Effective cluster assignment for modulo scheduling," in *Proceedings of* the 31st annual ACM/IEEE international symposium on Microarchitecture, ser. MICRO 31. Los Alamitos, CA, USA: IEEE Computer Society Press, 1998, pp. 103–114. [Online]. Available: http://dl.acm.org/citation.cfm?id=290940.290966
- [5] J. Buck and E. Lee, *The Token Flow Model*. Advanced Topics in Dataflow Computing and Multithreading, Wiley IEEE Computer Society, 1995.
- [6] A. Udupa, R. Govindarajan, and M. J. Thazhuthaveetil, "Software Pipelined Execution of Stream Programs on GPUs," in CGO, 2009, pp. 200–209.
- [7] S. Sanyal and S. Das, "Match: Mapping data-parallel tasks on a heterogeneous computing platform using the cross-entropy heuristic," in *Parallel and Distributed Processing Symposium*, 2005. Proceedings. 19th IEEE International, april 2005, p. 64b.
- [8] A. Jain, S. Sanyal, S. K. Das, and R. Biswas, "Fastmap: A distributed scheme for mapping large scale applications onto computational grids," *Challenges of Large Applications in Distributed Environments, International Workshop on*, vol. 0, p. 118, 2004.

- [9] D. Ajwani, S. Ali, and J. Morrison, "Graph partitioning for reconfigurable topology," in *Parallel Distributed Processing Symposium (IPDPS)*, 2012 IEEE 26th International, may 2012, pp. 836 –847.
- [10] C. Lattner and V. Adve, "LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation," in Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO'04), Palo Alto, California, Mar 2004
- [11] M. Griebl, C. Lengauer, and S. Wetzel, "Code Generation in the Polytope Model," in *In IEEE PACT*. IEEE Computer Society Press, 1998, pp. 106–111.
- [12] T. D. Braun, H. J. Siegel, N. Beck, L. L. Bölöni, M. Maheswaran, A. I. Reuther, J. P. Robertson, M. D. Theys, B. Yao, D. Hensgen, and R. F. Freund, "A comparison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems," *J. Parallel Distrib. Comput.*, vol. 61, no. 6, pp. 810–837, Jun. 2001. [Online]. Available: http://dx.doi.org/10.1006/jpdc.2000.1714
- [13] S. S. Skiena, The Algorithms Design Manual, 2nd ed. Springer-Verlag London, 2008.
- [14] G. Karypis, V. Kumar, and V. Kumar, "Multilevel k-way partitioning scheme for irregular graphs," *Journal of Parallel* and Distributed Computing, vol. 48, pp. 96–129, 1998.
- [15] T. G. Crainic, G. Perboli, W. Rei, and R. Tadei, "Efficient lower bounds and heuristics for the variable cost and size bin packing problem," *Comput. Oper. Res.*, vol. 38, no. 11, pp. 1474–1482, Nov. 2011. [Online]. Available: http://dx.doi.org/10.1016/j.cor.2011.01.001
- [16] E. Coffman, Jr., M. Garey, and D. Johnson, "An application of bin-packing to multiprocessor scheduling," *SIAM Journal* on *Computing*, vol. 7, no. 1, pp. 1–17, 1978. [Online]. Available: http://epubs.siam.org/doi/abs/10.1137/0207001
- [17] G. Karypis and V. Kumar, "Metis unstructured graph partitioning and sparse matrix ordering system, version 2.0," Tech. Rep., 1995.

- [18] K. Devine, E. Boman, L. Riesen, U. Catalyurek, and C. Chevalier, "Getting started with zoltan: A short tutorial," in Proc. of 2009 Dagstuhl Seminar on Combinatorial Scientific Computing, 2009, also available as Sandia National Labs Tech Report SAND2009-0578C.
- [19] C. Chevalier and F. Pellegrini, "Pt-scotch: A tool for efficient parallel graph ordering," *Parallel Comput.*, vol. 34, no. 6-8, pp. 318–331, Jul. 2008. [Online]. Available: http://dx.doi.org/10.1016/j.parco.2007.12.001
- [20] J. Donald and M. Martonosi, "An Efficient, Practical Parallelization Methodology for Multicore Architecture Simulation," *IEEE Computer. Architecture. Letter.*, vol. 5, pp. 14–, July 2006.
- [21] M. I. Gordon, W. Thies, and S. Amarasinghe, "Exploiting coarse-grained task, data, and pipeline parallelism in stream programs," *SIGOPS Oper. Syst. Rev.*, vol. 40, pp. 151–162, October 2006. [Online]. Available: http://doi.acm.org/10.1145/1168917.1168877
- [22] G. C. Sih and E. A. Lee, "Declustering: A new multiprocessor scheduling technique," *IEEE Trans. Parallel Distrib. Syst.*, vol. 4, pp. 625–637, June 1993. [Online]. Available: http://portal.acm.org/citation.cfm?id=628910.629186
- [23] P. M. Carpenter, A. Ramirez, and E. Ayguade, "Mapping stream programs onto heterogeneous multiprocessor systems," in *Proceedings of the 2009 international* conference on Compilers, architecture, and synthesis for embedded systems, ser. CASES '09. New York, NY, USA: ACM, 2009, pp. 57–66. [Online]. Available: http://doi.acm.org/10.1145/1629395.1629406
- [24] W. Thies, M. Karczmarek, and S. P. Amarasinghe, "StreamIt: A Language for Streaming Applications," in CC '02: 11th International Conference on Compiler Construction, London, UK, 2002, pp. 179–196.
- [25] E. Waingold, M. Taylor, V. Sarkar, V. Lee, W. Lee, J. Kim, M. Frank, P. Finch, S. Devabhaktumi, R. Barua, J. Babb, S. Amarsinghe, and A. Agarwal, "Baring it all to software: The raw machine," MIT, Cambridge, MA, USA, Tech. Rep., 1997.
- [26] W. Kohler, "A Preliminary Evaluation of the Critical Path Method for Scheduling Tasks on Multiprocessor Systems," *Computers, IEEE Transactions on*, vol. C-24, no. 12, pp. 1235 – 1238, dec. 1975.
- [27] T. L. Adam, K. M. Chandy, and J. R. Dickson, "A comparison of list schedules for parallel processing systems," *Commun. ACM*, vol. 17, no. 12, pp. 685–690, Dec. 1974. [Online]. Available: http://doi.acm.org/10.1145/361604.361619
- [28] H. Topcuoglu, S. Hariri, and M.-Y. Wu, "Performance-effective and low-complexity task scheduling for heterogeneous computing," *Parallel and Distributed Systems, IEEE Transactions on*, vol. 13, no. 3, pp. 260 –274, mar 2002.
- [29] M. Maheswaran, S. Ali, H. J. Siegel, D. Hensgen, and R. F. Freund, "Dynamic mapping of a class of independent tasks onto heterogeneous computing systems," *J. Parallel Distrib. Comput.*, vol. 59, no. 2, pp. 107–131, Nov. 1999. [Online]. Available: http://dx.doi.org/10.1006/jpdc.1999.1581
- [30] A. Doğan and F. özgüner, "Genetic algorithm based scheduling of meta-tasks with stochastic execution times in heterogeneous computing systemst1," *Cluster Computing*, vol. 7, no. 2, pp. 177–190, Apr. 2004. [Online]. Available: http://dx.doi.org/10.1023/B:CLUS.0000018566.13071.cb
- [31] Kumar, S. and Jantsch, A. and Soininen, J.-P. and Forsell, M. and Millberg, M. and Oberg, J. and Tiensyrja, K. and Hemani,

A., "A network on chip architecture and design methodology," in *VLSI*, 2002. *Proceedings. IEEE Computer Society Annual Symposium on*, 2002, pp. 105 –112.