# Efficient Parallel Mappings of a Dynamic Programming Algorithm: A Summary of Results\*

George Karypis and Vipin Kumar karypis@cs.umn.edu and kumar@cs.umn.edu Department of Computer Science University of Minnesota Minneapolis, MN 55414

### **Abstract**

In this paper we are concerned with Dynamic Programming (DP) algorithms whose solution is given by a recurrence relation similar to that for the matrix parenthesization problem. Guibas, Kung and Thompson presented a systolic array algorithm for this problem that uses  $O(n^2)$  processing cells and solves the problem in O(n) time. We present three different mappings of this systolic algorithm on a mesh connected parallel computer. The first two mappings use commonly known techniques for mapping systolic arrays to mesh computers. Both of them are able to obtain only a fraction of maximum possible performance. The primary reason for the poor performance of these formulations is that different nodes at different levels in the multistage graph in the DP formulation require different amounts of computation. Any adaptation has to take this into consideration and evenly distribute the work among the processors. Our third mapping balances the work load among processors and thus is capable of providing efficiency approximately equal to 1 (i.e., speedup approximately equal to the number of processors) for any number of processors and sufficiently large problem. We experimentally evaluate these mappings on a mesh embedded onto a 256 processor nCUBE/2\*. It can be shown that our mapping can be used to efficiently map a wide class of two dimension systolic array algorithms onto mesh connected parallel computers.

### 1 Introduction

Dynamic programming (DP) is a widely used problem solving paradigm for optimization problems that is widely applied to a large number of areas including optimal control, industrial engineering, economics and artificial intelligence [3, 4, 13, 17]. Many practical problems involving a sequence of interrelated decisions can be efficiently solved by DP. The essence of many DP algorithms lies in computing solutions of the smallest subproblems and storing the results for usage in computing larger subproblems. Thus the solution to the original problem is constructed

in a bottom-up fashion. A natural method of parallelizing various DP algorithms is to assign the task of of solving different subproblems to different processors.

A DP formulation is expressed as a recursive functional equation whose left-hand side is an expression involving the maximization (or minimization) of values of some cost functions. Li and Wah [11], have developed a classification of DP programming schemes according to the form of the functional equations and the nature of the recursion. As it was shown in [11, 5] monadic-serial DP problems can be solved by a series of matrix-vector multiplication which is easy to parallelize [5]. On the other hand there is no general parallel formulation for polyadic-nonserial DP problems.

In this paper we are concerned with the polyadic-nonserial DP algorithms whose solution is given by a recurrence relation similar to that for the matrix parenthesization problem [6]. Examples of these problems are: optimal triangularization of polygons, optimal binary search trees [6], and the CYK parser [1]. The serial complexity of these problems is  $O(n^3)$ . A number of parallel formulations have been proposed in [9] that use O(n) processors on a hypercube and solves the problem in  $O(n^2)$  time. A systolic array algorithm has been proposed in [7] that uses  $O(n^2)$  processing cells and solves the problem in O(n) time. Finally, there are some non-cost-optimal parallel formulations for PRAM machines that solve the problem in  $O(\log^2 n)$  time using  $n^6/\log n$  processors [14, 15].

The systolic algorithm for two dimension systolic arrays can be directly mapped onto a mesh connected parallel computer by assigning each cell to a different processor. This mapping leads to poor utilization, because in general purpose parallel computers, the communication cost for sending a unit message is much higher than unit computations. As shown in Ibara, Pong and Sohn [10], this problem can be corrected by assigning a block of of cells to each processor. Now the computation at each processor becomes proportional to the area of the block, and communication becomes proportional to the periphery. By choosing big enough block sizes, the ratio of communication to computation at each processor can be made arbitrarily small.

In this paper, we present three different formulations of the systolic algorithm [7] on a mesh connected parallel computer. The first formulation is a mapping of the systolic algorithm on a two dimension mesh computer along the lines proposed in [10]. This formulation results an upper

<sup>\*</sup>This work was supported by IST/SDIO through the Army Research Office grant #28408-MA-SDI and by the United States Army Research Office, Contract Number DAAL03-89-C-0038 at the University of Minnesota Army High Performance Computing Research Center.

<sup>\*</sup>nCUBE/2 is a registered trademark of nCUBE Corporation.

bound on the efficiency equal to 1/12 for sufficiently large number of processors. The second formulation is a slightly modified version of the first scheme but also has an upper bound of 1/3 in efficiency. The primary reason for the poor performance of these formulations is that different nodes at different levels in the multistage graph require different amounts of computation. Any adaptation has to take this into consideration and evenly distribute the work among the processors. The third formulation uses a mapping that balances the work load among processors and thus is capable of providing efficiency approximately equal to 1 for any number of processors and sufficiently large problem. We present a theoretical analysis of these mappings and experimentally evaluate them on a mesh embedded onto a 256 processor nCUBE/2.

This paper is organized as follows: Section 2 and 3 present an overview of the dynamic programming algorithm and the available parallel formulations. Section 4 and 5 present and analyze our various mappings of the systolic algorithm onto a mesh parallel computer. Section 6 presents experimental results, and finally Section 7 provides some concluding remarks.

## 2 The Parenthesization Problem and the Dynamic Programming Algorithm

The parenthesization and other isomorphic problems, can be efficiently solved using a dynamic programming algorithm [6]. Let c(i,j) be the cost of multiplying the matrices  $A_i, A_{i+1}, \ldots A_j$ . The dynamic programming paradigm constructs the solution to this problem based on the solution of its subproblems. This approach gives rise to the following recurrence relation for the parenthesization problem:

$$c(i,j) = \left\{ \begin{array}{l} \min\limits_{1 \leq i \leq k < j \leq n} \{c(i,k) + c(k+1,j) + r_{i-1}r_kr_j\} \\ 0, \qquad 0 \leq i = j \leq n \end{array} \right.$$

where matrix  $A_i$  has  $r_{i-1}$  rows and  $r_i$  columns. Given equation (1) the problem reduces to finding the value for c(1, n).

The solution to this recurrence relation, equation (1), is obtained by a bottom-up approach. An auxiliary table C[n][n] is used for storing the values of c(i,j) and an other one S[n][n] for storing the optimal indices for k. The algorithm fills in the tables C and S in a manner that corresponds to solving the parenthesization problem on matrix chains of increasing length. We can graphically visualize this if we think of filling in the tables in a diagonal order (see Figure 1). This concept of diagonal oriented computations will be extensively used in the rest of this paper. For a more detailed description refer to [6]. The complexity of this algorithm is  $n^3/6$  for large enough n.

# 3 Parallel Formulations of the Dynamic Programming Algorithm

The dynamic programming algorithm for the parenthesization problem can be easily parallelized using a linear array of p processors where  $1 \le p \le n$ . This linear array formulation will compute successive diagonals of matrix C at successive steps. If there are I nodes in a diagonal, we assign I/p nodes to each of the p processors. Each processor computes the cost of the entries c(i,j) assigned to it. This is followed by an all-to-all broadcast [5] during which solution costs of the subproblems at that diagonal

are made known to all the processors. Since each processor has complete information about subproblem costs at preceding diagonals, no communication is needed other than the all-to-all broadcast. The cost of performing the all-to-all broadcast of O(n/p) information among p processors is O(n) hence, The runtime of this formulation is  $O(n^3/p) + O(n^2)$ , where  $O(n^3/p)$  is the time spent in computation, and  $O(n^2)$  communication time. If n is sufficiently larger than p, then the communication time can be made to be an arbitrarily small fraction of the computation time, and linear speedups can be obtained. An alternative mapping was proposed by Ibara, Pong and Sohn in the context of the CYK parser [9]. Their formulation uses p = O(n) processors, connected in a hypercube topology, and solves the problem in  $O(n^3/p)$  time, which is cost optimal. The formulation of Ibara et.al. has properties similar to the formulation for linear array mentioned above. Both formulations are efficient only if p is sufficiently smaller

A faster formulation can be achieved using n(n+1)/2 processors on a PRAM machine. In this mapping each processor computes an entry c(i,j) of the matrix C. From equation 1, it can be shown that having finished diagonal t, we can perform some computations on the subsequent t+1 diagonals. Thus, the work in diagonal n can start when diagonal n/2 has been computed. Furthermore, we know that entries in diagonal n require n computations; hence, the runtime of this formulation is given by the recurrence relation: T(n) = T(n/2) + n, whose solution for sufficiently large n is: T(n) = 2n. The exact processor-time product of the PRAM formulation is  $n(n+1)/2 \times 2n \approx n^3$ ; hence, even though the PRAM algorithm is significantly faster, it does 6 times more work than the sequential algorithm therefore its efficiency is only 0.167.

Guibas, Kung and Thomson [7] have developed a systolic algorithm for the parenthesization problem. Their algorithm uses n(n+1)/2 processing elements (cells) connected as a two dimension systolic array (TSA) as shown in Figure 1, and solves the problem in essentially the same time as the PRAM algorithm outlined above. For the rest of this paper we will refer to this algorithm as GKT. A brief description of the algorithm follows. For a more detailed description the reader should refer to [7].

The inputs c(i,i) are applied in parallel to the cells with coordinates (i,i) and each cell (i,j) computes c(i,j). If a cell is computing an element of diagonal t, then its result is ready at time 2t. At that moment the cell starts transmitting its result upwards and to the right. The result travels along both directions by moving by one cell per time unit for t additional units. From that moment until eternity the result moves a cell every two time units. During a time unit a cell (i,j) will receive results for previous subproblems. If the new results improve the cost, they replace the currently held values. At each time unit a cell receives at most two sets of results from smaller subproblems, hence it has to perform at most two sets of computations.

The purpose of this paper is to investigate a number of possible mappings of the GKT algorithm onto a mesh connected parallel computer [2] with p processors. A mesh connected parallel computer has a structure similar to that of the TSA; hence, the mapping of the TSA algorithm onto a mesh can be done in a natural way. Furthermore, we will assume that the mesh connected parallel computer has



Figure 1: GKT algorithm for a TSA

wrap-around communication links. This is done merely to simplify the presentation and is not required by any of our proposed mappings.

# 4 Mapping the Systolic Algorithm onto a Mesh Parallel Computer

In the GKT algorithm results are communicated at two different speeds (either once every time unit or twice every time unit). This guarantees that results arrive at a cell when this cell is ready to use them. This is important for systolic algorithms, as a systolic array is supposed to have only a small amount of memory at each cell. General purpose processors have substantial amounts of memory which can be used to store results arriving at earlier times. In our mappings, messages are transmitted with no delays. When messages are received at a processor they are stored in local memory until they are used.

Despite the above simplification, mapping the GKT algorithm onto a mesh connected parallel computer poses a number of problems. Direct implementation of the systolic algorithm (i.e., use of n(n+1)/2 processors) will lead to an inefficient algorithm and underutilization of the parallel computer. This is because in general purpose parallel computers, the cost of sending an element to another processor is much higher than the cost of performing the computations associated with that element. For example in nCUBE/2, the cost of sending one element is  $180\mu s$  while the cost of performing a computation is  $4\mu s$  for the parenthesization problem.

A solution to this problem is to map more than one TSA cell onto a single mesh processor. The computations associated with each processor, in a time step, is usually proportional to the number of cells assigned to it, while the communication is proportional to the number of cells it has at the boundary. By varying the number of processors, we can adjust the cost of communication to computation and hence obtain an efficient parallel formulation.

Furthermore, an efficient mapping has to keep as many processors doing useful work as possible. Due to the nature of the GKT algorithm, the computations will move in a wavefront form within the TSA. At any given time in the execution of the algorithm, just a band of diagonal cells will be performing computations, while the remaining cells will either have finished their share of work or will be waiting to

receive results for subproblems currently being computed. This computational pattern will lead to cells sitting idle at various points of the execution of the algorithm, and depending on the mapping might lead to processors sitting idle as well.

Finally, different cells in the GKT algorithm will perform different amounts of computation. Each TSA cell will compute an entry c(i,j). The amount of computation required is proportional to the diagonal that c(i,j) belongs to. The higher the diagonal (i.e., greater the value of (j-i)), the higher the amount of required computation. Hence, even though we might be able to map the same number of cells onto each processor, the computations required may vary significantly.

Given these criteria for efficient mappings of the GKT algorithm onto a mesh connected parallel computer, we present and analyze three different mappings that address these issues to different degrees. We assume that the mesh has  $\sqrt{p} \times \sqrt{p}$  processors, n is a multiple of  $\sqrt{p}$ , and  $P_{i,j}$  is the processor of the  $i^{th}$  row and  $j^{th}$  column. These mappings are described in the following sections.

# 4.1 Checkerboarding Mapping (CM)

An intuitive and straight forward mapping is to group blocks of  $\frac{n}{\sqrt{p}} \times \frac{n}{\sqrt{p}}$  cells together and map them onto the same processor. Because of the triangular shape of the TSA, processors  $P_{i,j}$  for i>j will not be assigned any TSA cells while, the processors  $P_{i,i}$  will be assigned only  $\frac{n}{2\sqrt{p}}(\frac{n}{\sqrt{p}}+1)$  cells. Figure 2 illustrates this mapping. This mapping scheme is often called checkerboarding and it has been used in a number of applications [8, 5, 10]. the memory requirements at each processor is  $O(\frac{n^2}{\sqrt{p}})$ .



Figure 2: Checkerboarding Mapping

The only TSA cells that need to communicate with the surrounding processors are those along the periphery of the block where for every diagonal received or computed each processor can perform computations on a number of diagonals residing on it. Hence, during each step, the computation performed is  $O(n^2/p)$  while the communication is  $O(n/\sqrt{p})$ .

However, checkerboarding mapping has a number of limitations. It maps cells only to  $\sqrt{p}(\sqrt{p}+1)/2$  processors, and thus the remaining  $\sqrt{p}(\sqrt{p}-1)/2$  processors are idle all the time. Also due to the nature of the algorithm, the band of diagonal entries being computed will reside on a small number of adjacent diagonals of mesh processors. For example, after computing diagonal t we can perform computations on the following  $\min(t+1,n-t)$  diagonals. Hence, during that time processors on these diagonals processors will be performing computations while the remaining processors will either have finished their work or will be waiting to receive diagonals that are currently being computed. Finally, because computations associated with a cell increases as the number of the diagonal containing this cell increases, different processors will have different work loads even though they have the same number of TSA cells.

# 4.2 Modified Checkerboarding Mapping (MCM)

One of the limitations of the checkerboarding mapping is that it maps no work to about half of the processors. The modified checkerboarding solves this problem and at the same time it preserves the communication properties of CM.

The modified checkerboarding mapping is achieved as follows: First we partition the TSA into 3 blocks  $L_1$ ,  $L_2$  and  $L_3$  as illustrated in Figure 3. Block  $L_3$  is partitioned in a checkerboarding fashion into blocks each having  $\frac{n}{2\sqrt{p}} \times \frac{n}{2\sqrt{p}}$  cells, and is mapped onto the processor mesh. Blocks  $L_1$  and  $L_2$  are rotated  $180^o$  about their common boundary with block  $L_3$  and then are mapped onto the processor mesh in a checkerboarding fashion. An alternative way of visualizing this mapping is to think that first the TSA is being folded along the common boundaries  $L_1 - L_3$  and  $L_2 - L_3$  and then the  $\frac{n}{2} \times \frac{n}{2}$  square obtained, is mapped onto the  $\sqrt{p} \times \sqrt{p}$  processor mesh in a checkerboarding fashion.

This mapping guarantees that all the processors will have the same number of cells  $\frac{n^2}{2p}$ , with the exception of the processors  $P_{i,\sqrt{p}+1-i}$  for  $1 \le i \le \sqrt{p}$  that have  $\frac{n}{2\sqrt{p}}$  additional cells. Also, it can be easily seen that the work mapped to processors on the same row is roughly the same (with the exception of the diagonal processors). The same statement though, doesn't hold for the processors along the same column. The processors at the first row do more work than those at the second, and so forth. For example, each cell of a processor at the first row belongs to a diagonal that is by  $\frac{n}{2\sqrt{p}}$  higher than the corresponding cell of the processor at the second row. Thus, the processors at the first row has to perform  $\frac{n^3}{4p\sqrt{p}}$  more computations. Simi-

larly, the processors at the second row has to perform  $\frac{n^3}{4p\sqrt{p}}$  more computations than the processors at the third row, and so forth. Hence, even though modified checkerboarding utilizes p processors, it does not eliminate work load imbalances.



| (1,5) (1,6) | (1,4) (1,3) | (1,7) (1,8) | (1,2) (1,1) |
|-------------|-------------|-------------|-------------|
| (2,5) (2,6) | (2,4) (2,3) | (2,7) (2,8) | (2,2)       |
|             |             | İ           |             |
|             |             | (8,8)       |             |
|             |             | (7,7) (7,8) |             |
| (3,5) (3,6) | (3,4) (3,3) | (3,7) (3,8) |             |
| (4,5) (4,6) | (4,4)       | (4,7) (4,8) |             |
|             |             |             |             |
| (6,6)       |             | (5,7) (5,8) |             |
| (5,5) (5,6) |             | (6,7) (6,8) |             |

Figure 3: Modified Checkerboarding Mapping

Shuffling Mapping (SM)

The two mappings proposed so far didn't fully address the various issues involved in efficiently mapping the TSA algorithm onto a mesh connected parallel computer. Even though communication locality was a property of both the checkerboarding and the modified checkerboarding mappings, work load was unevenly distributed among the processors. Here we present a different mapping that preserves the communication characteristics of checkerboarding and at the same time evenly distributes the work among the processors.

This new mapping maps successive rows and columns of cells onto successive rows and columns of the mesh respectively. In particular, the c(i,j) cell of the TSA is mapped onto the  $(((i-1) \bmod \sqrt{P}) + 1, ((j-1) \bmod \sqrt{P}) + 1)$  processor of the mesh. The above definition requires a wrap around mesh but there is an alternative way of mapping the TSA that eliminates this requirement. We can think of the TSA as being partitioned into columns each containing  $\sqrt{p}$  consecutive cells. Then these 'fat' columns are being folded along their common boundaries and a column containing  $\sqrt{p}$  cells of depth  $\frac{n}{\sqrt{p}}$  is obtained. This column is again being partitioned into rows each having  $\sqrt{p}$  cells and these rows are being folded along their common boundaries. The resulting  $\sqrt{p} \times \sqrt{p}$  block is then mapped onto the

processor mesh. This mapping maps either  $\frac{n}{\sqrt{p}}(\frac{n}{\sqrt{p}}+1)$  or  $\frac{n}{\sqrt{p}}(\frac{n}{\sqrt{p}}-1)$  TSA cells onto a mesh processor. Even though adjacent rows and columns of the TSA are being mapped onto adjacent rows and columns of the processor mesh, the amount of communication performed is similar to the checkerboarding scheme. This is because when cell c(i,j) sends its results to cell c(i,j+1) then the result is also received by cells  $c(i,j+1+k\sqrt{p})$  for  $k=1,2,3,\ldots$  This mapping is illustrated in Figure 4. For the rest of this paper this mapping wall be referred to as shuffling. A variation of this mapping was used in the context of shortest path on sparse graphs in [16].



| (1,1) (1,3) (1,5) (1,7) (3,3) (3,5) (3,7) | (1,2) (1,4) (1,6) (1,8) (3,4) (3,6) (3,8) |
|-------------------------------------------|-------------------------------------------|
| (5,5) (5,7)                               | (5,6) (5,8)                               |
| (7,7)                                     | (7,8)                                     |
| (2,3) (2,5) (2,7)                         | (2,2) (2,4) (2,6) (2,8)                   |
|                                           |                                           |
| (4,5) (4,7)                               | (4,4) (4,6) (4,8)                         |
| (4,5) (4,7)<br>(6,7)                      | (4,4) (4,6) (4,8)<br>(6,6) (6,8)          |

Figure 4: Shuffling mapping of a  $8 \times 8$  TSA onto a  $2 \times 2$ 

mesh. Note that both in CM and MCM mappings, each processor is assigned portions of consecutive diagonals where in SM each processors is assigned portions of diagonals that are  $\sqrt{p}$  apart. As we know, the amount of work required to compute a diagonal increases as the diagonal increases; thus, in the CM and MCM mappings the processors having higher diagonals will do more work than those having lower ones. On the other hand, in SM, each processor is assigned an equal number of low and high diagonals, thus the work allocated to each processor doesn't vary significantly. Furthermore, because consecutive TSA rows and columns reside on consecutive mesh rows and columns, the

processors will start working at an earlier time compared with either checkerboarding or modified checkerboarding mappings.

5 Analysis of the Various Mappings

We analyzed the performance of all the three different mappings. Due to space limitations, in this section, we will only present a summary of our analytical results. The reader should refer to [12] for a detailed analysis. For reasons, discussed in the previous section, checkerboarding mapping does not manage to evenly distribute the work among the processors. As a result of that, the maximum obtained efficiency of CM is smaller than 1. The attainable efficiency depends on the number of processors, and as p increases it approaches 1/12. Thus, the efficiency of the checkerboarding mapping is bounded by 1/12 for sufficiently large p. For similar reasons, the efficieny of the modified checkerboarding mapping is bouned by 1/3 for sufficiently large p. Clearly, MCM provides a significant improvement over CM, but still it only utilizes 1/3 of the processors efficiently. The dependence on the number of processors and the upper bound on the efficiency of CM and MCM is illustrated in Table 1. On the other hand, as nincreases, the efficiency of the shuffling mapping increases approaching 1; hence, shuffling yields a cost optimal parallel formulation. As pointed out in the previous section, the upper bounds on the efficiency of CM and MCM is due to poor work load balancing while, shuffling mapping manages to evenly distribute the work among the processors.

|     | p | 4    | 16   | 64   | 256  | 1024 | 4096  |
|-----|---|------|------|------|------|------|-------|
| CM  | E | 0.33 | 0.17 | 0.13 | 0.10 | 0.09 | 0.083 |
| MCM | E | 0.72 | 0.53 | 0.43 | 0.39 | 0.36 | 0.33  |

Table 1: Analytical efficiency upper bounds for CM and MCM formulations.

#### 6 Experimental Results

We implemented all three mappings of the GKT algorithm presented in Section 4 on an nCUBE/2 parallel computer. nCUBE/2 is a hypercube connected parallel computer and a well known mapping [2] was used to embed a wrap around mesh on it.

Å large number of experiments were made with different values of p and n. Some of these results are shown in Table 2. In calculating these efficiencies, we used the runtime of the serial algorithm on one processor, as the amount of work W

From the results shown in Table 2, we can clearly see how the various mappings perform. The shuffling mapping does significantly better than either the checkerboarding or the modified checkerboarding mappings. For all schemes, the efficiencies increase with higher n, as the overheads due to communication and idling become a smaller fraction of the actual work. For SM, the efficiency goes all the way upto 1 (with increasing problem size), but for CM and MCM, it saturates at a smaller value than 1 as predicted by the analysis presented in Section 5. The saturation points for CM and MCM become smaller for larger number of processors as predicted by the analysis. Comparing the points where the efficiency saturates at Table 2 with the theoretical upper bounds in the efficiencies Table 1 we see that they are very close. In particular, for p = 16, the predicted upper bound for CM and MCM are .17 and .53

respectively, which are quite close to the observed values .15 and .52. Similar statements are true for p=64 and p=256.

| $p\downarrow$ | $n \rightarrow$ | 200 | 280  | 360  | 520  | 600  | 2000 |
|---------------|-----------------|-----|------|------|------|------|------|
|               | CM              | .15 | .15  | .15  | .15  | .15  | .15  |
| 16            | MCM             | .45 | .47  | .49  | .52  | .52  | .52  |
|               | SM              | .63 | .74  | .80  | .86  | .89  | .97  |
|               |                 |     |      |      |      |      |      |
|               | $n \rightarrow$ | 400 | 560  | 720  | 1040 | 1200 | 4000 |
|               | CM              | .11 | .11  | .11  | .11  | .11  | .11  |
| 64            | MCM             | .34 | .37  | .39  | .42  | .42  | .42  |
|               | SM              | .57 | .69  | .75  | .82  | .86  | .96  |
|               |                 |     |      |      |      | •    |      |
|               | $n \rightarrow$ | 800 | 1120 | 1440 | 2080 | 2400 | 8000 |
|               | CM              | .09 | .09  | .09  | .09  | .09  | .09  |
| 256           | MCM             | .28 | .31  | .33  | .38  | .38  | .38  |
|               | SM              | .50 | .64  | .69  | .78  | .83  | .95  |

Table 2: Efficiencies of the various mappings.

#### 7 Conclusions

This paper presents a mapping of a two dimension systolic array on a mesh connected parallel computer that balances work among processors and minimizes communication costs for a class of systolic algorithms. In mapping of systolic algorithms similar to the parenthesization problem, it is particularly important that work be evenly distributed. For example, checkerboarding and modified checkerboarding mappings yield poor performance even if we assume that idling and communication time is zero. On the other hand shuffling evenly distributes the work among the mesh processors and yields efficiency approaching 1 for large enough problems. It can be shown that the shuffling mapping can be used to efficiently map a wide class of TSA algorithms onto mesh connected parallel computers. In particular, any TSA algorithm where the inputs to a cell are forwarded with no changes can be mapped efficiently onto a mesh parallel computer using this mapping.

#### References

- A.V. Aho and J.D. Ullman. The Theory of Parsing, Translation and Compiling, Vol. 1, Parsing. Englewood Cliffs, NJ Prentice Hall, 1972.
- [2] S. G. Akl. The Design and Analysis of Parallel Algorithms. Prentice-Hall. 1989.
- [3] R. Bellman and S. Dreyfus. Applied Dynamic Programming. Princeton University Press, Princeton, NJ, 1962.
- [4] U. Bertele and F. Brioschi. Nonserial Dynamic Programming. Academic Press, NY, 1972.
- [5] D. P. Bertsekas and J.N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Prentice-Hall, 1989.
- [6] Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. *Introduction to Algorithms*. MIT Press, McGraw-Hill, 1990.
- [7] L. J. Guibas, H. T. Kung, and C. D. Thompson. Direct VLSI Implementation of Combinatorial Algorithms.

- In Proceedings of Conference on Very Large Scale Integration, California Institute of Technology, pages 509–525, 1979.
- [8] Anshul Gupta and Vipin Kumar. On the scalability of Matrix Multiplication Algorithms on Parallel Computers. Technical Report TR 91-54, Computer Science Department, University of Minnesota, Minneapolis, MN 55455, 1991.
- [9] Oscar H. Ibara, Ting-Chuen Pong, and Stephen M. Sohn. Parallel Recognition and Parsing on the Hypercube. IEEE Transactions on Computers, 40(6):764– 770, June 1991.
- [10] Oscar H. Ibara and Stephen M. Sohn. On Mapping Systolic Algorithms onto the Hypercube. IEEE Transaction on Parallel and Distributed Systems, 1(1):48– 63, January 1990.
- [11] Guo jie Li and Benjamin W. Wah. Parallel Processing of Serial Dynamic Programming Problems. In Proc. COMPSAC 85, pages 81–89, 1985.
- [12] George Karypis and Vipin Kumar. Efficient Parallel Mappings of a Dynamic Programming Algorithm. Technical Report 92-59, University of Minnesota, Computer Science Departement, October 1992.
- [13] H. Ney. Dynamic programming as a technique for Pattern Recognition. In Proc. 6th Intl. Conf. Pattern Recognition, pages 1119 – 1125, Oct. 1982.
- [14] Wojciech Rytter. On Efficienct Parallel Computations for Some Dynamic Programming Problems. Theoretical Computer Science, 59:297–307, 1988.
- [15] L. Valiant, S. Skyum, S. Berkowitz, and C. Rackoff. Fast Parallel Computation of Polynomials Using Few Processors. SIAM J. Computing, 12(4), 1983.
- [16] Kumiko Wada and Nobuyuki Ichiyoshi. A Distributed Shortest Path Algorithm and Its Mapping on the Multi-PSI. In Proceedings of Distributed Massively Concurent Computers., 1989.
- [17] D. White. Dynamic Programming. Oliver and Boyd, Edinburgh, England, 1969.