## HW2

Group: Thread-Titans

Group Members: Alex, Bashar, Fizza, Maya, Ramin 

## Exercise 2.6
Prove that $𝐸 = 1$ implies that all processors are active all the time.
(Hint: suppose all processors finish their work in time $T$ , except for one processor in $T'$ < $T$ . What is $T_p$ in this case? Explore the above relations.)


To prove that $E = 1$ implies that all processors are active all the time, we start with the definition of efficiency:

$$E_p = \frac{S_p}{p} = \frac{T_1 / T_p}{p}$$

If $E_p = 1$, then:

$$\frac{T_1 / T_p}{p} = 1$$

which simplifies to:

$$T_p = \frac{T_1}{p}$$

This is the ideal case where speedup $S_p$ is equal to $p$, i.e, the execution time is perfectly divided among the $p$ processors.

On the contrary, assume that one processor finishes its work earlier, at time $T' < T$, while others continue working. The total execution time $T_p$ would then be determined by the slowest processor, which still takes $T$. However, since one processor was idle for duration $T - T'$, this means all processors (atleast there is 1) were not fully utilized.

If $E_p = 1$, then there is no such idle time and every processor is doing useful work at every moment until completion. Therefore, all processors must be active the entire time.






## Exercise 2.10

Show that, with the scheme for parallel addition just outlined, you can multiply two matrices in $\log_2 N$ time with $N^3/2$ processors. What is the resulting efficiency?

The standard matrix multiplication algorithm of  two $N \times N$ matrices requires $O(N^3)$ operations. Each element $C_{ij}$ in the result matrix is computed as:

$$C_{ij} = \sum_{k=1}^{N} A_{ik} B_{kj}$$

which involves summing $N$ terms. Instead of summing sequentially in $O(N)$ time, we use the parallel sum strategy: each processor starts with a single term, then in each step, pairs of values are summed in parallel. Since the number of active processors halves at each step, the total number of steps is $\log_2 N$. Thus, summing $N$ terms takes $O(\log_2 N)$ time.

To determine why we use $N^3/2$ processors, note that matrix multiplication involves computing all $N^2$ elements of $C$. Each element requires $N$ multiplications followed by $N-1$ additions, giving a total of approximately $N^3$ operations. To parallelize this efficiently, we assign one processor per multiplication, which requires $N^3$ processors. However, summation follows a binary tree structure, meaning that summing two values requires one processor for each pair. Since there are $N^3$ terms and each step halves the number of terms, we only need about $N^3/2$ processors for summation. Thus, with $N^3/2$ processors, we can fully parallelize both multiplication and summation.

Since the slowest step in this approach is summation, and we use the parallel sum strategy, the total execution time for matrix multiplication is $O(\log_2 N)$. Efficiency is given by:

$$E = \frac{\text{sequential time}}{\text{parallel time} \times \text{number of processors}}$$

Since sequential matrix multiplication takes $O(N^3)$ time, our parallel version runs in $O(\log_2 N)$ time with $O(N^3/2)$ processors. This gives:

$$E = \frac{O(N^3)}{O(\log_2 N) \times O(N^3/2)} = O\left(\frac{N^3}{(N^3/2) \log_2 N}\right) = O\left(\frac{2}{\log_2 N}\right)$$

This efficiency decreases as $N$ grows but remains better than many naive parallel approaches.

## Exercise 2.11

Let’s do a specific example. Assume that a code has a setup that takes 1 second and a parallelizable section that takes 1000 seconds on one processor. What are the speedup and efficiency if the code is executed with 100 processors? What are they for 500 processors? Express your answer to at most two significant digits.


Let the total execution time on one processor be $T_1 = 1 + 1000 = 1001$ seconds. The sequential fraction of the code is $F_s = \frac{1}{1001}$, and the parallelizable fraction is $F_p = \frac{1000}{1001}$. To compute the speedup and efficiency for 100 processors, we use Amdahl’s Law: 

$$ T_P = T_1(F_s + F_p/P)$$
$$ T_{100} = 1001 \left( \frac{1}{1001} + \frac{1000}{1001 \times 100} \right) = 11$$

The speedup is:

$$ S_{100} = \frac{T_1}{T_{100}} = \frac{1001}{11} = 91$$

and the efficiency is:

$$ E_{100} = \frac{S_{100}}{p} = \frac{91}{100} = 0.91$$

For 500 processors, we similarly compute:

$$ T_{500} = 1001 \left( \frac{1}{1001} + \frac{1000}{1001 \times 500} \right) = 3$$

The speedup is:

$$ S_{500} = \frac{1001}{3} = 333.6667 $$

and the efficiency is:

$$ E_{500} = \frac{333.6667}{500} = 0.6673334$$

Using the above formulas, we find that for 100 processors, the speedup is approximately 91 and the efficiency is 0.91, while for 500 processors, the speedup is approximately 334 and the efficiency is 0.67.

## Exercise 2.12

Investigate the implications of Amdahl’s law: if the number of processors 𝑃 increases, how does the parallel fraction of a code have to increase to maintain a fixed
efficiency?

Consider the speedup and efficiency of a parallel system given by the Amdahl law:

$$ S_P = \frac{1}{F_s + \frac{F_p}{P}} $$

The efficiency $E_P$ is defined as:

$$ E_P = \frac{S_P}{P} = \frac{1}{P \left( F_s + \frac{F_p}{P} \right)} $$

To maintain a fixed efficiency as $P$ increases, we have to keep the denominator constant. That can only be made possible if the parallel fraction $F_p$ increases as $P$ grows. 

For large $P$, the term $\frac{F_p}{P}$ becomes small compared to $F_s$ term, so the efficiency is primarily determined by $F_s$. Therefore, in order to maintain a constant efficiency, the parallel fraction $F_p$ must increase to offset the increasing effect of the sequential fraction $F_s$.

In a nutshell if $P$ increases, the benefit of parallelization diminishes unless the parallel fraction $F_p$ also increases. If the sequential fraction $F_s$ remains constant, there is a limit to the speedup, and the efficiency will decrease as more processors are added. To maintain the same efficiency, the parallelizable part of the code must grow in proportion to $P$ to overcome the growing impact of the sequential part.
