# CS 4820 Algorithms

This is a study note for CS 4820: algorithms. It is meant to be a summary of crucial materials in the course. This was the first time that I wrote notes like this in English.

## 1 Greedy Algorithms

Greedy algorithms involve localized reasoning: You only care about a form of local optimal while progressing through traversal. If the total effect of all local optimal steps is equal to the global optimal, then the problem could be solved by Greedy.

### 1.1 Problem: Interval Scheduling

#### The Problem

There are n requests that occupy time intervals. Each of them starts at s(i) and ends at f(i). Only one request could be processed at a time. Find the maximum amount of requests that could be scheduled without time intervals overlapping with each other.

#### The Design

While scheduling, always choose the first compatible interval with the least f(i)'s.

#### The Analysis

Proof: By "Stays Ahead" Method. We prove that among all possible solutions, our solution has the least f(i) for every i. Then we prove by contradiction that our solution is optimal.

Run Time: O(n log n), from sorting.

### 1.2 Problem: Minimize Lateness Scheduling

#### The Problem

All intervals are available from point s. Each has length t(i) and deadline d(i). Scheduling intervals after the deadline will have a lateness penality proportional to lateness time. Only one interval can be processed at a time. All intervals should be scheduled. Find a scheduling of these intervals such that the total lateness is minimized.

#### The Design

Always schedule the one with the earliest deadline first.

#### The Analysis

Proof: By "Exchange" Method. First we prove that there is an optimal solution O with no idle time. Then we prove that invert the scheduling of all intervals with deadlines against the "fully sorted order" will not increase total lateness. Since intervals can be inverted up untill become fully sorted by their deadlines, our solution is proved to be at least as good as optimal.

### 1.3 Problem: Graph Shortest Paths

#### The Problem

In a graph where each edge is weighed by l(e) non negative, find a path from s to v with the lowest sum of l(e)'s.

#### The Design: Dijkstra's Algorithm

Starting from s. Add all nodes already explored to a set S. For each node j with at least an edge connecting to S, refresh its distance to s using

\begin{equation*}
d'(v) = min_{e=(u,v):u\in S}d(u)+l_{e}
\end{equation*}

#### The Analysis

Proof: By "Stay Ahead" Method. We prove that if another path P' with the shortest distance to v exists, then its cost is no less than the cost of P by the non-negativity of the edge weights and the process of always finding the least cost outward edge.

Run Time: Using a priority queue, Dijkstra costs O(m) + n O(log n) + m O(log n) for a graph with n nodes and m edges.

### 1.4 Problem: Minimum Spanning Tree

#### The Problem

A **minimum spanning tree** is a subgraph of a connected and edge-weighted graph that has the least total cost possible. Find the minimum spanning tree of a connected edge-weighed graph.

#### The Design

**Kruskal's Algorithm**: Start from no edges at all, and insert edges by increasing cost. We only insert those which do not create cycles.

**Prim's Algorithm**: Start from any node s, build up set S by adding node v that minimizes the attachment cost $min_{e=(u,v): u \in S} C_e$ to S. The edges added form a minimum spanning tree.

**Reverse-Delete Algorithm**: Start from the full graph, delete edges by decreasing cost. Only delete those that will not make the graph disconnected.

#### The Analysis

Proof: First a lemma called "Cut Property" saying that the minimum spanning tree of G contains the lowest cost edge e connecting a subgraph S and other parts of G. This is proven by "Exchange Method", showing that  edges of other spanning trees could be exchanged with e and lower its cost. The counterpart of that lemma in "Reverse Delete" is "Cycle Property", that the minimum spanning tree does not contain the highest cost edge e in any cycle C in G. Then we prove that output of Kruskal's and Prim's are spanning trees.

Run Time: Prim's Algorithm is nearly the same as Dijkstra's, which is O(m + n log n + m log n) while using a priority queue. 

#### Appendix: The Union Find Data Structure

Kruskal's Algorithm needs a data structure that could find whether a node is in a connected component and union two connected components. They require the Union Find data structure. 

This efficient implementation of Union Find below gives a pointer to each node, initially pointed to itself. Unioning two components only require us to change one pointer, and we always change the pointer of the component with the lower size to point to the component with the higher size. Find traces the pointer chain, and while done, pointing every node on the path to the root. Union takes O(1), Find takes O(log n). 

Kruskal's Algorithm, implemented by Union Find, takes O(m log n) time.

#### Appendix: Huffman Code

Prefix codes are a function f that maps $x \in S$ to a sequence of 0's and 1's such that if $x \neq y$, then f(x) is not a prefix of f(y). Notice that a binary tree can express prefix codes.

Since an optimal prefix code would let the lowest frequency word to have the longest path, we could build the tree bottom up. Specifically, union the two lowest frequency words into one word (node), and build up the tree by adding a parent to them. Continue until all words are fitted into the tree. The total running time is O(k log k).

## 2 Divide and Conquer

Divide and Conquer is a way of dividing a problem into several non-overlapping subproblems, recursively crack them individually, and union them together to get the solution of the big problem. The Master Theorem is the reason why this tactic would reduce time complexity.

### 2.1 The Master Theorem

In Divide and Conquer, a recurrence relation

\begin{equation*}T(n) = aT(n/b) + f(n)\end{equation*}

is very common. The complexity of T(n) is given by the Master Theorem:

1. Suppose $\exists \epsilon > 0$ such that $ f(n) = O(n^{log_{b}a} - \epsilon )$, then $T(n) = \Theta(n^{log_{b}a})$.

2. Suppose $\exists k \geq 0$ such that $f(n) = \Theta(n^{log_{b}a}log^{k}n) $, then $T(n) = \Theta(n^{log_{b}a}log^{k+1}n)$.

3. Suppose $\exists \epsilon > 0$ such that $f(n) = \Omega(n^{log_{b}a} + \epsilon )$, but $\exists c > 0$ and $n$ such that $af(n/b) \leq cf(n)$, then $T(n) = \Theta(f(n))$.

### 2.2 Problem: Inversion Number

#### The Problem

In a sequence a(i), if i < j but a(i) > a(j), then this is called an inversion. Find the number of inversions in a.

#### The Design and the Analysis

The process of counting the number of inversions can be done in the same way as merge sorting the list. While merging, maintain two counters and move the one with smaller value. If the one with smaller value is in the second list, then inversion counter should be added by the number of remaining elements in the first list. The run time is O(n log n).

### 2.3 Problem: Closest Pair of Points

#### The Problem

Find the closest pair of points among n points in a plane.

#### The Design and the Analysis

1-D version is easier to be solved. We first sort these points at O(n log n) and then traverse the line at O(n) to find the closest pair.

2-D version can be cracked by d/c. Divide the plane into the left plane and the right plane and do the algorithm separately. We now know the closest distance of point pairs within these two parts, and the minimum of them is called d'. Note that while combining the left part and the right part, the only points that we should consider are those near the border. For each point near the border, we only need to consider a constant amount of points near it, so the "merge" complexity is O(n). By the Master Theorem, the total complexity is O(n log n).

### 2.4 Problem: Convolution & FFT

#### The Problem

The **convolution** of two vectors a and b is $a * b = (a_0 b_0, a_0 b_1 + a_1 b_0, a_0 b_2 + a_1 b_1 + a_2 b_0, ..., a_{n-2}b_{n-1} + a_{n-1}b_{n-2}, a_{n-1}b_{n-1})$ a vector of diagonal sums. Find the convolution of two vectors efficiently.

#### The Design and the Analysis: Fast Fourier Transform

Think about multiplying two polynomials, and the coefficients of the resulted polynomial is the convolution of the coefficients of the previous polynomials. A way to multiply P and Q is to 
(1) evaluate 2n values of P and 2n values of Q.
(2) multiply these values one by one to get 2n values of PQ.
(3) Since PQ is at most of degree 2n - 2, these calculated values determine the formula of PQ.

(2) can be done in O(n). (1) is harder to do and needs d/c. Instead of evaluating P and Q at random values, we evaluate P and Q at 2nth roots of unity. We divide P into two polynomials: Pl with even terms and Pr with odd terms. They should be evaluated at nth roots of unity. Since $P(x) = Pl(x^2)+xPr(x^2)$, $P(\omega_{j, 2n})=Pl(\omega_{j,2n}^2)+\omega_{j,2n}Pr(\omega_{j,2n}^2)$. This means we now have a recurrence relation $T(n) \leq 2T(n/2) + O(n)$, which gives us O(n log n) complexity.

(3) requires polynomial interpolation, which is done by Discrete Fourier Transform, summed up in below:

For any polynomial $C(x) = \sum_{s = 0}^{2n-1}c_s x^s$, and corresponding polynomial $D(x) = \sum_{s = 0}^{2n-1}C(\omega_{s, 2n}) x^s$, we have that $c_s = \frac{1}{2n}D(\omega_{2n - s, 2n})$.

Evaluating values $D(\omega_{2n-s,2n})$ is the same as in (1), which takes O(n log n) time. This whole practice of polynomial multiplication (and convolution calculation) thus takes O(n log n) time.

### 2.5 Problem: Matrix Multiplication Optimization

#### The Problem

The naive way of multiplying two matrixes of n * n will generate a time complexity of $O(n^3)$. Can we do better, even just a little bit?

#### The Design and the Analysis: Strassen's Algorithm

#### Problem Related: Integer Multiplication Optimization

## 3 Dynamic Programming

Dynamic programming is an extension of Greedy and Divide and Conquer. You divide a problem into a polynomial number of possibly-overlapping subproblems, crack them individually and recursively, and merge them together through a recurrence relation. Note that the overlapping feature of subproblems typically require memoization, or better, solving subproblems first via iteration.

### 3.1 Problem: Weighted Interval Scheduling

#### The Problem

This is similar to Problem 1.1. There are n requests, each with starting time s(i) and finishing time f(i), also a weight v(i). Find a scheduling scheme such that no intervals overlapping are selected, and the total weight is maximized.

#### The Design

Suppose opt(j) is the maximum total weight that could be generated by intervals 1, ..., j. Then 
\begin{equation*}opt(j) = max(v_j + opt((p(j)), opt(j - 1)),\end{equation*} where p(j) is the right most interval that does not overlap with j. Also, opt(0) = 0.

#### The Analysis

The memoized version has O(n) complexity after we sort the intervals. Even recording the trace of the scheduling process does not need more than O(n). Changing the structure of the algorithm from top-down recursion to bottom-up iteration would reduce the extra time of function calls and would be more efficient.

### 3.2 Problem: Segmented Least Squares

#### The Problem

Fit the data into segmented lines. Specifically, add an extra line into the model would involve penalty, and errors between data and model would also involve penalty, and the goal is to find a solution with the least penalty.

#### The Design

Suppose opt(j) is the optimal solution with the first j data points (the solution to the problem would be opt(n)), then
\begin{equation*}opt(j) = \min_{1 \leq i \leq j}(e_{i,j} + C + opt(i-1)).\end{equation*} Here, C is the penalty of adding a new line, and $e_{i,j}$ is the sum of errors of the data points between i and j if you draw the least-square best fit line between these two. The base case would be opt(0) = 0.

#### The Analysis

Calculating one least square error $e_{i,j}$ can be cleverly done in O(n), and calculating all of them would require $O(n^3)$. Calculating opt's then requires $O(n^2)$.

### 3.3 Problem: Knapsack

#### The Problem

There are n objects, each with a value v(i) and a weight w(i). You have a knapsack with capacity W. Find a way to fill maximum value into your knapsack.

#### The Design

If you have the first i objects and capacity w, the optimal value in the knapsack would be opt(i, w). Thus
\begin{equation*} opt(i, w) = max(opt(i-1, w), v_i + opt(i-1, w-w_i)). \end{equation*}  Also, opt(0, x) = 0 for all x and opt(y, 0) = 0 for all y.

#### The Analysis

This program runs in O(nW) time with bottom-up iteration. This is a problem known as pseudo-polynomial time, because the complexity is proportional to W, but input W only takes log W digits, so the time complexity is exponential with regard to input size.

### 3.4 Problem: RNA Secondary Structure

#### The Problem

RNAs can fold with itself. It is a chain of bases A, U, C, G. A pairs with U, and C pairs with G. RNAs with most base connecting pairs have the lowest energy. But not all pairs are possible. Specifically, (1) the connecting point between two bases of a pair must be at least 4 bases apart (no sharp turn). (2) No base appears in more than one pair. (3) No connections should cross with each other. Find the pairing of a RNA string with the lowest energy.

#### The Design and the Analysis

This is a dynamic programming over intervals. If you consider the maximum number of pairs between base i and j as opt(i, j), then it satisfies

(1) if $i \geq j - 4$, then opt(i, j) = 0.

(2) all other cases, $opt(i, j) = max(opt(i, j - 1), max(1 + opt(i, t - 1) + opt(t + 1, j - 1)))$, where the second max is over all t's between i and j such that $b_t$ and $b_j$ forms a valid base pair.

In [6]:
def RNAstructure(self):
    # for all i, j
    if i >= j - 4:
        opt[i][j] = 0;
    for k in range(5, n):
        for i in range(1, n - k + 1):
            j = i + k;
            maxn = opt[i][j - 1];
            for t in range(i, j):
                if isValidPair(t, j):
                    maxn = max(maxn, 1 + opt[i][t - 1] + opt[t + 1][j - 1]);
            opt[i][j] = maxn;
    return 0;

Since there are three iterations in the algorithm above, the time complexity is $O(n^3)$.

### 3.5 Problem: Sequence Alignment

#### The Problem

Two strings need to be aligned with each other. You can add characters to strings or change characters in the strings, each with penalty. Find a way to align strings together with the lowest penalty.

#### The Design

Let opt(i, j) be the minimum cost of aligning the first i characters of string A and first j characters of string B together. Then
\begin{equation*}opt(i, j) = min(a_{x_i y_j} + opt(i - 1, j - 1), \delta + opt(i - 1, j), \delta + opt(i, j - 1)).\end{equation*}
Where $a_{x_i y_j}$ is the replacement penalty of the ith charater of the first word and jth character of the second, and $\delta$ being the gap penalty. Also, opt(0, n) = opt(n, 0) = n.

#### The Analysis

Correctness could be proved by induction on i+j. Time complexity if O(mn), where m, n are lengths of the strings.

#### Problem Related: Longest Common Subsequence

Find the longest common subsequence of two strings is also a DP problem. If $a_i = b_j$, then opt(i, j) = opt(i - 1, j - 1), otherwise opt(i, j) = max(opt(i, j - 1), opt(i - 1, j)), and opt(i, 0) = opt(0, j) = 0.

### 3.6 Problem: Graph Shortest Path

#### The Problem

Different from Problem 1.3, this time we allow negative edges in a graph, but still no negative cycles. Find the  path with the lowest weight from u to v in this graph.

#### The Design and the Analysis: Bellman-Ford Algorithm

Let opt(i, v) be the minimum cost of a v-t path using at most i edges. Then
\begin{equation*} opt(i, v) = min(opt(i - 1, v), min_{w \in V}(opt(i-1, w) + c_{vw})).\end{equation*}
Also, opt(0, t) is 0, and for all v in V but not t, opt(0, v) is initially infinity. We just need to calculate opt(n, s). Note that i has a meaning other than the number of edges of the current optimal path: it could just be a counter -- at the ith iteration, all updates to opt's will correspond to the discovery of a shorter path with i edges to t. 

An important fact guarantees the correctness of this algorithm: if the graph does not have a negative cycle, then its shortest path must have number of edges less than n. If after n-1 iterations, there exists some v and w such that $opt(n-1, v) > opt(n - 1, w) + c_{vw}$, then there must be a negative cycle.

The recurrence relation could be further simplified into this:
\begin{equation*} M(v) = min(M(v), min_{w \in V}(M(w) + c_{vw})) \end{equation*}
running (n-1) times on all nodes $v \in V$. Note that depending on the execution sequence of nodes, it is possible that in the ith iteration, a path having more than i edges could be found, but we could guarantee that all paths having i edges towards the destination would be found after the ith iteration of the algorithm.

The time complexity of this algorithm is O(mn). During each iteration, each node should visit all edges that it could reach in one step to check for updates, so in one iteration, all edges are traversed, and there are n iterations, so time complexity is O(mn). 

#### Further Reading: SPFA

A lot of checks of adjacent nodes of v is unnecessary; if we could put those newly-updated nodes into a queue, then every time we only need to update those v's affected by nodes in the queue. This is the idea of Shortest Path Faster Algorithm, and its complexity is O(m).

#### Problem Related: Floyd-Warshall Algorithm

There is a related algorithm that could calculate the shortest path between any nodes in graph at $O(n^3)$. 

