# Subsequences and Segments
###### (Algorithm Design with Haskell)

Subsequences are a subset of the elements in a list and segments are a contiguous subset of the elements of a list. The goal is to solve several problems involving these concepts using thinning algorithms, which are like greedy algorithms except they admit more than one candidate at each step.

### The longest upsequence
Given a sequence of elements of an ordered type, find the longest subsequence in which the elements are strictly increasing.

In [4]:
subseq :: [a] -> [[a]]
subseq = foldr go [[]] where
  go x xss = xss ++ map (x:) xss
  
up :: Ord a => [a] -> Bool
up xs = and $ zipWith (<) xs (tail xs)

The naive implementation takes exponential time because there is an exponential number of subsequences to check. (a list of n items has 2^n subsequences and n^2 + n / 2 segments)

Our aim is to produce a linearithmic algorithm.

The starting point is the following refinement definition

`lus <- MaxWith length . filter up . subseqs`

The first step is to try to fuse filter and subseqs

```
lus <- MaxWith length . foldr step [[]] where
  step x xss = xss ++ map (x:) (filter (ok x) xss)
  ok x xs = null xs || x < head xs
```

This way only up sequences are kept at each step.

The next step is to see if a greedy solution is possible. It is not because there could be a better solution that arises from not just taking the next least item and discarding others. `ab` is the lus of `xab` but `uvwxab` has `uvwx` as the lus, so x should not be discarded. Nor is a greedy solution possible when traversing the list from left to right.

Since there is no greedy solution, we want to introduce a thinning step.

```
lus <- MaxWith length . ThinBy thin . foldr step [[]]
```

Where it must be that `thin xs ys => length xs >= length ys`

A candidate is clearly better than another if it is no shorter and it's first element, if it exists, is larger. We also want to be sure to keep the empty list as a candidate.

In [5]:
thin :: Ord a => [a] -> [a] -> Bool
thin [] [] = True
thin [] _ = False
thin _ [] = False
thin (x:xs) (y:ys) = x >= y && length xs >= length ys

Now we want to fuse `ThinBy thin` with `foldr step [[]]`

After some swizzling, we end us with this refinement

```
tstep x xss = thinBy thin (step x xss)
foldr tstep [[]] <- ThinBy thin . foldr step [[]]
lus <- MaxWith lenght . foldr tstep [[]]
```

The thinning process is made more efficient by keeping the candidates in increasing order of length using `mergeBy`.

In [11]:
thinBy :: (a -> a -> Bool) -> [a] -> [a]
thinBy f = foldr go [] where
  go x (y:xs) | f x y = x : xs
              | f y x = y : xs
              | otherwise = x : y : xs     
  go x [] = [x]
  
mergeBy :: (a -> a -> Bool) -> [a] -> [a] -> [a]
mergeBy _ xs [] = xs
mergeBy _ [] ys = ys
mergeBy f (x:xs) (y:ys)
  | f x y = x : mergeBy f xs (y:ys)
  | otherwise = y : mergeBy f (x:xs) ys
  
lus = last . foldr tstep [[]]
tstep x xss = thinBy thin $ mergeBy cmp xss yss where
  yss = map (x:) (filter (ok x) xss)
  cmp xs ys = length xs >= length ys
  ok x [] = True
  ok x (y:_) = x < y

Ignoring length calculation, this takes `O(nr)` steps where n is the length of the input and r is the length of the longest up sequence. At most r + 1 upsequences are kept in play at each stage and these can be updated in O(r) steps.

The path to further optimisation is to observe that the action of `tstep` is always finding a position in the list of candidates where `x` is less than the head of one candidate and <= the head of the next candidate, then we remove the 2nd candidate, replacing it with the `x:` the first candidate. If there is no such position, then x:xs is added to the end of the list.

To do this efficiently, we must be able to search for the position in the list where this condition occurrs. This involves using a binary search tree for the candidates rather than a list - not shown here.

In [12]:
tstep x ([]:xss) = [] : search x [] xss where
  search x xs [] = [x:xs]
  search x xs (ys:xss)
    | head ys > x = ys : search x ys xss
    | otherwise = (x:xs):xss

This version finds the insertion point by linear search.

### The longest common subsequence

Given two input sequences, find the longest sequence that is a subsequence of both inputs.

Our initial refinement is
```
lcs <- MaxWith length . filter (sub xs) . subseq
```

In [17]:
sub :: Eq a => [a] -> [a] -> Bool
sub _ [] = True
sub (x:xs) (y:ys) | x == y = sub xs ys
                  | otherwise = sub (x:xs) ys
sub [] _ = False

First we look to fuse `filter (sub xs)` with `subseq`. The result is that we filter at each step to maintain a smaller pool of candidates rather than filtering at the end over a huge set.

In [18]:
step :: Eq a => [a] -> a -> [[a]] -> [[a]]
step xs y yss = yss ++ filter (sub xs) (map (y:) yss)

We now decide whether a greedy approach is possible. It is not because we are not able to discard all candidates except one at each step - the optimal solution may appear to be suboptimal until the very end.

Therefore we introduce a thinning step.

```
lcs <- MinWith length . ThinBy thin . foldr step [[]]
```

A candidate is strictly better than another if the length is at least as large and the starting position in `xs` is greater.

In [40]:
position :: Eq a => [a] -> [a] -> Int
position xs ys = help (length xs) (reverse xs) (reverse ys) where
  help i _ [] = i
  help _ [] _ = -1
  help i (x:xs) (y:ys) = help (pred $! i) xs $ if x == y then ys else y:ys
  
thin :: Eq a => [a] -> [a] -> [a] -> Bool
thin xs ys zs = length ys >= length zs && position xs ys >= position xs zs

Our next objective is to fuse `thinBy thin` with `foldr step [[]]`.

In [45]:
tstep :: Eq a => [a] -> a -> [[a]] -> [[a]]
tstep xs y yss = thinBy (thin xs) $ mergeBy cmp zss yss
 where
 zss = dropWhile negpos $ map (y:) yss
 negpos ys = position xs ys < 0
 cmp ys zs = position xs ys <= position xs zs

We keep the candidates in order of increasing position and therefore decreasing length.

The next optimisation is to cache the position, length, and leading portion of each candidate

In [50]:
ext (_, _, _, x) = x
psn (x, _, _, _) = x
lng (_, x, _, _) = x

cons x (p,k,ws,us) = (p - 1 - length as, k + 1, tail bs, x : us) where
 (as, bs) = span (/= x) ws
 
lcs :: Eq a => [a] -> [a] -> [a]
lcs xs = ext . head . foldr tstep start where
  start = [(length xs, 0, reverse xs, [])]
  tstep y yss = thinBy thin $ mergeBy cmp zss yss where
    zss = dropWhile ((< 0) . psn) $ map (cons y) yss
  thin a b = lng a >= lng b && psn a >= psn b
  cmp a b = psn a <= psn b
  
lcs "abcdef" "1dbbe32f"

"def"

This algorithm takes O(mn) where m and n are the lengths of the input lists.

### A short segment with maximum sum

Given a list of positive and negative integers, find a segment with maximum sum that is no longer than a specified length.

```
mss b <- MaxWith sum . filter (short b) . segments 
```

In [55]:
import Data.List

short :: Int -> [a] -> Bool
short x = (<= x) . length

segments :: [a] -> [[a]]
segments = concatMap inits . tails

We can reason about the spec equationally as follows
```
  MaxWith sum . filter (short b) . segments
= {definition of segments}
  MaxWith sum . filter (short b) . concatMap inits . tails
= {since filter p . concat = concat . map (filter p)}
  MaxWith sum . concatMap (filter (short b) . inits) . tails
= {distributive law}
  MaxWith sum . map (MaxWith sum . filter (short b) . inits) . tails
-> {with msp b <- MaxWith sum . filter (short b) . inits}
  MaxWith sum . map (msp b) . tails
```
The form of `mss` suggests an appeal to the Scan Lemma, which is an important tool when dealing problems involving segments.
```
map (foldr op e) . tails = scanr op e
```
If we can express `msp` as an instance of `foldr` then we can apply the Scan Lemma. There is no obvious solution to this so instead proceed with the usual thinning strategy and try to fuse `filter (short b)` with `inits`.

```
msp b <- MaxWith sum . foldr op [[]] where
  op x xss = [] : take b (map (x:) xss)
```
Next we introduce thinning
```
msp b <- MaxWith sum . ThinBy thin . foldr op [[]]
```

In [58]:
thin :: (Num a, Ord a) => [a] -> [a] -> Bool
thin xs ys = sum xs >= sum ys && length xs <= length ys

The next step is to use fusion again to thin at each step rather than once at the end.

In [71]:
msp :: (Num a, Ord a) => Int -> [a] -> [a]
msp b = last . foldr (op b) [[]]
op b x xss = [] : thin (map (x:) (cut xss)) where
  thin = dropWhile ((<= 0) . sum)
  cut xss' = if length (last xss') == b then init xss' else xss'

We cut from the end of the list to keep the prefixes short and thin from the front to remove non-maximal sums.

In [89]:
import Data.Ord

maxWith :: Ord b => (a -> b) -> [a] -> a
maxWith = maximumBy . comparing

mss :: (Num a, Ord a) => Int -> [a] -> [a]
mss b = maxWith sum . map last . scanr (op b) [[]]

mss 2 [1,2,3]

[2,3]

Further optimisation comes from the fact that are candidates share suffixes and as a result `op` is inefficient due to `map`ping. Instead we can represent the candidates as the differences - repeated suffixes are elided

We can define a function that goes from this compact representation to the normal one:
```
abs :: [[a]] -> [[a]]
abs = scanl (++) []
```
To effect this change we need a function, `opR`, so that
```
abst (opR b x xss) = op b x (abst xss)
```
Then by the fusion law of `foldr`, we have
```
abst . foldr (opR b) [] = foldr (op b) [[]]
```
Here we are applying the fusion law in the fission direction to split the fold on the right into two functions.
To define `opR` we need `cutR` and `thinR`

In [94]:
cutR :: Int -> [[a]] -> [[a]]
cutR b xss = if length (concat xss) == b then init xss else xss

thinR :: (Ord a, Num a) => a -> [[a]] -> [[a]]
thinR x xss = add [x] xss where
  add xs xss
    | sum xs > 0 = xs : xss
    | xs':xss' <- xss
    = add (xs ++ xs') xss'
    | otherwise = []
    
opR :: (Ord a, Num a) => Int -> a -> [[a]] -> [[a]]
opR b x = thinR x . cutR b

mss :: (Ord a, Num a) => Int -> [a] -> [a]
mss b = maxWith sum . map concat . scanr (opR b) []

mss 3 [1,2,-1,3]

[2,-1,3]

The final step would be to make sure all the `++`, length, concat, init, and sum operations are efficient, which involves tupling and concatenation via function composition. This is straight forward if not tedious.