# Comparison of longest common subsequence (LCS) algorithms

This notebooks compares longest common subsequence (LCS) algorithms:

- Brute-force: generate combinations of subseqences and check if they are common subsequences.
- Dynamic programming: take advantage of common subproblems to not evaluate the same subsequence more than once.
- Hirschbger's linear space: a dynamic programming approach that uses significantly less memory.

The comparison measures:

- Runtime efficiency: how long does it take to find the LCS.
- Memory efficiency: how much memory is used to find the LCS.

## Problem description

>> Add here the problem description and references.

## Notebook structure

>>> Describe here the structure of the notebook

## Sanity check and initialization

Check that the algorithms work by testing them against controlled input.

There are three part to the tests:

1. Automated tests that check against well-defined inputs. They are meant to be easy to debug, in case an algorithm fails.
1. Tests with longer inputs that simular DNA strands. They test more realistic scenarios, but still short enough to run fast.
1. A visual check, by printing the aligned subsequence. They guard against the test code itself having a failure that generates false positives.

In [None]:
import lcs_test

lcs_test.test(visualize=True)

Set a seed to make pseudo-random generator generate the same sequence across runs. This makes it easier to compare different runs of the algorithms.

In [None]:
import random
random.seed(42)

## Tests

To illustrate a real-life scenario, the code checks if a DNA strand is part of
a larger DNA sequence (see [this for an illustration](https://en.wikipedia.org/wiki/Subsequence#Applications)).

In [None]:
import metrics

### Runtime tests and analysis

#### Data collection

Get raw test data. This includes all repetitions.Get raw test data. This includes all repetitions.

In [None]:
rt_results_raw = metrics.runtime(verbose=1)

Change into a Pandas dataframe to facilitate analysis.


In [None]:
import pandas as pd
rt_results = pd.DataFrame(rt_results_raw)
rt_results.columns = ['Algorithm', 'DNA size', 'Strand size', 'Test number', 'Runtime (s)']
display(rt_results)

In [None]:
rt_results_summary = rt_results.groupby(['Algorithm','DNA size', 'Strand size']).mean()
rt_results_summary.drop(['Test number'], axis='columns', inplace=True)
rt_results_summary.reset_index(inplace=True)
display(rt_results_summary.sort_values(by=['DNA size', 'Runtime (s)']))

In [None]:
import seaborn as sns
sns.set(style="whitegrid")
sns.barplot(data=rt_results_summary, y='DNA size', x='Runtime (s)', hue='Algorithm',
    orient='h');

### Memory usage and analysis

Get raw test data. This includes all repetitions.

In [None]:
mem_results_raw = metrics.memory(verbose=1)

In [None]:
mem_results = pd.DataFrame(mem_results_raw)
mem_results.columns = ['Algorithm', 'DNA size', 'Strand size', 'Test number', 'Memory used (KiB)', 'Runtime (s)']
display(mem_results)

## References

| <!-- -->    | <!-- -->    |
|-------------|-------------|
| 1  | CLRS  |