# Ordering Strings of Varying Length Lexicographically

## Background Info

We can lexicographic order of strings to catalog genetic strings of varying lengths.

## Problem

Say that we have strings $s = s_1s_2...s_m$ and $t = t_1t_2...t_n$ with $m\le n$. Consider the substring $t'=t[1:m]$. 

We have 2 cases:

1. If $s=t'$, then we set $s \le_{Lex} t$ because $s$ is shorter than $t$ (e.g., APPLE $\lt$ APPLET).
2. Otherwise, $s \neq t'$. We define $s \lt_{Lex} t$ if $s \lt_{Lex} t'$ and define $s \gt_{Lex} t$ if $s \gt_{Lex} t'$ (e.g., APPLET $\lt_{Lex}$ ARTS because APPL $\lt_{Lex}$ ARTS).

**Given**: A permutation of at most 12 symbols defining an ordered alphabet $A$ and a positive integer $n \ (n \le 4)$.<br>
**Return**: All strings of length at most $n$ formed from $A$, ordered lexicographically. 

## Solution Explaination

This problem can be solved with recursion. Let's say:<br>
$n$ = the maximum length of the string formed (given)<br>
$L$ = the list containing symbols defining an ordered alphabet $A$<br>
$l_m$ = the list containing all the possible strings of length at most $m$ formed with symbols in $L$, ordered lexicographically

Suppose we know $l_{n-1}$. Then, we could obtain $l_{n}$ by adding all the elements in $l_{n-1}$ and concatenating each of the symbols in $L$ to the elements in $l_{n-1}$ that are of length $n-1$. Let's take Rosalind's given values to illustrate this:<br>
* $n$ = 3
* $L$ = [D, N, A]

Let's say, we know $l_1$, which is the list of all the possible strings of length at most 1 formed with symbols in $L$, which would make $l_1$ = [D, N, A]. Then, to obtain $l_2$, we'd have to:
1. include all the elements that are in $l_2$<br>
$l_2$ = [D, N, A]
2. concatenate each of the symbols in $L$ to all strings in $l_2$ that are of length 2<br>
$l_2$ = [D, DD, DN, DA, N, ND, NN, NA, A, AD, AN, AA]

The above process is the recursive case. Then, what should be the base case? The base case would be when $n = 1$, which is when we would be just $L$, meaning $l_1 = L$.

In [22]:
l = 'D N A'
n = 3
l = l.split(' ')

def get_all_possible_seqs(n):
    if n == 1:
        return l
    else:
        return get_next_set_of_seqs(get_all_possible_seqs(n-1), n-1)

def get_next_set_of_seqs(seq_lst, m):
    res = []
    for s in seq_lst:
        res.append(s)
        if len(s) == m:
            for i in l:
                res.append(s + i)
    return res


In [21]:
#print(get_all_possible_seqs(n))

In [23]:
# -- Print out in Rosalind way --
for i in get_all_possible_seqs(n):
    print(i)

D
DD
DDD
DDN
DDA
DN
DND
DNN
DNA
DA
DAD
DAN
DAA
N
ND
NDD
NDN
NDA
NN
NND
NNN
NNA
NA
NAD
NAN
NAA
A
AD
ADD
ADN
ADA
AN
AND
ANN
ANA
AA
AAD
AAN
AAA


## Actual Dataset

In [18]:
l = 'F M Z L C X P A K R'
l = l.split(' ')
n = 4

In [20]:
print(get_all_possible_seqs(n))

['F', 'FF', 'FFF', 'FFFF', 'FFFM', 'FFFZ', 'FFFL', 'FFFC', 'FFFX', 'FFFP', 'FFFA', 'FFFK', 'FFFR', 'FFM', 'FFMF', 'FFMM', 'FFMZ', 'FFML', 'FFMC', 'FFMX', 'FFMP', 'FFMA', 'FFMK', 'FFMR', 'FFZ', 'FFZF', 'FFZM', 'FFZZ', 'FFZL', 'FFZC', 'FFZX', 'FFZP', 'FFZA', 'FFZK', 'FFZR', 'FFL', 'FFLF', 'FFLM', 'FFLZ', 'FFLL', 'FFLC', 'FFLX', 'FFLP', 'FFLA', 'FFLK', 'FFLR', 'FFC', 'FFCF', 'FFCM', 'FFCZ', 'FFCL', 'FFCC', 'FFCX', 'FFCP', 'FFCA', 'FFCK', 'FFCR', 'FFX', 'FFXF', 'FFXM', 'FFXZ', 'FFXL', 'FFXC', 'FFXX', 'FFXP', 'FFXA', 'FFXK', 'FFXR', 'FFP', 'FFPF', 'FFPM', 'FFPZ', 'FFPL', 'FFPC', 'FFPX', 'FFPP', 'FFPA', 'FFPK', 'FFPR', 'FFA', 'FFAF', 'FFAM', 'FFAZ', 'FFAL', 'FFAC', 'FFAX', 'FFAP', 'FFAA', 'FFAK', 'FFAR', 'FFK', 'FFKF', 'FFKM', 'FFKZ', 'FFKL', 'FFKC', 'FFKX', 'FFKP', 'FFKA', 'FFKK', 'FFKR', 'FFR', 'FFRF', 'FFRM', 'FFRZ', 'FFRL', 'FFRC', 'FFRX', 'FFRP', 'FFRA', 'FFRK', 'FFRR', 'FM', 'FMF', 'FMFF', 'FMFM', 'FMFZ', 'FMFL', 'FMFC', 'FMFX', 'FMFP', 'FMFA', 'FMFK', 'FMFR', 'FMM', 'FMMF', 'FMMM', 'F

## Problem solved!