To demonstrate integer encoding of protein sequences using the ``SequencePreprocessor().encode_integer()`` method, we first create an example sequence:

In [7]:
import aaanalysis as aa
import pandas as pd

list_seq = ["AACDEFGHIY", "IIHGFECDAY"]
sp = aa.SequencePreprocessor()

Provide the sequence as ``seq`` parameter to obtain a feature matrix (``X``) and the respective ``features``, which are integer amino acid representation at given residue positions: 

In [8]:
X, features = sp.encode_integer(list_seq=list_seq)

# Convert to DataFrame for visualization
df_encode = pd.DataFrame(X, columns=features)
aa.display_df(df=df_encode, show_shape=True)

DataFrame shape: (2, 10)


Unnamed: 0,P1,P2,P3,P4,P5,P6,P7,P8,P9,P10
1,1,1,2,3,4,5,6,7,8,20
2,8,8,7,6,5,4,2,3,1,20


You can adjust the used ``alphabet`` to change the considered characters:

In [9]:
# Show integer encoding with smaller alphabet
list_seq = ["ABC", "CBA"]
ALPHABET = "ABC"
X, features = sp.encode_integer(list_seq=list_seq, alphabet=ALPHABET)

# Convert to DataFrame for visualization
df_encode = pd.DataFrame(X, columns=features, index=list_seq)
aa.display_df(df=df_encode, show_shape=True)

DataFrame shape: (2, 3)


Unnamed: 0,P1,P2,P3
ABC,1,2,3
CBA,3,2,1


Change the ``gap`` symbol (default=``-``) as follows:

In [10]:
# Show integer encoding with other gap ('*')
list_seq = ["ABC", "CB*"]
ALPHABET = "ABC"
X, features = sp.encode_integer(list_seq=list_seq, alphabet=ALPHABET, gap="*")

# Convert to DataFrame for visualization
df_encode = pd.DataFrame(X, columns=features, index=list_seq)
aa.display_df(df=df_encode, show_shape=True)

DataFrame shape: (2, 3)


Unnamed: 0,P1,P2,P3
ABC,1,2,3
CB*,3,2,0


If one sequence is smaller than the other, gaps will be included either at the N-terminus or C-terminus (default), which is called padding. Adjust the padding using the ``pad_at`` (``N`` or ``C``) parameter:

In [11]:
# Show default padding (at C-Termius)
list_seq = ["ABC", "B"]
ALPHABET = "ABC"
X, features = sp.encode_integer(list_seq=list_seq, alphabet=ALPHABET)

# Convert to DataFrame for visualization
df_encode = pd.DataFrame(X, columns=features, index=list_seq)
aa.display_df(df=df_encode, show_shape=True)

DataFrame shape: (2, 3)


Unnamed: 0,P1,P2,P3
ABC,1,2,3
B,2,0,0


In [12]:
# Show N-terminal padding
list_seq = ["ABC", "B"]
ALPHABET = "ABC"
X, features = sp.encode_integer(list_seq=list_seq, alphabet=ALPHABET, pad_at="N")

# Convert to DataFrame for visualization
df_encode = pd.DataFrame(X, columns=features, index=list_seq)
aa.display_df(df=df_encode, show_shape=True)

DataFrame shape: (2, 3)


Unnamed: 0,P1,P2,P3
ABC,1,2,3
B,0,0,2
