To demonstrate one-hot encoding of a protein sequences using the ``SequencePreprocessor().encode_one_hot()`` method, we first create an example sequence:

In [1]:
import aaanalysis as aa
import pandas as pd

list_seq = ["AACDEFGHIY", "IIHGFECDAY"]
sp = aa.SequencePreprocessor()

Provide the sequence as ``seq`` parameter to obtain a feature matrix (``X``) and the respective ``features``, which are binary representation of each amino acid at given residue positions: 

In [2]:
X, features = sp.encode_one_hot(list_seq=list_seq)

# Convert to DataFrame for visualization
df_one_hot = pd.DataFrame(X, columns=features, index=list_seq)
aa.display_df(df=df_one_hot, show_shape=True)

DataFrame shape: (2, 200)


Unnamed: 0,1A,1C,1D,1E,1F,1G,1H,1I,1K,1L,1M,1N,1P,1Q,1R,1S,1T,1V,1W,1Y,2A,2C,2D,2E,2F,2G,2H,2I,2K,2L,2M,2N,2P,2Q,2R,2S,2T,2V,2W,2Y,3A,3C,3D,3E,3F,3G,3H,3I,3K,3L,3M,3N,3P,3Q,3R,3S,3T,3V,3W,3Y,4A,4C,4D,4E,4F,4G,4H,4I,4K,4L,4M,4N,4P,4Q,4R,4S,4T,4V,4W,4Y,5A,5C,5D,5E,5F,5G,5H,5I,5K,5L,5M,5N,5P,5Q,5R,5S,5T,5V,5W,5Y,6A,6C,6D,6E,6F,6G,6H,6I,6K,6L,6M,6N,6P,6Q,6R,6S,6T,6V,6W,6Y,7A,7C,7D,7E,7F,7G,7H,7I,7K,7L,7M,7N,7P,7Q,7R,7S,7T,7V,7W,7Y,8A,8C,8D,8E,8F,8G,8H,8I,8K,8L,8M,8N,8P,8Q,8R,8S,8T,8V,8W,8Y,9A,9C,9D,9E,9F,9G,9H,9I,9K,9L,9M,9N,9P,9Q,9R,9S,9T,9V,9W,9Y,10A,10C,10D,10E,10F,10G,10H,10I,10K,10L,10M,10N,10P,10Q,10R,10S,10T,10V,10W,10Y
AACDEFGHIY,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
IIHGFECDAY,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1


You can adjust the used ``alphabet`` to change the considered characters:

In [3]:
# Show one-hot encoding with smaller alphabet
list_seq = ["ABC", "CBA"]
ALPHABET = "ABC"
X, features = sp.encode_one_hot(list_seq=list_seq, alphabet=ALPHABET)

# Convert to DataFrame for visualization
df_one_hot = pd.DataFrame(X, columns=features, index=list_seq)
aa.display_df(df=df_one_hot, show_shape=True)

DataFrame shape: (2, 9)


Unnamed: 0,1A,1B,1C,2A,2B,2C,3A,3B,3C
ABC,1,0,0,0,1,0,0,0,1
CBA,0,0,1,0,1,0,1,0,0


Change the ``gap`` symbol (default=``-``) as follows:

In [4]:
# Show one-hot encoding with other gap ('*')
list_seq = ["ABC", "CB*"]
ALPHABET = "ABC"
X, features = sp.encode_one_hot(list_seq=list_seq, alphabet=ALPHABET, gap="*")

# Convert to DataFrame for visualization
df_one_hot = pd.DataFrame(X, columns=features, index=list_seq)
aa.display_df(df=df_one_hot, show_shape=True)

DataFrame shape: (2, 9)


Unnamed: 0,1A,1B,1C,2A,2B,2C,3A,3B,3C
ABC,1,0,0,0,1,0,0,0,1
CB*,0,0,1,0,1,0,0,0,0


If one sequence is smaller than the other, gaps will be included either at the N-terminus or C-terminus (default), which is called padding. Adjust the padding using the ``pad_at`` (``N`` or ``C``) parameter:

In [5]:
# Show default padding (at C-Termius)
list_seq = ["ABC", "B"]
ALPHABET = "ABC"
X, features = sp.encode_one_hot(list_seq=list_seq, alphabet=ALPHABET)

# Convert to DataFrame for visualization
df_one_hot = pd.DataFrame(X, columns=features, index=list_seq)
aa.display_df(df=df_one_hot, show_shape=True)

DataFrame shape: (2, 9)


Unnamed: 0,1A,1B,1C,2A,2B,2C,3A,3B,3C
ABC,1,0,0,0,1,0,0,0,1
B,0,1,0,0,0,0,0,0,0


In [6]:
# Show N-terminal padding
list_seq = ["ABC", "B"]
ALPHABET = "ABC"
X, features = sp.encode_one_hot(list_seq=list_seq, alphabet=ALPHABET, pad_at="N")

# Convert to DataFrame for visualization
df_one_hot = pd.DataFrame(X, columns=features, index=list_seq)
aa.display_df(df=df_one_hot, show_shape=True)

DataFrame shape: (2, 9)


Unnamed: 0,1A,1B,1C,2A,2B,2C,3A,3B,3C
ABC,1,0,0,0,1,0,0,0,1
B,0,0,0,0,0,0,0,1,0
