To demonstrate the ``filter_seq()`` function, we load the ``SEQ_CAPSID`` example dataset (see [Breimann24a]_):

In [1]:
import aaanalysis as aa
aa.options["verbose"] = False
df_seq = aa.load_dataset(name="SEQ_CAPSID", n=1000)

The ``filter_seq()`` function is a Python wrapper for two different sequence clustering and filtering algorithms, which have to be installed independently of the ``AAanaylsis`` package. Select them by setting the ``method`` parameter to ``cd-hit`` (default) or ``mmseqs``:

In [2]:
# Filtering using CD-HIT (default)
df_clust = aa.filter_seq(df_seq=df_seq)
n_clust = df_clust["cluster"].nunique()
print(f"Number of CD-HIT clusters: {n_clust}")
aa.display_df(df_clust, n_rows=-5, show_shape=True)

# Filtering using MMSeqs
df_clust = aa.filter_seq(df_seq=df_seq, method="mmseqs")
n_clust = df_clust["cluster"].nunique()
print(f"Number of MMSeqs2 clusters: {n_clust}")

Number of CD-HIT clusters: 2000
DataFrame shape: (2000, 4)


Unnamed: 0,entry,cluster,identity_with_rep,is_representative
1996,CAPSID_4517,1995,100.0,1
1997,CAPSID_4516,1996,100.0,1
1998,CAPSID_4300,1997,100.0,1
1999,CAPSID_4108,1998,100.0,1
2000,CAPSID_4984,1999,100.0,1


Number of MMSeqs2 clusters: 1998


You can obtain a redundancy-reduced set of protein sequences by selecting the representative sequence of each cluster:

In [3]:
# Select redundancy-reduced sequences
df_selected = df_clust[df_clust["is_representative"] == 1]
aa.display_df(df_selected, n_rows=-5, show_shape=True)

DataFrame shape: (1998, 4)


Unnamed: 0,entry,cluster,identity_with_rep,is_representative
1996,CAPSID_4936,1993,100.0,1
1997,CAPSID_4968,1994,100.0,1
1998,CAPSID_5002,1995,100.0,1
1999,CAPSID_5037,1996,100.0,1
2000,CAPSID_5069,1997,100.0,1


To reduce the number of clusters, you can decrease the sequence ``similarity_threshold`` (default=0.9): 

In [4]:
# Decrease number of clusters by using lower sequence similarity
df_clust = aa.filter_seq(df_seq=df_seq, similarity_threshold=0.5)
n_clust = df_clust["cluster"].nunique()
print(f"Number of CD-HIT clusters: {n_clust}")

# Filtering with MMSeqs
df_clust = aa.filter_seq(df_seq=df_seq, method="mmseqs", similarity_threshold=0.5)
n_clust = df_clust["cluster"].nunique()
print(f"Number of MMSeqs2 clusters: {n_clust}")

Number of CD-HIT clusters: 1584
Number of MMSeqs2 clusters: 1603


Adjust the length of the subsequence (called 'word' or 'k-mers') using the ``word_size`` parameter, which is optimized by default depending on the similarity threshold:

In [5]:
df_clust = aa.filter_seq(df_seq=df_seq, similarity_threshold=0.5, word_size=2)
n_clust = df_clust["cluster"].nunique()
print(f"Number of CD-HIT clusters: {n_clust}")

Number of CD-HIT clusters: 1584


For ``cd-hit``, you can change to local (less strict) sequence identity by setting ``global_identity=False``:

In [6]:
# Clustering with local sequence identity
df_clust = aa.filter_seq(df_seq=df_seq, similarity_threshold=0.5, global_identity=False)
n_clust = df_clust["cluster"].nunique()
print(f"Number of CD-HIT clusters: {n_clust}")

Number of CD-HIT clusters: 1627


The minimum coverage of the longer and shorter sequence can be adjusted using the ``coverage_long`` and ``coverage_short`` parameters: 

In [7]:
# Clustering with the highest sequence coverage
df_clust = aa.filter_seq(df_seq=df_seq, similarity_threshold=0.5, coverage_long=1, coverage_short=1)
n_clust = df_clust["cluster"].nunique()
print(f"Number of CD-HIT clusters (high coverage): {n_clust}")

# Clustering with the lowest sequence coverage
df_clust = aa.filter_seq(df_seq=df_seq, similarity_threshold=0.5, coverage_long=0.1, coverage_short=0.1)
n_clust = df_clust["cluster"].nunique()
print(f"Number of CD-HIT clusters (low coverage): {n_clust}")

Number of CD-HIT clusters (high coverage): 1930
Number of CD-HIT clusters (low coverage): 1596


To sort the sequences by cluster, set ``sort_clsuters=True`` (default=``False``):

In [8]:
# Sort sequences by clusters
df_clust = aa.filter_seq(df_seq=df_seq, sort_clusters=True)

Multiprocessing can be enabled by using the ``n_jobs`` parameter, which is set to the maximum if ``n_jobs=None``. However, this is only recommend for large datasets: 

In [9]:
import time

# Run without multiprocessing
time_start = time.time()
df_clust = aa.filter_seq(df_seq=df_seq, n_jobs=1)
time_no_mp = round(time.time() - time_start, 2)
print(f"Time without multiprocessing: {time_no_mp} seconds")

# Run with multiprocessing
time_start = time.time()
df_clust = aa.filter_seq(df_seq=df_seq, n_jobs=3)
time_mp = round(time.time() - time_start, 2)
print(f"Time with multiprocessing. {time_mp} seconds")

Time without multiprocessing: 0.54 seconds
Time with multiprocessing. 0.68 seconds


Set ``verbose=True`` to show the direct messages of the algorithms during processing and in case of errors. This can be very detailed for MMSeqs2 and is therefore not demonstrated here. 