An overview dataset table is provided as default, where the suffix in the 'Dataset' ('AA', 'SEQ', and 'DOM') column corresponds to the 'Level' values ('Amino acid', 'Sequence', and 'Domain' level). Load datasets using the ``load_dataset()`` function:

In [1]:
import aaanalysis as aa
df_info = aa.load_dataset()
aa.display_df(df=df_info, show_shape=True, max_height=600)

DataFrame shape: (14, 11)


Unnamed: 0,Level,Dataset,# Sequences,Avg length,# Amino acids,# Positives,# Negatives,Predictor,Description,Reference,Label
1,Amino acid,AA_CASPASE3,233,796.587983,185605,705,184900,PROSPERous,Prediction of c...3 cleavage site,"Song et al., 2018",1 (adjacent to ... cleavage site)
2,Amino acid,AA_FURIN,71,831.028169,59003,163,58840,PROSPERous,Prediction of f...n cleavage site,"Song et al., 2018",1 (adjacent to ... cleavage site)
3,Amino acid,AA_LDR,342,345.754386,118248,35469,82779,IDP-Seq2Seq,Prediction of l...d regions (LDR),"Tang et al., 2020","1 (disordered), 0 (ordered)"
4,Amino acid,AA_MMP2,573,546.205934,312976,2416,310560,PROSPERous,Prediction of M...) cleavage site,"Song et al., 2018",1 (adjacent to ... cleavage site)
5,Amino acid,AA_RNABIND,221,248.873303,55001,6492,48509,GMKSVM-RU,Prediction of R...(RBP60 dataset),"Yang et al., 2021","1 (binding), 0 (non-binding)"
6,Amino acid,AA_SA,233,796.587983,185605,101082,84523,PROSPERous,Prediction of s...PASE3 data set),"Song et al., 2018",1 (exposed/acce...non-accessible)
7,Sequence,SEQ_AMYLO,1414,6.0,8484,511,903,ReRF-Pred,Prediction of a...ognenic regions,Teng et al. 2021,1 (amyloidogeni...-amyloidogenic)
8,Sequence,SEQ_CAPSID,7935,424.030246,3364680,3864,4071,VIRALpro,Prediction of capdsid proteins,"Galiez et al., 2016",1 (capsid prote...capsid protein)
9,Sequence,SEQ_DISULFIDE,2547,241.252454,614470,897,1650,Dipro,Prediction of d...es in sequences,"Cheng et al., 2006",1 (sequence wit...ithout SS bond)
10,Sequence,SEQ_LOCATION,1835,399.126975,732398,1045,790,,Prediction of s...lasma membrane),"Shen et al., 2019",1 (protein in c...asma membrane)


Load one of the datasets from the overview table by using a name from the 'Dataset' column (e.g., ``name='SEQ_CAPSID'``). The number of proteins per class can be adjusted by the ``n`` parameter:

In [2]:
df_seq = aa.load_dataset(name="SEQ_CAPSID", n=2)
aa.display_df(df=df_seq)

Unnamed: 0,entry,sequence,label
1,CAPSID_1,MVTHNVKINKHVTRR...DTPRIPATKLDEENV,0
2,CAPSID_2,MKKRQKKMTLSNFTD...AMLEAVINARHFGEE,0
3,CAPSID_4072,MALTTNDVITEDFVR...AWKAIFPEAAVKVDA,1
4,CAPSID_4073,MGELTDNGVQLAKAQ...TCTNPAAHAKIRDLK,1


The sampling can be performed randomly by setting ``random=True``: 

In [3]:
df_seq = aa.load_dataset(name="SEQ_CAPSID", n=2, random=True)
aa.display_df(df=df_seq)

Unnamed: 0,entry,sequence,label
1,CAPSID_1080,MKFTQFGEKFTRYSG...GINIIAEEVLKAYSE,0
2,CAPSID_3847,MLAELLSTFRRRPPE...KRAATGYGEGGRRRG,0
3,CAPSID_6263,MANYQDIAVEFAGDL...QNVVAAARVFRGTGV,1
4,CAPSID_5423,MGALLAVIAEVAEVS...TTPHRSSKTYSKRRH,1


Sequences with non-canonical amino acids are by default removed, which can be disabled by setting ``non_canonical_aa='keep'`` or ``non_canonical_aa='gap'``: 

In [4]:
n_unfiltered = len(aa.load_dataset(name='SEQ_DISULFIDE', non_canonical_aa="keep"))
n = len(aa.load_dataset(name='SEQ_DISULFIDE'))
print(f"'SEQ_DISULFIDE' contain {n_unfiltered} proteins and {n} after filtering.")    

'SEQ_DISULFIDE' contain 2547 proteins and 2202 after filtering.


Datasets can be filtered for the minimum and maximum sequence length using ``min_len`` and ``max_len``:

In [5]:
n_len_filtered = len(aa.load_dataset(name='SEQ_DISULFIDE', min_len=100, max_len=200))
print(f"'SEQ_DISULFIDE' contain {n_unfiltered} proteins, of which {n_len_filtered} have a length between 100 and 200 residues.")   


'SEQ_DISULFIDE' contain 2547 proteins, of which 644 have a length between 100 and 200 residues.


For the 'Amino acid level' datasets, the size of the amino acid window can be adjusted using the ``aa_window_size`` parameter:

In [6]:
df_aa = aa.load_dataset(name="AA_CASPASE3", n=2, aa_window_size=5)
aa.display_df(df=df_aa)

Unnamed: 0,entry,sequence,label
1,CASPASE3_1_pos126,LRDSM,1
2,CASPASE3_1_pos127,RDSML,1
3,CASPASE3_1_pos2,MSLFD,0
4,CASPASE3_1_pos3,SLFDL,0


For Positive-Unlabeled (PU) learning, datasets are provided containing only positive (labeled by '1') and unlabeled data ('2'), indicated by a 'PU' suffix in the 'Dataset' column name:

In [7]:
df_seq = aa.load_dataset(name="DOM_GSEC_PU", n=3)
aa.display_df(df=df_seq)

Unnamed: 0,entry,sequence,label,tmd_start,tmd_stop,jmd_n,tmd,jmd_c
1,P05067,MLPGLALLLLAAWTA...GYENPTYKFFEQMQN,1,701,723,FAEDVGSNKG,AIIGLMVGGVVIATVIVITLVML,KKKQYTSIHH
2,P14925,MAGRARSGLLLLLLG...EEEYSAPLPKPAPSS,1,868,890,KLSTEPGSGV,SVVLITTLLVIPVLVLLAIVMFI,RWKKSRAFGD
3,P70180,MRSLLLFTFSACVLL...RELREDSIRSHFSVA,1,477,499,PCKSSGGLEE,SAVTGIVVGALLGAGLLMAFYFF,RKKYRITIER
4,P12821,MGAASGRRGPGLLLP...SHGPQFGSEVELRHS,2,1257,1276,GLDLDAQQAR,VGQWLLLFLGIALLVATLGL,SQRLFSIRHR
5,P36896,MAESAGASSFFPLVV...KKTLSQLSVQEDVKI,2,127,149,EHPSMWGPVE,LVGIIAGPVFLLFLIIIIVFLVI,NYHQRVYHNR
6,Q8NER5,MTRALCSALRQALLL...KKTISQLCVKEDCKA,2,114,136,PNAPKLGPME,LAIIITVPVCLLSIAAMLTVWAC,QGRQCSYRKK
