An overview dataset table is provided as default, where the suffix in the 'Dataset' ('AA', 'SEQ', and 'DOM') column corresponds to the 'Level' values ('Amino acid', 'Sequence', and 'Domain' level):

In [30]:
import aaanalysis as aa
df_info = aa.load_dataset()
aa.display_df(df=df_info, show_shape=True)

DataFrame shape: (14, 10)


Unnamed: 0,Level,Dataset,# Sequences,# Amino acids,# Positives,# Negatives,Predictor,Description,Reference,Label
1,Amino acid,AA_CASPASE3,233,185605,705,184900,PROSPERous,Prediction o...leavage site,"Song et al., 2018",1 (adjacent ...eavage site)
2,Amino acid,AA_FURIN,71,59003,163,58840,PROSPERous,Prediction o...leavage site,"Song et al., 2018",1 (adjacent ...eavage site)
3,Amino acid,AA_LDR,342,118248,35469,82779,IDP-Seq2Seq,Prediction o...egions (LDR),"Tang et al., 2020",1 (disordere... 0 (ordered)
4,Amino acid,AA_MMP2,573,312976,2416,310560,PROSPERous,Prediction o...leavage site,"Song et al., 2018",1 (adjacent ...eavage site)
5,Amino acid,AA_RNABIND,221,55001,6492,48509,GMKSVM-RU,Prediction o...P60 dataset),"Yang et al., 2021","1 (binding),...non-binding)"
6,Amino acid,AA_SA,233,185605,101082,84523,PROSPERous,Prediction o...E3 data set),"Song et al., 2018",1 (exposed/a...-accessible)
7,Sequence,SEQ_AMYLO,1414,8484,511,903,ReRF-Pred,Prediction o...enic regions,Teng et al. 2021,1 (amyloidog...yloidogenic)
8,Sequence,SEQ_CAPSID,7935,3364680,3864,4071,VIRALpro,Prediction o...sid proteins,"Galiez et al., 2016",1 (capsid pr...sid protein)
9,Sequence,SEQ_DISULFIDE,2547,614470,897,1650,Dipro,Prediction o...in sequences,"Cheng et al., 2006",1 (sequence ...out SS bond)
10,Sequence,SEQ_LOCATION,1835,732398,1045,790,,Prediction o...ma membrane),"Shen et al., 2019",1 (protein i...a membrane)


Load one of the datasets from the overview table by using a name from the 'Dataset' column (e.g., ``name='SEQ_CAPSID'``). The number of proteins per class can be adjusted by the ``n`` parameter:

In [31]:
df_seq = aa.load_dataset(name="SEQ_CAPSID", n=2)
aa.display_df(df=df_seq)

Unnamed: 0,entry,sequence,label
1,CAPSID_1,MVTHNVKINKHV...RIPATKLDEENV,0
2,CAPSID_2,MKKRQKKMTLSN...EAVINARHFGEE,0
3,CAPSID_4072,MALTTNDVITED...AIFPEAAVKVDA,1
4,CAPSID_4073,MGELTDNGVQLA...NPAAHAKIRDLK,1


The sampling can be performed randomly by setting ``random=True``: 

In [32]:
df_seq = aa.load_dataset(name="SEQ_CAPSID", n=2, random=True)
aa.display_df(df=df_seq)

Unnamed: 0,entry,sequence,label
1,CAPSID_2975,MTGAQIFTKLLN...GEEWIKILKEDL,0
2,CAPSID_3308,MENTYRPRRTCL...LSRTGERFRPPA,0
3,CAPSID_5245,MALINPQFPYAG...TFNQPLINTQEG,1
4,CAPSID_5158,MKMASNDAAPST...PMGTGNGRRRVQ,1


Sequences with non-canonical amino acids are by default removed, which can be disabled by setting ``non_canonical_aa='keep'`` or ``non_canonical_aa='gap'``: 

In [33]:
n_unfiltered = len(aa.load_dataset(name='SEQ_DISULFIDE', non_canonical_aa="keep"))
n = len(aa.load_dataset(name='SEQ_DISULFIDE'))
print(f"'SEQ_DISULFIDE' contain {n_unfiltered} proteins and {n} after filtering.")    

'SEQ_DISULFIDE' contain 2547 proteins and 2202 after filtering.


Datasets can be filtered for the minimum and maximum sequence length using ``min_len`` and ``max_len``:

In [34]:
n_len_filtered = len(aa.load_dataset(name='SEQ_DISULFIDE', min_len=100, max_len=200))
print(f"'SEQ_DISULFIDE' contain {n_unfiltered} proteins, of which {n_len_filtered} have a length between 100 and 200 residues.")   


'SEQ_DISULFIDE' contain 2547 proteins, of which 644 have a length between 100 and 200 residues.


For the 'Amino acid level' datasets, the size of the amino acid window can be adjusted using the ``aa_window_size`` parameter:

In [35]:
df_aa = aa.load_dataset(name="AA_CASPASE3", n=2, aa_window_size=5)
aa.display_df(df=df_aa)

Unnamed: 0,entry,sequence,label
1,CASPASE3_1_pos126,LRDSM,1
2,CASPASE3_1_pos127,RDSML,1
3,CASPASE3_1_pos2,MSLFD,0
4,CASPASE3_1_pos3,SLFDL,0


For Positive-Unlabeled (PU) learning, datasets are provided containing only positive (labeled by '1') and unlabeled data ('2'), indicated by a 'PU' suffix in the 'Dataset' column name:

In [36]:
df_seq = aa.load_dataset(name="DOM_GSEC_PU", n=10)
aa.display_df(df=df_seq)

Unnamed: 0,entry,sequence,label,tmd_start,tmd_stop,jmd_n,tmd,jmd_c
1,P05067,MLPGLALLLLAA...NPTYKFFEQMQN,1,701,723,FAEDVGSNKG,AIIGLMVGGVVIATVIVITLVML,KKKQYTSIHH
2,P14925,MAGRARSGLLLL...YSAPLPKPAPSS,1,868,890,KLSTEPGSGV,SVVLITTLLVIPVLVLLAIVMFI,RWKKSRAFGD
3,P70180,MRSLLLFTFSAC...REDSIRSHFSVA,1,477,499,PCKSSGGLEE,SAVTGIVVGALLGAGLLMAFYFF,RKKYRITIER
4,Q03157,MGPTSPAARGQG...ENPTYRFLEERP,1,585,607,APSGTGVSRE,ALSGLLIMGAGGGSLIVLSLLLL,RKKKPYGTIS
5,Q06481,MAATGTAAAAAT...NPTYKYLEQMQI,1,694,716,LREDFSLSSS,ALIGLLVIAVAIATVIVISLVML,RKRQYGTISH
6,P35613,MAAALFVLLGFA...DKGKNVRQRNSS,1,323,345,IITLRVRSHL,AALWPFLGIVAEVLVLVTIIFIY,EKRRKPEDVL
7,P35070,MDRAARCSGASS...PINEDIEETNIA,1,119,141,LFYLRGDRGQ,ILVICLIAVMVVFIILVIGVCTC,CHPLRKRRKR
8,P09803,MGARCRSFSALL...KLADMYGGGEDD,1,711,733,GIVAAGLQVP,AILGILGGILALLILILLLLLFL,RRRTVVKEPL
9,P19022,MCRIAGALRTLL...KKLADMYGGGDD,1,724,746,RIVGAGLGTG,AIIAILLCIIILLILVLMFVVWM,KRRDKERQAK
10,P16070,MDKFWWHAAWGL...RNLQNVDMKIGV,1,650,672,GPIRTPQIPE,WLIILASLLALALILAVCIAVNS,RRRCGQKKKL
