To demonstrate the ``dPULearn().fit()``method, we create a small example dataset containing positive (1) and unlabeled (2) data samples:

In [1]:
import aaanalysis as aa
import pandas as pd
import numpy as np
aa.options["verbose"] = False

X = np.array([[0.2, 0.1], [0.25, 0.2], [0.2, 0.3], [0.5, 0.7]])
labels = np.array([1, 2, 2, 2])

Use ``dPULearn`` with default Principal Component Analysis (PCA) to obtain a defined number of reliable negatives samples (0) by specifying the ``n_unl_to_neg`` parameter:

In [2]:
dpul = aa.dPULearn()
dpul.fit(X=X, labels=labels, n_unl_to_neg=1)
df_pu = dpul.df_pu_
labels = dpul.labels_ # Updated labels
aa.display_df(df_pu)

Unnamed: 0,selection_via,PC1 (100.0%),PC1 (100.0%)_abs_dif
1,,-0.4,0.0
2,,-0.2,0.2
3,,0.4,0.8
4,PC1,0.8,1.2


As a real-world example, you can load our γ-secretase substrate prediction dataset containing substrates (positive samples, 1) and a redundancy-reduced set of single-span type I transmembrane proteins with unknown substrates status (unlabeled samples, 2):

In [3]:
df_seq = aa.load_dataset(name="DOM_GSEC_PU")
labels = df_seq["label"].to_numpy()
n_pos = sum([x == 1 for x in labels])   # Get number of positive samples
aa.display_df(df=df_seq.tail(5), show_shape=True, n_cols=5)

DataFrame shape: (5, 8)


Unnamed: 0,entry,sequence,label,tmd_start,tmd_stop
690,P60852,MAGGSATTWGYPVAL...LSQTWAQKLWESNRQ,2,602,624
691,P20239,MARWQRKASVSSPCG...FICYLYKKRTIRFNH,2,684,703
692,P21754,MELSYRLFICLLLWG...TRRCRTASHPVSASE,2,387,409
693,Q12836,MWLLRCVLLCVSLSL...LAVKKQKSCPDQMCQ,2,506,528
694,Q8TCW7,MEQIWLLLLLTIRVL...PTSLVLNGIRNPVFD,2,374,396


Using the respective features, we can create a feature matrix and obtain 'reliable' non-substrates by dPULearn:

In [4]:
df_feat = aa.load_features(name="DOM_GSEC")
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
X = sf.feature_matrix(features=df_feat["feature"], df_parts=df_parts)

# Number of positive (1) and unlabeled (2) samples
print(pd.Series(labels).value_counts())
dpul.fit(X=X, labels=labels, n_unl_to_neg=n_pos)
df_pu = dpul.df_pu_
new_labels = dpul.labels_ 

# Number of updated labels containing reliable negatives (0)
print(pd.Series(new_labels).value_counts())

# Show only selected entries
df = df_pu[df_pu["selection_via"].str.contains("PC", na=False)]
aa.display_df(df=df, show_shape=True, n_rows=20, n_cols=5)

2    631
1     63
Name: count, dtype: int64
2    568
1     63
0     63
Name: count, dtype: int64
DataFrame shape: (63, 15)


Unnamed: 0,selection_via,PC1 (56.2%),PC2 (7.4%),PC3 (2.9%),PC4 (2.8%)
81,PC3,0.0336,0.0073,0.0982,-0.0078
82,PC7,0.0334,-0.0411,0.0335,-0.0052
84,PC1,0.021,-0.0478,0.0752,-0.0054
90,PC4,0.039,-0.032,-0.0013,0.1109
95,PC2,0.032,-0.0821,0.0258,-0.0377
109,PC1,0.0261,-0.0585,0.0757,-0.0209
149,PC1,0.0265,-0.038,0.0191,0.0455
158,PC1,0.0235,-0.0607,0.054,0.0009
161,PC1,0.0259,0.0314,0.0449,0.0554
169,PC1,0.0265,-0.0099,0.0125,-0.0167


Since ``dPULearn().fit()`` returns the fitted model, list comprehension can be utilized to create results for various settings of a ``n_componentes``. If given as a float > 0 and < 1, this parameter represents the percentage of total variance to be retained by principal component analysis (PCA).

In [5]:
list_labels = [dpul.fit(X=X, labels=labels, n_unl_to_neg=n_pos, n_components=i).labels_ for i in [0.6, 0.7, 0.8, 0.9, 0.95]]

As alternative to ``PCA-based identification`` of negatives, ``distance-based identification`` can be performed using distance metrics including 'euclidean', 'manhattan', or 'cosine' distance. A DataFrame with the 

In [6]:
df_pu = dpul.fit(X=X, labels=labels, n_unl_to_neg=n_pos, metric="euclidean").df_pu_
aa.display_df(df_pu.sort_values(by="selection_via"), n_rows=10, show_shape=True)

DataFrame shape: (694, 3)


Unnamed: 0,selection_via,euclidean_dif,euclidean_abs_dif
84,euclidean,3.4807,3.4807
505,euclidean,3.2327,3.2327
509,euclidean,3.3363,3.3363
526,euclidean,3.3897,3.3897
533,euclidean,3.3639,3.3639
542,euclidean,3.075,3.075
546,euclidean,3.1625,3.1625
548,euclidean,3.1119,3.1119
552,euclidean,3.2886,3.2886
553,euclidean,3.6208,3.6208


Using ``PCA-based identification``, 'df_pu' provides the principal component (PC) values for all used PC and offers a label indicating based on which PC the respective negative samples was identified on:   

In [7]:
df_pu = dpul.fit(X=X, labels=labels, n_unl_to_neg=n_pos, n_components=0.8).df_pu_
aa.display_df(df_pu.sort_values(by="selection_via"), n_rows=n_pos+1, n_cols=4, show_shape=True)

DataFrame shape: (694, 15)


Unnamed: 0,selection_via,PC1 (56.2%),PC2 (7.4%),PC3 (2.9%)
497,PC1,0.0225,-0.0512,0.0134
615,PC1,0.0261,-0.0533,0.0993
406,PC1,0.0254,-0.0308,0.0272
446,PC1,0.0262,-0.0137,0.0545
455,PC1,0.0266,-0.0521,0.0895
468,PC1,0.0256,-0.0688,0.0118
471,PC1,0.025,-0.0055,0.0835
668,PC1,0.0232,-0.0169,0.0765
605,PC1,0.0258,-0.0545,0.0067
505,PC1,0.0231,-0.0484,0.0339
