Create a small example dataset for dPUlearn containing positive (1), unlabeled (2) data samples and the identified negatives (0):

In [1]:
import aaanalysis as aa
import pandas as pd
import numpy as np
aa.options["verbose"] = False
X = np.array([[0.2, 0.1], [0.1, 0.15], [0.25, 0.2], [0.2, 0.3], [0.5, 0.7]])
# Three different sets of labels 
list_labels = [[1, 1, 2, 0, 0], [1, 1, 0, 2, 0], [1, 1, 0, 0, 2]]

Use the ``dPULearn().eval()`` method to obtain the evaluation for each label set:  

In [2]:
dpul = aa.dPULearn()
df_eval = dpul.eval(X, list_labels=list_labels)
aa.display_df(df_eval)

Unnamed: 0,name,n_rel_neg,avg_STD,avg_IQR,avg_abs_AUC_pos,avg_abs_AUC_unl
1,Set 1,2,0.175,0.175,0.4375,0.25
2,Set 2,2,0.1875,0.1875,0.5,0.25
3,Set 3,2,0.0375,0.0375,0.4375,0.5


The dataset names given in the 'name' column or can be customized, typically using the name of the identification method, e.g., 'euclidean' for Euclidean distance-based. This can be achieved by setting ``names_datasets``:

In [3]:
names_datasets = ["Dataset 1", "Dataset 2", "Dataset 3"]
df_eval = dpul.eval(X, list_labels=list_labels, names_datasets=names_datasets)
aa.display_df(df_eval)

Unnamed: 0,name,n_rel_neg,avg_STD,avg_IQR,avg_abs_AUC_pos,avg_abs_AUC_unl
1,Dataset 1,2,0.175,0.175,0.4375,0.25
2,Dataset 2,2,0.1875,0.1875,0.5,0.25
3,Dataset 3,2,0.0375,0.0375,0.4375,0.5


The `df_eval` DataFrame provides two categories of quality measures:

1. **Homogeneity Within Negatives**: Measured by 'avg_STD' and 'avg_IQR', indicating the uniformity and spread of identified negatives.
2. **Dissimilarity With Other Groups**: Represented here by 'avg_abs_AUC_pos/unl', comparing identified negatives with positives ('pos', label 1) and unlabeled samples ('unl', label 2).

For a more comprehensive analysis, include `X_neg` as a feature matrix of ground-truth negatives to assess their dissimilarity with the identified negatives:

In [4]:
X_neg = [[0.5, 0.8], [0.4, 0.4]]
df_eval = dpul.eval(X, list_labels=list_labels, names_datasets=names_datasets, X_neg=X_neg)
aa.display_df(df_eval)

Unnamed: 0,name,n_rel_neg,avg_STD,avg_IQR,avg_abs_AUC_pos,avg_abs_AUC_unl,avg_abs_AUC_neg
1,Dataset 1,2,0.175,0.175,0.4375,0.25,0.1875
2,Dataset 2,2,0.1875,0.1875,0.5,0.25,0.1875
3,Dataset 3,2,0.0375,0.0375,0.4375,0.5,0.5


If the variance within the data is high enough, the Kullback-Leibler Divergence (KLD) can be computed to assess the dissimilarity of distributions between the identified negatives and the other groups:

In [5]:
# Extend the unlabeled group by one sample to fulfill variance requirements
X = np.array([[0.2, 0.1], [0.1, 0.15], [0.25, 0.2], [0.2, 0.3], [0.5, 0.7], [0.6, 0.8]])
list_labels = [[1, 1, 2, 0, 0, 2], [1, 1, 0, 2, 0, 2], [1, 1, 0, 0, 2, 2]]
df_eval = dpul.eval(X, list_labels=list_labels, names_datasets=names_datasets, X_neg=X_neg, comp_kld=True)
aa.display_df(df_eval)

Unnamed: 0,name,n_rel_neg,avg_STD,avg_IQR,avg_abs_AUC_pos,avg_KLD_pos,avg_abs_AUC_unl,avg_KLD_unl,avg_abs_AUC_neg,avg_KLD_neg
1,Dataset 1,2,0.175,0.175,0.4375,1.4144,0.125,0.0031,0.1875,0.1813
2,Dataset 2,2,0.1875,0.1875,0.5,1.3669,0.125,0.0033,0.1875,0.1041
3,Dataset 3,2,0.0375,0.0375,0.4375,1.0168,0.5,30.3179,0.5,12.0202
