## Estimating Accuracy from Unlabeled Data

Try to implement the Bayesian Error Estimation (BEE) model in following paper. This is based on my limited understanding and I can't guarantee the implementation is bug-free.

[Emmanouil Antonios Platanios, Avinava Dubey, Tom Mitchell ; Proceedings of The 33rd International Conference on Machine Learning, PMLR 48:1416-1425, 2016.](http://proceedings.mlr.press/v48/platanios16.html)

In [24]:
import numpy as np

### 0. Generate dummy data

In real-life, the estimations should come from estimators

In [161]:
num_iters = 50
num_estimators = 4
num_samples = 1000

In [162]:
labeling_matrix = np.random.randint(0, 2, (num_samples, num_estimators))

### 1. Gibbs Sampling

In [271]:
num_samples = labeling_matrix.shape[0]

In [272]:
true_labels = np.random.randint(0, 2, num_samples)
error_rates = 0.2*np.random.random(num_estimators)
print("initial error rate:", error_rates)

initial error rate: [0.11223538 0.09043202 0.10214499 0.07726982]


In [273]:
# set the hyper-parameters > 1 so that it's convex shape with mean 0.5. 
alpha_p, beta_p, alpha_e, beta_e = 2, 2, 2, 10

In [274]:
def sample_p():
    """ equation 2 + discounting the old label when sampling
    """
    sigma_l = np.sum(true_labels)
    return np.random.beta(alpha_p + sigma_l, beta_p + num_samples - sigma_l)

In [275]:
def sample_l(p, i):
    """ equation 3
    """
    pi = 1
    # the number of correct predictions of each estimator. dim [num_estimators, 1]
    pi = np.zeros(2)  # the pi value for l=0 and l=1
    for k in range(2):    
        num_corrects = labeling_matrix[i,:] == k
        temp = np.power(error_rates, 1 - num_corrects)*np.power(1 - error_rates, num_corrects)
        pi[k] = np.prod(temp)
    prob = pi * np.asarray([1-p, p])
    positive_prob = prob[1]/np.sum(prob)
    return random.binomial(1, positive_prob)

In [276]:
def sample_e(j):
    """ equation 4
    """
    sigma_j = np.sum(labeling_matrix[:, j] == true_labels)
    return np.random.beta(alpha_e + sigma_j, beta_e + num_samples - sigma_j)

In [277]:
for it in range(num_iters):
    for i in range(num_samples):
        p = sample_p()
        true_labels[i] = sample_l(p, i)
    for j in range(num_estimators):
        error_rates[j] = sample_e(j)
    print("Iteration", it, ":")
    print("Accuracy", 1 - error_rates)
    #print(true_labels)

Iteration 0 :
Accuracy [0.1113605  0.14529091 0.29715096 0.14500209]
Iteration 1 :
Accuracy [0.96665613 0.93794597 0.68095726 0.97851989]
Iteration 2 :
Accuracy [0.09558476 0.08943538 0.31542269 0.06481639]
Iteration 3 :
Accuracy [0.93316706 0.92556004 0.6528714  0.96728755]
Iteration 4 :
Accuracy [0.09743072 0.08342346 0.4194996  0.10258508]
Iteration 5 :
Accuracy [0.93246761 0.92822207 0.705937   0.95412881]
Iteration 6 :
Accuracy [0.0693895  0.1451995  0.41912669 0.04520672]
Iteration 7 :
Accuracy [0.9629295  0.91559994 0.68213596 0.97005163]
Iteration 8 :
Accuracy [0.0805446  0.10719331 0.39014003 0.08738436]
Iteration 9 :
Accuracy [0.94776536 0.93314799 0.68136139 0.93279774]
Iteration 10 :
Accuracy [0.07772267 0.09775339 0.32426243 0.09794675]
Iteration 11 :
Accuracy [0.94670279 0.94160717 0.65472558 0.97841544]
Iteration 12 :
Accuracy [0.09030817 0.10477313 0.36965285 0.04090852]
Iteration 13 :
Accuracy [0.95820329 0.93890483 0.64415449 0.97212839]
Iteration 14 :
Accuracy [0.049

### 2. Generate some real predictions

In [247]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
data = load_breast_cancer()

In [248]:
from sklearn.linear_model import LogisticRegression

In [249]:
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.33, random_state=42)
num_samples = X_train.shape[0]

#### Alternative 1: varying the number of samples

In [252]:
train_ratios = [0.02, 0.02, 0.05, 0.1]

predictions = []
for i in range(len(train_ratios)):
    mask = np.random.binomial(1, train_ratios[i], num_samples)
    #print(mask)
    X = X_train[mask==1]
    y = y_train[mask==1]
    print("num_samples:", X.shape)
    model = LogisticRegression()
    model.fit(X, y)
    print(model.score(X_test, y_test))
    predictions.append(model.predict(X_test))

num_samples: (6, 30)
0.9414893617021277
num_samples: (9, 30)
0.9361702127659575
num_samples: (22, 30)
0.9148936170212766
num_samples: (35, 30)
0.9574468085106383


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

#### Alternative 2: split the feature space

In [268]:
feature_ids = [0, 2, 15, 17, 30]

predictions = []
for i in range(len(train_ratios)):
    X = X_train[:, feature_ids[i]: feature_ids[i+1]]
    y = y_train
    X_test_temp = X_test[:, feature_ids[i]: feature_ids[i+1]]
    print("num_samples:", X.shape)
    model = LogisticRegression()
    model.fit(X, y)
    print(model.score(X_test_temp, y_test))
    predictions.append(model.predict(X_test_temp))

num_samples: (381, 2)
0.9148936170212766
num_samples: (381, 13)
0.9202127659574468
num_samples: (381, 2)
0.6436170212765957
num_samples: (381, 13)
0.9680851063829787


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [269]:
predictions = np.column_stack(predictions)
predictions.shape

(188, 4)

In [270]:
labeling_matrix = predictions 
# set the variable down. Now go back to section 1