## Estimating Accuracy from Unlabeled Data

Try to implement the Bayesian Error Estimation (BEE) model in following paper. This is based on my limited understanding and I can't guarantee the implementation is bug-free.

[Emmanouil Antonios Platanios, Avinava Dubey, Tom Mitchell ; Proceedings of The 33rd International Conference on Machine Learning, PMLR 48:1416-1425, 2016.](http://proceedings.mlr.press/v48/platanios16.html)

In [24]:
import numpy as np

### 0. Generate dummy data

In real-life, the estimations should come from estimators

In [161]:
num_estimators = 4
num_samples = 1000

In [162]:
labeling_matrix = np.random.randint(0, 2, (num_samples, num_estimators))

### 1. Gibbs Sampling

In [289]:
num_iters = 50
num_samples = labeling_matrix.shape[0]
num_estimators = labeling_matrix.shape[1]

In [290]:
true_labels = np.random.randint(0, 2, num_samples)
error_rates = 0.2*np.random.random(num_estimators)
print("initial error rate:", error_rates)

initial error rate: [0.01478959 0.01661954 0.19450443 0.16508176]


In [291]:
# set the hyper-parameters > 1 so that it's convex shape with mean 0.5. 
alpha_p, beta_p, alpha_e, beta_e = 2, 2, 2, 10

In [292]:
def sample_p():
    """ equation 2 + discounting the old label when sampling
    """
    sigma_l = np.sum(true_labels)
    return np.random.beta(alpha_p + sigma_l, beta_p + num_samples - sigma_l)

In [293]:
def sample_l(p, i):
    """ equation 3
    """
    pi = 1
    # the number of correct predictions of each estimator. dim [num_estimators, 1]
    pi = np.zeros(2)  # the pi value for l=0 and l=1
    for k in range(2):    
        num_corrects = labeling_matrix[i,:] == k
        temp = np.power(error_rates, 1 - num_corrects)*np.power(1 - error_rates, num_corrects)
        pi[k] = np.prod(temp)
    prob = pi * np.asarray([1-p, p])
    positive_prob = prob[1]/np.sum(prob)
    return random.binomial(1, positive_prob)

In [294]:
def sample_e(j):
    """ equation 4
    """
    sigma_j = np.sum(labeling_matrix[:, j] == true_labels)
    return np.random.beta(alpha_e + sigma_j, beta_e + num_samples - sigma_j)

In [296]:
for it in range(num_iters):
    for i in range(num_samples):
        p = sample_p()
        true_labels[i] = sample_l(p, i)
    for j in range(num_estimators):
        error_rates[j] = sample_e(j)
    print("Iteration", it, ":")
    print("Estimator accuracy:", 1 - error_rates)
    #print(true_labels)

Iteration 0 :
Estimator accuracy: [0.0991644  0.14431785 0.334768   0.06524638]
Iteration 1 :
Estimator accuracy: [0.9595642  0.9077215  0.66613227 0.98375647]
Iteration 2 :
Estimator accuracy: [0.07955334 0.11209361 0.38814018 0.06319425]
Iteration 3 :
Estimator accuracy: [0.91937292 0.92211103 0.66620295 0.97203312]
Iteration 4 :
Estimator accuracy: [0.09628537 0.1230774  0.37456843 0.11796218]
Iteration 5 :
Estimator accuracy: [0.95378078 0.93439577 0.66027637 0.95436496]
Iteration 6 :
Estimator accuracy: [0.11846627 0.08854344 0.30590127 0.06125564]
Iteration 7 :
Estimator accuracy: [0.95153381 0.93784875 0.66264068 0.97646642]
Iteration 8 :
Estimator accuracy: [0.11553379 0.11559109 0.37038231 0.07592953]
Iteration 9 :
Estimator accuracy: [0.94002122 0.94173215 0.699374   0.95706724]
Iteration 10 :
Estimator accuracy: [0.09899512 0.11967651 0.42670451 0.06972055]
Iteration 11 :
Estimator accuracy: [0.94084482 0.92680299 0.73004305 0.98041103]
Iteration 12 :
Estimator accuracy: [0.

### 2. Generate some real predictions

In [282]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
data = load_breast_cancer()

In [283]:
from sklearn.linear_model import LogisticRegression

In [284]:
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.33, random_state=42)
num_samples = X_train.shape[0]

#### Alternative 1: varying the number of samples

In [285]:
train_ratios = [0.02, 0.02, 0.05, 0.1]

predictions = []
for i in range(len(train_ratios)):
    mask = np.random.binomial(1, train_ratios[i], num_samples)
    #print(mask)
    X = X_train[mask==1]
    y = y_train[mask==1]
    print("num_samples:", X.shape)
    model = LogisticRegression()
    model.fit(X, y)
    print(model.score(X_test, y_test))
    predictions.append(model.predict(X_test))

num_samples: (6, 30)
0.8936170212765957
num_samples: (8, 30)
0.925531914893617
num_samples: (22, 30)
0.9308510638297872
num_samples: (39, 30)
0.9308510638297872


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

#### Alternative 2: split the feature space

In [286]:
feature_ids = [0, 2, 15, 17, 30]

predictions = []
for i in range(len(train_ratios)):
    X = X_train[:, feature_ids[i]: feature_ids[i+1]]
    y = y_train
    X_test_temp = X_test[:, feature_ids[i]: feature_ids[i+1]]
    print("num_samples:", X.shape)
    model = LogisticRegression()
    model.fit(X, y)
    print(model.score(X_test_temp, y_test))
    predictions.append(model.predict(X_test_temp))

num_samples: (381, 2)
0.9148936170212766
num_samples: (381, 13)
0.9202127659574468
num_samples: (381, 2)
0.6436170212765957
num_samples: (381, 13)
0.9680851063829787


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [287]:
predictions = np.column_stack(predictions)
predictions.shape

(188, 4)

In [288]:
labeling_matrix = predictions 
# set the variable down. Now go back to section 1