Learning from crowds methods implemented in Python. The available methods:
- Majority Voting: soft, hard, weighted
- Dawid and Skene: ground truth (GT) inference based on confusion matrices (CM) of annotators.
- Raykar et al: predictive model over GT inference based on CM of annotators
- Mixture Models: inference of model and groups on annotations of the data or the annotators
- Global Behavior: based on label noise solutions, a global confusion matrix to infer a predictive model.
- Without predictive model: As Dawid and Skene, infers only the GT based on a global confusion matrix.
- Rodrigues et al (2013): predictive model over GT inference based on annotators reliability.
New methods on Updates
For examples of how to use the methods see the notebooks Tutorials on:
- LabelMe: a real image dataset
- Sentiment: a real text dataset
- Synthetic: a synthetic dataset
- Scalability Comparison: over the synthetic dataset
Read some dataset annotations
import numpy as np
y_obs = np.loadtxt("./data/LabelMe/answers.txt",dtype='int16') #not annotation symbol ==-1
T_weights = np.sum(y_obs != -1,axis=0) #number of annotations per annotator
print("Remove %d annotators that do not annotate on this set "%(np.sum(T_weights==0)))
y_obs = y_obs[:,T_weights!=0]
print("Shape (n_samples,n_annotators)= ",y_obs.shape)
For further details on representation see the documentation
You can estimate the ground truth with some aggregation technique: Majority Voting (MV)
from codeE.representation import set_representation
r_obs = set_representation(y_obs,"global")
print("Global representation shape (n_samples, n_classes)= ",r_obs.shape)
from codeE.methods import LabelAgg
label_A = LabelAgg(scenario="global")
mv_soft = label_A.predict(r_obs, 'softMV')
mv_hard = label_A.predict(r_obs, 'hardMV')
Read the dataset input patterns
X_train = ...
Define a predictive model over the ground truth
fz_x = ...
You can infer a predictive model with the ground truth
from codeE.representation import set_representation
y_obs_categorical = set_representation(y_obs,'onehot')
print("Individual representation shape (N,T,K)= ",y_obs_categorical.shape)
from codeE.methods import ModelInf_EM as Raykar
R_model = Raykar()
R_model.set_model(fz_x)
R_model.fit(X_train, y_obs_categorical, runs=20)
raykar_fx = R_model.get_basemodel()
raykar_fx.predict(new_X)
You can infer the predictive model and groups of behaviors
from codeE.methods import ModelInf_EM_CMM as CMM
CMM_model = CMM(M=3)
CMM_model.set_model(fz_x)
CMM_model.fit(X_train, r_obs, runs =20)
cmm_fx = CMM_model.get_basemodel()
cmm_fx.predict(new_X)
For the other available methods see the methods documentation
- Predictive model support Logistic Regression on sklearn
Only with one run in the configuration of the methods. Example
from sklearn.linear_model import LogisticRegression as LR
model_sklearn_A = LR(C= 1, multi_class="multinomial")
from codeE.methods import ModelInf_EM as Raykar
R_model = Raykar(init_Z="softmv")
args = {'epochs':1, 'optimizer': "newton-cg", 'lib_model': "sklearn"}
R_model.set_model(model_sklearn_A, **args)
R_model.fit(X_train, y_obs_categorical, runs=1)
- New methods to learning from crowds without the EM (using only backpropagation on neural networks)
Define your base predictive model over ground truth:
fz_x = keras models
Rodrigues & Pereira - CrowdLayer (based on Raykar et al.)
from codeE.methods import ModelInf_BP as Rodrigues18
Ro_model = Rodrigues18()
args = {'batch_size':BATCH_SIZE, 'optimizer':OPT}
Ro_model.set_model(fz_x, **args)
Ro_model.fit(X_train, y_obs_categorical, runs=10)
learned_fz_x = Ro_model.get_basemodel()
... use learned_fz_x
Goldberger & Ben-Reuven - NoiseLayer (based on Global Behavior)
from codeE.methods import ModelInf_BP_G as G_Noise
GNoise_model = G_Noise()
args = {'batch_size':BATCH_SIZE, 'optimizer':OPT}
GNoise_model.set_model(fz_x, **args)
GNoise_model.fit(X_train, r_obs, runs=10)
learned_fz_x = GNoise_model.get_basemodel()
... use learned_fz_x
- V2 LabelMe: image dataset
- V2 Sentiment: text dataset
- Or in the methods documentation
- Prior on Label noise without EM
- Guan et al. 2018 (models with label aggregation)
- Kajino et al. 2012 (models with model aggregation)
- Fast estimation, based on hard or discrete, on other methods besides DS
Copyright (C) 2022 authors of the github.
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.