# Active Learning - Comparando estratégias

Este notebook visa comparar as diversas estratégias de aprendizado ativo encontradas no documento do Burr Settles, disponível em: http://active-learning.net/.

Algumas das estratégias implementadas são:

- Amostra por incerteza
- Amostragem aleatória
- Consulta por comitê
- Aprendizado passivo
- Redução do erro esperado
- Expected Gradient Length

## O Framework

As estruturas do framework seguem o seguinte pipeline:
1. É usuário define quantas instâncias ele deseja através da variável *n_queries* (nota: quanto maior o número de instâncias, maior o custo computacional);

2. É definido um classificador através da função *which_classifier*, sendo os parâmentos:
    - **Classifier:** Define qual o classificador será utilizado no processo (atualmente só existe o KNN);
    
3. É definido o dataset através da função *which_dataset*:
    - **dataset:** Define o dataset a ser utilizado no framework(atualmente só possui o iris_dataset);
    - **n_split:** Define o tamanho das divisões feitas no dataset  (*cross-validation*).
    
4. A função *which_dataset* é responsável por retornar:
    - **X_raw:** Características dos dados do conjunto;
    - **y_raw:** Rótulos dos dados do conjunto;
    - **idx_data:** n listas (n = n_split) com a seguinte estrutura: [[train],[test]], nas listas train tendo os ids dos dados de treino e test os ids dos dados de teste. Assim, idx_data[i][j], tal que i = bag e j = treino(0) ou teste(1);
    
5. Após definir todo o ambiente, uma bateria de funções é executada, sendo essas as estratégias de amostragem do aprendizado ativo junto do dataset e do classificador escolhido.

Cada função de estratégia possui a mesma entrada e saída para padronização do framework, sendo elas:

#### Entrada
- **X_raw:** Características dos dados do conjunto;
- **y_raw:**  Rótulos dos dados do conjunto;
- **idx_data:** n listas (n = n_split) de ids do conjunto;
- **idx_bag:** Qual lista é desejado usar (idx_bag < n_splits);
- **classifier:** Qual classificador será utilizado (definido na função *which_classifier*);
- **init_size:**  Tamanho inicial da amostra (toda estratégia parte de um tamanho mínimo aleatório).

#### Saída:
- **score:** Acurácia do classificador + estratégia naquela bag;
- **time_elapsed:** Tempo de execução;
- **sample_size:** Quantidade de amostras utilizadas para treino daquele modelo;

## Importações

### Bibliotecas

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import ShuffleSplit

from modAL.uncertainty import classifier_uncertainty
from modAL.models import ActiveLearner

from modAL.models import ActiveLearner, Committee   

In [2]:
from timeit import default_timer as timer
# start = timer()
# end = timer()
# total_time = end - start # em segundos

In [3]:
from copy import deepcopy

### Pré-ajuste do conjunto de dados e dos classificadores

#### Conjuntos de dados

In [4]:
from sklearn.datasets import load_iris

In [15]:
def which_dataset(dataset = "iris", n_splits = 5):
    
    # Futuramente essa etapa será ajustada para receber qualquer dataset (ou lista com datasets)
    if (dataset == "iris"):
        data = load_iris()
        X_raw = data['data']
        y_raw = data['target']
        
    # cross validation bags
    data_cv = ShuffleSplit(n_splits= n_splits, test_size=0.3, random_state=0) #n_splits
    
    # extraindo ids do data_cv
    idx_data = []
    for train_index, test_index in data_cv.split(X_raw):
            idx_data.append([train_index, test_index])

    return X_raw, y_raw, idx_data

In [49]:
X_raw, y_raw, idx_data = data[0],data[1],data[2]
# X_raw[idx_data[idx_bag][0][initial_idx]]
#X_raw[data[2][0][0]]
y_raw[idx_data[0][1]]

array([2, 1, 0, 2, 0, 2, 0, 1, 1, 1, 2, 1, 1, 1, 1, 0, 1, 1, 0, 0, 2, 1,
       0, 0, 2, 0, 0, 1, 1, 0, 2, 1, 0, 2, 2, 1, 0, 1, 1, 1, 2, 0, 2, 0,
       0])

#### Classificadores

In [6]:
from sklearn.neighbors import KNeighborsClassifier

In [7]:
def which_classifier(parameters, classifier = 'k'):
    
    if (classifier == 'k'):
        return KNeighborsClassifier(parameters)

## Estratégias

### Amostra por incerteza

In [8]:
def uncertain_sample(X_raw, y_raw, idx_data, idx_bag, classifier, init_size):
    sample_size = 0
    start = timer()

    # amostragem aleatória inicial
    
    initial_idx = np.random.choice(range(len(idx_data[idx_bag][0])), size=init_size, replace=False)
    X_train, y_train = X_raw[idx_data[idx_bag][0][initial_idx]], y_raw[idx_data[idx_bag][0][initial_idx]]
    sample_size = sample_size + len(initial_idx)

    # iniciando o aprendiz
    learner = ActiveLearner(
        estimator=classifier,
        X_training=X_train, 
        y_training=y_train
    )
    unqueried_score = learner.score(X_raw[idx_data[idx_bag][1]], y_raw[idx_data[idx_bag][1]])
    new_score = unqueried_score

    i = len(idx_data[idx_bag][0])

    while (learner.score(X_raw[idx_data[idx_bag][1]], y_raw[idx_data[idx_bag][1]]) < 0.90) and (i != 0):
        stream_idx = i
    #         print("i = ", i,\
    #               "len X_raw: ", len(X_raw[idx_data[idx_bag][0]]),\
    #               "\tstream_idx: ", stream_idx,\
    #               "\tuncertainty: ", classifier_uncertainty(learner, X_raw[stream_idx].reshape(1, -1)),\
    #               "\tscore: ", learner.score(X_raw[idx_data[idx_bag][1]], y_raw[idx_data[idx_bag][1]]))
        i = i - 1
        if classifier_uncertainty(learner, X_raw[stream_idx].reshape(1, -1)) > 0.3:
            sample_size = sample_size + 1
            learner.teach(X_raw[stream_idx].reshape(1, -1), y_raw[stream_idx].reshape(-1, ))
            new_score = learner.score(X_raw[idx_data[idx_bag][1]], y_raw[idx_data[idx_bag][1]])

        np.delete(idx_data[idx_bag][0], i, axis = 0)

    end = timer()
    time_elapsed = end - start
    
    return [new_score, time_elapsed, sample_size]

### Amostragem aleatória

In [9]:
def random_sampling(X_raw, y_raw, idx_data, idx_bag, classifier, init_size):
    sample_size = 0
    start = timer()    

    #amostra aleatória
    training_indices = np.random.randint(low=0, high=len(idx_data), size=init_size)
    sample_size = sample_size + len(training_indices)
    
    #sub-amostragem de treino e teste
    X_train = X_raw[idx_data[idx_bag][0][training_indices]]
    y_train = y_raw[idx_data[idx_bag][0][training_indices]]

    X_test = np.delete(X_raw, idx_data[idx_bag][1][training_indices], axis=0)
    y_test = np.delete(y_raw, idx_data[idx_bag][1][training_indices], axis=0)

    classifier.fit(X_train, y_train)
    pred = classifier.predict(X_test)
    score = classifier.score(X_test,y_test)
    
    end = timer()
    time_elapsed = end - start
    
    return [score, time_elapsed, sample_size]

### Consulta por comitê

In [10]:
def query_by_committee(X_raw, y_raw, idx_data, idx_bag, classifier, init_size):
    start = timer()
    
    # define todos os dados como pool
    learner_list = []
    X_pool = X_raw[idx_data[idx_bag][0]]
    y_pool = y_raw[idx_data[idx_bag][0]]

    # definindo dados para treino
    sample_size = 0
    train_idx = np.random.choice(range(len(idx_data[idx_bag][0])), size=init_size, replace=False)
    sample_size = sample_size + len(train_idx)
    X_train = X_pool[train_idx]
    y_train = y_pool[train_idx]

    # removendo dados extraídos da pool
    X_pool = np.delete(X_pool, train_idx, axis=0)
    y_pool = np.delete(y_pool, train_idx)

    # iniciando o aprendiz
    learner = ActiveLearner(
        estimator=classifier,
        X_training=X_train, y_training=y_train
    )
    learner_list.append(learner)

    # juntando os membros do comitê
    committee = Committee(learner_list=learner_list)

    # estratégia query by committee
    for idx in range(init_size):
        query_idx, query_instance = committee.query(X_pool)
        committee.teach(
            X=X_pool[query_idx].reshape(1, -1),
            y=y_pool[query_idx].reshape(1, )
        )

        # removendo novas amostras do comitê
        X_pool = np.delete(X_pool, query_idx, axis=0)
        y_pool = np.delete(y_pool, query_idx)

    
    score = committee.score(X_raw[idx_data[idx_bag][1]], y_raw[idx_data[idx_bag][1]])
    

    end = timer()
    time_elapsed = end - start
    
    return [score, time_elapsed, sample_size]

### Aprendizado passivo

In [11]:
def passive_learning(X_raw, y_raw, idx_data, idx_bag, classifier, init_size):
    sample_size = 0
    start = timer() 
    
    from sklearn.model_selection import train_test_split

    classifier.fit(X_raw[idx_data[idx_bag][0]], y_raw[idx_data[idx_bag][0]])
    sample_size = sample_size + len(X_raw[idx_data[idx_bag][0]])
    y_predict = classifier.predict(X_raw[idx_data[idx_bag][1]])

    from sklearn.metrics import accuracy_score

    score = accuracy_score(y_raw[idx_data[idx_bag][1]],y_predict)
    
    end = timer()
    time_elapsed = end - start
    
    return [score, time_elapsed, sample_size]

## Algoritmo

### O Framework

In [17]:
legend = []
n_queries = 10
n_splits = 5
k = 5

classifier = which_classifier(k)

X_raw, y_raw, idx_data = which_dataset()

#idx_data[loop_cv][train_or_test][index]
#print(" X_raw: \n", iris_x[X_raw], "\n y_raw: \n", iris_y[y_raw])

performance_history_total = []

performance_history = []
for idx_bag, cv_bag in enumerate(idx_data):
    uncertain_score = uncertain_sample(X_raw, y_raw, idx_data, idx_bag, classifier, k)
    performance_history.append(uncertain_score)
performance_history_total.append(performance_history)
legend.append("Uncertainty Sampling")

performance_history = []
for idx_bag, cv_bag in enumerate(idx_data):
    classifier = which_classifier(k)
    random_score = random_sampling(X_raw, y_raw, idx_data, idx_bag, classifier, k)
    performance_history.append(random_score)
performance_history_total.append(performance_history)
legend.append("Random Sampling")

performance_history = []
for idx_bag, cv_bag in enumerate(idx_data):
    classifier = which_classifier(k)
    qbc_score = query_by_committee(X_raw, y_raw, idx_data, idx_bag, classifier, k)
    performance_history.append(qbc_score)
performance_history_total.append(performance_history)
legend.append("Query by committee")

performance_history = []
for idx_bag, cv_bag in enumerate(idx_data):
    classifier = which_classifier(k)
    passive_score = passive_learning(X_raw, y_raw, idx_data, idx_bag, classifier, k)
    performance_history.append(passive_score)
performance_history_total.append(performance_history)
legend.append("Passive learning")

In [13]:
def plot_strategies_acc(performance_history_total, legend, title):
    
    %matplotlib inline
    import matplotlib as mpl
    import matplotlib.pyplot as plt

    fig, ax = plt.subplots(figsize=(8.5, 6), dpi=130)

    for idx,pht in enumerate(performance_history_total):
        ax.plot(pht)
        ax.scatter(range(len(pht)), pht, s=13)

    ax.xaxis.set_major_locator(mpl.ticker.MaxNLocator(nbins=5, integer=True))
    ax.yaxis.set_major_locator(mpl.ticker.MaxNLocator(nbins=10))
    ax.yaxis.set_major_formatter(mpl.ticker.PercentFormatter(xmax=1))

    ax.set_ylim(bottom=0, top=1)
    ax.grid(True)

    ax.set_title(title + " - Classification accuracy with {n_queries} queries".format(n_queries = n_queries))
    ax.set_xlabel('Query iteration')
    ax.set_ylabel('Classification Accuracy')
    ax.legend(legend, loc='lower right')

    plt.show()

In [60]:
title = "iris"
##plot_strategies_acc(performance_history_total, legend, title)
# [score, time_elapsed, sample_size]

inner_list = []
for i in range(n_splits):
    for idx,pht in enumerate(performance_history_total):
        inner_list.append(pht[i][0])
        

for j in range(len(legend)):
    for i in range(n_splits):
        print(j , " ", i, " ", inner_list[i+j])

0   0   0.8444444444444444
0   1   0.3287671232876712
0   2   0.6
0   3   0.9777777777777777
0   4   0.8888888888888888
1   0   0.3287671232876712
1   1   0.6
1   2   0.9777777777777777
1   3   0.8888888888888888
1   4   0.3356164383561644
2   0   0.6
2   1   0.9777777777777777
2   2   0.8888888888888888
2   3   0.3356164383561644
2   4   0.8888888888888888
3   0   0.9777777777777777
3   1   0.8888888888888888
3   2   0.3356164383561644
3   3   0.8888888888888888
3   4   0.9555555555555556


In [36]:
inner_list

[0.8444444444444444,
 0.3287671232876712,
 0.6,
 0.9777777777777777,
 0.8888888888888888,
 0.3356164383561644,
 0.8888888888888888,
 0.9555555555555556,
 0.9333333333333333,
 0.3333333333333333,
 0.6222222222222222,
 0.9555555555555556,
 0.7777777777777778,
 0.3401360544217687,
 0.6444444444444445,
 0.9333333333333333,
 0.5333333333333333,
 0.3333333333333333,
 0.6444444444444445,
 0.9777777777777777]

In [21]:
df = pd.DataFrame(performance_history_total, index = legend)
display(df)

Unnamed: 0,0,1,2,3,4
Uncertainty Sampling,"[0.8444444444444444, 0.3131388540004991, 14]","[0.8888888888888888, 0.446496502001537, 18]","[0.9333333333333333, 0.2909028459998808, 16]","[0.7777777777777778, 0.45157809299962537, 14]","[0.5333333333333333, 0.5020524119991023, 13]"
Random Sampling,"[0.3287671232876712, 0.010698504998799763, 5]","[0.3356164383561644, 0.014665246000731713, 5]","[0.3333333333333333, 0.013384011001107865, 5]","[0.3401360544217687, 0.010826646001078188, 5]","[0.3333333333333333, 0.010833574000571389, 5]"
Query by committee,"[0.6, 0.035534458000256564, 5]","[0.8888888888888888, 0.035704555000847904, 5]","[0.6222222222222222, 0.03529515700029151, 5]","[0.6444444444444445, 0.03521676499985915, 5]","[0.6444444444444445, 0.035621463001007214, 5]"
Passive learning,"[0.9777777777777777, 0.002517123999496107, 105]","[0.9555555555555556, 0.0026776190006785328, 105]","[0.9555555555555556, 0.004469530998903792, 105]","[0.9333333333333333, 0.004586813000059919, 105]","[0.9777777777777777, 0.002795252999931108, 105]"


In [None]:
#df[['score', 'time_elapsed', 'sample_size']] = df.columns.split(" ",expand=True,)

## Espaço para testes

### Redução do erro esperado

__Returns:__ The indices of the instances from X chosen to be labelled; the instances from X chosen to be labelled.

In [None]:
init_size = k = 3
classifier = which_classifier(k)
X_raw, y_raw, idx_data = which_dataset()
idx_bag = 0

initial_idx = np.random.choice(range(len(idx_data[idx_bag][0])), size=init_size, replace=False)
X_train, y_train = X_raw[idx_data[idx_bag][0][initial_idx]], y_raw[idx_data[idx_bag][0][initial_idx]]

learner = ActiveLearner(
    estimator=classifier,
    X_training=X_train, 
    y_training=y_train
)
unqueried_score = learner.score(X_raw[idx_data[idx_bag][1]], y_raw[idx_data[idx_bag][1]])
new_score = unqueried_score

i = len(idx_data[idx_bag][0])

while (learner.score(X_raw[idx_data[idx_bag][1]], y_raw[idx_data[idx_bag][1]]) < 0.8) and (i != 0):
    stream_idx = i
    i = i - 1
    if classifier_uncertainty(learner, X_raw[stream_idx].reshape(1, -1)) > 0.3:
        learner.teach(X_raw[stream_idx].reshape(1, -1), y_raw[stream_idx].reshape(-1, ))
        new_score = learner.score(X_raw[idx_data[idx_bag][1]], y_raw[idx_data[idx_bag][1]])

    np.delete(idx_data[idx_bag][0], i, axis = 0)
    
from modAL.expected_error import expected_error_reduction
print("Amostra por incerteza sem redução de erro:")
print([new_score])

sample_er = expected_error_reduction(learner, X_raw[idx_data[idx_bag][0]])[0][0]
learner.teach(X_raw[sample_er].reshape(1, -1), y_raw[sample_er].reshape(-1, ))
new_score = learner.score(X_raw[idx_data[idx_bag][1]], y_raw[idx_data[idx_bag][1]])
print("Amostra por incerteza com redução de erro:")
print([new_score])

In [None]:
print(expected_error_reduction(learner, X_raw[idx_data[idx_bag][0]]))

print(X_raw[39],"\t", y_raw[39])

print(X_raw[expected_error_reduction(learner, X_raw[idx_data[idx_bag][0]])[0][0]])

print(i)