# Experiment 1 - Single Sample
In this experiment, we aim to solve the counting task using QuaNet, a deep learning architecture for quantification that predicts class prevalence values. It takes as input: (i) class prevalence values estimated by a classifier; (ii) posterior probabilities $Pr(𝑦|x)$ for the positive class (since QuaNet is binary) for each document $x$, and (iii) embedded document representations.

We chose QuaNet because, as stated in the detailed overview of the [LeQua challenge](https://ceur-ws.org/Vol-3180/paper-146.pdf), it outperforms other methods in the **binary classification task on raw documents**. However, the methods provided by `QuaPy` are interchangeable black-boxes, meaning one can easily replace them to test which performs best for the given task.

In [1]:
from experiment_1_code import *

We first load the *Cabrio* and *Villata* dataset, constructing a dictionary that, for each abstract, contains the graph structure of its arguments, reconstructed using the *.ann* files provided with the raw documents.

For this experiment, we chose the label to be the number of arguments in each document. We then drop labels with counts less than 4 to prevent imbalance.

In [2]:
dataset = read_brat_dataset('../data/train/neoplasm_train') + read_brat_dataset('../data/dev/neoplasm_dev')

label_counts = {}
for data in dataset:
    label = data['n']
    if label in label_counts:
        label_counts[label] += 1
    else:
        label_counts[label] = 1

threshold = 4
labels_to_remove = {label for label, count in label_counts.items() if count < threshold}
dataset = [data for data in dataset if data['n'] not in labels_to_remove]

label_counts = {}
for data in dataset:
    label = data['n']
    if label in label_counts:
        label_counts[label] += 1
    else:
        label_counts[label] = 1

for label, count in sorted(label_counts.items()):
    print(f'Label {label}: {count} samples')

print(f'\nThere are {len(label_counts)} different labels with more than {threshold} samples -> {list(label_counts.keys())}\n')

Label 0: 4 samples
Label 1: 133 samples
Label 2: 108 samples
Label 3: 72 samples
Label 4: 44 samples
Label 5: 22 samples
Label 6: 11 samples

There are 7 different labels with more than 4 samples -> [1, 2, 4, 3, 5, 6, 0]



In [3]:
idx = random.randint(0,len(dataset))

print(f'Abstract {dataset[idx]['filename']}: {dataset[idx]['text']}')
print()
for ann in dataset[idx]['annotations']:
    print(f'\t{ann}')
print()
print(f'N. Args: {dataset[idx]["n"]}')
for arg, data in dataset[idx]['arguments'].items():
    print(f'\t{arg}: {data}')

print()
print(f'C: {dataset[idx]["c"]}')
print()
print(f'S: {dataset[idx]["s"]}')
print()
print(f'R: {dataset[idx]["r"]}')

Abstract 10880550:  In phase II trials, paclitaxel has been shown to have antitumor activity in patients with advanced non-small-cell lung cancer (NSCLC). However, the survival and quality-of-life (QOL) benefits of paclitaxel used as a single agent compared with supportive care alone have not been assessed in a randomized clinical trial. A total of 157 patients with stage IIIB or IV NSCLC who had received no prior chemotherapy were randomly assigned to receive either best supportive care alone (78 patients) or paclitaxel plus supportive care (79 patients). Paclitaxel was administered as a 3-hour intravenous infusion every 3 weeks. Supportive care included palliative radiotherapy and supportive therapy with corticosteroids, antibiotics, analgesics, antiemetics, transfusions, and other symptomatic therapy as required. The primary end point of the study was survival. Time to disease progression, response rate, adverse events, and QOL were secondary end points. Pretreatment characteristics

We now construct the `LabelledCollection` as required by `QuaPy`, from which we obtain the *train* and *test* collections; this is wrong, as the test set should be constructed using the other files provided by *Cabrio* and *Villata*. Finally, we create a `Dataset` object and tokenize it.

In [4]:
labelled_collection = qp.data.LabelledCollection([data['text'] for data in dataset], 
                                                 [data['n'] for data in dataset], 
                                                 classes=list(label_counts.keys()))

In [5]:
train_instances, test_instances, train_labels, test_labels = train_test_split(
    labelled_collection.instances, labelled_collection.labels, test_size=0.1, random_state=42
)

train_collection = qp.data.LabelledCollection(train_instances, train_labels, classes=list(label_counts.keys()))
test_collection = qp.data.LabelledCollection(test_instances, test_labels, classes=list(label_counts.keys()))

unique_labels = list(set(labelled_collection.classes_))
print(unique_labels)

unique_labels = list(set(train_collection.classes_))
print(unique_labels)

unique_labels = list(set(test_collection.labels))
print(unique_labels)

[np.int64(0), np.int64(1), np.int64(2), np.int64(3), np.int64(4), np.int64(5), np.int64(6)]
[np.int64(0), np.int64(1), np.int64(2), np.int64(3), np.int64(4), np.int64(5), np.int64(6)]
[np.int64(1), np.int64(2), np.int64(3), np.int64(4), np.int64(5), np.int64(6)]


In [6]:
abs_dataset = Dataset(train_collection, test_collection)
qp.data.preprocessing.index(abs_dataset, min_df=5, inplace=True)

# qp.environ['SAMPLE_SIZE'] = 100
qp.environ['SAMPLE_SIZE'] = 1

indexing: 100%|██████████████████████████████████████████████████| 354/354 [00:00<00:00, 7545.02it/s]
indexing: 100%|██████████████████████████████████████████████████████████████| 40/40 [00:00<?, ?it/s]


At this point, we train a simple CNN for the task. Note that this part will be replaced with the baseline in future experiments.

In [7]:
# train the text classifier:
cnn_module = CNNnet(abs_dataset.vocabulary_size, abs_dataset.training.n_classes)
cnn_classifier = NeuralClassifierTrainer(cnn_module, device='cpu')
cnn_classifier.fit(*abs_dataset.training.Xy)

[NeuralNetwork running on cpu]


  self.net.load_state_dict(torch.load(checkpoint))
[CNNnet] training epoch=32 tr-loss=0.06350 tr-acc=99.19% tr-macroF1=99.35% patience=1/10 val-loss=1.9


training ended by patience exhasted; loading best model parameters in ../checkpoint/classifier_net.dat for epoch 22
performing one training pass over the validation set...
[done]


<quapy.classification.neural.NeuralClassifierTrainer at 0x242f6e3a3f0>

Next, we train `QuaNet`.

In [8]:
# train QuaNet (alternatively, we can set fit_classifier=True and let QuaNet train the classifier)
quantifier = QuaNet(cnn_classifier, device='cpu')
quantifier.fit(abs_dataset.training, fit_classifier=False)



QuaNetModule(
  (lstm): LSTM(107, 64, batch_first=True, dropout=0.5, bidirectional=True)
  (dropout): Dropout(p=0.5, inplace=False)
  (ff_layers): ModuleList(
    (0): Linear(in_features=156, out_features=1024, bias=True)
    (1): Linear(in_features=1024, out_features=512, bias=True)
  )
  (output): Linear(in_features=512, out_features=7, bias=True)
)


  ptrue = torch.as_tensor([sample_data.prevalence()], dtype=torch.float, device=self.device)
[QuaNet] epoch=1 [it=499/500]	tr-mseloss=0.05892 tr-maeloss=0.12085	val-mseloss=-1.00000 val-maeloss=
100%|██████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 75.89it/s]
[QuaNet] epoch=2 [it=499/500]	tr-mseloss=0.03342 tr-maeloss=0.06458	val-mseloss=0.03485 val-maeloss=0
100%|██████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 70.84it/s]
[QuaNet] epoch=3 [it=499/500]	tr-mseloss=0.02845 tr-maeloss=0.05770	val-mseloss=0.02434 val-maeloss=0
100%|██████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 75.88it/s]
[QuaNet] epoch=4 [it=499/500]	tr-mseloss=0.03267 tr-maeloss=0.05947	val-mseloss=0.04299 val-maeloss=0
100%|██████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 91.04it/s]
[QuaNet] epoch=5 [it=499/500]	tr-mseloss=0.03398 tr-maeloss=0.06501	val-mseloss=0.02693 val

training ended by patience exhausted; loading best model parameters in ../checkpoint\QuaNet-628254-150520-402110-325807-111020 for epoch 6



  self.quanet.load_state_dict(torch.load(checkpoint))


Finally, we evaluate the accuracy of our model by sampling one document at a time and inferring its distribution (from which the *Single Sample* name of the experiment). The `argmax` of the distribution will represent the model's output.

In [9]:
evaluate_accuracy(abs_dataset, quantifier, set_type='train')
evaluate_accuracy(abs_dataset, quantifier, set_type='test')

Correct predictions on train set: 287
Wrong predictions on train set: 67
Accuracy on train set: 0.8107

Correct predictions on test set: 9
Wrong predictions on test set: 31
Accuracy on test set: 0.2250



The results on the train set are poor. We might:
- Tune the model, as we just used the default configuration. This *must* be tested, has the method might work better with minimum effort;
- Explore other quantification techniques instead of QuaPy. If we determine that a method suits better, we might consider to conduct other experiments like the third one differently. Please, give a look to the [LeQua](https://ceur-ws.org/Vol-3180/paper-146.pdf) paper; the UniOviedo should have obtained the best result on the *T2A* task, the one with raw documents and multi-classes;