# Experiment 4 - Relations
Now that we have our model capable of detecting components, let us consider as label the number of relation: we still split each abstract into sentences and label, make each possible combination between them and label each combined sentence as either *No related* or *Related*, corresponding to 0 and 1 respectively.

During inference, we will use the same sampling technique that processes all the sentences that form an abstract. This time, our goal is to determine the percentage of relations between arguments' components that each abstract contains, and inspect if this information along with the percentage of components could help to build a working model that predicts the number of arguments.

In [1]:
from experiment_4_code import *

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Antonio\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


### 1. Preprocessing
Here we preprocess our data by splitting the abstracts into sentences. Each combination of two sentences is then labeled as either being a `Relationship`, labeled as `1`, or `Not relationship`, labeled as `0`.

In [104]:
def split_by_boundaries(text, components_boundaries, components_types):
    """
    Split the text into parts based on the component boundaries and label them accordingly.
    
    This returns a list of tuples: (text_part, label, arg_id), where:
        - label is: 0 -> None, 1 -> Claim, 2 -> Premise
        - arg_id is the component ID (e.g., T1, T2) if it's a Claim or Premise, otherwise None
    """
    labeled_parts = []
    last_pos = 0

    # Iterate over sorted boundaries and extract the corresponding text parts
    sorted_boundaries = sorted([(start, end, components_types[key], key) 
                                for key, bounds in components_boundaries.items() 
                                for start, end in bounds])

    for start, end, label_type, arg_id in sorted_boundaries:
        # Add the non-labeled part between the previous boundary and the current one
        if last_pos < start:
            non_labeled_part = text[last_pos:start]
            if non_labeled_part.strip():  # Avoid empty parts
                labeled_parts.append((non_labeled_part, 0, None))

        # Add the labeled part (Claim or Premise) with its component ID (arg_id)
        labeled_parts.append((text[start:end], 1 if label_type in ['Claim', 'MajorClaim', 'Premise'] else 0, arg_id))
        
        last_pos = end

    # Add the remaining part of the text (after the last component)
    if last_pos < len(text):
        remaining_part = text[last_pos:]
        if remaining_part.strip():
            labeled_parts.append((remaining_part, 0, None))

    return labeled_parts

def label_sentences(text, reduce, components_boundaries, components_types):
    """
    First split the text according to the component boundaries, label premises and claims,
    and then split the remaining parts into sentences and label them as None.
    """
    labeled_sentences = []
    
    # Split text based on boundaries and label Claims/Premises
    labeled_parts = split_by_boundaries(text, components_boundaries, components_types)
    
    # Now process each part
    for part, label, arg_id in labeled_parts:
        if label == 0:
            # For the non-labeled parts, split into sentences
            sentences = sent_tokenize(part)
            for sentence in sentences:
                labeled_sentences.append({'sentence': sentence, 'label': 0, 'arg_id': None})
        else:
            # For labeled parts (Claim or Premise), treat the entire part as one sentence
            labeled_sentences.append({'sentence': part, 'label': label, 'arg_id': arg_id})

    return labeled_sentences

def read_brat_dataset(folder, reduce=False):
    dataset, temp_sentence, merge = [], "", False
    
    for root, _, files in os.walk(folder):
        for file in files:
            positives,negatives = 0,0
            if file.endswith('.ann'):
                ann_path = os.path.join(root, file)
                txt_path = ann_path.replace('.ann', '.txt')
                
                if os.path.exists(txt_path):
                    with open(ann_path, 'r', encoding='utf-8') as ann_f, open(txt_path, 'r', encoding='utf-8') as txt_f:
                        annotations = [line.strip().split('\t') for line in ann_f]
                        text = txt_f.read()

                        # Extract component boundaries and types
                        components_boundaries = {
                            ann[0]: [(int(ann[1].split(' ')[1]), int(ann[1].split(' ')[2]))] 
                            for ann in annotations if ann[0].startswith('T')
                        }
                        components_types = {
                            ann[0]: ann[1].split(' ')[0] 
                            for ann in annotations if ann[0].startswith('T')
                        }

                        # Extract relations between components (using Arg1 and Arg2)
                        relations = {}
                        for ann in annotations:
                            if ann[0].startswith('R'):
                                relation_type, arg1, arg2 = ann[1].split(' ')
                                arg1_id = arg1.split(':')[1]
                                arg2_id = arg2.split(':')[1]
                                relations[(arg1_id, arg2_id)] = relation_type
                                
                        # Label sentences based on boundaries first, then split the remaining text into sentences
                        labeled_sentences = label_sentences(text, reduce, components_boundaries, components_types)
                        
                        # Store sentence combinations and assign relation labels
                        sentence_combinations = list(combinations(labeled_sentences, 2))
                        for sentence1, sentence2 in sentence_combinations:
                            label = 0  # Default label is 0 (no relation)
                            rel_components = None
                            if sentence1['arg_id'] and sentence2['arg_id']:
                                # Check if there is a relation between the component IDs of the two sentences
                                if (sentence1['arg_id'], sentence2['arg_id']) in relations:
                                    label = 1  # Label 1 if relation exists between the components
                                    rel_components = (sentence1['arg_id'], sentence2['arg_id'])
                                elif (sentence2['arg_id'], sentence1['arg_id']) in relations:
                                    label = 1  # Label 1 if relation exists between the components
                                    rel_components = (sentence2['arg_id'], sentence1['arg_id'])

                                if label:
                                    positives+=1
                                    dataset.append({
                                        'text': text,
                                        'filename': file.split('.')[0],
                                        'components': rel_components, 
                                        'sentence_pair': sentence1['sentence'] + sentence2['sentence'],
                                        'label': label  
                                    })
                                else:
                                    negatives+=1
                                    if negatives <= positives + 4 and reduce:
                                        dataset.append({
                                            'text': text,
                                            'filename': file.split('.')[0],
                                            'components': rel_components, 
                                            'sentence_pair': sentence1['sentence'] + sentence2['sentence'],
                                            'label': label  
                                        })
                                    elif not reduce:
                                        dataset.append({
                                            'text': text,
                                            'filename': file.split('.')[0],
                                            'components': rel_components, 
                                            'sentence_pair': sentence1['sentence'] + sentence2['sentence'],
                                            'label': label  
                                        })
                            elif sentence1['arg_id'] or sentence2['arg_id']:
                                dataset.append({
                                    'text': text,
                                    'filename': file.split('.')[0],
                                    'components': rel_components, 
                                    'sentence_pair': sentence1['sentence'] + sentence2['sentence'],
                                    'label': 0  
                                })
                                    
    return dataset

In [105]:
train_set = read_brat_dataset('../data/train/neoplasm_train') + read_brat_dataset('../data/dev/neoplasm_dev')
test_set = read_brat_dataset('../data/test/glaucoma_test') + read_brat_dataset('../data/test/neoplasm_test') + read_brat_dataset('../data/test/mixed_test')

In [106]:
label_counts_train, avg_sentences_per_file_train = compute_dataset_statistics(train_set, dataset_name="train")
label_counts_test, avg_sentences_per_file_test = compute_dataset_statistics(test_set, dataset_name="test")

- Train set:
	Label 0: 24969 samples
	Label 1: 1636 samples

	There are 2 different labels in the train set -> [0, 1]
	Average number of sentences pairs per file in train set: 67
	Max sentence pair length: 162
	Average relationships per file: 4.16
	Average no relationships per file: 62.42

- Test set:
	Label 0: 17528 samples
	Label 1: 1120 samples

	There are 2 different labels in the test set -> [0, 1]
	Average number of sentences pairs per file in test set: 69
	Max sentence pair length: 140
	Average relationships per file: 4.23
	Average no relationships per file: 65.16



In [107]:
reduce = True
train_set_reduced = read_brat_dataset('../data/train/neoplasm_train', reduce) + read_brat_dataset('../data/dev/neoplasm_dev', reduce)
test_set_reduced = read_brat_dataset('../data/test/glaucoma_test', reduce) + read_brat_dataset('../data/test/neoplasm_test', reduce) + read_brat_dataset('../data/test/mixed_test', reduce)

In [108]:
label_counts_train_reduced, avg_sentences_per_file_train_reduced = compute_dataset_statistics(train_set_reduced, dataset_name="train")
label_counts_test_reduced, avg_sentences_per_file_test_reduced = compute_dataset_statistics(test_set_reduced, dataset_name="test")

- Train set:
	Label 0: 20184 samples
	Label 1: 1636 samples

	There are 2 different labels in the train set -> [0, 1]
	Average number of sentences pairs per file in train set: 55
	Max sentence pair length: 162
	Average relationships per file: 4.16
	Average no relationships per file: 50.46

- Test set:
	Label 0: 14484 samples
	Label 1: 1120 samples

	There are 2 different labels in the test set -> [0, 1]
	Average number of sentences pairs per file in test set: 58
	Max sentence pair length: 140
	Average relationships per file: 4.23
	Average no relationships per file: 53.84



Follows an example of what our dictionary looks like.

In [86]:
# display_file_info(train_set, filename='16416368') 
# display_file_info(train_set, filename=None)

Next, we create the dataset `FilenameLabelledCollection`, which inherits from `QuaPy`'s `LabelledCollection` class. This allows us to keep track of the filenames corresponding to each abstract to which the sentences belong. The `index` method is also modified to return two `FilenameLabelledCollection` instances.

In [87]:
train_collection = FilenameLabelledCollection([data['sentence_pair'] for data in train_set], 
                                                 [data['label'] for data in train_set], 
                                                 [data['filename'] for data in train_set], 
                                                 classes=list(label_counts_train.keys()))

test_collection = FilenameLabelledCollection([data['sentence_pair'] for data in test_set], 
                                                 [data['label'] for data in test_set], 
                                                 [data['filename'] for data in test_set], 
                                                 classes=list(label_counts_test.keys()))

train_collection_reduced = FilenameLabelledCollection([data['sentence_pair'] for data in train_set_reduced], 
                                                 [data['label'] for data in train_set_reduced], 
                                                 [data['filename'] for data in train_set_reduced], 
                                                 classes=list(label_counts_train_reduced.keys()))

test_collection_reduced = FilenameLabelledCollection([data['sentence_pair'] for data in test_set_reduced], 
                                                 [data['label'] for data in test_set_reduced], 
                                                 [data['filename'] for data in test_set_reduced], 
                                                 classes=list(label_counts_test_reduced.keys()))

In [88]:
def index(dataset: Dataset, indexer, inplace=False, fit=True, **kwargs):
    """
    Indexes the tokens of a textual :class:`quapy.data.base.Dataset` of string documents.
    To index a document means to replace each different token by a unique numerical index.
    Rare words (i.e., words occurring less than `min_df` times) are replaced by a special token `UNK`

    :param dataset: a :class:`quapy.data.base.Dataset` object where the instances of training and test documents
        are lists of str
    :param min_df: minimum number of occurrences below which the term is replaced by a `UNK` index
    :param inplace: whether or not to apply the transformation inplace (True), or to a new copy (False, default)
    :param kwargs: the rest of parameters of the transformation (as for sklearn's
        `CountVectorizer <https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html>_`)
    :return: a new :class:`quapy.data.base.Dataset` (if inplace=False) or a reference to the current
        :class:`quapy.data.base.Dataset` (inplace=True) consisting of lists of integer values representing indices.
    """
    qp.data.preprocessing.__check_type(dataset.training.instances, np.ndarray, str)
    qp.data.preprocessing.__check_type(dataset.test.instances, np.ndarray, str)

    training_index = indexer.fit_transform(dataset.training.instances) if fit else indexer.transform(dataset.training.instances) 
    test_index = indexer.transform(dataset.test.instances)

    training_index = np.asarray(training_index, dtype=object)
    test_index = np.asarray(test_index, dtype=object)

    if inplace:
        dataset.training = FilenameLabelledCollection(training_index, dataset.training.labels, dataset.training.filenames, dataset.classes_)
        dataset.test = FilenameLabelledCollection(test_index, dataset.test.labels, dataset.test.filenames, dataset.classes_)
        dataset.vocabulary = indexer.vocabulary_
        return dataset
    else:
        training = FilenameLabelledCollection(training_index, dataset.training.labels.copy(), dataset.training.filenames, dataset.classes_)
        test = FilenameLabelledCollection(test_index, dataset.test.labels.copy(), dataset.test.filenames, dataset.classes_)
        return Dataset(training, test, indexer.vocabulary_)
        
indexer = qp.data.preprocessing.IndexTransformer(min_df=1)

abs_dataset = Dataset(train_collection, test_collection)
index(abs_dataset, indexer, inplace=True)

abs_dataset_reduced = Dataset(train_collection_reduced, test_collection_reduced)
index(abs_dataset_reduced, indexer, fit=False, inplace=True)

qp.environ['SAMPLE_SIZE'] = avg_sentences_per_file_train
# qp.environ['SAMPLE_SIZE'] = avg_sentences_per_file_train

indexing: 100%|███████████████████████████████████████████████| 8158/8158 [00:00<00:00, 47694.50it/s]
indexing: 100%|███████████████████████████████████████████████| 5522/5522 [00:00<00:00, 46554.57it/s]
indexing: 100%|███████████████████████████████████████████████| 3373/3373 [00:00<00:00, 44832.35it/s]
indexing: 100%|███████████████████████████████████████████████| 2478/2478 [00:00<00:00, 43837.54it/s]


### 2. Classifier
`QuaNet` requires a classifier that can provide embedded representations of the inputs. In the original paper, `QuaNet` was tested using an `LSTM` as the base classifier; as `QuaPy`'s authors show in their [example](https://hlt-isti.github.io/QuaPy/manuals/methods.html#the-quanet-neural-network), we will use an instantiation of `QuaNet` that employs a `CNN` as a probabilistic classifier, taking its last layer representation as the document embedding.

In [89]:
from sklearn.metrics import accuracy_score, f1_score

class WeightedNeuralClassifierTrainer(NeuralClassifierTrainer):

    def __init__(self,
                 net: 'TextClassifierNet',
                 lr=1e-3,
                 weight_decay=0,
                 patience=10,
                 epochs=200,
                 batch_size=64,
                 batch_size_test=512,
                 padding_length=300,
                 device='cuda',
                 checkpointpath='../checkpoint/classifier_net.dat',
                 criterion=torch.nn.CrossEntropyLoss()):

        super().__init__(
                 net,
                 lr=lr,
                 weight_decay=weight_decay,
                 patience=patience,
                 epochs=epochs,
                 batch_size=batch_size,
                 batch_size_test=batch_size_test,
                 padding_length=padding_length,
                 device=device,
                 checkpointpath=checkpointpath)
        
        self.criterion = criterion
    
    def _train_epoch(self, data, status, pbar, epoch):
        self.net.train()
        losses, predictions, true_labels = [], [], []
        for xi, yi in data:
            self.optim.zero_grad()
            logits = self.net.forward(xi)
            loss = self.criterion(logits, yi)
            loss.backward()
            self.optim.step()
            losses.append(loss.item())
            preds = torch.softmax(logits, dim=-1).detach().cpu().numpy().argmax(axis=-1)

            status["loss"] = np.mean(losses)
            predictions.extend(preds.tolist())
            true_labels.extend(yi.detach().cpu().numpy().tolist())
            status["acc"] = accuracy_score(true_labels, predictions)
            status["f1"] = f1_score(true_labels, predictions, average='macro')
            self.__update_progress_bar(pbar, epoch)

    def _test_epoch(self, data, status, pbar, epoch):
        self.net.eval()
        losses, predictions, true_labels = [], [], []
        with torch.no_grad():
            for xi, yi in data:
                logits = self.net.forward(xi)
                loss = self.criterion(logits, yi)
                losses.append(loss.item())
                preds = torch.softmax(logits, dim=-1).detach().cpu().numpy().argmax(axis=-1)
                predictions.extend(preds.tolist())
                true_labels.extend(yi.detach().cpu().numpy().tolist())

            status["loss"] = np.mean(losses)
            status["acc"] = accuracy_score(true_labels, predictions)
            status["f1"] = f1_score(true_labels, predictions, average='macro')
            self.__update_progress_bar(pbar, epoch)

    def __update_progress_bar(self, pbar, epoch):
        pbar.set_description(f'[{self.net.__class__.__name__}] training epoch={epoch} '
                             f'tr-loss={self.status["tr"]["loss"]:.5f} '
                             f'tr-acc={100 * self.status["tr"]["acc"]:.2f}% '
                             f'tr-macroF1={100 * self.status["tr"]["f1"]:.2f}% '
                             f'patience={self.early_stop.patience}/{self.early_stop.PATIENCE_LIMIT} '
                             f'val-loss={self.status["va"]["loss"]:.5f} '
                             f'val-acc={100 * self.status["va"]["acc"]:.2f}% '
                             f'macroF1={100 * self.status["va"]["f1"]:.2f}%')

In [90]:
# cnn_module = CNNnet(abs_dataset.vocabulary_size, abs_dataset.training.n_classes)

# negative_samples = len([el for el in abs_dataset_reduced.training.labels if el == 0])
# positive_samples = len([el for el in abs_dataset_reduced.training.labels if el == 1])
# class_weights = torch.tensor([1-(negative_samples/(negative_samples+positive_samples)),1-(positive_samples/(negative_samples+positive_samples))] )
# criterion = torch.nn.CrossEntropyLoss(weight=class_weights)

# cnn_classifier = WeightedNeuralClassifierTrainer(cnn_module, device='cpu', criterion=criterion)
# cnn_classifier.fit(*abs_dataset_reduced.training.Xy)

cnn_module = CNNnet(abs_dataset.vocabulary_size, abs_dataset.training.n_classes)
cnn_classifier = NeuralClassifierTrainer(cnn_module, device='cpu')
cnn_classifier.fit(*abs_dataset_reduced.training.Xy)

[NeuralNetwork running on cpu]


  self.net.load_state_dict(torch.load(checkpoint))
[CNNnet] training epoch=14 tr-loss=0.02162 tr-acc=99.15% tr-macroF1=99.15% patience=1/10 val-loss=1.4


training ended by patience exhasted; loading best model parameters in ../checkpoint/classifier_net.dat for epoch 4
performing one training pass over the validation set...
[done]


<quapy.classification.neural.NeuralClassifierTrainer at 0x135ea618650>

In [91]:
# f1_train = 1-qp.error.f1e(abs_dataset_reduced.training.labels, cnn_classifier.predict(abs_dataset_reduced.training.instances))
# accuracy_train = 1-qp.error.acce(abs_dataset_reduced.training.labels, cnn_classifier.predict(abs_dataset_reduced.training.instances))
# print('- Train set:')
# print(f'\tF1: {f1_train}')    
# print(f'\tAccuracy: {accuracy_train}')    

# f1_test = 1-qp.error.f1e(abs_dataset_reduced.test.labels, cnn_classifier.predict(abs_dataset_reduced.test.instances))
# accuracy_test = 1-qp.error.acce(abs_dataset_reduced.test.labels, cnn_classifier.predict(abs_dataset_reduced.test.instances))
# print('- Test set:')
# print(f'\tF1: {f1_test}')    
# print(f'\tAccuracy: {accuracy_test}')    

f1_train = 1-qp.error.f1e(abs_dataset.training.labels, cnn_classifier.predict(abs_dataset.training.instances))
accuracy_train = 1-qp.error.acce(abs_dataset.training.labels, cnn_classifier.predict(abs_dataset.training.instances))
print('- Train set:')
print(f'\tF1: {f1_train}')    
print(f'\tAccuracy: {accuracy_train}')    

f1_test = 1-qp.error.f1e(abs_dataset.test.labels, cnn_classifier.predict(abs_dataset.test.instances))
accuracy_test = 1-qp.error.acce(abs_dataset.test.labels, cnn_classifier.predict(abs_dataset.test.instances))
print('- Test set:')
print(f'\tF1: {f1_test}')    
print(f'\tAccuracy: {accuracy_test}')    

- Train set:
	F1: 0.6595181368937398
	Accuracy: 0.6978426084824711
- Test set:
	F1: 0.5787563314172924
	Accuracy: 0.6481347337921043


In [95]:
infer(test_set, indexer, comp_quantifier=None, comp_classifier=None, rel_quantifier=None, rel_classifier=cnn_classifier, filename=None, use_tokenizer=True)

File 15037889 - Relations: [0] - Text:


To compare the longitudinal effects of treatment on intraocular pressure (IOP) and visual field performance in Japanese normal-tension glaucoma (NTG) between latanoprost and timolol.
This is an open-label, randomized, study. A total of 62 NTG patients were prospectively, consecutively enrolled. All study subjects were randomly assigned to 0.005% latanoprost instillation once daily in the morning or 0.5% timolol instillation twice daily for a prospective 3-year follow-up, and underwent a routine ocular examination every month. Automated perimetry was performed every 6 months using Humphrey field analysers. Stereophotographs of optic discs were also obtained every 6 months.
Percentage of IOP reduction or the magnitude of IOP reduction showed no intergroup differences either at any time point (13-15%). In the visual field, the estimated rate of change in the MD value (dB/year) was -0.34+/-0.17 (SE) for the latanoprost group, and -0.10+/-0.18 (SE) f

indexing: 100%|███████████████████████████████████████████████████| 21/21 [00:00<00:00, 20916.74it/s]

	Classification:
		# Ground truth relations: (0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0)
		# Predicted relations labels: [0 0 0 1 1 0 0 0 0 1 0 0 0 1 0 0 1 0 1 0 1]





### 3. QuaNet 
The results are solid, let's move onto `QuaNet` training phase. `QuaNet` observes the classification predictions to learn higher-order *quantification embeddings*, which are then refined by incorporating quantification predictions of simple classify-and-count-like methods.

![architecture](./images/quanet_architecture.png)

The QuaNet architecture (see Figure 1) consists of two main components: a **recurrent component** and a **fully connected component**.

#### 3.1 Recurrent Component: Bidirectional LSTM
- The core of the model is a **Bidirectional LSTM** (Long Short-Term Memory), a type of recurrent neural network. 
- The LSTM receives as input a **list of pairs** $⟨Pr(c|x), \vec{x}⟩$, where:
  - $Pr(c|x)$ is the probability that a classifier $h$ assigns class $c$ to document $x$.
  - $\vec{x}$ is the **document embedding**, a vector representing the document's content.
- The list is **sorted by the value of $Pr(c|x)$**, meaning the documents are arranged from least to most likely to belong to class $c$.
  
The **intuition** behind this approach is that the LSTM will "learn to count" positive and negative examples. By observing the ordered sequence of probabilities, the LSTM should learn to recognize the point where the documents switch from negative to positive examples. The document embedding $\vec{x}$ helps the LSTM assign different importance to each document when making its prediction.

The output of the LSTM is called a **quantification embedding**—a dense vector representing the information about the quantification task learned from the input data.

#### 3.2 Fully Connected Component
- The vector returned by the LSTM is combined with additional information, specifically **quantification-related statistics**:
  - $\hat{p}_c^{CC}(D)$, $\hat{p}_c^{ACC}(D)$, $\hat{p}_c^{PCC}(D)$, and $\hat{p}_c^{PACC}(D)$, which are quantification predictions from different methods.
  - $tpr_b$, $fpr_b$, $tpr_s$, and $fpr_s$, aggregate statistics related to true positive and false positive rates, which are easy to compute from the classifier $h$ using a validation set.

This combined vector then passes through the second part of the architecture, which is made up of **fully connected layers** with **ReLU activations**. These layers adjust the quantification embedding using the additional statistics from the classifier to improve the accuracy of the quantification.

The final output is a prediction $\hat{p}_c^{QuaNet}(c|D)$, which represents the probability of class $c$ for the dataset $D$, produced by a **softmax layer**.

QuaNet could use quantification predictions from many methods, but it focuses on those that are **computationally efficient** (like CC, ACC, PCC, and PACC). This ensures that the process remains fast while still providing sufficient information for accurate predictions.

### Details

| Layer | Type | Dimensions | Activation | Dropout |
|---|---|---|---|---|
| Input | LSTM | 128 | N/A | N/A |
| Dense 1 | Dense | 1024 | ReLU | 0.5 |
| Dense 2 | Dense | 512 | ReLU | 0.5 |
| Output | Dense | 2 | Softmax | N/A |

- The LSTM has **64 hidden dimensions**, and since it’s bidirectional, the final LSTM output has **128 dimensions**.
- This LSTM output is concatenated with the **8 quantification statistics** (giving a total of 136 dimensions), which is then fed into:
  - **Two dense layers** with **1,024** and **512 dimensions**, each using **ReLU activation** and **0.5 dropout**.
  - Finally, the output is passed through a **softmax layer** of size 2 to make the final class prediction.


In [59]:
# train QuaNet (alternatively, we can set fit_classifier=True and let QuaNet train the classifier)
quantifier = QuaNet(cnn_classifier, qp.environ['SAMPLE_SIZE'], device='cpu')
quantifier.fit(abs_dataset.training, fit_classifier=False)



QuaNetModule(
  (lstm): LSTM(102, 64, batch_first=True, dropout=0.5, bidirectional=True)
  (dropout): Dropout(p=0.5, inplace=False)
  (ff_layers): ModuleList(
    (0): Linear(in_features=136, out_features=1024, bias=True)
    (1): Linear(in_features=1024, out_features=512, bias=True)
  )
  (output): Linear(in_features=512, out_features=2, bias=True)
)


  ptrue = torch.as_tensor([sample_data.prevalence()], dtype=torch.float, device=self.device)
[QuaNet] epoch=1 [it=499/500]	tr-mseloss=0.02247 tr-maeloss=0.11509	val-mseloss=-1.00000 val-maeloss=
100%|██████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 85.34it/s]
[QuaNet] epoch=2 [it=499/500]	tr-mseloss=0.01620 tr-maeloss=0.10015	val-mseloss=0.00939 val-maeloss=0
100%|██████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 98.89it/s]
[QuaNet] epoch=3 [it=499/500]	tr-mseloss=0.01581 tr-maeloss=0.09844	val-mseloss=0.01178 val-maeloss=0
100%|██████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 94.75it/s]
[QuaNet] epoch=4 [it=499/500]	tr-mseloss=0.01450 tr-maeloss=0.09204	val-mseloss=0.00923 val-maeloss=0
100%|██████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 88.83it/s]
[QuaNet] epoch=5 [it=499/500]	tr-mseloss=0.01611 tr-maeloss=0.09939	val-mseloss=0.01475 val

training ended by patience exhausted; loading best model parameters in ../checkpoint\QuaNet-90930-370682-278513-222161-753977 for epoch 19



  self.quanet.load_state_dict(torch.load(checkpoint))


We wrapped `QuaPy`'s error evaluation function and manually modified how each sample is selected; we adjusted the sampling strategy to work with batches where the batch size is equal to the number of sentences that compose each abstract. This allows us to select the entire document based on the filename associated with each sentence. We will also evaluate the results using the standard random sampling technique, where sentences from different abstracts are grouped into the same batch.

In [60]:
print('Results on training set:')
result_train = evaluate(collection=abs_dataset.training, n=[1,2], quantifier=quantifier)

print('\nResults on test set:')
result_test = evaluate(collection=abs_dataset.test, n=[1,2], quantifier=quantifier)

Results on training set:
Error Metric	Standard	ByDoc (n=1)	ByDoc (n=2)
------------	--------	-----------	-----------
AE             	0.0567         	0.1781         	0.1400         
RAE            	0.1601         	0.4373         	0.3508         
MSE            	0.0032         	0.0612         	0.0372         
MAE            	0.0567         	0.1781         	0.1400         
MRAE           	0.1601         	0.4373         	0.3508         
MKLD           	0.0078         	0.1657         	0.0797         

Results on test set:
Error Metric	Standard	ByDoc (n=1)	ByDoc (n=2)
------------	--------	-----------	-----------
AE             	0.1822         	0.2672         	0.2279         
RAE            	0.5105         	0.8070         	0.6694         
MSE            	0.0332         	0.1067         	0.0777         
MAE            	0.1822         	0.2672         	0.2279         
MRAE           	0.5105         	0.8070         	0.6694         
MKLD           	0.0686         	0.3070         	0.1892         


The results seem promising, although the differences observed with the modified sampling strategy are substantial. This observation led us to investigate the effects of increasing the number of elements per batch, which allows us to notice a decrease in error values that tend to align more closely with the standard random sampling technique. This suggests that we might need to explore several solutions:

- **Modify the Sampling Strategy During Training**: Adjusting the sampling strategy at training time could help the model learn the distribution of components within a single abstract.
  
- **Utilize Longer Documents**: We may consider working with longer documents containing more sentences. For instance, the [dataset suggested by Galassi](https://madoc.bib.uni-mannheim.de/46084/1/argmining-18-multi%20%289%29.pdf) contains entire papers annotated with argument components.

Let’s observe how the current model behaves with some sample instances.

In [75]:
infer(test_set, indexer, comp_quantifier=None, comp_classifier=None, rel_quantifier=quantifier, rel_classifier=cnn_classifier, filename='29605574', use_tokenizer=True)

File 29605574 - Relations: [0] - Text:
Multiple studies have evaluated the hypoglycemic effect of cinnamon in patients with diabetes mellitus (DM) type II, with conflicting results. Differences in Baseline Body Mass Index (BMI) of patients may be able to explain the observed differences in the results. This study was designed to evaluate the effect of cinnamon supplementation on anthropometric, glycemic and lipid outcomes of patients with DM type II based on their baseline BMI. The study was designed as a triple-blind placebo-controlled randomized clinical trial, using a parallel design. One hundred and forty patients referred to Diabetes Clinic of Yazd University of Medical Sciences with diagnosis of DM type II were randomly assigned in four groups: cinnamon (BMI ≥ 27, BMI < 27) and Placebo (BMI ≥ 27, BMI < 27). Patients received cinnamon bark powder or placebo in 500 mg capsules twice daily for 3 months. Anthropometric, glycemic and lipid outcomes were measured before and after the i

indexing: 100%|████████████████████████████████████████████████████| 10/10 [00:00<00:00, 9530.34it/s]


	Quantification:
		# True distribution relations: [Class 0 = 0.8000, Class 1 = 0.2000]
		# Estimated distribution relations: [Class 0 = 0.0228, Class 1 = 0.9772]


indexing: 100%|███████████████████████████████████████████████████| 55/55 [00:00<00:00, 27564.43it/s]

	# Estimated distribution on tokenized relations: [Class 0 = 0.0779, Class 1 = 0.9221]

	# AE: 0.7772
	# RAE: 2.1980
	# MSE: 0.6040
	# MAE: 0.7772
	# MRAE: 2.1980
	# MKLD: 1.9173
	Classification:
		# Ground truth relations: (0, 0, 0, 0, 0, 1, 0, 0, 1, 0)
		# Predicted relations labels: [1 1 1 1 1 1 1 1 1 1]





The predictions, although not perfect, appear promising. We compute the posterior estimations in two ways: first, by using sentence tokenization applied to the dataset, taking into account the annotations provided by *Cabrio* and *Villata*; second, by simply applying `sent_tokenize()` to each abstract. The error difference between the two approaches is minimal.

### 4. QuaNet with custom training routine

Now, let us attempt the first of the previously proposed modifications. We have adjusted the `QuaNet` training routine so that, in each epoch, every batch contains all the sentences from an abstract.

In [36]:
# train QuaNet (alternatively, we can set fit_classifier=True and let QuaNet train the classifier)
quantifier_custom = QuaNetTrainerABS(cnn_classifier, qp.environ['SAMPLE_SIZE'], device='cpu')
quantifier_custom.fit(abs_dataset.training, fit_classifier=False)


KeyboardInterrupt



In [13]:
print('Results on training set:')
result_train_custom = evaluate(collection=abs_dataset.training, n=[1,3,5,10, qp.environ['SAMPLE_SIZE'], 15], quantifier=quantifier_custom)

print('\nResults on test set:')
result_test_custom = evaluate(collection=abs_dataset.test, n=[1,3,5,10, qp.environ['SAMPLE_SIZE'], 15], quantifier=quantifier_custom)

Results on training set:
Error Metric	Standard	ByDoc (n=1)	ByDoc (n=3)	ByDoc (n=5)	ByDoc (n=10)	ByDoc (n=13)	ByDoc (n=15)
------------	--------	-----------	-----------	-----------	------------	------------	------------
AE             	0.1426         	0.1616         	0.1423         	0.1424         	0.1427         	0.1430         	0.1426         
RAE            	0.2651         	0.3329         	0.2715         	0.2679         	0.2669         	0.2675         	0.2661         
MSE            	0.0203         	0.0381         	0.0255         	0.0228         	0.0217         	0.0219         	0.0212         
MAE            	0.1426         	0.1616         	0.1423         	0.1424         	0.1427         	0.1430         	0.1426         
MRAE           	0.2651         	0.3329         	0.2715         	0.2679         	0.2669         	0.2675         	0.2661         
MKLD           	0.0360         	0.0675         	0.0450         	0.0403         	0.0384         	0.0386         	0.0375         

Results on t

In [14]:
filename = random.choice(test_collection.filenames)

print('Standard QuaNet:')
infer(test_set, indexer, comp_quantifier=quantifier, comp_classifier=cnn_classifier, rel_quantifier=None, filename=filename, show_text=False, use_tokenizer=True)

print('Custom approach QuaNet:')
infer(test_set, indexer, comp_quantifier=quantifier_custom, comp_classifier=cnn_classifier, rel_quantifier=None, filename=filename, show_text=False, use_tokenizer=True)

Standard QuaNet:


indexing: 100%|████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<?, ?it/s]


	# Labels: (0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1)
	# Predicted labels: [0 1 0 0 0 0 0 0 0 0 0 1 1 1 0 1]

	# True distribution: [Non-components = 0.625, Components = 0.375]
	# Estimated distribution: [Non-components = 0.7147798538208008, Components = 0.28522008657455444]



indexing: 100%|████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<?, ?it/s]


	# Estimated distribution on tokenized sentences: [None = 0.7086560130119324, Components = 0.29134401679039]

	# AE: 0.0898
	# RAE: 0.1762
	# MSE: 0.0081
	# MAE: 0.0898
	# MRAE: 0.1762
	# MKLD: 0.0158
Custom approach QuaNet:


indexing: 100%|████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<?, ?it/s]


	# Labels: (0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1)
	# Predicted labels: [0 1 0 0 0 0 0 0 0 0 0 1 1 1 0 1]

	# True distribution: [Non-components = 0.625, Components = 0.375]
	# Estimated distribution: [Non-components = 0.37237972021102905, Components = 0.6276203393936157]



indexing: 100%|████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<?, ?it/s]

	# Estimated distribution on tokenized sentences: [None = 0.37244391441345215, Components = 0.6275560855865479]

	# AE: 0.2526
	# RAE: 0.4959
	# MSE: 0.0638
	# MAE: 0.2526
	# MRAE: 0.4959
	# MKLD: 0.1122





Unfortunately, we observe generally higher error values with this methodology. Moreover, if we run the previous cell multiple times, we notice that the custom quantifier tends to predict an equal distribution of `Components` and `Non-components`: this might reflects the fact that, on average, both the training and test sets have a roughly equal number of argument components and non-components within each individual abstract.

```python
- Train set:
	Label 0: 2760 samples
	Label 1: 2593 samples

	There are 2 different labels in the train set -> [0, 1]
	Average number of sentences per file in train set: 13
	Max sentence length: 107
	Average components per file: 6.48
	Average non-components per file: 6.90

- Test set:
	Label 0: 1948 samples
	Label 1: 1880 samples

	There are 2 different labels in the test set -> [0, 1]
	Average number of sentences per file in test set: 14
	Max sentence length: 91
	Average components per file: 6.99
	Average non-components per file: 7.24
```