# Experiment 5 - Combinations
Now that we have our model capable of detecting components, let us consider as label the number of relation: we still split each abstract into sentences and label, make each possible combination between them and label each combined sentence as either *No related* or *Related*, corresponding to 0 and 1 respectively.

During inference, we will use the same sampling technique that processes all the sentences that form an abstract. This time, our goal is to determine the percentage of relations between arguments' components that each abstract contains, and inspect if this information along with the percentage of components could help to build a working model that predicts the number of arguments.

In [1]:
from experiment_5_code import *

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Antonio\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


### 1. Preprocessing
Here we preprocess our data by splitting the abstracts into sentences. Each combination of two sentences is then labeled as either being a `Relationship`, labeled as `1`, or `Not relationship`, labeled as `0`.

In [2]:
train_set = read_brat_dataset('../data/train/neoplasm_train') + read_brat_dataset('../data/dev/neoplasm_dev')
test_set = read_brat_dataset('../data/test/glaucoma_test') + read_brat_dataset('../data/test/neoplasm_test') + read_brat_dataset('../data/test/mixed_test')

In [3]:
label_counts_train, avg_sentences_per_file_train = compute_dataset_statistics(train_set, dataset_name="train")
label_counts_test, avg_sentences_per_file_test = compute_dataset_statistics(test_set, dataset_name="test")

- Train set:
	Label 0: 35324 samples
	Label 1: 1636 samples

	There are 2 different labels in the train set -> [0, 1]
	Average number of sentences pairs per file in train set: 92
	Max sentence pair length: 162
	Average relationships per file: 4.16
	Average no relationships per file: 88.31

- Test set:
	Label 0: 24835 samples
	Label 1: 1120 samples

	There are 2 different labels in the test set -> [0, 1]
	Average number of sentences pairs per file in test set: 96
	Max sentence pair length: 145
	Average relationships per file: 4.23
	Average no relationships per file: 92.32



Follows an example of what our dictionary looks like.

In [38]:
# display_file_info(train_set, filename='16416368') 
# display_file_info(train_set, filename=None)

Next, we create the dataset `FilenameLabelledCollection`, which inherits from `QuaPy`'s `LabelledCollection` class. This allows us to keep track of the filenames corresponding to each abstract to which the sentences belong. The `index` method is also modified to return two `FilenameLabelledCollection` instances.

In [5]:
train_collection = FilenameLabelledCollection([data['sentence_pair'] for data in train_set], 
                                                 [data['label'] for data in train_set], 
                                                 [data['filename'] for data in train_set], 
                                                 classes=list(label_counts_train.keys()))

test_collection = FilenameLabelledCollection([data['sentence_pair'] for data in test_set], 
                                                 [data['label'] for data in test_set], 
                                                 [data['filename'] for data in test_set], 
                                                 classes=list(label_counts_test.keys()))

In [6]:
indexer = qp.data.preprocessing.IndexTransformer(min_df=1)
abs_dataset = Dataset(train_collection, test_collection)

index(abs_dataset, indexer, inplace=True)

qp.environ['SAMPLE_SIZE'] = avg_sentences_per_file_train

indexing: 100%|███████████████████████████████████████████████████████████████| 36960/36960 [00:01<00:00, 23797.87it/s]
indexing: 100%|███████████████████████████████████████████████████████████████| 25955/25955 [00:00<00:00, 33018.80it/s]


### 2. Classifier
`QuaNet` requires a classifier that can provide embedded representations of the inputs. In the original paper, `QuaNet` was tested using an `LSTM` as the base classifier; as `QuaPy`'s authors show in their [example](https://hlt-isti.github.io/QuaPy/manuals/methods.html#the-quanet-neural-network), we will use an instantiation of `QuaNet` that employs a `CNN` as a probabilistic classifier, taking its last layer representation as the document embedding.

In [7]:
# train the text classifier:
cnn_module = CNNnet(abs_dataset.vocabulary_size, abs_dataset.training.n_classes)
cnn_classifier = NeuralClassifierTrainer(cnn_module, device='cpu')
cnn_classifier.fit(*abs_dataset.training.Xy)

[NeuralNetwork running on cpu]


  self.net.load_state_dict(torch.load(checkpoint))
[CNNnet] training epoch=19 tr-loss=0.01389 tr-acc=99.56% tr-macroF1=97.36% patience=1/10 val-loss=0.47329 val-acc=94.72


training ended by patience exhasted; loading best model parameters in ../checkpoint/classifier_net.dat for epoch 9
performing one training pass over the validation set...
[done]


<quapy.classification.neural.NeuralClassifierTrainer at 0x1e32edb9bb0>

In [8]:
f1_train = 1-qp.error.f1e(abs_dataset.training.labels, cnn_classifier.predict(abs_dataset.training.instances))
accuracy_train = 1-qp.error.acce(abs_dataset.training.labels, cnn_classifier.predict(abs_dataset.training.instances))
print('- Train set:')
print(f'\tF1: {f1_train}')    
print(f'\tAccuracy: {accuracy_train}')    

f1_test = 1-qp.error.f1e(abs_dataset.test.labels, cnn_classifier.predict(abs_dataset.test.instances))
accuracy_test = 1-qp.error.acce(abs_dataset.test.labels, cnn_classifier.predict(abs_dataset.test.instances))
print('- Test set:')
print(f'\tF1: {f1_test}')    
print(f'\tAccuracy: {accuracy_test}')    

- Train set:
	F1: 0.4886835260915279
	Accuracy: 0.9557359307359308
- Test set:
	F1: 0.4889742075211656
	Accuracy: 0.9568483914467347


### 3. QuaNet 
The results are solid, let's move onto `QuaNet` training phase. `QuaNet` observes the classification predictions to learn higher-order *quantification embeddings*, which are then refined by incorporating quantification predictions of simple classify-and-count-like methods.

![architecture](./images/quanet_architecture.png)

The QuaNet architecture (see Figure 1) consists of two main components: a **recurrent component** and a **fully connected component**.

#### 3.1 Recurrent Component: Bidirectional LSTM
- The core of the model is a **Bidirectional LSTM** (Long Short-Term Memory), a type of recurrent neural network. 
- The LSTM receives as input a **list of pairs** $⟨Pr(c|x), \vec{x}⟩$, where:
  - $Pr(c|x)$ is the probability that a classifier $h$ assigns class $c$ to document $x$.
  - $\vec{x}$ is the **document embedding**, a vector representing the document's content.
- The list is **sorted by the value of $Pr(c|x)$**, meaning the documents are arranged from least to most likely to belong to class $c$.
  
The **intuition** behind this approach is that the LSTM will "learn to count" positive and negative examples. By observing the ordered sequence of probabilities, the LSTM should learn to recognize the point where the documents switch from negative to positive examples. The document embedding $\vec{x}$ helps the LSTM assign different importance to each document when making its prediction.

The output of the LSTM is called a **quantification embedding**—a dense vector representing the information about the quantification task learned from the input data.

#### 3.2 Fully Connected Component
- The vector returned by the LSTM is combined with additional information, specifically **quantification-related statistics**:
  - $\hat{p}_c^{CC}(D)$, $\hat{p}_c^{ACC}(D)$, $\hat{p}_c^{PCC}(D)$, and $\hat{p}_c^{PACC}(D)$, which are quantification predictions from different methods.
  - $tpr_b$, $fpr_b$, $tpr_s$, and $fpr_s$, aggregate statistics related to true positive and false positive rates, which are easy to compute from the classifier $h$ using a validation set.

This combined vector then passes through the second part of the architecture, which is made up of **fully connected layers** with **ReLU activations**. These layers adjust the quantification embedding using the additional statistics from the classifier to improve the accuracy of the quantification.

The final output is a prediction $\hat{p}_c^{QuaNet}(c|D)$, which represents the probability of class $c$ for the dataset $D$, produced by a **softmax layer**.

QuaNet could use quantification predictions from many methods, but it focuses on those that are **computationally efficient** (like CC, ACC, PCC, and PACC). This ensures that the process remains fast while still providing sufficient information for accurate predictions.

### Details

| Layer | Type | Dimensions | Activation | Dropout |
|---|---|---|---|---|
| Input | LSTM | 128 | N/A | N/A |
| Dense 1 | Dense | 1024 | ReLU | 0.5 |
| Dense 2 | Dense | 512 | ReLU | 0.5 |
| Output | Dense | 2 | Softmax | N/A |

- The LSTM has **64 hidden dimensions**, and since it’s bidirectional, the final LSTM output has **128 dimensions**.
- This LSTM output is concatenated with the **8 quantification statistics** (giving a total of 136 dimensions), which is then fed into:
  - **Two dense layers** with **1,024** and **512 dimensions**, each using **ReLU activation** and **0.5 dropout**.
  - Finally, the output is passed through a **softmax layer** of size 2 to make the final class prediction.


In [9]:
# train QuaNet (alternatively, we can set fit_classifier=True and let QuaNet train the classifier)
quantifier = QuaNet(cnn_classifier, qp.environ['SAMPLE_SIZE'], device='cpu')
quantifier.fit(abs_dataset.training, fit_classifier=False)



QuaNetModule(
  (lstm): LSTM(102, 64, batch_first=True, dropout=0.5, bidirectional=True)
  (dropout): Dropout(p=0.5, inplace=False)
  (ff_layers): ModuleList(
    (0): Linear(in_features=136, out_features=1024, bias=True)
    (1): Linear(in_features=1024, out_features=512, bias=True)
  )
  (output): Linear(in_features=512, out_features=2, bias=True)
)


  ptrue = torch.as_tensor([sample_data.prevalence()], dtype=torch.float, device=self.device)
[QuaNet] epoch=1 [it=499/500]	tr-mseloss=0.01962 tr-maeloss=0.09637	val-mseloss=-1.00000 val-maeloss=-1.00000 patience=
100%|███████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 120.25it/s]
[QuaNet] epoch=2 [it=499/500]	tr-mseloss=0.00497 tr-maeloss=0.05501	val-mseloss=0.00512 val-maeloss=0.05798 patience=10
100%|███████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 118.36it/s]
[QuaNet] epoch=3 [it=499/500]	tr-mseloss=0.00496 tr-maeloss=0.05493	val-mseloss=0.00699 val-maeloss=0.07354 patience=9/
100%|███████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 123.70it/s]
[QuaNet] epoch=4 [it=499/500]	tr-mseloss=0.00436 tr-maeloss=0.05148	val-mseloss=0.00355 val-maeloss=0.04846 patience=10
100%|██████████████████████████████████████████████████████████████

training ended by patience exhausted; loading best model parameters in ../checkpoint\QuaNet-8306-461904-406973-78647-744218 for epoch 8



  self.quanet.load_state_dict(torch.load(checkpoint))


We wrapped `QuaPy`'s error evaluation function and manually modified how each sample is selected; we adjusted the sampling strategy to work with batches where the batch size is equal to the number of sentences that compose each abstract. This allows us to select the entire document based on the filename associated with each sentence. We will also evaluate the results using the standard random sampling technique, where sentences from different abstracts are grouped into the same batch.

In [10]:
print('Results on training set:')
result_train = evaluate(collection=abs_dataset.training, n=[1,2,3,4,5], quantifier=quantifier)

print('\nResults on test set:')
result_test = evaluate(collection=abs_dataset.test, n=[1,2,3,4,5], quantifier=quantifier)

Results on training set:
Error Metric	Standard	ByDoc (n=1)	ByDoc (n=2)	ByDoc (n=3)	ByDoc (n=4)	ByDoc (n=5)
------------	--------	-----------	-----------	-----------	-----------	-----------
AE             	0.0249         	0.0360         	0.0320         	0.0497         	0.0794         	0.1111         
RAE            	0.2635         	0.3390         	0.3188         	0.5028         	0.8205         	1.1640         
MSE            	0.0006         	0.0018         	0.0018         	0.0043         	0.0090         	0.0157         
MAE            	0.0249         	0.0360         	0.0320         	0.0497         	0.0794         	0.1111         
MRAE           	0.2635         	0.3390         	0.3188         	0.5028         	0.8205         	1.1640         
MKLD           	0.0050         	0.0325         	0.0168         	0.0244         	0.0418         	0.0657         

Results on test set:
Error Metric	Standard	ByDoc (n=1)	ByDoc (n=2)	ByDoc (n=3)	ByDoc (n=4)	ByDoc (n=5)
------------	--------	-----------	-

The results seem promising, although the differences observed with the modified sampling strategy are substantial. This observation led us to investigate the effects of increasing the number of elements per batch, which allows us to notice a decrease in error values that tend to align more closely with the standard random sampling technique. This suggests that we might need to explore several solutions:

- **Modify the Sampling Strategy During Training**: Adjusting the sampling strategy at training time could help the model learn the distribution of components within a single abstract.
  
- **Utilize Longer Documents**: We may consider working with longer documents containing more sentences. For instance, the [dataset suggested by Galassi](https://madoc.bib.uni-mannheim.de/46084/1/argmining-18-multi%20%289%29.pdf) contains entire papers annotated with argument components.

Let’s observe how the current model behaves with some sample instances.

In [37]:
infer(test_set, indexer, comp_quantifier=None, comp_classifier=None, rel_quantifier=quantifier, rel_classifier=cnn_classifier, filename=None, use_tokenizer=True)

File 22310086 - Relations: [0] - Text:


To compare the efficacy and safety of tafluprost, a preservative-free (PF) prostaglandin analogue, with PF timolol in patients with open-angle glaucoma or ocular hypertension.
Randomized, double-masked, multicenter clinical trial.
After discontinuation and washout of existing ocular hypotensive treatment, patients who had intraocular pressure (IOP) =23 and =36 mm Hg in at least 1 eye at the 08:00 hour time point were randomized 1:1 to 12 weeks of treatment with either PF tafluprost 0.0015% or PF timolol 0.5%. IOP was measured 3 times during the day (08:00, 10:00, 16:00 hours) at baseline and at weeks 2, 6, and 12. It was hypothesized that PF tafluprost would be noninferior to PF timolol over 12 weeks with regard to change from baseline IOP. The trial was powered for a noninferiority margin of 1.5 mm Hg at each of the 9 time points assessed.
A total of 643 patients were randomized and 618 completed (PF tafluprost = 306, PF timolol = 312). IOPs at

indexing: 100%|█████████████████████████████████████████████████████████████████████| 91/91 [00:00<00:00, 30143.87it/s]


	# Relations Labels: (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0)
	# Predicted relations labels: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

	# True distribution relations: [No relation = 0.945054945054945, Relation = 0.054945054945054944]
	# Estimated distribution relations: [No relation = 0.9992772936820984, Relation = 0.0007227116730064154]



indexing: 100%|█████████████████████████████████████████████████████████████████████| 91/91 [00:00<00:00, 45492.45it/s]

	# Estimated distribution relations on tokenized sentence pairs: [No relation = 0.9992772936820984, Relation = 0.0007227116730064154]

	# AE: 0.0542
	# RAE: 0.4775
	# MSE: 0.0029
	# MAE: 0.0542
	# MRAE: 0.4775
	# MKLD: 0.0842





The predictions, although not perfect, appear promising. We compute the posterior estimations in two ways: first, by using sentence tokenization applied to the dataset, taking into account the annotations provided by *Cabrio* and *Villata*; second, by simply applying `sent_tokenize()` to each abstract. The error difference between the two approaches is minimal.

### 4. QuaNet with custom training routine

Now, let us attempt the first of the previously proposed modifications. We have adjusted the `QuaNet` training routine so that, in each epoch, every batch contains all the sentences from an abstract.

In [36]:
# train QuaNet (alternatively, we can set fit_classifier=True and let QuaNet train the classifier)
quantifier_custom = QuaNetTrainerABS(cnn_classifier, qp.environ['SAMPLE_SIZE'], device='cpu')
quantifier_custom.fit(abs_dataset.training, fit_classifier=False)


KeyboardInterrupt



In [13]:
print('Results on training set:')
result_train_custom = evaluate(collection=abs_dataset.training, n=[1,3,5,10, qp.environ['SAMPLE_SIZE'], 15], quantifier=quantifier_custom)

print('\nResults on test set:')
result_test_custom = evaluate(collection=abs_dataset.test, n=[1,3,5,10, qp.environ['SAMPLE_SIZE'], 15], quantifier=quantifier_custom)

Results on training set:
Error Metric	Standard	ByDoc (n=1)	ByDoc (n=3)	ByDoc (n=5)	ByDoc (n=10)	ByDoc (n=13)	ByDoc (n=15)
------------	--------	-----------	-----------	-----------	------------	------------	------------
AE             	0.1426         	0.1616         	0.1423         	0.1424         	0.1427         	0.1430         	0.1426         
RAE            	0.2651         	0.3329         	0.2715         	0.2679         	0.2669         	0.2675         	0.2661         
MSE            	0.0203         	0.0381         	0.0255         	0.0228         	0.0217         	0.0219         	0.0212         
MAE            	0.1426         	0.1616         	0.1423         	0.1424         	0.1427         	0.1430         	0.1426         
MRAE           	0.2651         	0.3329         	0.2715         	0.2679         	0.2669         	0.2675         	0.2661         
MKLD           	0.0360         	0.0675         	0.0450         	0.0403         	0.0384         	0.0386         	0.0375         

Results on t

In [14]:
filename = random.choice(test_collection.filenames)

print('Standard QuaNet:')
infer(test_set, indexer, comp_quantifier=quantifier, comp_classifier=cnn_classifier, rel_quantifier=None, filename=filename, show_text=False, use_tokenizer=True)

print('Custom approach QuaNet:')
infer(test_set, indexer, comp_quantifier=quantifier_custom, comp_classifier=cnn_classifier, rel_quantifier=None, filename=filename, show_text=False, use_tokenizer=True)

Standard QuaNet:


indexing: 100%|████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<?, ?it/s]


	# Labels: (0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1)
	# Predicted labels: [0 1 0 0 0 0 0 0 0 0 0 1 1 1 0 1]

	# True distribution: [Non-components = 0.625, Components = 0.375]
	# Estimated distribution: [Non-components = 0.7147798538208008, Components = 0.28522008657455444]



indexing: 100%|████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<?, ?it/s]


	# Estimated distribution on tokenized sentences: [None = 0.7086560130119324, Components = 0.29134401679039]

	# AE: 0.0898
	# RAE: 0.1762
	# MSE: 0.0081
	# MAE: 0.0898
	# MRAE: 0.1762
	# MKLD: 0.0158
Custom approach QuaNet:


indexing: 100%|████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<?, ?it/s]


	# Labels: (0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1)
	# Predicted labels: [0 1 0 0 0 0 0 0 0 0 0 1 1 1 0 1]

	# True distribution: [Non-components = 0.625, Components = 0.375]
	# Estimated distribution: [Non-components = 0.37237972021102905, Components = 0.6276203393936157]



indexing: 100%|████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<?, ?it/s]

	# Estimated distribution on tokenized sentences: [None = 0.37244391441345215, Components = 0.6275560855865479]

	# AE: 0.2526
	# RAE: 0.4959
	# MSE: 0.0638
	# MAE: 0.2526
	# MRAE: 0.4959
	# MKLD: 0.1122





Unfortunately, we observe generally higher error values with this methodology. Moreover, if we run the previous cell multiple times, we notice that the custom quantifier tends to predict an equal distribution of `Components` and `Non-components`: this might reflects the fact that, on average, both the training and test sets have a roughly equal number of argument components and non-components within each individual abstract.

```python
- Train set:
	Label 0: 2760 samples
	Label 1: 2593 samples

	There are 2 different labels in the train set -> [0, 1]
	Average number of sentences per file in train set: 13
	Max sentence length: 107
	Average components per file: 6.48
	Average non-components per file: 6.90

- Test set:
	Label 0: 1948 samples
	Label 1: 1880 samples

	There are 2 different labels in the test set -> [0, 1]
	Average number of sentences per file in test set: 14
	Max sentence length: 91
	Average components per file: 6.99
	Average non-components per file: 7.24
```