# Experiment 1 - Claims Quantification through Sentence-wise Approach

This is a variant of the first experiment in which we consider as positive only `Claims` and `MajorClaims`, not `Premises`.

In [1]:
from experiment_1_code import *

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Antonio\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Antonio\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


### 1. Preprocessing
Here we preprocess our data by splitting the abstracts into sentences. Each sentence is then labeled as either being a `Claim`, labeled as `1`, or `Not a claim`, labeled as `0`.

In [2]:
train_set = read_brat_dataset('../data/train/neoplasm_train', positives=['Claim', 'MajorClaim'])
val_set = read_brat_dataset('../data/dev/neoplasm_dev', positives=['Claim', 'MajorClaim'])

glaucoma_test = read_brat_dataset('../data/test/glaucoma_test', positives=['Claim', 'MajorClaim'])
neoplasm_test = read_brat_dataset('../data/test/neoplasm_test', positives=['Claim', 'MajorClaim'])
mixed_test = read_brat_dataset('../data/test/mixed_test', positives=['Claim', 'MajorClaim'])

test_set = glaucoma_test + neoplasm_test + mixed_test

In [3]:
label_counts_train, avg_sentences_per_file_train = compute_dataset_statistics(train_set, dataset_name="train")
label_counts_test, avg_sentences_per_file_test = compute_dataset_statistics(test_set, dataset_name="test")

- Train set:
	Label 0: 3924 samples
	Label 1: 730 samples

	There are 2 different labels in the train set -> [0, 1]
	Average number of sentences per file in train set: 13
	Max sentence length: 107
	Average components per file: 2.09
	Average non-components per file: 11.21

- Test set:
	Label 0: 3188 samples
	Label 1: 650 samples

	There are 2 different labels in the test set -> [0, 1]
	Average number of sentences per file in test set: 14
	Max sentence length: 91
	Average components per file: 2.43
	Average non-components per file: 11.85



Next, we create the dataset composed by several `FilenameLabelledCollection`, which inherits from `QuaPy`'s `LabelledCollection` class. 

We created this class since we needed to keep track of the filenames corresponding to each abstract to which the sentences belong. The `index` method is also modified to return `FilenameLabelledCollection` instances.

In [4]:
train_collection = FilenameLabelledCollection([data['sentence'] for data in train_set], 
                                                 [data['label'] for data in train_set], 
                                                 [data['filename'] for data in train_set])

# train_collection, val_collection = train_collection.split_stratified(0.7)
# train_collection, val_collection = train_collection.split_stratified_by_filenames(0.75)
val_collection = FilenameLabelledCollection([data['sentence'] for data in val_set], 
                                                 [data['label'] for data in val_set], 
                                                 [data['filename'] for data in val_set])

test_collection = FilenameLabelledCollection([data['sentence'] for data in test_set], 
                                                 [data['label'] for data in test_set], 
                                                 [data['filename'] for data in test_set])

glaucoma_collection = FilenameLabelledCollection([data['sentence'] for data in glaucoma_test], 
                                                 [data['label'] for data in glaucoma_test], 
                                                 [data['filename'] for data in glaucoma_test])

neoplasm_collection = FilenameLabelledCollection([data['sentence'] for data in neoplasm_test], 
                                                 [data['label'] for data in neoplasm_test], 
                                                 [data['filename'] for data in neoplasm_test])

mixed_collection = FilenameLabelledCollection([data['sentence'] for data in mixed_test], 
                                                 [data['label'] for data in mixed_test], 
                                                 [data['filename'] for data in mixed_test])

In [5]:
indexer = qp.data.preprocessing.IndexTransformer(min_df=1)

# Create and index the dataset
abs_dataset = CustomDataset(training=train_collection, test=test_collection, val=val_collection)
index(abs_dataset, indexer, inplace=True)

# Index the test collections
index(glaucoma_collection, indexer, fit=False, inplace=True)
index(neoplasm_collection, indexer, fit=False, inplace=True)
index(mixed_collection, indexer, fit=False, inplace=True)

qp.environ['SAMPLE_SIZE'] = avg_sentences_per_file_train

indexing: 100%|███████████████████████████████████████████████| 4654/4654 [00:00<00:00, 74464.85it/s]
indexing: 100%|█████████████████████████████████████████████████| 708/708 [00:00<00:00, 45312.00it/s]
indexing: 100%|███████████████████████████████████████████████| 3838/3838 [00:00<00:00, 84646.98it/s]
indexing: 100%|███████████████████████████████████████████████| 1291/1291 [00:00<00:00, 82809.75it/s]
indexing: 100%|███████████████████████████████████████████████| 1338/1338 [00:00<00:00, 85645.07it/s]
indexing: 100%|███████████████████████████████████████████████| 1209/1209 [00:00<00:00, 51159.85it/s]


### 2. Classifier
`QuaNet` requires a classifier that can provide embedded representations of the inputs. In the original paper, `QuaNet` was tested using an `LSTM` as the base classifier; as `QuaPy`'s authors show in their [example](https://hlt-isti.github.io/QuaPy/manuals/methods.html#the-quanet-neural-network), we will use an instantiation of `QuaNet` that employs a `CNN` as a probabilistic classifier, taking its last layer representation as the document embedding.

We will use the same set of hyperparameters tuned for the previous model, although tuning on this one could led to better results.

In [6]:
set_seed(42)

embedding_size = 180
hidden_size = 269   
lr = 0.0009964893016712443

cnn_module = CNNnet(
    abs_dataset.vocabulary_size,
    abs_dataset.training.n_classes,
    embedding_size=embedding_size,
    hidden_size=hidden_size
)

optimizer = Adam(cnn_module.parameters(), lr=lr)
scheduler = CosineAnnealingLR(optimizer, T_max=2)

cnn_classifier = ScheduledNeuralClassifierTrainer(
    cnn_module,
    lr_scheduler=scheduler,
    optim = optimizer,
    device='cpu',
    checkpointpath='../checkpoints/claims/classifier_net.dat',
    padding_length=107,
    patience=10
)

cnn_classifier.fit(*abs_dataset.training.Xy, *abs_dataset.val.Xy)

[NeuralNetwork running on cpu]


  self.net.load_state_dict(torch.load(self.checkpointpath))
[CNNnet] training epoch=32 tr-loss=0.01027 tr-acc=99.76% tr-macroF1=99.55% patience=1/10 val-loss=0.5


Training ended by patience exhausted; loading best model parameters from ../checkpoints/claims/classifier_net.dat from epoch 22
Performing a final training pass over the validation set...
[Training complete] - Best loss on validation set: 0.4284893870353699 - Best f1 on validation set: 0.8128353056432426


<experiment_1_code.ScheduledNeuralClassifierTrainer at 0x1a099ab63c0>

In [7]:
f1_train = 1-qp.error.f1e(abs_dataset.training.labels, cnn_classifier.predict(abs_dataset.training.instances))
accuracy_train = 1-qp.error.acce(abs_dataset.training.labels, cnn_classifier.predict(abs_dataset.training.instances))
print('- Train set:')
print(f'\tF1: {f1_train}')    
print(f'\tAccuracy: {accuracy_train}')    

f1_test_glaucoma = 1-qp.error.f1e(glaucoma_collection.labels, cnn_classifier.predict(glaucoma_collection.instances))
accuracy_test_glaucoma = 1-qp.error.acce(glaucoma_collection.labels, cnn_classifier.predict(glaucoma_collection.instances))

print('- Glaucoma test set:')
print(f'\tF1: {f1_test_glaucoma}')    
print(f'\tAccuracy: {accuracy_test_glaucoma}')    

f1_test_neoplasm = 1-qp.error.f1e(neoplasm_collection.labels, cnn_classifier.predict(neoplasm_collection.instances))
accuracy_test_neoplasm = 1-qp.error.acce(neoplasm_collection.labels, cnn_classifier.predict(neoplasm_collection.instances))

print('- Neoplasm test set:')
print(f'\tF1: {f1_test_neoplasm}')    
print(f'\tAccuracy: {accuracy_test_neoplasm}')

f1_test_mixed = 1-qp.error.f1e(mixed_collection.labels, cnn_classifier.predict(mixed_collection.instances))
accuracy_test_mixed = 1-qp.error.acce(mixed_collection.labels, cnn_classifier.predict(mixed_collection.instances))

print('- Mixed test set:')
print(f'\tF1: {f1_test_mixed}')    
print(f'\tAccuracy: {accuracy_test_mixed}')

- Train set:
	F1: 0.9995935971377502
	Accuracy: 0.9997851310700473
- Glaucoma test set:
	F1: 0.763537600192457
	Accuracy: 0.8962044926413633
- Neoplasm test set:
	F1: 0.7571175278622087
	Accuracy: 0.8714499252615845
- Mixed test set:
	F1: 0.756428775527134
	Accuracy: 0.8734491315136477


In [8]:
infer(test_set, indexer, comp_quantifier=None, comp_classifier=cnn_classifier, rel_quantifier=None, rel_classifier=None, filename=None, use_tokenizer=True)

File 9643663 - Text:
 To evaluate the efficacy and tolerability of 'Casodex' monotherapy (150 mg daily) for metastatic and locally advanced prostate cancer. A total of 1,453 patients with either confirmed metastatic disease (M1), or T3/T4 non-metastatic disease with elevated prostate-specific antigen (M0) were recruited into one of two identical, multicentre, randomised studies to compare 'Casodex' 150 mg/day with castration. The protocols allowed for combined analysis. At a median follow-up period of approximately 100 weeks for both studies, 'Casodex' 150 mg was found to be less effective than castration in patients with metastatic disease (M1) at entry (hazard ratio of 1.30 for time to death) with a difference in median survival of 6 weeks. In symptomatic M1 patients, 'Casodex' was associated with a statistically significant improvement in subjective response (70%) compared with castration (58%). Analysis of a validated quality-of-life questionnaire proved an advantage for 'Casodex' 

indexing: 100%|██████████████████████████████████████████████████████████████| 13/13 [00:00<?, ?it/s]

	Classification:
		# Ground truth components: (0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1)
		# Predicted components labels: [0 0 0 0 1 0 0 0 1 0 1 1 1]





### 3. QuaNet 
The results are solid, let's move onto `QuaNet` training phase. `QuaNet` observes the classification predictions to learn higher-order *quantification embeddings*, which are then refined by incorporating quantification predictions of simple classify-and-count-like methods.

![architecture](./images/quanet_architecture.png)

The QuaNet architecture (see Figure 1) consists of two main components: a **recurrent component** and a **fully connected component**.

#### 3.1 Recurrent Component: Bidirectional LSTM
- The core of the model is a **Bidirectional LSTM** (Long Short-Term Memory), a type of recurrent neural network. 
- The LSTM receives as input a **list of pairs** $⟨Pr(c|x), \vec{x}⟩$, where:
  - $Pr(c|x)$ is the probability that a classifier $h$ assigns class $c$ to document $x$.
  - $\vec{x}$ is the **document embedding**, a vector representing the document's content.
- The list is **sorted by the value of $Pr(c|x)$**, meaning the documents are arranged from least to most likely to belong to class $c$.
  
The **intuition** behind this approach is that the LSTM will "learn to count" positive and negative examples. By observing the ordered sequence of probabilities, the LSTM should learn to recognize the point where the documents switch from negative to positive examples. The document embedding $\vec{x}$ helps the LSTM assign different importance to each document when making its prediction.

The output of the LSTM is called a **quantification embedding**—a dense vector representing the information about the quantification task learned from the input data.

#### 3.2 Fully Connected Component
- The vector returned by the LSTM is combined with additional information, specifically **quantification-related statistics**:
  - $\hat{p}_c^{CC}(D)$, $\hat{p}_c^{ACC}(D)$, $\hat{p}_c^{PCC}(D)$, and $\hat{p}_c^{PACC}(D)$, which are quantification predictions from different methods.
  - $tpr_b$, $fpr_b$, $tpr_s$, and $fpr_s$, aggregate statistics related to true positive and false positive rates, which are easy to compute from the classifier $h$ using a validation set.

This combined vector then passes through the second part of the architecture, which is made up of **fully connected layers** with **ReLU activations**. These layers adjust the quantification embedding using the additional statistics from the classifier to improve the accuracy of the quantification.

The final output is a prediction $\hat{p}_c^{QuaNet}(c|D)$, which represents the probability of class $c$ for the dataset $D$, produced by a **softmax layer**.

QuaNet could use quantification predictions from many methods, but it focuses on those that are **computationally efficient** (like CC, ACC, PCC, and PACC). This ensures that the process remains fast while still providing sufficient information for accurate predictions.

### Details

| Layer | Type | Dimensions | Activation | Dropout |
|---|---|---|---|---|
| Input | LSTM | 128 | N/A | N/A |
| Dense 1 | Dense | 1024 | ReLU | 0.5 |
| Dense 2 | Dense | 512 | ReLU | 0.5 |
| Output | Dense | 2 | Softmax | N/A |

- The LSTM has **64 hidden dimensions**, and since it’s bidirectional, the final LSTM output has **128 dimensions**.
- This LSTM output is concatenated with the **8 quantification statistics** (giving a total of 136 dimensions), which is then fed into:
  - **Two dense layers** with **1,024** and **512 dimensions**, each using **ReLU activation** and **0.5 dropout**.
  - Finally, the output is passed through a **softmax layer** of size 2 to make the final class prediction.


In [9]:
#train QuaNet (alternatively, we can set fit_classifier=True and let QuaNet train the classifier)
set_seed(42)

quantifier = QuaNet(cnn_classifier, qp.environ['SAMPLE_SIZE'], qdrop_p=0, device='cpu', checkpointdir='../checkpoints/claims', checkpointname='Quanet-Claims')
quantifier.fit(abs_dataset.training, fit_classifier=False)

QuaNetModule(
  (lstm): LSTM(102, 64, batch_first=True, bidirectional=True)
  (dropout): Dropout(p=0, inplace=False)
  (ff_layers): ModuleList(
    (0): Linear(in_features=136, out_features=1024, bias=True)
    (1): Linear(in_features=1024, out_features=512, bias=True)
  )
  (output): Linear(in_features=512, out_features=2, bias=True)
)


  ptrue = torch.as_tensor([sample_data.prevalence()], dtype=torch.float, device=self.device)
[QuaNet] epoch=1 [it=499/500]	tr-mseloss=0.00481 tr-maeloss=0.03717	val-mseloss=-1.00000 val-maeloss=
100%|█████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 104.56it/s]
[QuaNet] epoch=2 [it=499/500]	tr-mseloss=0.00065 tr-maeloss=0.01784	val-mseloss=0.00022 val-maeloss=0
100%|██████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 95.11it/s]
[QuaNet] epoch=3 [it=499/500]	tr-mseloss=0.00037 tr-maeloss=0.01340	val-mseloss=0.00233 val-maeloss=0
100%|██████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 88.57it/s]
[QuaNet] epoch=4 [it=499/500]	tr-mseloss=0.00018 tr-maeloss=0.00897	val-mseloss=0.00040 val-maeloss=0
100%|██████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 95.10it/s]
[QuaNet] epoch=5 [it=499/500]	tr-mseloss=0.00098 tr-maeloss=0.02128	val-mseloss=0.00012 val

training ended by patience exhausted; loading best model parameters in ../checkpoints/claims\Quanet-Claims for epoch 28



  self.quanet.load_state_dict(torch.load(checkpoint))


We wrapped `QuaPy`'s error evaluation function and manually modified how each sample is selected; we adjusted the sampling strategy to work with batches where the batch size is equal to the number of sentences that compose each abstract. This allows us to select the entire document based on the filename associated with each sentence. We will also evaluate the results using the standard random sampling technique, where sentences from different abstracts are grouped into the same batch.

In [10]:
print('- Train set:')
result_train = evaluate(collection=abs_dataset.training, n=[1,3,5,10, qp.environ['SAMPLE_SIZE'], 15], quantifier=quantifier)

print('- Val set:')
result_train = evaluate(collection=abs_dataset.val, n=[1,3,5,10, qp.environ['SAMPLE_SIZE'], 15], quantifier=quantifier)

print('- Glaucoma test set:')
result_test = evaluate(collection=glaucoma_collection, n=[1,3,5,10, qp.environ['SAMPLE_SIZE'], 15], quantifier=quantifier)

print('- Neoplasm test set:')
result_test = evaluate(collection=neoplasm_collection, n=[1,3,5,10, qp.environ['SAMPLE_SIZE'], 15], quantifier=quantifier)

print('- Mixed test set:')
result_test = evaluate(collection=mixed_collection, n=[1,3,5,10, qp.environ['SAMPLE_SIZE'], 15], quantifier=quantifier)

- Train set:
Error Metric	Standard	ByDoc (n=1)	ByDoc (n=3)	ByDoc (n=5)	ByDoc (n=10)	ByDoc (n=13)	ByDoc (n=15)
------------	--------	-----------	-----------	-----------	------------	------------	------------
AE             	0.0091         	0.0050         	0.0113         	0.0104         	0.0097         	0.0096         	0.0096         
RAE            	0.0284         	0.0182         	0.0414         	0.0360         	0.0318         	0.0311         	0.0310         
MSE            	0.0001         	0.0001         	0.0002         	0.0001         	0.0001         	0.0001         	0.0001         
MAE            	0.0091         	0.0050         	0.0113         	0.0104         	0.0097         	0.0096         	0.0096         
MRAE           	0.0284         	0.0182         	0.0414         	0.0360         	0.0318         	0.0311         	0.0310         
MKLD           	0.0002         	0.0002         	0.0007         	0.0005         	0.0003         	0.0003         	0.0003         
- Val set:
Error Metric	S

The results seem promising, although the differences observed with the modified sampling strategy are substantial. This observation led us to investigate the effects of increasing the number of elements per batch, which allows us to notice a decrease in error values that tend to align more closely with the standard random sampling technique. This suggests that we might need to explore several solutions:

- **Modify the Sampling Strategy During Training**: Adjusting the sampling strategy at training time could help the model learn the distribution of components within a single abstract.
  
- **Utilize Longer Documents**: We may consider working with longer documents containing more sentences. For instance, the [dataset suggested by Galassi](https://madoc.bib.uni-mannheim.de/46084/1/argmining-18-multi%20%289%29.pdf) contains entire papers annotated with argument components.

Let’s observe how the current model behaves with some sample instances.

In [11]:
infer(test_set, indexer, comp_quantifier=quantifier, comp_classifier=cnn_classifier, filename=None, use_tokenizer=True)

File 9643663 - Text:
 To evaluate the efficacy and tolerability of 'Casodex' monotherapy (150 mg daily) for metastatic and locally advanced prostate cancer. A total of 1,453 patients with either confirmed metastatic disease (M1), or T3/T4 non-metastatic disease with elevated prostate-specific antigen (M0) were recruited into one of two identical, multicentre, randomised studies to compare 'Casodex' 150 mg/day with castration. The protocols allowed for combined analysis. At a median follow-up period of approximately 100 weeks for both studies, 'Casodex' 150 mg was found to be less effective than castration in patients with metastatic disease (M1) at entry (hazard ratio of 1.30 for time to death) with a difference in median survival of 6 weeks. In symptomatic M1 patients, 'Casodex' was associated with a statistically significant improvement in subjective response (70%) compared with castration (58%). Analysis of a validated quality-of-life questionnaire proved an advantage for 'Casodex' 

indexing: 100%|██████████████████████████████████████████████████████████████| 13/13 [00:00<?, ?it/s]


	Quantification:
		# True distribution components: [Class 0 = 0.6154, Class 1 = 0.3846]
		# Estimated distribution components: [Class 0 = 0.6111, Class 1 = 0.3889]


indexing: 100%|██████████████████████████████████████████████████████████████| 13/13 [00:00<?, ?it/s]

	# Estimated distribution on tokenized components: [Class 0 = 0.6111, Class 1 = 0.3889]

	# AE: 0.0042
	# RAE: 0.0083
	# MSE: 0.0000
	# MAE: 0.0042
	# MRAE: 0.0083
	# MKLD: 0.0000
	Classification:
		# Ground truth components: (0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1)
		# Predicted components labels: [0 0 0 0 1 0 0 0 1 0 1 1 1]





The predictions, although not perfect, appear promising. We compute the posterior estimations in two ways: first, by using sentence tokenization applied to the dataset, taking into account the annotations provided by *Cabrio* and *Villata*; second, by simply applying `sent_tokenize()` to each abstract. The error difference between the two approaches is minimal.