# Experiment 3 - Sentence-wise
We are now testing a different approach. Instead of considering the document as a whole and using the number of arguments as the label, we split each abstract into sentences and label them as either *Non-component* or *Component*, corresponding to 0 and 1 respectively.

During inference, we implement a sampling technique that processes all the sentences forming an abstract. Our goal is to determine the percentage of arguments' components each abstract contains. Afterward, multiple approaches can be explored to complete the task. One approach could involve using a regressor to learn the number of arguments in an abstract based on the sentence distribution. Another option is to apply a model like in the first experiment, but this time using the number of relations as the label. The resulting data, combined with the premise/claim distribution, can be used to estimate the number of arguments.

This experiment is still in progress. To-do list:
- Ensure the sentences are split fairly; ✅
- Train using the same sampling technique as during inference;
- Fix the sampling technique and inference functions; ✅
- Investigate the proposed approaches;

BISOGNA RISCRIVERE IL METODO EPOCH DI QUANETTRAINER PER CAMBIARE LA STRATEGIA DI SAMPLING

In [1]:
from experiment_3_code import *

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Antonio\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Antonio\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


### 1. Preprocessing
Here we preprocess our data by splitting the abstracts into sentences. Each sentence is then labeled as either being a `Component` of an argument, labeled as `1`, or `Not a component`, labeled as `0`.

In [2]:
train_set = read_brat_dataset('../data/train/neoplasm_train') 
val_set = read_brat_dataset('../data/dev/neoplasm_dev')

glaucoma_test = read_brat_dataset('../data/test/glaucoma_test')
neoplasm_test = read_brat_dataset('../data/test/neoplasm_test')
mixed_test = read_brat_dataset('../data/test/mixed_test')

test_set = glaucoma_test + neoplasm_test + mixed_test

In [3]:
label_counts_train, avg_sentences_per_file_train = compute_dataset_statistics(train_set, dataset_name="train")
label_counts_val, avg_sentences_per_file_val = compute_dataset_statistics(val_set, dataset_name="val")
label_counts_test, avg_sentences_per_file_test = compute_dataset_statistics(test_set, dataset_name="test")

- Train set:
	Label 0: 2378 samples
	Label 1: 2267 samples

	There are 2 different labels in the train set -> [0, 1]
	Average number of sentences per file in train set: 13
	Max sentence length: 107
	Average components per file: 6.48
	Average non-components per file: 6.79

- Val set:
	Label 0: 382 samples
	Label 1: 326 samples

	There are 2 different labels in the val set -> [0, 1]
	Average number of sentences per file in val set: 14
	Max sentence length: 91
	Average components per file: 6.52
	Average non-components per file: 7.64

- Test set:
	Label 0: 1948 samples
	Label 1: 1880 samples

	There are 2 different labels in the test set -> [0, 1]
	Average number of sentences per file in test set: 14
	Max sentence length: 91
	Average components per file: 6.99
	Average non-components per file: 7.24



Follows an example of what our dictionary looks like.

In [4]:
display_file_info(train_set, filename=None)

File 17075117 - Text:
 In the randomized, multinational phase II/III trial (V325) of untreated advanced gastric cancer patients, the phase II part selected docetaxel, cisplatin, and fluorouracil (DCF) over docetaxel and cisplatin for comparison against cisplatin and fluorouracil (CF; reference regimen) in the phase III part. Advanced gastric cancer patients were randomly assigned to docetaxel 75 mg/m2 and cisplatin 75 mg/m2 (day 1) plus fluorouracil 750 mg/m2/d (days 1 to 5) every 3 weeks or cisplatin 100 mg/m2 (day 1) plus fluorouracil 1,000 mg/m2/d (days 1 to 5) every 4 weeks. The primary end point was time-to-progression (TTP). In 445 randomly assigned and treated patients (DCF = 221; CF = 224), TTP was longer with DCF versus CF (32% risk reduction; log-rank P < .001). Overall survival was longer with DCF versus CF (23% risk reduction; log-rank P = .02). Two-year survival rate was 18% with DCF and 9% with CF. Overall response rate was higher with DCF (chi2 P = .01). Grade 3 to 4 tre

Next, we create the dataset `FilenameLabelledCollection`, which inherits from `QuaPy`'s `LabelledCollection` class. This allows us to keep track of the filenames corresponding to each abstract to which the sentences belong. The `index` method is also modified to return two `FilenameLabelledCollection` instances.

In [5]:
train_collection = FilenameLabelledCollection([data['sentence'] for data in train_set], 
                                                 [data['label'] for data in train_set], 
                                                 [data['filename'] for data in train_set])

val_collection = FilenameLabelledCollection([data['sentence'] for data in val_set], 
                                                 [data['label'] for data in val_set], 
                                                 [data['filename'] for data in val_set])

test_collection = FilenameLabelledCollection([data['sentence'] for data in test_set], 
                                                 [data['label'] for data in test_set], 
                                                 [data['filename'] for data in test_set])

glaucoma_collection = FilenameLabelledCollection([data['sentence'] for data in glaucoma_test], 
                                                 [data['label'] for data in glaucoma_test], 
                                                 [data['filename'] for data in glaucoma_test])

neoplasm_collection = FilenameLabelledCollection([data['sentence'] for data in neoplasm_test], 
                                                 [data['label'] for data in neoplasm_test], 
                                                 [data['filename'] for data in neoplasm_test])

mixed_collection = FilenameLabelledCollection([data['sentence'] for data in mixed_test], 
                                                 [data['label'] for data in mixed_test], 
                                                 [data['filename'] for data in mixed_test])

In [6]:
indexer = qp.data.preprocessing.IndexTransformer(min_df=1)

# Create and index the dataset
abs_dataset = CustomDataset(training=train_collection, test=test_collection, val=val_collection)
index(abs_dataset, indexer, inplace=True)

# Index the test collections
index(glaucoma_collection, indexer, fit=False, inplace=True)
index(neoplasm_collection, indexer, fit=False, inplace=True)
index(mixed_collection, indexer, fit=False, inplace=True)

qp.environ['SAMPLE_SIZE'] = avg_sentences_per_file_train

indexing: 100%|███████████████████████████████████████████████| 4645/4645 [00:00<00:00, 27156.62it/s]


AttributeError: 'Dataset' object has no attribute 'val'

### 2. Classifier
`QuaNet` requires a classifier that can provide embedded representations of the inputs. In the original paper, `QuaNet` was tested using an `LSTM` as the base classifier; as `QuaPy`'s authors show in their [example](https://hlt-isti.github.io/QuaPy/manuals/methods.html#the-quanet-neural-network), we will use an instantiation of `QuaNet` that employs a `CNN` as a probabilistic classifier, taking its last layer representation as the document embedding.

In [None]:
study = optuna.create_study(direction="maximize")
study.optimize(lambda trial: objective(trial, abs_dataset), n_trials=50)

print("Best trial:")
trial = study.best_trial
print(f"  Macro F1 Score: {trial.value}")
print("  Params:")
for key, value in trial.params.items():
    print(f"    {key}: {value}")

[I 2024-10-30 13:51:07,611] A new study created in memory with name: no-name-512b1d1c-904d-4a95-bd09-dffc24c34565


Starting trial 0 with parameters:
    Embedding size: 191 - Hidden size: 269
    Optimizer: Adam (lr: 7.429468607521178e-05) - Scheduler: CosineAnnealingWarmRestarts (params: {'T_0': 11, 'T_mult': 3})
[NeuralNetwork running on cpu]


  self.net.load_state_dict(torch.load(self.checkpointpath))
[CNNnet] training epoch=33 tr-loss=0.02935 tr-acc=99.41% tr-macroF1=99.41% patience=1/5 val-loss=0.43


Training ended by patience exhausted; loading best model parameters from ../checkpoint/components/classifier_net.dat from epoch 28
Performing a final training pass over the validation set...


[I 2024-10-30 14:03:52,321] Trial 0 finished with value: 0.84483459057097 and parameters: {'embedding_size': 191, 'hidden_size': 269, 'optimizer': 'Adam', 'lr': 7.429468607521178e-05, 'scheduler': 'CosineAnnealingWarmRestarts', 'T_0': 11, 'T_mult': 3}. Best is trial 0 with value: 0.84483459057097.


[Training complete] - Best loss on validation set: 0.3879459351301193 - Best f1 on validation set: 0.84483459057097
Starting trial 1 with parameters:
    Embedding size: 106 - Hidden size: 278
    Optimizer: Adam (lr: 4.496657263459981e-05) - Scheduler: CosineAnnealingLR (params: {'T_max': 11})
[NeuralNetwork running on cpu]


  self.net.load_state_dict(torch.load(self.checkpointpath))
[CNNnet] training epoch=61 tr-loss=0.11613 tr-acc=96.00% tr-macroF1=95.99% patience=1/5 val-loss=0.39


Training ended by patience exhausted; loading best model parameters from ../checkpoint/components/classifier_net.dat from epoch 56
Performing a final training pass over the validation set...


[I 2024-10-30 14:18:56,683] Trial 1 finished with value: 0.8256494720010918 and parameters: {'embedding_size': 106, 'hidden_size': 278, 'optimizer': 'Adam', 'lr': 4.496657263459981e-05, 'scheduler': 'CosineAnnealingLR', 'T_max': 11}. Best is trial 0 with value: 0.84483459057097.


[Training complete] - Best loss on validation set: 0.3825526237487793 - Best f1 on validation set: 0.8256494720010918
Starting trial 2 with parameters:
    Embedding size: 163 - Hidden size: 281
    Optimizer: Adam (lr: 2.1239254662200434e-05) - Scheduler: CosineAnnealingLR (params: {'T_max': 12})
[NeuralNetwork running on cpu]


  self.net.load_state_dict(torch.load(self.checkpointpath))
[CNNnet] training epoch=58 tr-loss=0.22415 tr-acc=92.47% tr-macroF1=92.47% patience=1/5 val-loss=0.40


Training ended by patience exhausted; loading best model parameters from ../checkpoint/components/classifier_net.dat from epoch 53
Performing a final training pass over the validation set...


[I 2024-10-30 14:38:59,793] Trial 2 finished with value: 0.8053998632946002 and parameters: {'embedding_size': 163, 'hidden_size': 281, 'optimizer': 'Adam', 'lr': 2.1239254662200434e-05, 'scheduler': 'CosineAnnealingLR', 'T_max': 12}. Best is trial 0 with value: 0.84483459057097.


[Training complete] - Best loss on validation set: 0.4101656898856163 - Best f1 on validation set: 0.8053998632946002
Starting trial 3 with parameters:
    Embedding size: 170 - Hidden size: 298
    Optimizer: Adam (lr: 1.806464467046538e-05) - Scheduler: CosineAnnealingLR (params: {'T_max': 12})
[NeuralNetwork running on cpu]


  self.net.load_state_dict(torch.load(self.checkpointpath))
[CNNnet] training epoch=55 tr-loss=0.26673 tr-acc=90.45% tr-macroF1=90.44% patience=1/5 val-loss=0.41


Training ended by patience exhausted; loading best model parameters from ../checkpoint/components/classifier_net.dat from epoch 50
Performing a final training pass over the validation set...


[I 2024-10-30 14:59:15,080] Trial 3 finished with value: 0.8064693187048302 and parameters: {'embedding_size': 170, 'hidden_size': 298, 'optimizer': 'Adam', 'lr': 1.806464467046538e-05, 'scheduler': 'CosineAnnealingLR', 'T_max': 12}. Best is trial 0 with value: 0.84483459057097.


[Training complete] - Best loss on validation set: 0.4288260191679001 - Best f1 on validation set: 0.8064693187048302
Starting trial 4 with parameters:
    Embedding size: 159 - Hidden size: 284
    Optimizer: Adam (lr: 1.1268958994171102e-05) - Scheduler: CosineAnnealingWarmRestarts (params: {'T_0': 9, 'T_mult': 3})
[NeuralNetwork running on cpu]


  self.net.load_state_dict(torch.load(self.checkpointpath))
[CNNnet] training epoch=53 tr-loss=0.41050 tr-acc=82.89% tr-macroF1=82.86% patience=1/5 val-loss=0.49


Training ended by patience exhausted; loading best model parameters from ../checkpoint/components/classifier_net.dat from epoch 48
Performing a final training pass over the validation set...


[I 2024-10-30 15:16:42,187] Trial 4 finished with value: 0.783466862023817 and parameters: {'embedding_size': 159, 'hidden_size': 284, 'optimizer': 'Adam', 'lr': 1.1268958994171102e-05, 'scheduler': 'CosineAnnealingWarmRestarts', 'T_0': 9, 'T_mult': 3}. Best is trial 0 with value: 0.84483459057097.


[Training complete] - Best loss on validation set: 0.4912678897380829 - Best f1 on validation set: 0.783466862023817
Starting trial 5 with parameters:
    Embedding size: 173 - Hidden size: 256
    Optimizer: Adam (lr: 0.00043601308174167394) - Scheduler: CosineAnnealingLR (params: {'T_max': 9})
[NeuralNetwork running on cpu]


  self.net.load_state_dict(torch.load(self.checkpointpath))
[CNNnet] training epoch=17 tr-loss=0.01548 tr-acc=99.55% tr-macroF1=99.55% patience=1/5 val-loss=0.50


Training ended by patience exhausted; loading best model parameters from ../checkpoint/components/classifier_net.dat from epoch 12
Performing a final training pass over the validation set...


[I 2024-10-30 15:22:32,221] Trial 5 finished with value: 0.8491569287709215 and parameters: {'embedding_size': 173, 'hidden_size': 256, 'optimizer': 'Adam', 'lr': 0.00043601308174167394, 'scheduler': 'CosineAnnealingLR', 'T_max': 9}. Best is trial 5 with value: 0.8491569287709215.


[Training complete] - Best loss on validation set: 0.45491214841604233 - Best f1 on validation set: 0.8491569287709215
Starting trial 6 with parameters:
    Embedding size: 151 - Hidden size: 258
    Optimizer: Adam (lr: 0.0005346212833028434) - Scheduler: CosineAnnealingLR (params: {'T_max': 3})
[NeuralNetwork running on cpu]


  self.net.load_state_dict(torch.load(self.checkpointpath))
[CNNnet] training epoch=28 tr-loss=0.01646 tr-acc=99.41% tr-macroF1=99.41% patience=1/5 val-loss=0.66


Training ended by patience exhausted; loading best model parameters from ../checkpoint/components/classifier_net.dat from epoch 23
Performing a final training pass over the validation set...


[I 2024-10-30 15:31:24,667] Trial 6 finished with value: 0.8549063962891201 and parameters: {'embedding_size': 151, 'hidden_size': 258, 'optimizer': 'Adam', 'lr': 0.0005346212833028434, 'scheduler': 'CosineAnnealingLR', 'T_max': 3}. Best is trial 6 with value: 0.8549063962891201.


[Training complete] - Best loss on validation set: 0.5792213827371597 - Best f1 on validation set: 0.8549063962891201
Starting trial 7 with parameters:
    Embedding size: 130 - Hidden size: 292
    Optimizer: Adam (lr: 2.3918643253843326e-05) - Scheduler: CosineAnnealingWarmRestarts (params: {'T_0': 11, 'T_mult': 3})
[NeuralNetwork running on cpu]


[CNNnet] epoch=27 lr=0.00002 tr-loss=0.41912 tr-F1=82.02% patience=4/5 val-loss=0.49307 val-F1=77.52%

In [None]:
set_seed(42)

embedding_size = 161
hidden_size = 262   

cnn_module = CNNnet(
    abs_dataset.vocabulary_size,
    abs_dataset.training.n_classes,
    embedding_size=embedding_size,
    hidden_size=hidden_size
)

optimizer = Adam(cnn_module.parameters(), lr=8e-4)
scheduler = CosineAnnealingWarmRestarts(optimizer, T_0=5, T_mult=4)

cnn_classifier = ScheduledNeuralClassifierTrainer(
    cnn_module,
    lr_scheduler=scheduler,
    optim = optimizer,
    device='cpu',
    checkpointpath='../checkpoint/relations/classifier_net.dat',
    padding_length=107,
    patience=5
)

cnn_classifier.fit(*abs_dataset.training.Xy, *abs_dataset.val.Xy)

In [7]:
# train the text classifier:
# cnn_module = CNNnet(abs_dataset.vocabulary_size, abs_dataset.training.n_classes, padding=1)
cnn_module = CNNnet(abs_dataset.vocabulary_size, abs_dataset.training.n_classes)
cnn_classifier = NeuralClassifierTrainer(cnn_module, device='cpu', checkpointpath='../checkpoint/components/classifier_net.dat')
cnn_classifier.fit(*abs_dataset.training.Xy)

[NeuralNetwork running on cpu]


  self.net.load_state_dict(torch.load(checkpoint))
[CNNnet] training epoch=12 tr-loss=0.00068 tr-acc=100.00% tr-macroF1=100.00% patience=1/10 val-loss=1


training ended by patience exhasted; loading best model parameters in ../checkpoint/components/classifier_net.dat for epoch 2
performing one training pass over the validation set...
[done]


<quapy.classification.neural.NeuralClassifierTrainer at 0x272742a1dc0>

In [8]:
f1_train = 1-qp.error.f1e(abs_dataset.training.labels, cnn_classifier.predict(abs_dataset.training.instances))
accuracy_train = 1-qp.error.acce(abs_dataset.training.labels, cnn_classifier.predict(abs_dataset.training.instances))
print('- Train set:')
print(f'\tF1: {f1_train}')    
print(f'\tAccuracy: {accuracy_train}')    

f1_test = 1-qp.error.f1e(abs_dataset.test.labels, cnn_classifier.predict(abs_dataset.test.instances))
accuracy_test = 1-qp.error.acce(abs_dataset.test.labels, cnn_classifier.predict(abs_dataset.test.instances))
print('- Composed test set:')
print(f'\tF1: {f1_test}')    
print(f'\tAccuracy: {accuracy_test}')    

f1_test_glaucoma = 1-qp.error.f1e(glaucoma_collection.labels, cnn_classifier.predict(glaucoma_collection.instances))
accuracy_test_glaucoma = 1-qp.error.acce(glaucoma_collection.labels, cnn_classifier.predict(glaucoma_collection.instances))

print('- Glaucoma test set:')
print(f'\tF1: {f1_test_glaucoma}')    
print(f'\tAccuracy: {accuracy_test_glaucoma}')    

f1_test_neoplasm = 1-qp.error.f1e(neoplasm_collection.labels, cnn_classifier.predict(neoplasm_collection.instances))
accuracy_test_neoplasm = 1-qp.error.acce(neoplasm_collection.labels, cnn_classifier.predict(neoplasm_collection.instances))

print('- Neoplasm test set:')
print(f'\tF1: {f1_test_neoplasm}')    
print(f'\tAccuracy: {accuracy_test_neoplasm}')

f1_test_mixed = 1-qp.error.f1e(mixed_collection.labels, cnn_classifier.predict(mixed_collection.instances))
accuracy_test_mixed = 1-qp.error.acce(mixed_collection.labels, cnn_classifier.predict(mixed_collection.instances))

print('- Mixed test set:')
print(f'\tF1: {f1_test_mixed}')    
print(f'\tAccuracy: {accuracy_test_mixed}')

- Train set:
	F1: 0.9517755830616709
	Accuracy: 0.9518027274425556
- Composed test set:
	F1: 0.8653922324991364
	Accuracy: 0.8654649947753396
- Glaucoma test set:
	F1: 0.8775207855977778
	Accuracy: 0.8781055900621118
- Neoplasm test set:
	F1: 0.8648209791345296
	Accuracy: 0.8648648648648649
- Mixed test set:
	F1: 0.8525908213408213
	Accuracy: 0.8526490066225165


### 3. QuaNet 
The results are solid, let's move onto `QuaNet` training phase. `QuaNet` observes the classification predictions to learn higher-order *quantification embeddings*, which are then refined by incorporating quantification predictions of simple classify-and-count-like methods.

![architecture](./images/quanet_architecture.png)

The QuaNet architecture (see Figure 1) consists of two main components: a **recurrent component** and a **fully connected component**.

#### 3.1 Recurrent Component: Bidirectional LSTM
- The core of the model is a **Bidirectional LSTM** (Long Short-Term Memory), a type of recurrent neural network. 
- The LSTM receives as input a **list of pairs** $⟨Pr(c|x), \vec{x}⟩$, where:
  - $Pr(c|x)$ is the probability that a classifier $h$ assigns class $c$ to document $x$.
  - $\vec{x}$ is the **document embedding**, a vector representing the document's content.
- The list is **sorted by the value of $Pr(c|x)$**, meaning the documents are arranged from least to most likely to belong to class $c$.
  
The **intuition** behind this approach is that the LSTM will "learn to count" positive and negative examples. By observing the ordered sequence of probabilities, the LSTM should learn to recognize the point where the documents switch from negative to positive examples. The document embedding $\vec{x}$ helps the LSTM assign different importance to each document when making its prediction.

The output of the LSTM is called a **quantification embedding**—a dense vector representing the information about the quantification task learned from the input data.

#### 3.2 Fully Connected Component
- The vector returned by the LSTM is combined with additional information, specifically **quantification-related statistics**:
  - $\hat{p}_c^{CC}(D)$, $\hat{p}_c^{ACC}(D)$, $\hat{p}_c^{PCC}(D)$, and $\hat{p}_c^{PACC}(D)$, which are quantification predictions from different methods.
  - $tpr_b$, $fpr_b$, $tpr_s$, and $fpr_s$, aggregate statistics related to true positive and false positive rates, which are easy to compute from the classifier $h$ using a validation set.

This combined vector then passes through the second part of the architecture, which is made up of **fully connected layers** with **ReLU activations**. These layers adjust the quantification embedding using the additional statistics from the classifier to improve the accuracy of the quantification.

The final output is a prediction $\hat{p}_c^{QuaNet}(c|D)$, which represents the probability of class $c$ for the dataset $D$, produced by a **softmax layer**.

QuaNet could use quantification predictions from many methods, but it focuses on those that are **computationally efficient** (like CC, ACC, PCC, and PACC). This ensures that the process remains fast while still providing sufficient information for accurate predictions.

### Details

| Layer | Type | Dimensions | Activation | Dropout |
|---|---|---|---|---|
| Input | LSTM | 128 | N/A | N/A |
| Dense 1 | Dense | 1024 | ReLU | 0.5 |
| Dense 2 | Dense | 512 | ReLU | 0.5 |
| Output | Dense | 2 | Softmax | N/A |

- The LSTM has **64 hidden dimensions**, and since it’s bidirectional, the final LSTM output has **128 dimensions**.
- This LSTM output is concatenated with the **8 quantification statistics** (giving a total of 136 dimensions), which is then fed into:
  - **Two dense layers** with **1,024** and **512 dimensions**, each using **ReLU activation** and **0.5 dropout**.
  - Finally, the output is passed through a **softmax layer** of size 2 to make the final class prediction.


In [None]:
# train QuaNet (alternatively, we can set fit_classifier=True and let QuaNet train the classifier)
quantifier = QuaNet(cnn_classifier, qp.environ['SAMPLE_SIZE'], device='cpu', checkpointdir='../checkpoint/components', checkpointname='Quanet-Components')
quantifier.fit(abs_dataset.training, fit_classifier=False)

We wrapped `QuaPy`'s error evaluation function and manually modified how each sample is selected; we adjusted the sampling strategy to work with batches where the batch size is equal to the number of sentences that compose each abstract. This allows us to select the entire document based on the filename associated with each sentence. We will also evaluate the results using the standard random sampling technique, where sentences from different abstracts are grouped into the same batch.

In [None]:
print('Results on training set:')
result_train = evaluate(collection=abs_dataset.training, n=[1,3,5,10, qp.environ['SAMPLE_SIZE'], 15], quantifier=quantifier)

print('\nResults on test set:')
result_test = evaluate(collection=abs_dataset.test, n=[1,3,5,10, qp.environ['SAMPLE_SIZE'], 15], quantifier=quantifier)

The results seem promising, although the differences observed with the modified sampling strategy are substantial. This observation led us to investigate the effects of increasing the number of elements per batch, which allows us to notice a decrease in error values that tend to align more closely with the standard random sampling technique. This suggests that we might need to explore several solutions:

- **Modify the Sampling Strategy During Training**: Adjusting the sampling strategy at training time could help the model learn the distribution of components within a single abstract.
  
- **Utilize Longer Documents**: We may consider working with longer documents containing more sentences. For instance, the [dataset suggested by Galassi](https://madoc.bib.uni-mannheim.de/46084/1/argmining-18-multi%20%289%29.pdf) contains entire papers annotated with argument components.

Let’s observe how the current model behaves with some sample instances.

In [None]:
infer(test_set, indexer, comp_quantifier=quantifier, comp_classifier=cnn_classifier, filename=None, use_tokenizer=True)

The predictions, although not perfect, appear promising. We compute the posterior estimations in two ways: first, by using sentence tokenization applied to the dataset, taking into account the annotations provided by *Cabrio* and *Villata*; second, by simply applying `sent_tokenize()` to each abstract. The error difference between the two approaches is minimal.

### 4. QuaNet with custom training routine

Now, let us attempt the first of the previously proposed modifications. We have adjusted the `QuaNet` training routine so that, in each epoch, every batch contains all the sentences from an abstract.

In [None]:
# train QuaNet (alternatively, we can set fit_classifier=True and let QuaNet train the classifier)
quantifier_custom = QuaNetTrainerABS(cnn_classifier, qp.environ['SAMPLE_SIZE'], device='cpu', checkpointdir='../checkpoint/components-custom', checkpointname='Quanet-Components'))
quantifier_custom.fit(abs_dataset.training, fit_classifier=False)

In [None]:
print('Results on training set:')
result_train_custom = evaluate(collection=abs_dataset.training, n=[1,3,5,10, qp.environ['SAMPLE_SIZE'], 15], quantifier=quantifier_custom)

print('\nResults on test set:')
result_test_custom = evaluate(collection=abs_dataset.test, n=[1,3,5,10, qp.environ['SAMPLE_SIZE'], 15], quantifier=quantifier_custom)

In [None]:
filename = random.choice(test_collection.filenames)

print('Standard QuaNet:')
infer(test_set, indexer, comp_quantifier=quantifier, comp_classifier=cnn_classifier, rel_quantifier=None, filename=filename, show_text=False, use_tokenizer=True)

print('Custom approach QuaNet:')
infer(test_set, indexer, comp_quantifier=quantifier_custom, comp_classifier=cnn_classifier, rel_quantifier=None, filename=filename, show_text=False, use_tokenizer=True)

Unfortunately, we observe generally higher error values with this methodology. Moreover, if we run the previous cell multiple times, we notice that the custom quantifier tends to predict an equal distribution of `Components` and `Non-components`: this might reflects the fact that, on average, both the training and test sets have a roughly equal number of argument components and non-components within each individual abstract.

```python
- Train set:
	Label 0: 2760 samples
	Label 1: 2593 samples

	There are 2 different labels in the train set -> [0, 1]
	Average number of sentences per file in train set: 13
	Max sentence length: 107
	Average components per file: 6.48
	Average non-components per file: 6.90

- Test set:
	Label 0: 1948 samples
	Label 1: 1880 samples

	There are 2 different labels in the test set -> [0, 1]
	Average number of sentences per file in test set: 14
	Max sentence length: 91
	Average components per file: 6.99
	Average non-components per file: 7.24
```