<a target="_blank" href="https://colab.research.google.com/github/fracapuano/Spectroid/blob/main/MLP_Classification/Specter_FineTuning.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

>*Note*: When executing in Colab, please be sure to properly install dependacies and download data and trainedmodels that are needed to reproduce our results. 

Please refer to [these instructions](https://github.com/fracapuano/deep_NLP/blob/main/README.md?plain=1#L172-L178) for details. 
You can install the required dependancies by simply turning the following cell into a code-cell and run its code.


!pip install transformers datasets rich fastlangid wandb

# SPECTER Fine-Tuning

In this 2nd Extension to the SPECTER paper, we carried out the task of studying how to fine-tune the pre-trained SPECTER embedder. Please note that for the following we will refer to the embedder considered and to the term *SPECTER* interchangeably.

The [SPECTER paper](https://arxiv.org/abs/2004.07180) suggests that classification of the embeddings $e_i \in \mathbb R^{d}\ \forall i \in \mathcal X$ can be carried out using standard classical Machine Learning Algorithms such as SVM. In particular, the authors findings suggest that using a Linear Kernel SVM (with a fine-tuned value of $C$) one can obtain the following results in terms of classification performance:

|         Task        | Macro F1 Score |
|:-------------------:|:--------------:|
| MeSH Classification |      87.7      |
|  MAG Classification |      79.4      |

The basic intuition behind fine-tuning pretrained models is that the loss obtained in performing any downstream task can indeed be used to adjust the model's embeddings, *i.e.* the embeddings can be changed (albeit slightly only, hence the term *fine-tuning*) to better perform any given task such as the classification task we focus on in this extension. 

In general, classification consists in learning a discriminative function $\hat f:\mathcal X \mapsto \mathcal Y$ that maps data points in a given feature space $\mathcal X$ to their corresponding label (**one** of a finite number in a label set $\mathcal Y$). The original intuition of SPECTER's authors is that one can decouple the task of Text Classification into two main subcomponents: 

1. **Natural Language Embedding**, *i.e.* the task of obtaining contextualized numerical representations of textual data

2. **Embedding Classification**, *i.e.* the task of actually classyfing such numerical representations using any given classification algorithm (that is, learning $\hat f$).


<p align="center">
    <img width=1000 src="https://i.ibb.co/Dp6K4wt/SPECTERclassification.png" alt="ext2-scheme" border="0">
</p>

The process just displayed is one in which each paper $P_i$ is first embedded through $\texttt{SPECTER}$ into the corresponding embedding $e_i$. 
Later on, traditional Machine Learning techniques (here represented with the scikit-learn symbol) are used to learn the discriminative function $\hat f$ (hopefully) minimizing the classification error $\Vert l - \hat f(e) \Vert_{p} \ \forall i$ and for some $p$ norm. Here, the loss function is used to only "learn" $\hat f$. If one uses a SVM algorithm to classify the embedded papers, then it is possible to reproduce the results of SPECTER, *i.e.* to correctly classify the majority of *classification-static* paper embeddings.

<p align="center">
    <img width=1000 src="https://i.ibb.co/G7cJtLj/improvement.png" alt="ext2-scheme" border="0">
</p>

Despite this approach clearly is very well-performing in a variety of different situations, the information about the loss can in principle be used differently. In particular, one can propagate back the loss information to also change the procedure with which the very same papers are embedded. This idea relies on the simple yet possibly very powerful intuition that embeddings produced by SPECTER might suffer from over-generalization (*i.e.*, they might be not so specific for the tackled task) when used in the context of Text Classification.

The fact the embeddings might be slightly sub-optimal in terms of performance for classification tasks, follows from the fact that said embeddings are produced in the sake of producing high-quality (citation-network informed) numerical representations of scientific papers. To this aim, Text Classification simply is a downstream activity and in this does not enter the pipeline in its initial stages. Indeed, it is no more relevant to the procedure with which to embed paper than other tasks such as Citation Prediciton, for instance. 
This aspect clearly hinders the possibility of using SPECTER to its fullest in one specific application, since the embeddings it produces might be simply non tailored to be used to this aim. 

Our intuition is that one can **chain** the two steps on which Text Classification is based, thus unifying the whole process. 
After an often very extensive and data-intensive phase of pre-training, the embeddings produced by SPECTER are then fed in a **Classification Head** (CH) based on a Multi-Layer Perceptron architecture. This allows a complete flow of information between not only the CH parameters and the classification output, but also between the SPECTER embedding model and the classification output itself.

Theoretically, this flow of information can be used to tweak (or better, **fine-tune**) SPECTER parameters specifically for classification (or really any downstream task).

This is justified by the fact that, if one has **labelled** dataset $\tau$ defined as: 

$$
\begin{equation}
\tau =  \{ \mathcal P_i \vert l_i \}_{i = 1, \dots, \vert \mathcal X \vert}
\end{equation}
$$

Then, in the bottom part of the diagram, it is clear that the classification function $\hat f: \mathcal X \mapsto \mathcal Y$ is applied to any given paper $\mathcal P$ as follows: 

$$
\begin{equation}
g_{\text{bottom}}(\mathcal P) = \hat f(\texttt{SPECTER}(P))
\end{equation}
$$

Which yields that if one uses as loss-function the misclassification error $L(y, \hat y) \mapsto \mathbb R^+$ then, clearly enough, one practically observes that:

$$
\begin{equation}
\frac{\partial L}{\partial w_\texttt{SPECTER}} \neq 0
\end{equation}
$$

Now, of course, one cannot expect the extensively trained weights of SPECTER to significantly change for one specific task: as an encoder, SPECTER's job is, at the end of the day, to turn text into meaningful and contextual *general-purpose* numerical representation. 
Indeed, the major use of the information in $\tau$ is used in training the CH on top of SPECTER.
Nevertheless, the embeddings are indeed updated, so that classification is performed in a dynamical feature space, whose geometry is affected by the actual task rather than being independent.

In [1]:
# use wandb to track experiments and trainings -- uncomment next line to login with wandb account.
# !wandb login

In [2]:
# seed this notebook
from commons.utils import *
from transformers import AutoTokenizer, AutoModel

seedEverything(seed=321)  # specter's seed

# load SPECTER pre-trained model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('allenai/specter')
model = AutoModel.from_pretrained('allenai/specter')

  from .autonotebook import tqdm as notebook_tqdm


The data to perform the considered task are stored in the `data` folder. Nevertheless, accessing them in a way that is straight-forward to use to carry out the task here presented is not possible, as the actual textual data used are separated from the correspoding label. 

Should the data folder be empty (or not present) in your version, you can create by simply running: 
```bash
$ bash commons/getdata.sh
```

This bash file will download the `data` folder so that later steps of analysis are possible.

Inside the `data` folder, one can find:

1. `paper_metadata_mag_mesh.json`, which contains various features, such as:
    
   `pid`: Paper-Id, string (*e.g.*, `00021eeee2bf4e06fec98941206f97083c38b54d`).
   
   `abstract`: Paper abstract, Text,
   
   `cited_by`, List. Citation list in which each element is the `pid` of other papers citing this one, 
   
   `references`: List, Citation list in which each element is the `pid` of other papers cited by this one, 
   
   `title`: Paper title. Text
   
   
2. `mag/{train/val/test}.csv` which is organized as:

    `pid`: Paper-Id, string, (*e.g.*, `00021eeee2bf4e06fec98941206f97083c38b54d`).
    
    `label`: Label, int. An integer value in the 0-18 range representing one of the [MAG classes](https://github.com/allenai/scidocs/blob/ebf239d30d70062b4111f9e3a8efe2b3d3f3d303/README.md?plain=1#L121-L139)


3. `mesh/{train/val/test}.csv` which is organized as:

    `pid`: Paper-Id, string, (*e.g.*, `00021eeee2bf4e06fec98941206f97083c38b54d`).
    
    `label`: Label, int. An integer value in the 0-10 range representing one of the [MeSH classes](https://github.com/allenai/scidocs/blob/ebf239d30d70062b4111f9e3a8efe2b3d3f3d303/README.md?plain=1#L106-L116)
    

This clearly indicates that the data need to be preprocessed to make it usable for the the considered model. 

In particular, the data shall undergo: 
1. A step of **cleaning**, in which invalid papers are removed from the pool of the one that will later be considered. Considering our limited computational resources, we chose to avoid considering papers that do not present both title *and* abstract, as well as papers that are not in english. This reduces the original dataset size of ~23%.
2. A step in which they are **joined with the labelled data** (which is in `mesh` and `mag`).

In [3]:
from commons.data_utils import *
# perform data pre-processing
scidocs = load_metadata()
mag = load_dataset(dataset="mag").join(scidocs, how="inner")
mesh = load_dataset(dataset="mesh").join(scidocs, how="inner")

del scidocs

Retrieving non-english papers: 100%|██████████| 37556/37556 [00:08<00:00, 4249.53it/s]


Total number of papers in SciDocs: 48473
Total number of papers after data removing abstract/title lacking papers: 37556
Total number of papers after data removing non english papers: 37227


Once the different data sources have been poured all together, it is necessary to spend a little effort in interfacing Pandas and the DL framework used in here, that is PyTorch.

Here, we will use Huggingface Datasets as middle ground to obtain our final result.

Various different steps are needed to turn `pd.DataFrame` into something one can use to train a Pytorch model on. These are:

1. Change the `class_label` name for the label columns into `labels` (as per Pytorch API).
2. Concatenate title and abstract, using the formula: `title + tokenizer.sep_token + abstract`.
3. Obtain the numerical representation of title and abstract themselves, using a tokenizer.
4. Remove all useless columns from the dataset.
   
Here, we do all these four fundamental steps calling a custom-defined function presented in `commons`.

In [4]:
# takes ~45 seconds
mesh_hf, mag_hf = [
    tokenize_hf(
        hf = to_hf_dataset(dataset=dataset), 
        tokenizer=tokenizer
        )
    for dataset in [mesh, mag]
]

# set torch format for the considered data
mesh_hf.set_format("torch")
mag_hf.set_format("torch")

Casting the dataset: 100%|██████████| 24/24 [00:00<00:00, 166.50ba/s]
100%|██████████| 24/24 [00:13<00:00,  1.77ba/s]
Casting the dataset: 100%|██████████| 15/15 [00:00<00:00, 441.54ba/s]
100%|██████████| 15/15 [00:06<00:00,  2.49ba/s]


The MeSH and MAG embeddings can clearly be obtained through the original `SPECTER` model.
Please notice that embeddings these files may take up to ~30 minutes based on the computational resources available. 

Should you want to use pre-computed embeddings, they are available in the `embeddings` folder in `data`.

In [5]:
from commons.model_utils import embed_data

do_embed=False
if do_embed:
    # embedding takes approximately 30dd mins
    mesh_embeddings = embed_data(model=model, data=mesh_hf.remove_columns("labels"))
    mag_embeddings = embed_data(model=model, data=mag_hf.remove_columns("labels"))
else:
    # alternatively, read embeddings from data
    mesh_embeddings = torch.from_numpy(np.loadtxt("data/embeddings/mesh/mesh_embeddings.txt"))
    mag_embeddings = torch.from_numpy(np.loadtxt("data/embeddings/mag/mag_embeddings.txt"))

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Considering that the embeddings my be produced considering the task at hand, we produce a `SPECTER` model that has a classification head mounted-on. The classification head considered is one out of three, and is chosen according to an analysis on the degree of non-linearity that is needed to correctly separate the given classes. 

<p align="center">
    <img width=750 src="https://i.ibb.co/2qKZgfY/architectures.png" border="0">
</p>


These three possible classification heads have have been chosen according to the needed degree of non-linearity. In particular we tested out three possible alternatives: 
1. `CH1 = Linear(798, m)`, where `m` represents the number of class labels (19 and 11 for MAG and MeSH, respectively)
2. `CH2 = Sequential(Linear(768, n1), ReLU(), Linear(n1, m))`, where `n1` represents a given chosen dimension
3. `CH3 = Sequential(Linear(768, n1), ReLU(), ...,  Linear(n4, m))`, where `5` linear layers have been stacked one of top of the other, separating them with ReLU activations functions.

Considering that Neural Networks can be used to identify decision boundaries that are understood to be increasingly non-linear as the complexity of the architecture increases, one has that these three candidate architectures clearly introduce a fairly different degree of non-linearity.

In particular: 
1. `CH1` introduces a small degree of non-linearity, mainly due to an embeddings' modification in position. Such shift in the latent space considered happens in a fairly non-linear fashion (as per non-linearity in the underlying BERT architecture). In this context, the decision boundary is learned mainly exploiting the possibility of moving the points in the latent space so to ease up classification with a linear decision boundary (as the one obtained with one single layer only and no activation function whatsoever).
2. `CH2` introduces way more non-linearity than `CH1` does, even though clearly is way less than the non-linearity introduced by `CH3`. Here, the loss information is used to both learn a possibly complex decision boundary *and* to move the embeddings in the considered latent space.
3. `CH3` is introduces the largest amount of non-linearity in the decision boundary among the architectures considered *given* the embeddings.

We tested all three configurations per each task (both MAG and MeSH classification) and obtained results fairly in-line with our expectations.

# MeSH Classification

In [6]:
# four config dictionaries (one per classification head considered)
from commons.utils import mesh_config_1, mesh_config_2, mesh_config_3, mesh_config_2bis
mesh_splits = mesh_hf.train_test_split(test_size=mesh_config_1["test_size"])

## `CH1`: Really limited non-linearity
<p align="center">
    <img width=500 src="https://i.ibb.co/gzDg1M3/architecture1.png" alt="ext2-scheme" border="0">
</p>

In [7]:
from commons.experiment import Experiment
do_track=False
# instantiate an Experiment, when verbose prints out classification head architecture and number of parameters
ch1 = Experiment(config=mesh_config_1, splits=mesh_splits, dataset=mesh_hf, track=do_track)

Classification head architecture:
Sequential(
  (0): Linear(in_features=768, out_features=11, bias=True)
)
Number of parameters (MESH model): 1.0995e+08


Next, we train the model for 5 epochs only on the MeSH labelled dataset, obtaining a checkpoint at each training epoch.

In [8]:
train, test = False, True
if train:
    ch1.perform_training()  # might take some time...
else: 
    ch1.load_run()

if test: # tests the given configuration
    ch1.test_model()

Model trainedmodels/MESH_CH1.pth loaded successfully!


100%|██████████| 73/73 [01:07<00:00,  1.08it/s]


Average F1-Score 0.9370





If one simply wants to reproduce our findings, it is also possible to load one of ours pre-trained models from the `trainedmodels` folder.

## `CH2` & `CH3`: Increasing non-linearity
The same procedure can be carried out to evaluate the performance of `CH2` and `CH3`.



<div align="center" style="width:100%; display:flex;">
  <div style="width:49%; display:inline-block; margin-right:2%;">
    <img src="https://i.ibb.co/SvkHN21/architecture2.png" alt="Image 1">
  </div>
  <div style="width:49%; display:inline-block;">
    <img src="https://i.ibb.co/cgC3nvg/architecture3.png" alt="Image 2">
  </div>
</div>


In [10]:
# CH2
ch2 = Experiment(config=mesh_config_2, splits=mesh_splits, dataset=mesh_hf, track=do_track)
if train:
    ch2.perform_training()
else: 
    ch2.load_run()
if test:
    ch2.test_model()

# CH3
ch3 = Experiment(config=mesh_config_3, splits=mesh_splits, dataset=mesh_hf, track=do_track)
if train:
    ch3.perform_training()
else: 
    ch3.load_run()
if test:
    ch3.test_model()

# CH2.2
ch2bis = Experiment(config=mesh_config_2bis, splits=mesh_splits, dataset=mesh_hf, track=do_track)
if train:
    ch2bis.perform_training()
else: 
    ch2bis.load_run()
if test:
    ch2bis.test_model()

# MAG Classification

In [None]:
from commons.utils import mag_config_1, mag_config_2, mag_config_3, mag_config_2bis
mag_splits = mag_hf.train_test_split(test_size=mag_config_1["test_size"])

We here reproduce the same exact procedure we implemented for MeSH classification for MAG classification.

In [None]:
train, test, do_track = False, True, False

# CH1
ch1 = Experiment(config=mag_config_1, splits=mag_splits, dataset=mag_hf, track=do_track)
if train:
    ch1.perform_training()
else: 
    ch1.load_run()
if test:
    ch1.test_model()

# CH2
ch2 = Experiment(config=mag_config_2, splits=mag_splits, dataset=mag_hf, track=do_track)
if train:
    ch2.perform_training()
else: 
    ch2.load_run()
if test:
    ch2.test_model()

# CH3
ch3 = Experiment(config=mag_config_3, splits=mag_splits, dataset=mag_hf, track=do_track)
if train:
    ch3.perform_training()
else: 
    ch3.load_run()
if test:
    ch3.test_model()

# CH2.2
ch2bis = Experiment(config=mag_config_2bis, splits=mag_splits, dataset=mag_hf, track=do_track)
if train:
    ch2bis.perform_training()
else: 
    ch2bis.load_run()
if test:
    ch2bis.test_model()

# Training Results
For details about the training procedures and comments on the results obtained, please refer to the report that comes with this repo.

Here, we now show the training curves for the two considered tasks and a visualization of the effect of fine-tuning on the embeddings obtained through SPECTER. In particular, we did observe the following behavior for what concerns the training curves.

<div style="display: flex;">
  <figure>
    <a href="https://ibb.co/FkT1WjG"><img width=500 src="https://i.ibb.co/SDCgx9q/Mesh-Training.png" alt="Mesh-Training" border="0"></a>
    <figcaption>Cross-Entropy evolution during training of various CHs for MeSH</figcaption>
  </figure>
  <figure>
    <a width= href="https://ibb.co/YbnWRm3"><img width=500 src="https://i.ibb.co/BwWPzbt/Mag-Training.png" alt="Mag-Training" border="0"></a>
    <figcaption>Cross-Entropy evolution during training of various CHs for MAG</figcaption>
  </figure>
</div>

Please notice how the curve related to `CH2.2` is the shortes compared to others simply because of the much larger batch size (32 instead of 8) considered.

More than the evolution of the loss curve, however, we also studied the evolution of the embeddings across epochs. In particular, we now show a 2D visualization (obtained using 2-components PCA) in which one clearly sees (despite the low variance explained through said two components) the evolution of the embeddings during training. Please note that we abstained from showing a legend for class labels of said embeddings, as the visualization purposes clearly is the one of showing off the evolution of the embedder during fine-tuning rather than actually labelling the different points. As a final note, the visualization here shown has been obtained on a subsample of 1000 papers for both MeSH and MAG datasets.

## MeSH fine-tuning
<head>
   <style>
   .img-grid {
      display: grid;
      grid-template-columns: repeat(2, 1fr);
      grid-gap: 5px;
   }
   .img-grid img {
      width: 75%;
      height: auto;
   }
   </style>
</head>
<body>
   <div class="img-grid">
   <a href="https://ibb.co/VChR7cj"><img src="https://i.ibb.co/jwxmcP6/specter-reference-mesh.png" alt="specter-reference-mesh" border="0"></a>
   <a href="https://ibb.co/bBG9qXn"><img src="https://i.ibb.co/gTcKHJC/MESH-CH1.gif" alt="MESH-CH1" border="0"></a>
   <a href="https://ibb.co/DzpTfyp"><img src="https://i.ibb.co/KbwSxkw/MESH-CH2.gif" alt="MESH-CH2" border="0"></a>
   <a href="https://ibb.co/7p2fZFX"><img src="https://i.ibb.co/z2GMBwr/MESH-CH3.gif" alt="MESH-CH3" border="0"></a>
   </div>
</body>

## MAG fine-tuning
<head>
   <style>
   .img-grid {
      display: grid;
      grid-template-columns: repeat(2, 1fr);
      grid-gap: 5px;
   }
   .img-grid img {
      width: 75%;
      height: auto;
   }
   </style>
</head>
<body>
   <div class="img-grid">
   <a href="https://ibb.co/VChR7cj"><img src="https://i.ibb.co/jwxmcP6/specter-reference-mesh.png" alt="specter-reference-mesh" border="0"></a>
   <a href="https://ibb.co/bBG9qXn"><img src="https://i.ibb.co/gTcKHJC/MESH-CH1.gif" alt="MESH-CH1" border="0"></a>
   <a href="https://ibb.co/DzpTfyp"><img src="https://i.ibb.co/KbwSxkw/MESH-CH2.gif" alt="MESH-CH2" border="0"></a>
   <a href="https://ibb.co/7p2fZFX"><img src="https://i.ibb.co/z2GMBwr/MESH-CH3.gif" alt="MESH-CH3" border="0"></a>
   </div>
</body>

With these configurations we were able to obtain the following results in terms of Macro F1 Score, which are considerably higher than the ones presented in SPECTER for both MeSH (+~6.5) and MAG (+~16) tasks. 

|         Task        | Macro F1 Score |
|:-------------------:|:--------------:|
| *MeSH Classification (SPECTER)* |      *87.7*      |
| **MeSH Classification (ours)** |      **94.32**    |
|  *MAG Classification (SPECTER)* |     *79.4*      |
|  **MAG Classification (ours)** |      **95.45**     |
