# 2nd Extension

In this 2nd Extension to the Specter Paper, we always considered the fully pre-trained Specter Embedder and used this term interchangeably with Specter.

The [Specter Paper](https://arxiv.org/abs/2004.07180) suggests that classification of the embeddings $e_i \in \mathbb R^{d}\ \forall i \in \mathcal X$ can be carried out using standard classical Machine Learning Algorithms such as SVM. 

In this case, a discriminative function $\hat f$ is learned and applied on the embeddings $e_i$ to perform classification. 

The intuition behind this idea (leading to the 1-digit $\Delta$-improvement in performance the authors of Specter did achieve) is that one can decouple the task of Text Classification into two main subcomponents: 

1. **Natural Language Embedding**, i.e. to obtain contextualized numerical representations of textual data

2. **Embedding Classification**, i.e. to actually classify these numerical represents using standard Machine Learning techniques.

Consider now the following Figure, representing two alternative systems for Specter-based Text Classification. 

<img src="https://i.ibb.co/wJ5zYzp/ext2-scheme.png" alt="ext2-scheme" border="0">

In this picture, two alternative ways with which to classify a given paper $P_i$ are displayed. The process displayed in the top-part of the image is one in which paper $P_i$ is first embedded through $\texttt{specter}$ into the corresponding embedding $e_i$. Later on, traditional Machine Learning techniques (here represented with the scikit-learn symbol) are used to learn the discriminative function $\hat f$ (hopefully) minimizing the classification error $\Vert l - \hat f(e) 
\Vert_{p} \ \forall i$ and for some $p$ norm. 


This process is pretty straight-forward, as the task of Text Classification is here split into the two consituents sub-tasks of NL-Embedding and downstream classification. Clearly enough, the major use of the actual **labelled** data is done in learning $\hat f$. 

The bottom diagram shows instead **our** extension, based on the simple yet possibly very powerful intuition that embeddings produced by SPECTER might suffer from over-generalization when used in Text Classification. 

This directly follows from the fact that said embeddings are produced in the sake of producing high-quality (citation-network informed) embeddings to then later on performs tasks such as Text Classification, but also Citation Prediciton and many more. This clearly hinders the possibility of using SPECTER to its fullest in its applications to Text Classification, since the embeddings it produces might be simply non tailored to be used to this aim. 

Our intuition is that one can **chain** the two steps on which Text Classification is based unifying the whole process. After an often very extensive and data-intensive phase of pre-training, the embeddings produced by SPECTER are then fed in a Classification Head (CH) based on a Multi-Layer Perceptron. This allows, at least in principle, a complete flow of information between not only the CH parameters and the classification output, but also between the SPECTER embedding model and the classification output itself.

Theoretically, this flow of information can be used to tweak (or better, **fine-tune**) SPECTER parameters specifically for classification (or any downstream task really).

This is justified by the fact that having a **labelled** dataset $\tau$ defined as: 

$$
\begin{equation}
\tau =  \{ \mathcal P_i \vert l_i \}_{i = 1, \dots, \vert \mathcal X \vert}
\end{equation}
$$

Then, in the bottom part of the diagram, it is clear that the classification function $g: \mathcal X \mapsto \mathcal L$ is applied to any given paper $\mathcal P$ as follows: 

$$
\begin{equation}
g_{\text{bottom}}(\mathcal P) = \bar f(\texttt{specter}(P))
\end{equation}
$$

Which yields that if one uses as loss-function the misclassification error $L(l, \bar l) \mapsto \mathbb R^+$ then, clearly enough, one practically observes that:

$$
\begin{equation}
\frac{\partial L}{\partial w_\texttt{specter}} \neq 0
\end{equation}
$$

Now, of course one cannot expect the extensively trained weights of Specter to significantly change for one specific task: as an encoder, Specter's job is, at the end of the day, to turn text into meaningful and dynamic numerical representation. Indeed, the major use of the information in $\tau$ is used in training the CH on top of Specter. 

Nevertheless, the embeddings are indeed updated, so that Classification is applied in a dynamical feature space, whose geometry is affected to from the 

In [2]:
from transformers import AutoTokenizer, AutoModel

# load specter pre-trained model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('allenai/specter')
model = AutoModel.from_pretrained('allenai/specter')

  from .autonotebook import tqdm as notebook_tqdm
