<br>
<u>Notebook One</u> | 
<a href=https://leone.gdn/NLP target=_blank>Report</a> | 
<a href=https://github.com/andrealeone/NLP target=_blank>Repository</a>
<br><br>
<b>Transcript vectorisation</b><br><br>
Andrea Leone<br>
ML for NLP — University of Trento<br>
January 2022
<hr><br><br>

In [None]:
import project

import numpy  as np
import spacy

from tqdm.notebook import tqdm

<br>

Load the pre-trained pipelines for word embeddings:
* [`en_core_web_lg`](https://spacy.io/models/en#en_core_web_lg): English tok2vec pipeline optimized for CPU.  
Includes 685k unique vectors (300 dimensions); trained on [GloVe Common Crawl](https://nlp.stanford.edu/projects/glove/).
* [`en_core_web_trf`](https://spacy.io/models/en#en_core_web_trf): English transformer pipeline based on [RoBERTa](https://github.com/pytorch/fairseq/tree/master/examples/roberta).<br>
Includes no vectors, but can be obtained by extracting the transformer's internal embeddings (768 dimensions).

Both pipelines are trained on [WordNet 3.0](https://wordnet.princeton.edu/) lexical database of English, [ClearNLP](https://github.com/clir/clearnlp-guidelines/blob/master/md/components/dependency_conversion.md) Constituent-to-Dependency Conversion, and [OntoNotes 5](https://catalog.ldc.upenn.edu/LDC2013T19) corpus.

In [None]:
nlp  = spacy.load('en_core_web_lg')
trf  = spacy.load('en_core_web_trf')

<br>

### Static model

Query the records that still have no vector transcript

In [None]:
records = project.sql(""" 
    SELECT * FROM talks
    WHERE
        transcript IS NOT NULL AND
        vector     IS NULL
    ORDER BY slug ASC;
""")

For each record retrieved, get the transcript, input it in the `nlp` pipeline to vectorise the entire document (token-per-token), extract the document vector converting the numerical values to `float64`.

In [None]:
for record in tqdm( records ): 

    slug       = record[0]
    transcript = record[4]

    vector     = nlp( transcript ).vector.astype( np.float64 )
    vector     = project.sqlize_array( vector )

    project.sql_commit("UPDATE talks SET vector='{0}' WHERE slug='{1}'".format(vector, slug))

<br/>

### Transformer model

Query the records that still have no vectorised transcript

In [None]:
records = project.sql(""" 
    SELECT * FROM talks
    WHERE
        transcript IS NOT NULL AND
        vector_trf IS NULL
    ORDER BY slug ASC;
""")

As transformer-based pretrained models work at tensor-level, they eventually need to be re-aligned to the tokens to extract word/span/document vectors.

In [None]:
from spacy.language import Language 

@Language.factory('tensor2attr')
class Tensor2Attr:

    def __init__(self, name, nlp):
        pass

    def __call__(self, doc):
        self.add_attributes(doc)
        return doc

    def add_attributes(self, doc):
        doc.user_hooks['vector']           = self.doc_tensor
        doc.user_span_hooks['vector']      = self.span_tensor
        doc.user_token_hooks['vector']     = self.token_tensor
        doc.user_hooks['similarity']       = self.get_similarity
        doc.user_span_hooks['similarity']  = self.get_similarity
        doc.user_token_hooks['similarity'] = self.get_similarity

    def doc_tensor(self, doc):
        return doc._.trf_data.tensors[-1].mean(axis=0)

    def span_tensor(self, span):
        tensor_ix = span.doc._.trf_data.align[span.start: span.end].data.flatten()
        out_dim   = span.doc._.trf_data.tensors[0].shape[-1]
        tensor    = span.doc._.trf_data.tensors[0].reshape(-1, out_dim)[tensor_ix]
        return tensor.mean(axis=0)

    def token_tensor(self, token):
        tensor_ix = token.doc._.trf_data.align[token.i].data.flatten()
        out_dim   = token.doc._.trf_data.tensors[0].shape[-1]
        tensor    = token.doc._.trf_data.tensors[0].reshape(-1, out_dim)[tensor_ix]
        return tensor.mean(axis=0)

    def get_similarity(self, doc1, doc2):
        return np.dot(doc1.vector, doc2.vector) / (doc1.vector_norm * doc2.vector_norm)

trf.add_pipe('tensor2attr')
trf.pipeline

For each record retrieved, get the transcript, input it in the `trf` pipeline to vectorise the entire document using the transformer, align the tensors with the tokens with the custom task, and extract the document vector converting the numerical values to `float64`.

In [None]:
for record in tqdm( records ): 

    slug       = record[0]
    transcript = record[4]

    vector     = trf( transcript ).vector.astype( np.float64 )
    vector     = project.sqlize_array( vector )

    project.sql_commit("UPDATE talks SET vector_trf='{0}' WHERE slug='{1}'".format(vector, slug))

<br/>