Copyright 2021 Andrew M. Olney and made available under [CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0) for text and [Apache-2.0](http://www.apache.org/licenses/LICENSE-2.0) for code.

# Latent variable vectorization

We previously discussed vectorization, by which words are converted into numeric features based on their distribution in a text.
We've seen that we can perform vectorization on words, transformations on words, or transformations on multi-word units.
However, in all of these cases, the vectorization is ultimately based on counts in documents (where a document can be a sentence, paragraph, or something longer).
As a result, the **length of the vectors** or equivalently **the dimension of the vector space**.

What if we could represent the text vectors in a lower dimensional space, especially one where the dimensions correspond to latent variables of interest rather than document boundaries?
There are two reasons why this could be a good idea.
First, our vector spaces become extremely large as our text collections and vocabulary get bigger, which makes computation more costly.
Second, if we can re-represent our vector space according to latent variables of interest, the resulting vectors should be even better features in our predictive models.

## What you will learn

You will learn various methods for representing a vector space as a lower dimensional space of latent variables.
  
We will cover:

- Latent semantic analysis
- Latent Dirichlet allocation
- Word2vec/Doc2vec
- Contextual embeddings with BERT

## When to use latent variable vectorization

Vectorization based on latent variables generally produces better features than vectorization based on counts alone.
However, it can be computationally costly to construct vectorizations based on latent variables, and if you use a precomputed model, it's possible words in your text will have no associated representation in the model.
A good strategy is to try a precomputed model and compare to count-based vectorization; if the precomputed model is worse, you can consider if the problem is a mismatch with your text and whether it is worth correcting it.

## Latent semantic analysis

[Latent semantic analysis (LSA; also known as latent semantic indexing)](https://en.wikipedia.org/wiki/Latent_semantic_analysis) is an old but still popular approach to latent variable vectorization.
We start with it because it is based on the term-document matrix - LSA simply applies truncated [singular value decomposition (SVD)](https://en.wikipedia.org/wiki/Principal_component_analysis) to the term-document matrix.
If you are unfamiliar with SVD, the basic idea is that it finds a low-dimensional approximation of the original matrix (in the least squares sense) ranked by the dimensions of the approximation (i.e. the first dimension captures most of the signal, then the second, and so on).

Here and throughout this notebook, we'll make extensive use of a new library, [gensim](https://radimrehurek.com/gensim/), which provides support for a wide variety of vector space methods.
Importantly, gensim provides wrappers for sklearn, so you can call gensim transformations as part of sklearn pipelines (i.e. the same way you've already called `CountVectorizer`).
This is extremely useful, because it allows us to use gensim without deeply going into its native API, which is slightly unusual.
Note these wrappers were [deprecated in v4 of gensim](https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4), so they won't work with the latest version.

Let's use our previous example from [Vectorization](Vectorization-weighting.ipynb) so we can compare results:

- import `gensim.sklearn_api` as `sklearn_api`
- import `sklearn.pipeline` as `pipeline`
- Set `texts` to a list containing `"dogs chase cats"`, `"cats chase mice"`, `"mice eat cheese"`

In [1]:
import gensim.sklearn_api as sklearn_api
import sklearn.pipeline as pipeline
texts = ['dogs chase cats', 'cats chase mice', 'mice eat cheese']

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="+YmMk1^)_8M1pm|fI/@(">sklearn_api</variable><variable id="c:,wNTq#akK[c:VY8h$A">pipeline</variable><variable id="E3]e9N:_cfl3`6*DrUyF">texts</variable></variables><block type="importAs" id="ol,.vQWu()Zr?H$[$e~," x="-155" y="-53"><field name="libraryName">gensim.sklearn_api</field><field name="libraryAlias" id="+YmMk1^)_8M1pm|fI/@(">sklearn_api</field><next><block type="importAs" id="eW9]EDTCnI!FSOiJt(Zc"><field name="libraryName">sklearn.pipeline</field><field name="libraryAlias" id="c:,wNTq#akK[c:VY8h$A">pipeline</field><next><block type="variables_set" id="!vG9N/84G5VO1Doa%O2:"><field name="VAR" id="E3]e9N:_cfl3`6*DrUyF">texts</field><value name="VALUE"><block type="lists_create_with" id="NP?[gln!A4vn(*7OSuu="><mutation items="3"></mutation><value name="ADD0"><block type="text" id="h+]`Selu9A_[n/z||T-F"><field name="TEXT">dogs chase cats</field></block></value><value name="ADD1"><block type="text" id="Lh_U+o=L0ZA;B)X7eHuG"><field name="TEXT">cats chase mice</field></block></value><value name="ADD2"><block type="text" id="|U1sSWnAauA7EXm.erIz"><field name="TEXT">mice eat cheese</field></block></value></block></value></block></next></block></next></block></xml>

Gensim assumes the input texts are in its [native bag of words format (BOW)](https://radimrehurek.com/gensim_3.8.3/sklearn_api/text2bow.html) so we need a pipeline that starts with the BOW transformation followed by the LSA transformation.
We can specify the number of dimensions/latent variables we want using `num_topics`.

- Set `pipe` to with `pipeline` create `Pipeline` using a list with a list of tuples inside it:
    - `"bow"` and with `sklearn_api` create `Text2BowTransformer`
    - `"lsi"` and with `sklearn_api` create `LsiTransformer` using `num_topics=5`

*Note: 5 is a small number so we can display the results in Jupyter and compare to standard vectorization.
In practice, you want 50 to 1000 dimensions.* 

In [2]:
pipe = pipeline.Pipeline([('bow',(sklearn_api.Text2BowTransformer())), ('lsi',(sklearn_api.LsiTransformer(num_topics=5)))])

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="^ui0{[8zsM/[=}wzc`z6">pipe</variable><variable id="c:,wNTq#akK[c:VY8h$A">pipeline</variable><variable id="+YmMk1^)_8M1pm|fI/@(">sklearn_api</variable></variables><block type="variables_set" id="Q10/rZu8e=AvH:wt~C}j" x="-112" y="-35"><field name="VAR" id="^ui0{[8zsM/[=}wzc`z6">pipe</field><value name="VALUE"><block type="varCreateObject" id="BzYWJLXq~9j,;R*._61a"><field name="VAR" id="c:,wNTq#akK[c:VY8h$A">pipeline</field><field name="MEMBER">Pipeline</field><data>pipeline:Pipeline</data><value name="INPUT"><block type="lists_create_with" id="`1ELRLa]NxLgeIwsaDFk"><mutation items="1"></mutation><value name="ADD0"><block type="lists_create_with" id="Mm:2ZbCM3#-T*f)I5z36"><mutation items="2"></mutation><value name="ADD0"><block type="tupleBlock" id="$R.~F/tJt1N@A%_O77vJ"><value name="FIRST"><block type="text" id=".7_?X@cUZ^Rse9gf#fO1"><field name="TEXT">bow</field></block></value><value name="SECOND"><block type="varCreateObject" id="[ikM.(mL15^j{iu4w1[K"><field name="VAR" id="+YmMk1^)_8M1pm|fI/@(">sklearn_api</field><field name="MEMBER">Text2BowTransformer</field><data>sklearn_api:Text2BowTransformer</data></block></value></block></value><value name="ADD1"><block type="tupleBlock" id="ol+s}#MgI+[/XblBOn5K"><value name="FIRST"><block type="text" id=",~_OzX34K[?bBb5t;bDE"><field name="TEXT">lsi</field></block></value><value name="SECOND"><block type="varCreateObject" id="FSrb=o-]zbmGtA:7V~6n"><field name="VAR" id="+YmMk1^)_8M1pm|fI/@(">sklearn_api</field><field name="MEMBER">LsiTransformer</field><data>sklearn_api:LsiTransformer</data><value name="INPUT"><block type="dummyOutputCodeBlock" id="8@1/F)hRELZ~c^Hm:i/U"><field name="CODE">num_topics=5</field></block></value></block></value></block></value></block></value></block></value></block></value></block></xml>

Apply `pipe` to our list of texts and store the result:

- Set `matrix` to with `pipe` do `fit_transform` using `texts`

In [3]:
matrix = pipe.fit_transform(texts)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="S|H:+?;2rlq.$U$9H|(V">matrix</variable><variable id="^ui0{[8zsM/[=}wzc`z6">pipe</variable><variable id="E3]e9N:_cfl3`6*DrUyF">texts</variable></variables><block type="variables_set" id="Soedj}vxx];/oJ=F.;0B" x="5" y="-113"><field name="VAR" id="S|H:+?;2rlq.$U$9H|(V">matrix</field><value name="VALUE"><block type="varDoMethod" id="fb9bsZObm*gRL5/,}7`M"><field name="VAR" id="^ui0{[8zsM/[=}wzc`z6">pipe</field><field name="MEMBER">fit_transform</field><data>pipe:fit_transform</data><value name="INPUT"><block type="variables_get" id="/GWGl`-g^F6Jc.1/@6(E"><field name="VAR" id="E3]e9N:_cfl3`6*DrUyF">texts</field></block></value></block></value></block></xml>

The work has been done, but to get Jupyter to display it, we have to use a special function:

- with `matrix` do `tolist` 

In [4]:
matrix.tolist()

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="S|H:+?;2rlq.$U$9H|(V">matrix</variable></variables><block type="varDoMethod" id="65K2,gM_O0r}-G5@hHhv" x="8" y="-124"><field name="VAR" id="S|H:+?;2rlq.$U$9H|(V">matrix</field><field name="MEMBER">tolist</field><data>matrix:tolist</data></block></xml>

[[1.4472136497497559, -0.7745966911315918, 0.5527864098548889, 0.0, 0.0],
 [1.6180340051651, 0.0, -0.6180340051651001, 0.0, 0.0],
 [0.7236068248748779, 1.5491933822631836, 0.27639320492744446, 0.0, 0.0]]

Our original term document matrix was
```
matrix([[1, 1, 0, 1, 0, 0],
        [1, 1, 0, 0, 0, 1],
        [0, 0, 1, 0, 1, 1]])
```
Interestingly our new matrix, though it has 5 dimensions, has zeros in the last two dimensions.
This suggests that is is properly a 3 dimensional space.
If these dimensions are latent variables, what do they mean?
In general, we can't expect to recover the meaning of latent dimensions directly or exactly.
If you are interested in exploring what latent dimensions mean more, you should check out gensims `print_topics` method.

## Latent Dirichlet allocation

[Latent Dirichlet allocation (LDA)](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) is similar to LSA in spirit but has its basis in generative Bayesian modeling rather than linear algebra.
LDA also looks at word counts in documents but supposes that each document is a mixture of latent topics and that each word in a document is generated from one of these topic distributions.
Like LSA, there is a free parameter for the number of topics, which correspond to the latent variables/dimensions of the space.

We can use the [same approach for LDA](https://radimrehurek.com/gensim_3.8.3/sklearn_api/ldamodel.html) as LSA:

- Set `pipe` to with `pipeline` create `Pipeline` using a list with a list of tuples inside it:
    - `"bow"` and with `sklearn_api` create `Text2BowTransformer`
    - `"lda"` and with `sklearn_api` create `LdaTransformer` using `num_topics=5`
- Set `matrix` to with `pipe` do `fit_transform` using `texts`
- with `matrix` do `tolist`

*Note: You may wish to copy the blocks from above into the cell below and change as appropriate.*

In [5]:
pipe = pipeline.Pipeline([('bow',(sklearn_api.Text2BowTransformer())), ('lda',(sklearn_api.LdaTransformer(num_topics=5)))])
matrix = pipe.fit_transform(texts)

matrix.tolist()

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="^ui0{[8zsM/[=}wzc`z6">pipe</variable><variable id="S|H:+?;2rlq.$U$9H|(V">matrix</variable><variable id="c:,wNTq#akK[c:VY8h$A">pipeline</variable><variable id="E3]e9N:_cfl3`6*DrUyF">texts</variable><variable id="+YmMk1^)_8M1pm|fI/@(">sklearn_api</variable></variables><block type="variables_set" id="lEf:wYereJ`%x9cBnyUH" x="-112" y="-35"><field name="VAR" id="^ui0{[8zsM/[=}wzc`z6">pipe</field><value name="VALUE"><block type="varCreateObject" id="~__@Q?rrcL=fN{LS#2QO"><field name="VAR" id="c:,wNTq#akK[c:VY8h$A">pipeline</field><field name="MEMBER">Pipeline</field><data>pipeline:Pipeline</data><value name="INPUT"><block type="lists_create_with" id="5KCj?nw*.J7B32-kAA;5"><mutation items="1"></mutation><value name="ADD0"><block type="lists_create_with" id="4*mK7Q*4HeN+r[bzpw||"><mutation items="2"></mutation><value name="ADD0"><block type="tupleBlock" id="o+}KY@XCLPp6[S-}un%s"><value name="FIRST"><block type="text" id="57q*]bSD+qAXEN{Ig}g3"><field name="TEXT">bow</field></block></value><value name="SECOND"><block type="varCreateObject" id="e~+AGc+|/Z8!ji]}Ag0y"><field name="VAR" id="+YmMk1^)_8M1pm|fI/@(">sklearn_api</field><field name="MEMBER">Text2BowTransformer</field><data>sklearn_api:Text2BowTransformer</data></block></value></block></value><value name="ADD1"><block type="tupleBlock" id="20TLY`Q[[fy-q,?)%*^%"><value name="FIRST"><block type="text" id="6#GWtL1~wZk~r|Z@ar7X"><field name="TEXT">lda</field></block></value><value name="SECOND"><block type="varCreateObject" id="!F{!#]k~{2Ng9LtAV{B."><field name="VAR" id="+YmMk1^)_8M1pm|fI/@(">sklearn_api</field><field name="MEMBER">LdaTransformer</field><data>sklearn_api:LdaTransformer</data><value name="INPUT"><block type="dummyOutputCodeBlock" id="o%*8,({(+E4}pIM4{F@%"><field name="CODE">num_topics=5</field></block></value></block></value></block></value></block></value></block></value></block></value><next><block type="variables_set" id="sN{4|UIN/TUx8:5/q+2?"><field name="VAR" id="S|H:+?;2rlq.$U$9H|(V">matrix</field><value name="VALUE"><block type="varDoMethod" id="~K2R0foQ#Ft^nRAf!=}i"><field name="VAR" id="^ui0{[8zsM/[=}wzc`z6">pipe</field><field name="MEMBER">fit_transform</field><data>pipe:fit_transform</data><value name="INPUT"><block type="variables_get" id="(~K-t(4?@N@yN-,D9PdS"><field name="VAR" id="E3]e9N:_cfl3`6*DrUyF">texts</field></block></value></block></value></block></next></block><block type="varDoMethod" id="+/)y+{slS+=9/(oGbYc%" x="-109" y="126"><field name="VAR" id="S|H:+?;2rlq.$U$9H|(V">matrix</field><field name="MEMBER">tolist</field><data>matrix:tolist</data></block></xml>

[[0.7987903952598572,
  0.05001029744744301,
  0.050049956887960434,
  0.05109935253858566,
  0.05004999786615372],
 [0.05107072368264198,
  0.05050072446465492,
  0.050051089376211166,
  0.7983267307281494,
  0.050050731748342514],
 [0.05001014098525047,
  0.7993845343589783,
  0.05004967376589775,
  0.050505951046943665,
  0.05004971846938133]]

The result this time is both interesting and less impressive.
It seems that LDA has created a dummy code with one large value in each document in a different position.
This is perhaps a bit of an unfair conclusion, because methods like this need a lot of text in practice to work well.
However, it does serve to illustrate the difference between LSA and LDA, as well as the importance of choosing an appropriate number of dimensions when doing LDA.

## Word2vec/Doc2vec

[Word2vec (w2v)](https://en.wikipedia.org/wiki/Word2vec) is different from LSA/LDA in that it does not start with term-document matrix and then find latent dimensions but rather tries to predict words in context using a neural network.
Therefore while LSA/LDA can be described as latent vectorization based on counts, w2v can be described as latent vectorization based on predicting words in context.
w2v and other prediction-based methods have both better scaling properties than count-based methods as well as better predictive qualities when trained with large amounts of text.
Doc2vec (d2v) is based on a similar idea as w2v, but predicts the words in the document using a vector.
D2v is more appropriate for our use here, because it gives us one vector per text rather than one vector per word.

We can again use the [same approach for d2v](https://radimrehurek.com/gensim_3.8.3/models/doc2vec.html#gensim.models.doc2vec.Doc2Vec) in gensim, but with a few tweaks.

First, the d2v implementation in `sklearn_api` [isn't entirely compatible with sklearn](https://github.com/RaRe-Technologies/gensim/issues/2403), so we need a special preprocessor to go in our pipeline in front of it (kind of like `Text2BowTransformer`).
This code was adapted from [here](https://github.com/alex2awesome/gensim-sklearn-tutorial/blob/master/notebooks/gensim-in-sklearn-pipelines.ipynb); if you are interested in doing more complex things with `gensim`/`sklearn` pipelines, it has some good examples:

In [6]:
from sklearn.base import BaseEstimator, MetaEstimatorMixin
import nltk
import numpy as np
import pandas as pd
class MyTokenizer(BaseEstimator, MetaEstimatorMixin):
    """Tokenize input strings using NLTK."""
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        #This is where you'd put any custom NLTK processing
        #The final output must be a list of word lists
        return [nltk.word_tokenize(t) for t in X]
    
    def fit_transform(self, X, y=None):
        return self.fit(X, y).transform(X)

After you execute the above code, do this:

- Set `pipe` to with `pipeline` create `Pipeline` using a list with a list of tuples inside it:
    - `"mytokens"` and freestyle `MyTokenizer()`
    - `"d2v"` and with `sklearn_api` create `D2VTransformer` using freestyle `min_count=1, size=5`
- Set `matrix` to with `pipe` do `fit_transform` using `texts`
- with `matrix` do `tolist`

*Notes:*

- *`min_count=1` is needed for our small example and prevents low frequency word filtering*
- *You may wish to copy the blocks from above into the cell below and change as appropriate.*

In [7]:
pipe = pipeline.Pipeline([('mytokens',MyTokenizer()), ('d2v',(sklearn_api.D2VTransformer(min_count=1, size=5)))])
matrix = pipe.fit_transform(texts)

matrix.tolist()

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="^ui0{[8zsM/[=}wzc`z6">pipe</variable><variable id="S|H:+?;2rlq.$U$9H|(V">matrix</variable><variable id="c:,wNTq#akK[c:VY8h$A">pipeline</variable><variable id="E3]e9N:_cfl3`6*DrUyF">texts</variable><variable id="+YmMk1^)_8M1pm|fI/@(">sklearn_api</variable></variables><block type="variables_set" id="lEf:wYereJ`%x9cBnyUH" x="-112" y="-35"><field name="VAR" id="^ui0{[8zsM/[=}wzc`z6">pipe</field><value name="VALUE"><block type="varCreateObject" id="~__@Q?rrcL=fN{LS#2QO"><field name="VAR" id="c:,wNTq#akK[c:VY8h$A">pipeline</field><field name="MEMBER">Pipeline</field><data>pipeline:Pipeline</data><value name="INPUT"><block type="lists_create_with" id="5KCj?nw*.J7B32-kAA;5"><mutation items="1"></mutation><value name="ADD0"><block type="lists_create_with" id="4*mK7Q*4HeN+r[bzpw||"><mutation items="2"></mutation><value name="ADD0"><block type="tupleBlock" id="u3Qo)%9SI%,.?qmt(!Q!"><value name="FIRST"><block type="text" id="ts34moP=[t?7B9.f]q6*"><field name="TEXT">mytokens</field></block></value><value name="SECOND"><block type="dummyOutputCodeBlock" id=";TiG`q;b|kf`*p0o4@=N"><field name="CODE">MyTokenizer()</field></block></value></block></value><value name="ADD1"><block type="tupleBlock" id="20TLY`Q[[fy-q,?)%*^%"><value name="FIRST"><block type="text" id="6#GWtL1~wZk~r|Z@ar7X"><field name="TEXT">d2v</field></block></value><value name="SECOND"><block type="varCreateObject" id="!F{!#]k~{2Ng9LtAV{B."><field name="VAR" id="+YmMk1^)_8M1pm|fI/@(">sklearn_api</field><field name="MEMBER">D2VTransformer</field><data>sklearn_api:D2VTransformer</data><value name="INPUT"><block type="dummyOutputCodeBlock" id="o%*8,({(+E4}pIM4{F@%"><field name="CODE">min_count=1, size=5</field></block></value></block></value></block></value></block></value></block></value></block></value><next><block type="variables_set" id="sN{4|UIN/TUx8:5/q+2?"><field name="VAR" id="S|H:+?;2rlq.$U$9H|(V">matrix</field><value name="VALUE"><block type="varDoMethod" id="~K2R0foQ#Ft^nRAf!=}i"><field name="VAR" id="^ui0{[8zsM/[=}wzc`z6">pipe</field><field name="MEMBER">fit_transform</field><data>pipe:fit_transform</data><value name="INPUT"><block type="variables_get" id="(~K-t(4?@N@yN-,D9PdS"><field name="VAR" id="E3]e9N:_cfl3`6*DrUyF">texts</field></block></value></block></value></block></next></block><block type="varDoMethod" id="+/)y+{slS+=9/(oGbYc%" x="-109" y="126"><field name="VAR" id="S|H:+?;2rlq.$U$9H|(V">matrix</field><field name="MEMBER">tolist</field><data>matrix:tolist</data></block></xml>

[[0.07412126660346985,
  -0.0947391465306282,
  -0.04089774191379547,
  -0.0947369858622551,
  0.03851011022925377],
 [-0.050562601536512375,
  -0.06402575224637985,
  0.09507104754447937,
  0.08805355429649353,
  0.056405507028102875],
 [0.016094354912638664,
  0.0790107399225235,
  -0.09235618263483047,
  0.029193880036473274,
  -0.07684177905321121]]

The output seems even less structured than with LSA/LDA, but this is likely due to the small size of our corpus.

### Pretrained models

Gensim has pretrained models for w2v and related word-level methods that you can download in order vectorize your text data.
If you text data is small, you will very likely get a better result using these models than if you trained your own.
However, to use these in a pipeline, you will need to combine the word vectors to get a single vector representing the document.
`W2VTransformerDocLevel` [as shown here](https://github.com/alex2awesome/gensim-sklearn-tutorial/blob/master/notebooks/gensim-in-sklearn-pipelines.ipynb) gives a good start for how to do this.
Another option would be to download the models but then [load them into spaCy](https://stackoverflow.com/questions/50466643/in-spacy-how-to-use-your-own-word2vec-model-created-in-gensim); you'll no longer be using them in an sklearn pipeline, but that's not a problem per se (see next section).

In [8]:
import gensim.downloader as api
info = api.info() 
info['models'].keys() #lists available models; use 'corpora' to show corpora
# model = api.load("word2vec-google-news-300") #uncomment to run

dict_keys(['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis'])

## Contextual embeddings with BERT

The above latent variable vectorization techniques are not context sensitive, meaning that they create a single global representation for each text.
This is easiest to understand at the word level.
Consider what happens when we have one vector for a word with multiple senses, e.g. `fly` - the vector carries some of the meaning of all of those senses at once.
Contextual embeddings create vectors for words *in context*, so only the sense in a given context is represented (ideally speaking).

We can work with contextual embeddings using an extension to spaCy called `spacy-transformers`.
This software library integrates several transformer-based models (including BERT) into the spaCy pipeline.
We will download a pre-trained model (~500 MB) and use it.

Run the following in your **terminal**:

- `python -m spacy download en_core_web_trf`

*Note: You can [download a model with use glove vectors](https://spacy.io/models/en#en_core_web_lg), which are like w2v; to use w2v with spaCy, see the previous section.*

Next import spacy:

- import `spacy` as `spacy`

In [2]:
import spacy as spacy

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="wo4UApWDaefy4v}x]:aB">spacy</variable></variables><block type="importAs" id="_@8DH2rDVjgaG~;xITM8" x="16" y="10"><field name="libraryName">spacy</field><field name="libraryAlias" id="wo4UApWDaefy4v}x]:aB">spacy</field></block></xml>

Remember spaCy creates an NLP pipleine which we typically call `nlp`.

- Set `nlp` to with `spacy` do `load` using `"en_core_web_trf"`

In [3]:
nlp = spacy.load('en_core_web_trf')

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id=",E~J@5;3-)HPm.VY_#{n">nlp</variable><variable id="wo4UApWDaefy4v}x]:aB">spacy</variable></variables><block type="variables_set" id="Cg^6n$`lG^`#lq}+~w70" x="66" y="128"><field name="VAR" id=",E~J@5;3-)HPm.VY_#{n">nlp</field><value name="VALUE"><block type="varDoMethod" id="zFJKxKMRouc6,b.zH%bg"><field name="VAR" id="wo4UApWDaefy4v}x]:aB">spacy</field><field name="MEMBER">load</field><data>spacy:load</data><value name="INPUT"><block type="text" id="lU/b5URW5wf-ceQ1(+@h"><field name="TEXT">en_core_web_trf</field></block></value></block></value></block></xml>

Let's process **all** our `texts`; to process a list of strings, use `pipe` instead of `__call__`.
Since this gives use a list of docs, we use a comprehension to get the tensors out:

- Set `docs` to list with `nlp` do `pip` using `texts`
- Set `tensors` to list for each item `i` in `docs` yield freestyle `i._.trf_data.tensors[0][0]`
<!-- freestyle `i.tensor` -->
- freestyle `tensors[0].shape` (this is the tensor for `dogs chase cats`)

*Note: The standard API to get a tensor would be `i.tensor` here, but that appears to be broken, so we are using a more low-level approach.*

In [24]:
docs = list(nlp.pipe(texts))
tensors = list(i._.trf_data.tensors[0][0] for i in docs)

tensors[0].shape

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="(1d9)^aniU$`5DS[O+uR">docs</variable><variable id="x6K1x,OPRAST^)f^a%Vj">tensors</variable><variable id=",E~J@5;3-)HPm.VY_#{n">nlp</variable><variable id="E3]e9N:_cfl3`6*DrUyF">texts</variable><variable id="pVbrAvyAR[G;F*6dkZSY">i</variable></variables><block type="variables_set" id="u95WS8!kR,R,R[p,F=Rr" x="-208" y="209"><field name="VAR" id="(1d9)^aniU$`5DS[O+uR">docs</field><value name="VALUE"><block type="listBlock" id="}$cA{hd:Nw21b,|0_dr:"><value name="x"><block type="varDoMethod" id="/BZ=-}Gx@Q2z9yeL(1/J"><field name="VAR" id=",E~J@5;3-)HPm.VY_#{n">nlp</field><field name="MEMBER">pipe</field><data>nlp:pipe</data><value name="INPUT"><block type="variables_get" id="SlM6-1y!se`Fu(bUP$cw"><field name="VAR" id="E3]e9N:_cfl3`6*DrUyF">texts</field></block></value></block></value></block></value><next><block type="variables_set" id="2,Uip@po_tYY9..pRn$V"><field name="VAR" id="x6K1x,OPRAST^)f^a%Vj">tensors</field><value name="VALUE"><block type="listBlock" id="_%)?4jR%~Q,t)`/ZuFjK"><value name="x"><block type="comprehensionForEach" id="_6S/?i,DcBCPPFDC5j0L"><field name="VAR" id="pVbrAvyAR[G;F*6dkZSY">i</field><value name="LIST"><block type="variables_get" id="AIA;=K5r!F1O!6,^Fq]w"><field name="VAR" id="(1d9)^aniU$`5DS[O+uR">docs</field></block></value><value name="YIELD"><block type="dummyOutputCodeBlock" id=",fQpZ/OelblSWs/|=oJ;"><field name="CODE">i._.trf_data.tensors[0][0]</field></block></value></block></value></block></value></block></next></block><block type="dummyOutputCodeBlock" id="KX:waQjl)oXb~$wEJrV9" x="-209" y="398"><field name="CODE">tensors[0].shape</field></block></xml>

(6, 768)

There are 3 words in the document, but we have 6 vectors in the tensor!
The extra vectors are for hidden start/end tokens and *padding* (you would also see extra vectors for punctuation).

So for each of the underlying tokens, we have a 768 long vector.
However, we want a single vector to represent the entire text.

To get a single vector for the sentence, we can sum them:

- Set `docVectors` to list for each `i` in list `tensors` yield freestyle `i.sum(axis=0)`
- Display length of `docVectors[0]`

In [25]:
docVectors = list(i.sum(axis=0) for i in tensors)
print(len(docVectors[0]))

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="A(R^DIZnN1rq%retD=M}">docVectors</variable><variable id="pVbrAvyAR[G;F*6dkZSY">i</variable><variable id="x6K1x,OPRAST^)f^a%Vj">tensors</variable></variables><block type="variables_set" id=";R3n-P@Dqot}l+^hp#P$" x="-228" y="83"><field name="VAR" id="A(R^DIZnN1rq%retD=M}">docVectors</field><value name="VALUE"><block type="listBlock" id="[zCC[,S2fJV:|?IFn.=)"><value name="x"><block type="comprehensionForEach" id="m9}iRBGi(s:6//.a0$)}"><field name="VAR" id="pVbrAvyAR[G;F*6dkZSY">i</field><value name="LIST"><block type="variables_get" id="rAQZ)8#UFR.uyW;IIfOG"><field name="VAR" id="x6K1x,OPRAST^)f^a%Vj">tensors</field></block></value><value name="YIELD"><block type="dummyOutputCodeBlock" id="/JFA#Y^GO0kO6]bnc;z)"><field name="CODE">i.sum(axis=0)</field></block></value></block></value></block></value><next><block type="text_print" id="2nv,M]vMI]$Aj^Cblv;w"><value name="TEXT"><shadow type="text" id="L8Z]Ma}sD*%#tH]}MWZT"><field name="TEXT">abc</field></shadow><block type="lists_length" id="d3I1*iP+~y]#(v|GWuZ)"><value name="VALUE"><block type="indexer" id="?q-:QXg04X]qdSY?5_xc"><field name="VAR" id="A(R^DIZnN1rq%retD=M}">docVectors</field><value name="INDEX"><block type="math_number" id="]u(NjCwP{7$?k7KX8Rkz"><field name="NUM">0</field></block></value></block></value></block></value></block></next></block></xml>

768


Why the `sum(axis=0)`? 
Because tensors can be summed in different directions, we had to tell it which direction to sum in. 
Otherwise we would have gotten a vector that was 6 long instead of 768.

Now we have created contextual embeddings for our texts, just as in the other methods above. 
The major difference in usage is that we do not have this set up in an sklearn pipeline.
However it shouldn't be that hard to use the sample code above to put it into a pipeline, or leave it as is and possibly copy the results back to a dataframe before running a model with the other variables in said dataframe.

<!-- # docs = nlp("Apple shares rose on the news.")
# print(docs.tensor.shape)
# print(docs.vector.shape)
# data = docs._.trf_data
# apple = docs[0]
# import torch
# torch.cuda.is_available()
# spacy.prefer_gpu()
# spacy.require_gpu()
# import cupy
# cupy.zeros((1,1))
# data.tensors[0].shape #(1, 9, 768)
# docs._.trf_data.tensors[0][0].shape #(9, 768)
# data.tensors[0][0].sum(axis=0) #(768,) -->