# Document Embedding Basic Usage Example

Here, we provide a basic look into how the API can be used.

In [1]:
import os
os.chdir('..')

We use a sample scikit-learn dataset to test the embeddings.

In [2]:
from sklearn.datasets import fetch_20newsgroups

categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
documents = newsgroups_train.data[:20]

In general, you can use `Embedding` models that take in raw text to convert them into real valued vectors. You can then follow them up with `Transformation` models that take in and output real valued vectors.

For example, we can replicate a TfIdf to NMF featurization in the following manner:

In [3]:
from textwiser import TextWiser, Embedding, Transformation

emb = TextWiser(Embedding.TfIdf(min_df=5), Transformation.NMF(n_components=30))
vecs = emb.fit_transform(documents)
vecs

array([[2.9383649e-04, 8.3787434e-02, 5.4838383e-01, 0.0000000e+00,
        0.0000000e+00, 3.5673371e-01, 0.0000000e+00, 0.0000000e+00,
        0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
        0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
        0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
        2.4264532e-01, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
        4.0479924e-04, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
        0.0000000e+00, 0.0000000e+00],
       [1.8922535e-01, 1.6309661e-04, 2.8942347e-01, 0.0000000e+00,
        0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 1.3527455e-05,
        0.0000000e+00, 0.0000000e+00, 1.8145434e-05, 0.0000000e+00,
        0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
        5.3897417e-01, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
        0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 6.3893108e-06,
        2.2896272e-04, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
        0

Here's the old way:

In [4]:
from textwiser.embeddings import _TfIdfEmbeddings
from textwiser.transformations import _NMFTransformation

emb = _TfIdfEmbeddings(min_df=5).fit_transform(documents)
emb = _NMFTransformation(n_components=30).fit_transform(emb)
emb

tensor([[2.4315e-01, 0.0000e+00, 0.0000e+00, 3.6069e-02, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 2.1295e-01, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
        [1.9885e-01, 1.2645e-04, 1.3521e-01, 2.4087e-01, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 9.8129e-06, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         9.6432e-01, 0.0000e+00, 0.0000e+00, 5.0608e-07, 0.0000e+00, 0.0000e+00],
        [1.1295e-03, 0.0000e+00, 1.9995e-05, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 1.6312e-07, 0.0000e+00, 0.0000e+00, 0.0000e+00, 6.9571e-01,
         0.0000e+00, 0.000

We can just as easily try out a pooling of word embeddings:

In [5]:
from textwiser import TextWiser, Embedding, PoolOptions, Transformation, WordOptions

emb = TextWiser(Embedding.Word(word_option=WordOptions.word2vec, pretrained='en'), Transformation.Pool(pool_option=PoolOptions.max))
vecs = emb.fit_transform(documents)
vecs

array([[0.1969, 0.2157, 0.2076, ..., 0.4019, 0.2006, 0.2305],
       [0.2327, 0.3131, 0.3033, ..., 0.3267, 0.2783, 0.2475],
       [0.0992, 0.1078, 0.207 , ..., 0.2473, 0.1487, 0.1541],
       ...,
       [0.2538, 0.2691, 0.1851, ..., 0.3561, 0.3472, 0.1758],
       [0.2038, 0.1847, 0.4164, ..., 0.3179, 0.24  , 0.2351],
       [0.2573, 0.2906, 0.2647, ..., 0.4127, 0.205 , 0.2459]],
      dtype=float32)

Here's the old way:

In [6]:
from textwiser.embeddings import _WordEmbeddings
from textwiser.transformations import _PoolTransformation
from textwiser.options import WordOptions, PoolOptions

emb = _WordEmbeddings(WordOptions.word2vec, pretrained='en').fit_transform(documents)
emb = _PoolTransformation(PoolOptions.max).fit_transform(emb)
emb

tensor([[0.1969, 0.2157, 0.2076,  ..., 0.4019, 0.2006, 0.2305],
        [0.2327, 0.3131, 0.3033,  ..., 0.3267, 0.2783, 0.2475],
        [0.0992, 0.1078, 0.2070,  ..., 0.2473, 0.1487, 0.1541],
        ...,
        [0.2538, 0.2691, 0.1851,  ..., 0.3561, 0.3472, 0.1758],
        [0.2038, 0.1847, 0.4164,  ..., 0.3179, 0.2400, 0.2351],
        [0.2573, 0.2906, 0.2647,  ..., 0.4127, 0.2050, 0.2459]],
       grad_fn=<StackBackward>)

## Schema

We also provide a convenient schema for mixing and matching different word and document embeddings. This makes it easy to try out different embedding types rapidly.

There are two main operations: `transform` and `concat`.

The `transform` operation defines a list of operations. The first of these operations should be an `Embedding`, while the rest should be `Transformation`s. The idea is that the `Embeddings` have access to raw text and turn them into vectors, and therefore the following `Transformation`s need to operate on vectors. In PyTorch terms, this is equivalent to using `nn.Sequential`.

The `concat` operation defines a concatenation of multiple embedding vectors. This can be done both at word and sentence level. In PyTorch terms, this is equivalent to using `torch.cat`.

### Sample Schema

Below we outline a sample and presumably common use-case. At the root level, we have three different embeddings. The first two are different word embeddings, both are pooled (first using max pooling, second using mean pooling) to generate the sentence representations $s_1$ and $s_2$. The third is a tf-idf embedding of the document followed by a reduction of its dimensionality to 30 using NMF, generating sentence representation $s_3$. These representations are concatenated to $s_{123}$, and are fed into a final SVD transformation that brings the sentence vector back down to a manageable level ($s$).

In [7]:
from textwiser import TextWiser, Embedding

doc_embeddings_schema = {
    'transform': [
        {
            'concat': [
                {
                    'transform': [
                        ('word2vec', {'pretrained': 'en'}),
                        'pool'
                    ]
                },
                {
                    'transform': [
                        ('flair', {'pretrained': 'news-forward-fast'}),
                        ('pool', {'pool_option': 'mean'})
                    ]
                },
                {
                    'transform': [
                        'tfidf',
                        ('nmf', { 'n_components': 30 })
                    ]
                }
            ]
        },
        'svd'
    ]
}

doc_embeddings = TextWiser(Embedding.Compound(schema=doc_embeddings_schema))
doc_embeddings

TextWiser(
  (_imp): _Sequential(
    (0): _CompoundEmbeddings()
  )
)

Once the embeddings object is initialized, we can feed in a list of text documents and get the relevant output.

In [8]:
doc_embeddings.fit(documents)

TextWiser(
  (_imp): _Sequential(
    (0): _CompoundEmbeddings(
      (model): _Sequential(
        (0): _Concat(
          (embeddings): ModuleList(
            (0): _Sequential(
              (0): _WordEmbeddings(
                (model): Embedding(1000001, 300)
              )
              (1): _PoolTransformation()
            )
            (1): _Sequential(
              (0): _WordEmbeddings(
                (model): FlairEmbeddings(
                  (lm): LanguageModel(
                    (drop): Dropout(p=0.25, inplace=False)
                    (encoder): Embedding(275, 100)
                    (rnn): LSTM(100, 1024)
                    (decoder): Linear(in_features=1024, out_features=275, bias=True)
                  )
                )
              )
              (1): _PoolTransformation()
            )
            (2): _Sequential(
              (0): _TfIdfEmbeddings()
              (1): _NMFTransformation()
            )
          )
        )
        (1): _SVDTransform

In [9]:
%%time

emb = doc_embeddings.transform(documents)
emb

CPU times: user 1min 52s, sys: 2.94 s, total: 1min 55s
Wall time: 29.7 s


array([[-5.09309196e+00,  3.08767557e-02, -1.04446597e-01,
         3.07375669e-01, -2.61844337e-01,  2.92069882e-01,
        -2.39583835e-01,  2.90402502e-01, -4.02518362e-01,
         3.02182108e-01],
       [-5.28662968e+00,  2.36051977e-02,  6.81536943e-02,
        -2.25922227e-01,  3.66908275e-02, -4.84507084e-01,
         4.91836667e-02, -3.45129669e-01, -7.58192360e-01,
         3.14387113e-01],
       [-4.50751591e+00,  4.87101912e-01,  2.60197937e-01,
         6.20816827e-01,  6.30345106e-01,  2.53131449e-01,
         2.78462708e-01, -1.31215543e-01,  1.76260233e-01,
         2.01231360e-01],
       [-5.07704115e+00, -3.70861232e-01,  1.33992821e-01,
         3.22225809e-01, -4.74442035e-01, -1.31559968e-02,
        -3.82126033e-01, -1.68419063e-01,  5.44337705e-02,
        -4.31150459e-02],
       [-4.58557606e+00,  6.77263260e-01,  7.67389908e-02,
         1.90357313e-01, -2.35140249e-02, -7.93806255e-01,
        -5.83654642e-03,  2.23456278e-01,  2.67650597e-02,
        -5.

You can also specify the same schema in a json file.

In [10]:
import json

doc_embeddings = TextWiser(Embedding.Compound(schema='notebooks/schema.json'))
doc_embeddings

TextWiser(
  (_imp): _Sequential(
    (0): _CompoundEmbeddings()
  )
)

In [11]:
doc_embeddings.fit_transform(documents)

array([[-5.09357882e+00,  1.56707168e-02, -6.55764639e-02,
         3.95372391e-01, -4.27129604e-02,  4.13021803e-01,
        -1.29411668e-01,  4.42747742e-01, -2.32591271e-01,
        -1.64550215e-01],
       [-5.28606558e+00, -5.77974319e-03,  7.31223524e-02,
        -2.19560072e-01,  2.83031315e-01, -2.59074181e-01,
         3.54273975e-01,  4.36569363e-01, -4.44127023e-01,
         1.38857424e-01],
       [-4.50827408e+00,  4.91460443e-01, -8.75518262e-01,
        -2.96749286e-02, -5.35864472e-01, -1.42926574e-01,
         4.27924842e-03, -2.01539412e-01, -8.58182907e-02,
        -3.00842196e-01],
       [-5.07706833e+00, -4.18934226e-01, -1.10421285e-01,
         3.86063278e-01,  3.44134212e-01,  3.53466451e-01,
         8.69917870e-02, -1.62974387e-01,  6.05377629e-02,
         2.08491743e-01],
       [-4.58593941e+00,  6.70846045e-01, -2.32599854e-01,
         2.62758043e-02,  7.20463276e-01, -4.45691228e-01,
        -4.98465821e-02,  1.22333467e-01,  4.64539975e-01,
         4.