# Produce decontextualized word embeddings

In this notebook, we will produce [decontextualized word embeddings][1] out of [the `roberta-base` model][2] and our fine-tuned version that can represent math-specific tokens.

 [1]: https://aclanthology.org/2021.wmt-1.112
 [2]: https://huggingface.co/roberta-base

In [1]:
! hostname

mir


In [2]:
%%capture
! pip install .[transformers,scm]

In [3]:
from gensim.models import KeyedVectors

## The `roberta-base` model

First, we will extract decontextualized word embeddings out of the `roberta-base` model.

In [4]:
%%capture
! make decontextualized-word-embeddings-roberta-base

In [5]:
%ls -lh decontextualized-word-embeddings-roberta-base

-rw-r--r-- 1 novotny novotny 373K May 20 07:30 decontextualized-word-embeddings-roberta-base


To see how well the `roberta-base` model can represent scientific terms, we will load it and look at similar terms in the word embedding space.

In [6]:
base_embeddings = KeyedVectors.load('decontextualized-word-embeddings-roberta-base')

We can see that for scientific terms that have different meaning in math versus common usage, such as *absolute*, *property*, and *real*, `roberta-base` tends to favor the common usage:

- *absolute*: total and complete *as opposed to* involving absolute values
- *property*: someone's belongings/a building and the land belonging to it *as opposed to* an attribute of a math object
- *real*: something that actually exists *as opposed to* involving or containing real numbers

In [7]:
base_embeddings.most_similar('absolute')

[('absolutely', 0.9164975761832016),
 ('effective', 0.9140561937498846),
 ('extreme', 0.9050805288952175),
 ('actual', 0.9006715920201244),
 ('exclusive', 0.898657442002306),
 ('angular', 0.8981691067282201),
 ('relative', 0.8975861042260462),
 ('acceptable', 0.8964603402839815),
 ('offset', 0.8962548321691873),
 ('integer', 0.8945229985048906)]

In [8]:
base_embeddings.most_similar('property')

[('Property', 0.9564183915543218),
 ('properties', 0.9490537038487242),
 ('perties', 0.9119670440793409),
 ('estate', 0.9104328288512181),
 ('value', 0.9087446331980998),
 ('pointer', 0.9086655706777714),
 ('theme', 0.9080653826613683),
 ('policy', 0.907620184092995),
 ('deck', 0.9064049816662365),
 ('attribute', 0.9063193512822962)]

In [9]:
base_embeddings.most_similar('real')

[('Real', 0.9660123198743351),
 ('actual', 0.9356304996211505),
 ('reality', 0.9333833963212447),
 ('true', 0.9189921386453295),
 ('normal', 0.9181932473473009),
 ('natural', 0.9154994974594166),
 ('Normal', 0.9127584272831695),
 ('really', 0.9115041775006173),
 ('pal', 0.9112555406270034),
 ('legal', 0.9111350762652691)]

Since our decontextualized token embeddings are the mean of contextual token embeddings on scientific texts, even `roberta-base` is not completely off-the-mark and includes similar scientific terms in the top ten most similar terms. However, `roberta-base` cannot always properly tokenize and attend the context in a scientific article.

## The tuned `roberta-base` model

Next, we will extract decontextualized word embeddings out of the `roberta-base` model fine-tuned so that it can represent math-specific tokens.

In [10]:
%%capture
! make decontextualized-word-embeddings-tuned-roberta-base-text+latex

In [11]:
%ls -lh decontextualized-word-embeddings-tuned-roberta-base-text+latex

-rw-r--r-- 1 novotny novotny 1.1M May 20 06:55 decontextualized-word-embeddings-tuned-roberta-base-text+latex


To see how well our fine-tuned model can represent scientific terms, we will load it and look at similar terms in the word embedding space.

In [12]:
tuned_embeddings = KeyedVectors.load('decontextualized-word-embeddings-tuned-roberta-base-text+latex')

We can see that unlike `roberta-base`, our fine-tuned model tends to favor similar scientific terms:

- *absolute*: top ten terms include fragments of mathematical operators rather than terms similar to the common usage
- *property*: top ten terms do not include *value*, *estate*, or *policy*
- *real*: top ten terms do not include *actual* and include *rational*

In [13]:
tuned_embeddings.most_similar('absolute')

[('Bra', 0.849663554118112),
 ('sqrt', 0.8285149035579248),
 ('gtr', 0.8284741846550531),
 ('framebox', 0.8179374852182264),
 ('frak', 0.8155009521481359),
 ('\\under', 0.814968915081996),
 ('alty', 0.8130912137629961),
 ('operatorname', 0.8126520618368384),
 ('operatorname*', 0.8112455505496647),
 ('inst', 0.8105142482596693)]

In [14]:
tuned_embeddings.most_similar('property')

[('Property', 0.9384866797919728),
 ('properties', 0.9065765208383727),
 ('perty', 0.8977596035241844),
 ('attribute', 0.8971946003432749),
 ('element', 0.8941358996661748),
 ('entity', 0.8936722270445662),
 ('topic', 0.8895547168639073),
 ('perties', 0.8877034119576299),
 ('instance', 0.8855805134449843),
 ('functional', 0.8855760842701841)]

In [15]:
tuned_embeddings.most_similar('real')

[('Real', 0.9148605082046193),
 ('reality', 0.9077418971319128),
 ('binary', 0.8970199766755064),
 ('rational', 0.8940688385836887),
 ('ral', 0.8923245888063248),
 ('urnal', 0.8919872449237995),
 ('mal', 0.8905495854135415),
 ('functional', 0.8905244359371303),
 ('mental', 0.8890649030130348),
 ('ual', 0.8888234435310957)]

Futhermore, unlike with `roberta-base`, we can use our fine-tuned model to find similar math-specific tokens:

In [16]:
tuned_embeddings.most_similar(r'\cos')

[('}\\cos', 0.956758189554533),
 ('\\sin', 0.9514626298264707),
 ('\\cos%', 0.9434441854751765),
 ('\\tan', 0.9278311303489227),
 ('}\\sin', 0.9254699807540067),
 ('(\\cos', 0.9244286133752913),
 ('\\cosh', 0.9221619260868039),
 ('\\frac{\\cos', 0.919889123507782),
 ('\\displaystyle\\cos', 0.9196743298469137),
 ('\\cos(', 0.9189877033669472)]

In [17]:
tuned_embeddings.most_similar('F(x)')

[('F(x)=', 0.9593355274441487),
 ('F(x', 0.9417396755668178),
 ('F(t)', 0.9369839133795281),
 ('G(x)', 0.9361178736043807),
 ('F(y', 0.9223169943227582),
 ('F(z', 0.9210098745370999),
 ('V(x)', 0.9149784694692186),
 ('F(X', 0.9138278425878289),
 ('F(', 0.913121134182822),
 ('F(u', 0.911264177551353)]

In [18]:
tuned_embeddings.most_similar(r'\prod')

[('}\\prod', 0.9532089875946642),
 ('\\displaystyle\\prod', 0.9342738865765974),
 ('}=\\prod', 0.9288030534822265),
 ('\\prod\\limits', 0.9229099580532697),
 ('\\sum', 0.9175814060521996),
 ('\\bigoplus', 0.9112117963203984),
 ('\\frac{\\prod', 0.9109938754752355),
 ('\\displaystyle\\sum', 0.9089826808485659),
 ('\\bigsqcup', 0.906801984020918),
 ('\\sum\\limits', 0.9052707240234698)]