# Produce term similarity matrices

In this notebooks, we will produce term similarity matrices to be used in the soft vector space models.

In [1]:
! hostname

mir


In [2]:
%%capture
! pip install .[scm]

In [3]:
from gensim.similarities import SparseTermSimilarityMatrix

## Levenshtein similarities

First, we will produce term similarity matrices based on surface-level similarities using Levenshtein distances.

### The text + LaTeX format

As our primary representation, we use text and LaTeX separated by special `[MATH]` and `[/MATH]` tokens.

In [4]:
%%capture
! make levenshtein-similarity-matrix-text+latex

In [5]:
%ls -lh levenshtein-similarity-matrix-text+latex

-rw-r--r-- 1 novotny novotny 38M May  6 20:50 levenshtein-similarity-matrix-text+latex


In [6]:
SparseTermSimilarityMatrix.load('levenshtein-similarity-matrix-text+latex').matrix

<71897x71897 sparse matrix of type '<class 'numpy.float32'>'
	with 4878801 stored elements in Compressed Sparse Column format>

### The text format

For baselines and for models with separate indices for text and math, we have a separate term similarity matrix with just text.

In [7]:
%%capture
! make levenshtein-similarity-matrix-text

In [8]:
%ls -lh levenshtein-similarity-matrix-text

-rw-rw-r-- 1 novotny novotny 26M May  6 13:53 levenshtein-similarity-matrix-text


In [9]:
SparseTermSimilarityMatrix.load('levenshtein-similarity-matrix-text').matrix

<49559x49559 sparse matrix of type '<class 'numpy.float32'>'
	with 3276353 stored elements in Compressed Sparse Column format>

### The LaTeX format

For baselines and for models with separate indices for text and math, we have a separate term similarity matrix with just LaTeX.

In [10]:
%%capture
! make levenshtein-similarity-matrix-latex

In [11]:
%ls -lh levenshtein-similarity-matrix-latex

-rw-rw-r-- 1 novotny novotny 16M May  6 11:14 levenshtein-similarity-matrix-latex


In [12]:
SparseTermSimilarityMatrix.load('levenshtein-similarity-matrix-latex').matrix

<29772x29772 sparse matrix of type '<class 'numpy.float32'>'
	with 2010880 stored elements in Compressed Sparse Column format>

### The Tangent-L format

For baselines and for models with separate indices for text and math, we have a separate term similarity matrix with just the format used by [the Tangent-L search engine from UWaterloo][1].

 [1]: http://ceur-ws.org/Vol-2936/paper-05.pdf

In [13]:
%%capture
! make levenshtein-similarity-matrix-tangentl

In [14]:
%ls -lh levenshtein-similarity-matrix-tangentl

-rw-r--r-- 1 novotny novotny 62M May  7 00:06 levenshtein-similarity-matrix-tangentl


In [15]:
SparseTermSimilarityMatrix.load('levenshtein-similarity-matrix-tangentl').matrix

<100000x100000 sparse matrix of type '<class 'numpy.float32'>'
	with 8065212 stored elements in Compressed Sparse Column format>

## Word embedding similarities

Second, we will produce term similarity matrices based on semantic similarities using word embeddings extracted from language models.

### Non-positional `word2vec` embeddings

First, we will produce term similarity matrices based on semantic similarities using `word2vec` models without [positional weighting][1].

 [1]: https://github.com/MIR-MU/pine

#### The text + LaTeX format

As our primary representation, we use text and LaTeX separated by special `[MATH]` and `[/MATH]` tokens.

In [16]:
%%capture
! make word-embedding-similarity-matrix-text+latex

In [17]:
%ls -lh word-embedding-similarity-matrix-text+latex

-rw-r--r-- 1 novotny novotny 16M May  6 21:47 word-embedding-similarity-matrix-text+latex


In [18]:
SparseTermSimilarityMatrix.load('word-embedding-similarity-matrix-text+latex').matrix

<71897x71897 sparse matrix of type '<class 'numpy.float32'>'
	with 2022487 stored elements in Compressed Sparse Column format>

#### The text format

For models with separate indices for text and math, we have a separate term similarity matrix with just text.

In [19]:
%%capture
! make word-embedding-similarity-matrix-text

In [20]:
%ls -lh word-embedding-similarity-matrix-text

-rw-r--r-- 1 novotny novotny 19M May  7 21:15 word-embedding-similarity-matrix-text


In [21]:
SparseTermSimilarityMatrix.load('word-embedding-similarity-matrix-text').matrix

<49559x49559 sparse matrix of type '<class 'numpy.float32'>'
	with 2444223 stored elements in Compressed Sparse Column format>

#### The LaTeX format

For models with separate indices for text and math, we have a separate term similarity matrix with just LaTeX.

In [22]:
%%capture
! make word-embedding-similarity-matrix-latex

In [23]:
%ls -lh word-embedding-similarity-matrix-latex

-rw-rw-r-- 1 novotny novotny 6.0M May  6 14:03 word-embedding-similarity-matrix-latex


In [24]:
SparseTermSimilarityMatrix.load('word-embedding-similarity-matrix-latex').matrix

<29772x29772 sparse matrix of type '<class 'numpy.float32'>'
	with 767746 stored elements in Compressed Sparse Column format>

#### The Tangent-L format

For models with separate indices for text and math, we have a separate term similarity matrix with just the format used by [the Tangent-L search engine from UWaterloo][1].

 [1]: http://ceur-ws.org/Vol-2936/paper-05.pdf

In [25]:
%%capture
! make word-embedding-similarity-matrix-tangentl

In [26]:
%ls -lh word-embedding-similarity-matrix-tangentl

-rw-r--r-- 1 novotny novotny 5.8M May  7 01:45 word-embedding-similarity-matrix-tangentl


In [27]:
SparseTermSimilarityMatrix.load('word-embedding-similarity-matrix-tangentl').matrix

<100000x100000 sparse matrix of type '<class 'numpy.float32'>'
	with 701998 stored elements in Compressed Sparse Column format>

### Positional `word2vec` embeddings

Next, we will produce term similarity matrices based on semantic similarities using `word2vec` models with [positional weighting][1].

 [1]: https://github.com/MIR-MU/pine

#### The text + LaTeX format

As our primary representation, we use text and LaTeX separated by special `[MATH]` and `[/MATH]` tokens.

In [28]:
%%capture
! make word-embedding-similarity-matrix-text+latex-positional

In [29]:
%ls -lh word-embedding-similarity-matrix-text+latex-positional

-rw-r--r-- 1 novotny novotny 22M May  6 21:48 word-embedding-similarity-matrix-text+latex-positional


In [30]:
SparseTermSimilarityMatrix.load('word-embedding-similarity-matrix-text+latex-positional').matrix

<71897x71897 sparse matrix of type '<class 'numpy.float32'>'
	with 2812669 stored elements in Compressed Sparse Column format>

#### The text format

For baselines and for models with separate indices for text and math, we have a separate term similarity matrix with just text.

In [31]:
%%capture
! make word-embedding-similarity-matrix-text-positional

In [32]:
%ls -lh word-embedding-similarity-matrix-text-positional

-rw-r--r-- 1 novotny novotny 21M May  8 00:12 word-embedding-similarity-matrix-text-positional


In [33]:
SparseTermSimilarityMatrix.load('word-embedding-similarity-matrix-text-positional').matrix

<49559x49559 sparse matrix of type '<class 'numpy.float32'>'
	with 2657157 stored elements in Compressed Sparse Column format>

#### The LaTeX format

For models with separate indices for text and math, we have a separate term similarity matrix with just LaTeX.

In [34]:
%%capture
! make word-embedding-similarity-matrix-latex-positional

In [35]:
%ls -lh word-embedding-similarity-matrix-latex-positional

-rw-rw-r-- 1 novotny novotny 7.1M May  6 14:03 word-embedding-similarity-matrix-latex-positional


In [36]:
SparseTermSimilarityMatrix.load('word-embedding-similarity-matrix-latex-positional').matrix

<29772x29772 sparse matrix of type '<class 'numpy.float32'>'
	with 906458 stored elements in Compressed Sparse Column format>

#### The Tangent-L format

For models with separate indices for text and math, we have a separate term similarity matrix with just the format used by [the Tangent-L search engine from UWaterloo][1].

 [1]: http://ceur-ws.org/Vol-2936/paper-05.pdf

In [37]:
%%capture
! make word-embedding-similarity-matrix-tangentl-positional

In [38]:
%ls -lh word-embedding-similarity-matrix-tangentl-positional

-rw-r--r-- 1 novotny novotny 9.7M May  7 01:38 word-embedding-similarity-matrix-tangentl-positional


In [39]:
SparseTermSimilarityMatrix.load('word-embedding-similarity-matrix-tangentl-positional').matrix

<100000x100000 sparse matrix of type '<class 'numpy.float32'>'
	with 1216694 stored elements in Compressed Sparse Column format>

### Decontextualized `roberta-base` embeddings

Next, we will produce term similarity matrices based on semantic similarities using the [decontextualized word embeddings][1] of [the `roberta-base` model][2].

 [1]: https://aclanthology.org/2021.wmt-1.112
 [2]: https://huggingface.co/roberta-base

In [40]:
%%capture
! make decontextualized-word-embedding-similarity-matrix-roberta-base

In [41]:
%ls -lh decontextualized-word-embedding-similarity-matrix-roberta-base

-rw-r--r-- 1 novotny novotny 720K May 25 19:45 decontextualized-word-embedding-similarity-matrix-roberta-base


In [42]:
SparseTermSimilarityMatrix.load('decontextualized-word-embedding-similarity-matrix-roberta-base').matrix

<49559x49559 sparse matrix of type '<class 'numpy.float32'>'
	with 67179 stored elements in Compressed Sparse Column format>

### Decontextualized tuned `roberta-base` embeddings

Next, we will produce term similarity matrices based on semantic similarities using the [decontextualized word embeddings][1] of [the `roberta-base` model][2] fine-tuned so that it can represent math-specific tokens.

 [1]: https://aclanthology.org/2021.wmt-1.112
 [2]: https://huggingface.co/roberta-base

In [43]:
%%capture
! make decontextualized-word-embedding-similarity-matrix-tuned-roberta-base-text+latex

In [44]:
%ls -lh decontextualized-word-embedding-similarity-matrix-tuned-roberta-base-text+latex

-rw-r--r-- 1 novotny novotny 1.3M May 25 19:49 decontextualized-word-embedding-similarity-matrix-tuned-roberta-base-text+latex


In [45]:
SparseTermSimilarityMatrix.load('decontextualized-word-embedding-similarity-matrix-tuned-roberta-base-text+latex').matrix

<71897x71897 sparse matrix of type '<class 'numpy.float32'>'
	with 126649 stored elements in Compressed Sparse Column format>

## Combined similarities

Through, we will combine the Levenshtein and word embedding term similarity matrices to create combined term similarity matrices that take into account both surface-level and semantic term similarities.

### Non-positional `word2vec` embeddings

First, we will produce term similarity matrices based on semantic similarities using `word2vec` models without [positional weighting][1].

 [1]: https://github.com/MIR-MU/pine

#### The text + LaTeX format

As our primary representation, we use text and LaTeX separated by special `[MATH]` and `[/MATH]` tokens.

In [46]:
%%capture
! make similarity-matrix-text+latex

In [47]:
%ls -lh similarity-matrix-text+latex

-rw-r--r-- 1 novotny novotny 52M May  8 22:25 similarity-matrix-text+latex


In [48]:
SparseTermSimilarityMatrix.load('similarity-matrix-text+latex').matrix

<71897x71897 sparse matrix of type '<class 'numpy.float32'>'
	with 6728251 stored elements in Compressed Sparse Column format>

#### The text format

For models with separate indices for text and math, we have a separate term similarity matrix with just text.

In [49]:
%%capture
! make similarity-matrix-text

In [50]:
%ls -lh similarity-matrix-text

-rw-r--r-- 1 novotny novotny 43M May  7 21:15 similarity-matrix-text


In [51]:
SparseTermSimilarityMatrix.load('similarity-matrix-text').matrix

<49559x49559 sparse matrix of type '<class 'numpy.float32'>'
	with 5608333 stored elements in Compressed Sparse Column format>

#### The LaTeX format

For models with separate indices for text and math, we have a separate term similarity matrix with just LaTeX.

In [52]:
%%capture
! make similarity-matrix-latex

In [53]:
%ls -lh similarity-matrix-latex

-rw-rw-r-- 1 novotny novotny 21M May  6 14:04 similarity-matrix-latex


In [54]:
SparseTermSimilarityMatrix.load('similarity-matrix-latex').matrix

<29772x29772 sparse matrix of type '<class 'numpy.float32'>'
	with 2626954 stored elements in Compressed Sparse Column format>

#### The Tangent-L format

For models with separate indices for text and math, we have a separate term similarity matrix with just the format used by [the Tangent-L search engine from UWaterloo][1].

 [1]: http://ceur-ws.org/Vol-2936/paper-05.pdf

In [55]:
%%capture
! make similarity-matrix-tangentl

In [56]:
%ls -lh similarity-matrix-tangentl

-rw-r--r-- 1 novotny novotny 65M May  7 01:45 similarity-matrix-tangentl


In [57]:
SparseTermSimilarityMatrix.load('similarity-matrix-tangentl').matrix

<100000x100000 sparse matrix of type '<class 'numpy.float32'>'
	with 8399748 stored elements in Compressed Sparse Column format>

### Positional `word2vec` embeddings

Next, we will produce term similarity matrices based on semantic similarities using `word2vec` models with [positional weighting][1].

 [1]: https://github.com/MIR-MU/pine

#### The text + LaTeX format

As our primary representation, we use text and LaTeX separated by special `[MATH]` and `[/MATH]` tokens.

In [58]:
%%capture
! make similarity-matrix-text+latex-positional

In [59]:
%ls -lh similarity-matrix-text+latex-positional

-rw-r--r-- 1 novotny novotny 58M May  8 22:26 similarity-matrix-text+latex-positional


In [60]:
SparseTermSimilarityMatrix.load('similarity-matrix-text+latex-positional').matrix

<71897x71897 sparse matrix of type '<class 'numpy.float32'>'
	with 7492943 stored elements in Compressed Sparse Column format>

#### The text format

For baselines and for models with separate indices for text and math, we have a separate term similarity matrix with just text.

In [61]:
%%capture
! make similarity-matrix-text-positional

In [62]:
%ls -lh similarity-matrix-text-positional

-rw-r--r-- 1 novotny novotny 45M May  8 00:12 similarity-matrix-text-positional


In [63]:
SparseTermSimilarityMatrix.load('similarity-matrix-text-positional').matrix

<49559x49559 sparse matrix of type '<class 'numpy.float32'>'
	with 5800883 stored elements in Compressed Sparse Column format>

#### The LaTeX format

For models with separate indices for text and math, we have a separate term similarity matrix with just LaTeX.

In [64]:
%%capture
! make similarity-matrix-latex-positional

In [65]:
%ls -lh similarity-matrix-latex-positional

-rw-rw-r-- 1 novotny novotny 21M May  6 14:04 similarity-matrix-latex-positional


In [66]:
SparseTermSimilarityMatrix.load('similarity-matrix-latex-positional').matrix

<29772x29772 sparse matrix of type '<class 'numpy.float32'>'
	with 2729102 stored elements in Compressed Sparse Column format>

#### The Tangent-L format

For models with separate indices for text and math, we have a separate term similarity matrix with just the format used by [the Tangent-L search engine from UWaterloo][1].

 [1]: http://ceur-ws.org/Vol-2936/paper-05.pdf

In [67]:
%%capture
! make similarity-matrix-tangentl-positional

In [68]:
%ls -lh similarity-matrix-tangentl-positional

-rw-r--r-- 1 novotny novotny 68M May  7 01:38 similarity-matrix-tangentl-positional


In [69]:
SparseTermSimilarityMatrix.load('similarity-matrix-tangentl-positional').matrix

<100000x100000 sparse matrix of type '<class 'numpy.float32'>'
	with 8743780 stored elements in Compressed Sparse Column format>

### Decontextualized `roberta-base` embeddings

Next, we will produce term similarity matrices based on semantic similarities using the [decontextualized word embeddings][1] of [the `roberta-base` model][2].

 [1]: https://aclanthology.org/2021.wmt-1.112
 [2]: https://huggingface.co/roberta-base

In [70]:
%%capture
! make decontextualized-similarity-matrix-roberta-base

In [71]:
%ls -lh decontextualized-similarity-matrix-roberta-base

-rw-r--r-- 1 novotny novotny 26M May 25 19:45 decontextualized-similarity-matrix-roberta-base


In [72]:
SparseTermSimilarityMatrix.load('decontextualized-similarity-matrix-roberta-base').matrix

<49559x49559 sparse matrix of type '<class 'numpy.float32'>'
	with 3292243 stored elements in Compressed Sparse Column format>

### Decontextualized tuned `roberta-base` embeddings

Next, we will produce term similarity matrices based on semantic similarities using the [decontextualized word embeddings][1] of [the `roberta-base` model][2] fine-tuned so that it can represent math-specific tokens.

 [1]: https://aclanthology.org/2021.wmt-1.112
 [2]: https://huggingface.co/roberta-base

In [73]:
%%capture
! make decontextualized-similarity-matrix-tuned-roberta-base-text+latex

In [74]:
%ls -lh decontextualized-similarity-matrix-tuned-roberta-base-text+latex

-rw-r--r-- 1 novotny novotny 38M May 25 19:49 decontextualized-similarity-matrix-tuned-roberta-base-text+latex


In [75]:
SparseTermSimilarityMatrix.load('decontextualized-similarity-matrix-tuned-roberta-base-text+latex').matrix

<71897x71897 sparse matrix of type '<class 'numpy.float32'>'
	with 4924485 stored elements in Compressed Sparse Column format>