# Produce ARQMath runs

In this notebook, we will runs on the ARQMath-1, ARQMath-2, and ARQMath-3 topics to be submitted to [the ARQMath-3 competition][1].

 [1]: https://www.cs.rit.edu/~dprl/ARQMath/

In [1]:
! hostname

mir


In [2]:
%%capture
! pip install .[scm,evaluation]

In [3]:
from pathlib import Path
from re import sub

from pandas import DataFrame

def evaluate_run(basename: str) -> DataFrame:
    years = 2020, 2021
    measures = 'MAP', "nDCG'"
    labels = {"MAP": "map", "nDCG'": "ndcg"}
    formatters = {"MAP": lambda x: sub('%$', '', x),
                  "nDCG'": lambda x: sub(', .*', '', x)}

    rows = []
    for year in years:
        row = []
        year_directory = Path(f'submission{year}')
        for measure in measures:
            label = labels[measure]
            measure_file = f'{basename}.{label}_score'
            result_file = year_directory / measure_file
            with result_file.open('rt') as f:
                formatter = formatters[measure]
                result = f.read().rstrip('\r\n')
                result = formatter(result)
                result = float(result)
                row.append(result)
        rows.append(row)

    dataframe = DataFrame(rows, index=years, columns=measures)
    return dataframe

## The text format with no term similarities (baseline)

As our baseline, we will use a vector space model that uses just text and does not model any term similarities.

In [4]:
%%capture
! make submission2020/SCM-task1-baseline_joint_text-text-auto-X.tsv
! make submission2021/SCM-task1-baseline_joint_text-text-auto-X.tsv
! make submission2022/SCM-task1-baseline_joint_text-text-auto-X.tsv

In [5]:
evaluate_run('SCM-task1-baseline_joint_text-text-auto-X')

Unnamed: 0,MAP,nDCG'
2020,1.15,0.137
2021,0.87,0.103


## Joint soft vector space models

First, we will produce runs using soft vector space models that jointly model both text and math. Information retrieval systems based on joint soft vector space models allow users to request math information using natural language and vise versa.

### The text + LaTeX format with no term similarities (baseline)

As a baseline, we will use a joint vector space model that uses text and LaTeX separated by special `[MATH]` and `[/MATH]` tokens, and does not model any term similarities.

In [6]:
%%capture
! make submission2020/SCM-task1-baseline_joint_text+latex-both-auto-X.tsv
! make submission2021/SCM-task1-baseline_joint_text+latex-both-auto-X.tsv
! make submission2022/SCM-task1-baseline_joint_text+latex-both-auto-X.tsv

In [7]:
evaluate_run('SCM-task1-baseline_joint_text+latex-both-auto-X')

Unnamed: 0,MAP,nDCG'
2020,3.3,0.222
2021,1.37,0.168


### The text + LaTeX format with non-positional `word2vec` embeddings

As an alternative run, we will use a joint soft vector space model that uses text and LaTeX separated by special `[MATH]` and `[/MATH]` tokens, and models term similarities based on semantic similarities using `word2vec` models without [positional weighting][1].

 [1]: https://github.com/MIR-MU/pine

In [8]:
%%capture
! make submission2020/SCM-task1-joint_word2vec-both-auto-A.tsv
! make submission2021/SCM-task1-joint_word2vec-both-auto-A.tsv
! make submission2022/SCM-task1-joint_word2vec-both-auto-A.tsv

In [9]:
evaluate_run('SCM-task1-joint_word2vec-both-auto-A')

Unnamed: 0,MAP,nDCG'
2020,3.36,0.247
2021,1.48,0.183


### The text + LaTeX format with positional `word2vec` embeddings

As another alternative run, we will use a joint soft vector space model that uses text and LaTeX separated by special `[MATH]` and `[/MATH]` tokens, and models term similarities based on semantic similarities using `word2vec` models with [positional weighting][1].

 [1]: https://github.com/MIR-MU/pine

In [10]:
%%capture
! make submission2020/SCM-task1-joint_positional_word2vec-both-auto-A.tsv
! make submission2021/SCM-task1-joint_positional_word2vec-both-auto-A.tsv
! make submission2022/SCM-task1-joint_positional_word2vec-both-auto-A.tsv

In [11]:
evaluate_run('SCM-task1-joint_positional_word2vec-both-auto-A')

Unnamed: 0,MAP,nDCG'
2020,3.36,0.247
2021,1.49,0.184


### The text format with decontextualized `roberta-base` embeddings

As another alternative run, we will use a joint soft vector space model that uses just text and models term similarities based on semantic similarities using the [decontextualized word embeddings][1] of [the `roberta-base` model][2].

 [1]: https://aclanthology.org/2021.wmt-1.112
 [2]: https://huggingface.co/roberta-base

In [12]:
%%capture
! make submission2020/SCM-task1-joint_roberta_base-text-auto-A.tsv
! make submission2021/SCM-task1-joint_roberta_base-text-auto-A.tsv
! make submission2022/SCM-task1-joint_roberta_base-text-auto-A.tsv

In [13]:
evaluate_run('SCM-task1-joint_roberta_base-text-auto-A')

Unnamed: 0,MAP,nDCG'
2020,1.02,0.129
2021,0.73,0.097


### The text + LaTeX format with decontextualized tuned `roberta-base` embeddings

As another alternative run, we will use a joint soft vector space model that uses text and LaTeX separated by special `[MATH]` and `[/MATH]` tokens, and models term similarities based on semantic similarities using the [decontextualized word embeddings][1] of [the `roberta-base` model][2] fine-tuned so that it can represent math-specific tokens.

 [1]: https://aclanthology.org/2021.wmt-1.112
 [2]: https://huggingface.co/roberta-base

In [14]:
%%capture
! make submission2020/SCM-task1-joint_tuned_roberta_base-both-auto-A.tsv
! make submission2021/SCM-task1-joint_tuned_roberta_base-both-auto-A.tsv
! make submission2022/SCM-task1-joint_tuned_roberta_base-both-auto-A.tsv

In [15]:
evaluate_run('SCM-task1-joint_tuned_roberta_base-both-auto-A')

Unnamed: 0,MAP,nDCG'
2020,3.36,0.248
2021,1.48,0.184


## Interpolated soft vector space models

Secondly, we will produce runs using soft vector space models that model text and math separately and produce the final score of a document by interpolating scores for text and math. Interpolated soft vector space models are better theoretically motivated and more modular than joint vector space models, but they cannot model the similarities between text and math-specific tokens.

### The text + LaTeX format with no term similarities (baseline)

As a baseline, we will use interpolated vector space models that use text and LaTeX separated by special `[MATH]` and `[/MATH]` tokens, and does not model any term similarities.

In [16]:
%%capture
! make submission2020/SCM-task1-baseline_interpolated_text+latex-both-auto-X.tsv
! make submission2021/SCM-task1-baseline_interpolated_text+latex-both-auto-X.tsv
! make submission2022/SCM-task1-baseline_interpolated_text+latex-both-auto-X.tsv

In [17]:
evaluate_run('SCM-task1-baseline_interpolated_text+latex-both-auto-X')

Unnamed: 0,MAP,nDCG'
2020,2.17,0.208
2021,1.43,0.169


### The text + Tangent-L format with no term similarities (baseline)

As another baseline, we will use interpolated vector space models that use text and the format used by [the Tangent-L search engine from UWaterloo][1], and does not model any term similarities.

 [1]: http://ceur-ws.org/Vol-2936/paper-05.pdf

In [18]:
%%capture
! make submission2020/SCM-task1-baseline_interpolated_text+tangentl-both-auto-X.tsv
! make submission2021/SCM-task1-baseline_interpolated_text+tangentl-both-auto-X.tsv
! make submission2022/SCM-task1-baseline_interpolated_text+tangentl-both-auto-X.tsv

In [19]:
evaluate_run('SCM-task1-baseline_interpolated_text+langentl-both-auto-X')

Unnamed: 0,MAP,nDCG'
2020,3.74,0.293
2021,2.82,0.237


### The text + LaTeX format with non-positional `word2vec` embeddings

As an alternative run, we will use interpolated soft vector space models that use text and LaTeX separated by special `[MATH]` and `[/MATH]` tokens. The LaTeX soft vector space model uses term similarities based on semantic similarities using `word2vec` models without [positional weighting][1].

 [1]: https://github.com/MIR-MU/pine

In [20]:
%%capture
! make submission2020/SCM-task1-interpolated_text+word2vec_latex-both-auto-A.tsv
! make submission2021/SCM-task1-interpolated_text+word2vec_latex-both-auto-A.tsv
! make submission2022/SCM-task1-interpolated_text+word2vec_latex-both-auto-A.tsv

In [21]:
evaluate_run('SCM-task1-interpolated_text+word2vec_latex-both-auto-A')

Unnamed: 0,MAP,nDCG'
2020,2.53,0.224
2021,1.58,0.186


### The text + LaTeX format with positional `word2vec` embeddings

As another alternative run, we will use interpolated soft vector space models that use text and LaTeX separated by special `[MATH]` and `[/MATH]` tokens. The LaTeX model uses term similarities based on semantic similarities using `word2vec` models with [positional weighting][1].

 [1]: https://github.com/MIR-MU/pine

In [22]:
%%capture
! make submission2020/SCM-task1-interpolated_text+positional_word2vec_latex-both-auto-A.tsv
! make submission2021/SCM-task1-interpolated_text+positional_word2vec_latex-both-auto-A.tsv
! make submission2022/SCM-task1-interpolated_text+positional_word2vec_latex-both-auto-A.tsv

In [23]:
evaluate_run('SCM-task1-interpolated_text+positional_word2vec_latex-both-auto-A')

Unnamed: 0,MAP,nDCG'
2020,2.52,0.223
2021,1.61,0.186


### The text + Tangent-L format with non-positional `word2vec` embeddings

As another alternative run, we will use interpolated soft vector space models that use text and LaTeX separated by special `[MATH]` and `[/MATH]` tokens. The LaTeX soft vector space model uses term similarities based on semantic similarities using `word2vec` models without [positional weighting][1].

 [1]: https://github.com/MIR-MU/pine

In [24]:
%%capture
! make submission2020/SCM-task1-interpolated_text+word2vec_tangentl-both-auto-A.tsv
! make submission2021/SCM-task1-interpolated_text+word2vec_tangentl-both-auto-A.tsv
! make submission2022/SCM-task1-interpolated_text+word2vec_tangentl-both-auto-A.tsv

In [25]:
evaluate_run('SCM-task1-interpolated_text+word2vec_tangentl-both-auto-A')

Unnamed: 0,MAP,nDCG'
2020,2.49,0.257
2021,2.19,0.199


### The text + Tangent-L format with positional `word2vec` embeddings

As our primary run, we will use interpolated soft vector space models that use text and LaTeX separated by special `[MATH]` and `[/MATH]` tokens. The LaTeX model uses term similarities based on semantic similarities using `word2vec` models with [positional weighting][1].

 [1]: https://github.com/MIR-MU/pine

In [26]:
%%capture
! make submission2020/SCM-task1-interpolated_text+positional_word2vec_tangentl-both-auto-P.tsv
! make submission2021/SCM-task1-interpolated_text+positional_word2vec_tangentl-both-auto-P.tsv
! make submission2022/SCM-task1-interpolated_text+positional_word2vec_tangentl-both-auto-P.tsv

In [27]:
evaluate_run('SCM-task1-interpolated_text+positional_word2vec_tangentl-both-auto-P')

Unnamed: 0,MAP,nDCG'
2020,2.36,0.254
2021,1.98,0.197
