# Produce ARQMath runs

In this notebook, we will runs on the ARQMath-1, ARQMath-2, and ARQMath-3 topics to be submitted to [the ARQMath-3 competition][1].

 [1]: https://www.cs.rit.edu/~dprl/ARQMath/

In [1]:
! hostname

docker.apollo.fi.muni.cz


In [2]:
%%capture
! pip install .[scm,evaluation]

## Joint soft vector space models

First, we will produce runs using soft vector space models that jointly model both text and math. Information retrieval systems based on joint soft vector space models allow users to request math information using natural language and vise versa.

In [3]:
import json

from pandas import DataFrame

def evaluate_joint_run(basename: str) -> DataFrame:
    with open(f'submission/{basename}.alpha_and_gamma', 'rt') as f:
        alpha_and_gamma = json.load(f)
    if 'alpha' in alpha_and_gamma or 'gamma' in alpha_and_gamma:
        raise ValueError(f'Joint system from run {basename} is not yet optimized')

    alpha = alpha_and_gamma['best_alpha']
    gamma = alpha_and_gamma['best_gamma']
    
    with open(f'submission/{basename}.ndcg_score', 'rt') as f:
        ndcg = f.read()

    ndcg, *_ = ndcg.split(', ')
    ndcg = float(ndcg)

    formatters = {"alpha": lambda alpha: f'{alpha:.1f}',
                  "gamma": lambda gamma: f'{gamma:g}',
                  "ndcg": lambda ndcg: f'{alpha:.3f}'}

    rows = 'ARQMath-3',
    columns = 'α', 'γ', "NDCG'"
    data = [[alpha, gamma, ndcg]]

    dataframe = DataFrame(data, columns=columns, index=rows)

    return dataframe

### The text format with no term similarities (baseline)

As our baseline, we will use a vector space model that uses just text and does not model any term similarities.

In [4]:
%%capture
! make submission/SCM-task1-baseline_joint_text-text-auto-X.tsv

In [5]:
evaluate_joint_run('SCM-task1-baseline_joint_text-text-auto-X')

Unnamed: 0,α,γ,NDCG'
ARQMath-3,0.0,2,0.235


### The text + LaTeX format with no term similarities (baseline)

As another baseline, we will use a joint vector space model that uses text and LaTeX separated by special `[MATH]` and `[/MATH]` tokens, and does not model any term similarities.

In [6]:
%%capture
! make submission/SCM-task1-baseline_joint_text+latex-both-auto-X.tsv

In [7]:
evaluate_joint_run('SCM-task1-baseline_joint_text+latex-both-auto-X')

Unnamed: 0,α,γ,NDCG'
ARQMath-3,0.0,3,0.224


### The text + LaTeX format with non-positional `word2vec` embeddings

As an alternative run, we will use a joint soft vector space model that uses text and LaTeX separated by special `[MATH]` and `[/MATH]` tokens, and models term similarities based on semantic similarities using `word2vec` models without [positional weighting][1].

 [1]: https://github.com/MIR-MU/pine

In [8]:
%%capture
! make submission/SCM-task1-joint_word2vec-both-auto-A.tsv

In [9]:
evaluate_joint_run('SCM-task1-joint_word2vec-both-auto-A')

Unnamed: 0,α,γ,NDCG'
ARQMath-3,0.6,5,0.251


### The text + LaTeX format with positional `word2vec` embeddings

As another alternative run, we will use a joint soft vector space model that uses text and LaTeX separated by special `[MATH]` and `[/MATH]` tokens, and models term similarities based on semantic similarities using `word2vec` models with [positional weighting][1].

 [1]: https://github.com/MIR-MU/pine

In [10]:
%%capture
! make submission/SCM-task1-joint_positional_word2vec-both-auto-A.tsv

In [11]:
evaluate_joint_run('SCM-task1-joint_positional_word2vec-both-auto-A')

Unnamed: 0,α,γ,NDCG'
ARQMath-3,0.7,5,0.249


### The text format with decontextualized `roberta-base` embeddings

As another alternative run, we will use a joint soft vector space model that uses just text and models term similarities based on semantic similarities using the [decontextualized word embeddings][1] of [the `roberta-base` model][2].

 [1]: https://aclanthology.org/2021.wmt-1.112
 [2]: https://huggingface.co/roberta-base

In [12]:
%%capture
! make submission/SCM-task1-joint_roberta_base-text-auto-A.tsv

In [13]:
evaluate_joint_run('SCM-task1-joint_roberta_base-text-auto-A')

Unnamed: 0,α,γ,NDCG'
ARQMath-3,0.6,2,0.247


### The text + LaTeX format with decontextualized tuned `roberta-base` embeddings

As another alternative run, we will use a joint soft vector space model that uses text and LaTeX separated by special `[MATH]` and `[/MATH]` tokens, and models term similarities based on semantic similarities using the [decontextualized word embeddings][1] of [the `roberta-base` model][2] fine-tuned so that it can represent math-specific tokens.

 [1]: https://aclanthology.org/2021.wmt-1.112
 [2]: https://huggingface.co/roberta-base

In [14]:
%%capture
! make submission/SCM-task1-joint_tuned_roberta_base-both-auto-A.tsv

In [15]:
evaluate_joint_run('SCM-task1-joint_tuned_roberta_base-both-auto-A')

Unnamed: 0,α,γ,NDCG'
ARQMath-3,0.6,4,0.249


## Interpolated soft vector space models

Secondly, we will produce runs using soft vector space models that model text and math separately and produce the final score of a document by interpolating scores for text and math. Interpolated soft vector space models are better theoretically motivated and more modular than joint vector space models, but they cannot model the similarities between text and math-specific tokens.

In [16]:
import json

from pandas import DataFrame

def evaluate_interpolated_run(basename: str) -> DataFrame:
    with open(f'submission/{basename}.first_alpha_and_gamma', 'rt') as f:
        first_alpha_and_gamma = json.load(f)
    if 'alpha' in first_alpha_and_gamma or 'gamma' in first_alpha_and_gamma:
        raise ValueError(f'First system from run {basename} is not yet optimized')

    first_alpha = first_alpha_and_gamma['best_alpha']
    first_gamma = first_alpha_and_gamma['best_gamma']

    with open(f'submission/{basename}.second_alpha_and_gamma', 'rt') as f:
        second_alpha_and_gamma = json.load(f)
    if 'alpha' in second_alpha_and_gamma or 'gamma' in second_alpha_and_gamma:
        raise ValueError(f'Second system from run {basename} is not yet optimized')

    second_alpha = second_alpha_and_gamma['best_alpha']
    second_gamma = second_alpha_and_gamma['best_gamma']

    with open(f'submission/{basename}.beta', 'rt') as f:
        _beta = json.load(f)
    if 'beta' in _beta:
        raise ValueError(f'Interpolated system from run {basename} is not yet optimized')
    
    beta = _beta['best_beta']
    
    with open(f'submission/{basename}.ndcg_score', 'rt') as f:
        ndcg = f.read()

    ndcg, *_ = ndcg.split(', ')
    ndcg = float(ndcg)
        
    formatters = {"first_alpha": lambda alpha: f'{alpha:.1f}',
                  "first_gamma": lambda gamma: f'{gamma:g}',
                  "second_alpha": lambda alpha: f'{alpha:.1f}',
                  "second_gamma": lambda gamma: f'{gamma:g}',
                  "beta": lambda beta: f'{beta:.1f}',
                  "ndcg": lambda ndcg: f'{alpha:.3f}'}

    rows = 'ARQMath-3',
    columns = 'α₁', 'γ₁', 'α₂', 'γ₂', 'β', "NDCG'"
    data = [[first_alpha, first_gamma, second_alpha, second_gamma, beta, ndcg]]

    dataframe = DataFrame(data, columns=columns, index=rows)

    return dataframe

### The text + LaTeX format with no term similarities (baseline)

As a baseline, we will use interpolated vector space models that use text and LaTeX separated by special `[MATH]` and `[/MATH]` tokens, and does not model any term similarities.

In [17]:
%%capture
! make submission/SCM-task1-baseline_interpolated_text+latex-both-auto-X.tsv

In [18]:
evaluate_interpolated_run('SCM-task1-baseline_interpolated_text+latex-both-auto-X')

Unnamed: 0,α₁,γ₁,α₂,γ₂,β,NDCG'
ARQMath-3,0.0,2,0.0,5,0.6,0.257


### The text + Tangent-L format with no term similarities (baseline)

As another baseline, we will use interpolated vector space models that use text and the format used by [the Tangent-L search engine from UWaterloo][1], and does not model any term similarities.

 [1]: http://ceur-ws.org/Vol-2936/paper-05.pdf

In [19]:
%%capture
! make submission/SCM-task1-baseline_interpolated_text+tangentl-both-auto-X.tsv

In [20]:
evaluate_interpolated_run('SCM-task1-baseline_interpolated_text+tangentl-both-auto-X')

Unnamed: 0,α₁,γ₁,α₂,γ₂,β,NDCG'
ARQMath-3,0.0,2,0.0,4,0.6,0.349


### The text + LaTeX format with non-positional `word2vec` embeddings

As an alternative run, we will use interpolated soft vector space models that use text and LaTeX separated by special `[MATH]` and `[/MATH]` tokens. The LaTeX soft vector space model uses term similarities based on semantic similarities using `word2vec` models without [positional weighting][1].

 [1]: https://github.com/MIR-MU/pine

In [21]:
%%capture
! make submission/SCM-task1-interpolated_word2vec_text+latex-both-auto-A.tsv

In [22]:
evaluate_interpolated_run('SCM-task1-interpolated_word2vec_text+latex-both-auto-A')

Unnamed: 0,α₁,γ₁,α₂,γ₂,β,NDCG'
ARQMath-3,0.6,2,1.0,5,0.6,0.288


### The text + LaTeX format with positional `word2vec` embeddings

As another alternative run, we will use interpolated soft vector space models that use text and LaTeX separated by special `[MATH]` and `[/MATH]` tokens. The LaTeX model uses term similarities based on semantic similarities using `word2vec` models with [positional weighting][1].

 [1]: https://github.com/MIR-MU/pine

In [23]:
%%capture
! make submission/SCM-task1-interpolated_positional_word2vec_text+latex-both-auto-A.tsv

In [24]:
evaluate_interpolated_run('SCM-task1-interpolated_positional_word2vec_text+latex-both-auto-A')

Unnamed: 0,α₁,γ₁,α₂,γ₂,β,NDCG'
ARQMath-3,0.7,2,1.0,5,0.6,0.288


### The text + Tangent-L format with non-positional `word2vec` embeddings

As another alternative run, we will use interpolated soft vector space models that use text and LaTeX separated by special `[MATH]` and `[/MATH]` tokens. The LaTeX soft vector space model uses term similarities based on semantic similarities using `word2vec` models without [positional weighting][1].

 [1]: https://github.com/MIR-MU/pine

In [25]:
%%capture
! make submission/SCM-task1-interpolated_word2vec_text+tangentl-both-auto-A.tsv

In [26]:
evaluate_interpolated_run('SCM-task1-interpolated_word2vec_text+tangentl-both-auto-A')

Unnamed: 0,α₁,γ₁,α₂,γ₂,β,NDCG'
ARQMath-3,0.6,2,0.0,5,0.7,0.351


### The text + Tangent-L format with positional `word2vec` embeddings

As our primary run, we will use interpolated soft vector space models that use text and LaTeX separated by special `[MATH]` and `[/MATH]` tokens. The LaTeX model uses term similarities based on semantic similarities using `word2vec` models with [positional weighting][1].

 [1]: https://github.com/MIR-MU/pine

In [27]:
%%capture
! make submission/SCM-task1-interpolated_positional_word2vec_text+tangentl-both-auto-P.tsv

In [28]:
evaluate_interpolated_run('SCM-task1-interpolated_positional_word2vec_text+tangentl-both-auto-P')

Unnamed: 0,α₁,γ₁,α₂,γ₂,β,NDCG'
ARQMath-3,0.7,2,0.0,5,0.7,0.355
