# Produce dictionaries

In this notebooks, we will produce dictionaries to be used in the final systems.

In [1]:
! hostname

mir


In [2]:
%%capture
! pip install .[word2vec]

In [3]:
from gensim.corpora import Dictionary

## The text + LaTeX format

As our primary representation, we use text and LaTeX separated by special `[MATH]` and `[/MATH]` tokens.

In [4]:
%%capture
! make dictionary-text+latex

In [5]:
%ls -lh dictionary-text+latex

-rw-r--r-- 1 novotny novotny 2.3M May  6 20:36 dictionary-text+latex


In [6]:
len(Dictionary.load('dictionary-text+latex'))

71897

In [7]:
Dictionary.load('dictionary-text+latex').most_common(10)

[('Ġs', 65257098),
 ('Ġ}', 56983202),
 ('Ġwe', 51958468),
 ('Ġe', 50470017),
 ('Ġwith', 48166926),
 ('ĠThe', 47652222),
 ('Ġby', 43615246),
 ('Ġ)', 42822233),
 ('Ġbe', 42302554),
 ('Ġare', 41573593)]

## The text format

For baselines and for models with separate indices for text and math, we have a separate dictionary with just text.

In [8]:
%%capture
! make dictionary-text

In [9]:
%ls -lh dictionary-text

-rw-rw-r-- 1 novotny novotny 1.6M May  6 13:48 dictionary-text


In [10]:
len(Dictionary.load('dictionary-text'))

49559

In [11]:
Dictionary.load('dictionary-text').most_common(10)

[('-', 83532676),
 (')', 79652058),
 ('Ġ.', 59955611),
 ('Ġ,', 59671401),
 ('Ġwe', 52938848),
 ('Ġwith', 49090999),
 ('ĠThe', 47649681),
 ('Ġby', 44548555),
 ('Ġare', 42455468),
 ('Ġbe', 40728618)]

## The LaTeX format

For models with separate indices for text and math, we have a separate dictionary with just LaTeX.

In [12]:
%%capture
! make dictionary-latex

In [13]:
%ls -lh dictionary-latex

-rw-rw-r-- 1 novotny novotny 946K May  6 06:01 dictionary-latex


In [14]:
len(Dictionary.load('dictionary-latex'))

29772

In [15]:
Dictionary.load('dictionary-latex').most_common(10)

[('%', 11245581),
 ('\\displaystyle', 10412723),
 ('\\', 5876856),
 ('}', 5349242),
 ('x', 5306850),
 ('p', 5078834),
 ('A', 5032629),
 ('n', 5023353),
 ('}%', 4991690),
 ('k', 4817987)]

## The Tangent-L format

For models with separate indices for text and math, we have a separate dictionary with just the format used by [the Tangent-L search engine from UWaterloo][1].

 [1]: http://ceur-ws.org/Vol-2936/paper-05.pdf

In [16]:
%%capture
! make dictionary-tangentl

In [17]:
%ls -lh dictionary-tangentl

-rw-r--r-- 1 novotny novotny 3.3M May  7 00:01 dictionary-tangentl


In [18]:
len(Dictionary.load('dictionary-tangentl'))

100000

In [19]:
Dictionary.load('dictionary-tangentl').most_common(10)

[('(n!2,!0)', 8593592),
 ('(n!1,!0)', 7957511),
 ('(m!()1x1,[n,w])', 7364114),
 ('(f!,[n,o,u])', 3748478),
 ('(n!0,!0)', 3437038),
 ('(v!n,!0)', 3367907),
 ('(v!x,!0)', 3153294),
 ('(f!,[o,u])', 2400971),
 ('(m!()1x1,=,n)', 2283522),
 ('(v!t,!0)', 2198859)]