Python tools for creating suitable dataset for OpenAI's im2latex task: https://openai.com/requests-for-research/#im2latex. You can download a prebuilt dataset from here. The data is split into train (~84k), validation (~9k) and test (~10k) sets, which possibly isn't quite enough for this task. I can build bigger sets on request.
Note: This code is very ad-hoc and requires tinkering with the source
- To provide dataset suitable for solving im2latex task
- So people can compare performances between systems
- To provide the tools used to generate said dataset
- So people can generate different kind of images (quality, size), different formulas (different fonts), etc
- Misc tools for handling the datasets
- TeX Math tokenizer (possibly)
- Performance metric (takes list of true formulas and list of estimated formulas, outputs performance/accuracy)
- Tools for modifying the images in wanted way
/src/latex2formulas.py
- Script for parsing downloaded latex sources for formulas. Stores formulas in single .txt file (one formula per line)
/src/stackexchange2formulas.py
- Similar to
latex2formulas.py
, but for parsing StackExchange XMLs.
- Similar to
/src/arxiv2formulas.py
- Similar to
latex2formulas.py
, but for parsing arXiv .tar/.tar.gz files (source downloads).
- Similar to
/src/formula2image.py
- Creates images and dataset from a file of formulas
/src/im2latex_utils.py
- Collection of misc functions for handling these formulas
latex_urls.txt
- Text file containing urls to LaTeX dataset from here. Use
wget -i latex_urls.txt
to download these files.
- Text file containing urls to LaTeX dataset from here. Use
- Python 2.x or 3.x (only ran on 2.x, should work on 3.x too. Haven't tried running on Windows)
- For running the script with current settings and generating full-page images:
- Properly installed LaTeX-to-PDF chain (eg. calling
pdflatex
outputs .pdf for .tex file) - ImageMagick installed so that
convert
command works
- Properly installed LaTeX-to-PDF chain (eg. calling
- For creating more compact images of formulas (image cropped so that formula fits)
- textogif and its dependencies
textogif
needs to be placed in same directory where images are generated, otherwise it won't work.
- For tokenizing LaTeX, python package plasTeX is required.
- Download arXiv tarballs by following instructions here and put them under a directory
<TARDIR>
. - Run
python src/arxiv2tabulars.py <TARDIR>
. A filetabulars.txt
which contains all extracted tabulars will be created. - Run
python src/tabular2image.py tabulars.txt
. It will createim2latex.lst
which contains the image filenames and the corresponding tabular ids,im2latex_tabulars.lst
which lists all tabulars starting from id 0, as well as foldertabular_images
containing all images and foldertabular_pdfs
containing all pdfs. pdflatex and imagemagick are required. - Run
python src/deduplicate.py im2latex.lst im2latex_tabulars.lst
. It will remove duplicate tables. - Run
python src/split_train_val_test.py im2latex.lst im2latex_tabulars.lst
. It will split train, validation and test set at article level.im2latex_train.lst
,im2latex_validate.lst
andim2latex_test.lst
will be generated. - Run
python src/tokenize.py im2latex_tabulars.lst im2latex_tabulars.tok.txt
. It will tokenize Python package plasTeX is required.
- Download bunch of LaTeX sources packed in .tar files (by using the latex_urls.txt, for example)
- Run
python latex2formulas.py [directory where .tars are stored]
- Run
python formula2image.py [path to generated formula text file]
- Run
python formula2image.py [dataset_file] [formula_file] [image_dir]
to confirm dataset is valid
-
The end result should have two files and one directory (names can be changed in
formula2image.py
:im2latex.lst
- Each line is in format
formula_idx image_name render_type
- formula_idx is the line number where formula is in
im2latex_formulas.lst
- image_name is the name of the image connected to this rendering (without '.png')
- render_type is the name of render setup used, defined in
formula2image.py
- formula_idx is the line number where formula is in
- Each line is in format
im2latex_formulas.lst
- Each line contains one formula
/formula_images
- Directory where images are stored
-
Sometimes pdflatex gets stuck inside an infinite loop when compiling an image.
- To fix this you need to manually kill stuck pdflatex processes, otherwise script won't end
-
If
pdflatex
is used withconvert
this will generate pictures of whole page- While this might be a good thing (eg. fixed input size), it might also severly slow down training
-
textogif
generates smaller images but these will have varying dimensions. -
Possible TODOs:
- Finish tokenizer function / output list of tokens instead of raw formula in formula list
- Add accuracy metric (eg. word-error-rate or similar).
- Check this repository for some evaluation scripts: https://github.com/harvardnlp/im2markup
- Combine
...2formula.py
scripts into one, or at least make system more sensible rather than bunch of separate scripts.