Skip to content

da03/im2latex-dataset

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

im2latex-dataset

Python tools for creating suitable dataset for OpenAI's im2latex task: https://openai.com/requests-for-research/#im2latex. You can download a prebuilt dataset from here. The data is split into train (~84k), validation (~9k) and test (~10k) sets, which possibly isn't quite enough for this task. I can build bigger sets on request.

Note: This code is very ad-hoc and requires tinkering with the source

Ultimate goals

  • To provide dataset suitable for solving im2latex task
    • So people can compare performances between systems
  • To provide the tools used to generate said dataset
    • So people can generate different kind of images (quality, size), different formulas (different fonts), etc
  • Misc tools for handling the datasets
    • TeX Math tokenizer (possibly)
    • Performance metric (takes list of true formulas and list of estimated formulas, outputs performance/accuracy)
    • Tools for modifying the images in wanted way

Contents

  • /src/latex2formulas.py
    • Script for parsing downloaded latex sources for formulas. Stores formulas in single .txt file (one formula per line)
  • /src/stackexchange2formulas.py
    • Similar to latex2formulas.py, but for parsing StackExchange XMLs.
  • /src/arxiv2formulas.py
    • Similar to latex2formulas.py, but for parsing arXiv .tar/.tar.gz files (source downloads).
  • /src/formula2image.py
    • Creates images and dataset from a file of formulas
  • /src/im2latex_utils.py
    • Collection of misc functions for handling these formulas
  • latex_urls.txt
    • Text file containing urls to LaTeX dataset from here. Use wget -i latex_urls.txt to download these files.

Dependencies

  • Python 2.x or 3.x (only ran on 2.x, should work on 3.x too. Haven't tried running on Windows)
  • For running the script with current settings and generating full-page images:
    • Properly installed LaTeX-to-PDF chain (eg. calling pdflatex outputs .pdf for .tex file)
    • ImageMagick installed so that convert command works
  • For creating more compact images of formulas (image cropped so that formula fits)
    • textogif and its dependencies
    • textogif needs to be placed in same directory where images are generated, otherwise it won't work.
  • For tokenizing LaTeX, python package plasTeX is required.

Commands for building TABLE2LATEX dataset

  1. Download arXiv tarballs by following instructions here and put them under a directory <TARDIR>.
  2. Run python src/arxiv2tabulars.py <TARDIR>. A file tabulars.txt which contains all extracted tabulars will be created.
  3. Run python src/tabular2image.py tabulars.txt. It will create im2latex.lst which contains the image filenames and the corresponding tabular ids, im2latex_tabulars.lst which lists all tabulars starting from id 0, as well as folder tabular_images containing all images and folder tabular_pdfs containing all pdfs. pdflatex and imagemagick are required.
  4. Run python src/deduplicate.py im2latex.lst im2latex_tabulars.lst. It will remove duplicate tables.
  5. Run python src/split_train_val_test.py im2latex.lst im2latex_tabulars.lst. It will split train, validation and test set at article level. im2latex_train.lst, im2latex_validate.lst and im2latex_test.lst will be generated.
  6. Run python src/tokenize.py im2latex_tabulars.lst im2latex_tabulars.tok.txt. It will tokenize Python package plasTeX is required.

Building your own dataset

  1. Download bunch of LaTeX sources packed in .tar files (by using the latex_urls.txt, for example)
  2. Run python latex2formulas.py [directory where .tars are stored]
  3. Run python formula2image.py [path to generated formula text file]
  4. Run python formula2image.py [dataset_file] [formula_file] [image_dir] to confirm dataset is valid
  • The end result should have two files and one directory (names can be changed in formula2image.py:

    • im2latex.lst
      • Each line is in format formula_idx image_name render_type
        • formula_idx is the line number where formula is in im2latex_formulas.lst
        • image_name is the name of the image connected to this rendering (without '.png')
        • render_type is the name of render setup used, defined in formula2image.py
    • im2latex_formulas.lst
      • Each line contains one formula
    • /formula_images
      • Directory where images are stored
  • Sometimes pdflatex gets stuck inside an infinite loop when compiling an image.

    • To fix this you need to manually kill stuck pdflatex processes, otherwise script won't end

Issues and possible TODOs

  • If pdflatex is used with convert this will generate pictures of whole page

    • While this might be a good thing (eg. fixed input size), it might also severly slow down training
  • textogif generates smaller images but these will have varying dimensions.

  • Possible TODOs:

    • Finish tokenizer function / output list of tokens instead of raw formula in formula list
    • Add accuracy metric (eg. word-error-rate or similar).
    • Combine ...2formula.py scripts into one, or at least make system more sensible rather than bunch of separate scripts.

About

Python tools for creating suitable dataset for OpenAI's im2latex task: https://openai.com/requests-for-research/#im2latex

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 89.5%
  • C++ 9.3%
  • Makefile 1.2%