Building the Dataset

To build the dataset used to train the model run python <markdown folder> <json folder> <output folder>.

markdown folder:
It contains markdown files corresponding to the documents to annotate. The headings are used to generate the training labels for heading detection.

json folder:
It contains json files as produced by Parsr. Each file must correspond to a file in the markdown folder. It must have the same name but use the .json extension instead of .md.

For each document, the script generates a corresponding csv in the output folder. The csv contains 7 columns:

  1. line: the line as extracted by Parsr. It should correspond to a line in the pdf document.
  2. word_count: the number of words in the line.
  3. font_size: the size of the most common font present on the line.
  4. is_bold: a boolean indicating if the line is entirely bold.
  5. color: the color (in hex format) of the color of the most common font of the line.
  6. title_case: a boolean indicating whether the line is written in title case. This is mostly valid only for English.
  7. label: "paragraph" or "heading"
usage: [-h] md_dir json_dir out_dir

Extracts features to csv from .json files using .md files as labels

positional arguments:
  md_dir      folder containing the .md files (labels)
  json_dir    folder containing the .json files (data)
  out_dir     folder in which to save the .csv files

optional arguments:
  -h, --help  show this help message and exit

Training the Model

To train the model run python <csv folder> <output folder>.

csv folder:
It contains the csv generated by and must contain csv files formatted as described previously.

The script generates a file name model.js containing an executable model. This script should be placed in server/src/processing/HeadingDetectionDTModule.

usage: [-h] dataset_dir out_dir

Train a decision tree to recognize headings.

positional arguments:
  dataset_dir  folder containing the .csv files generated by
  out_dir      folder in which to save the trained model

optional arguments:
  -h, --help   show this help message and exit
