Building the Dataset
To build the dataset used to train the model run
python build_dataset.py <markdown folder> <json folder> <output folder>.
It contains markdown files corresponding to the documents to annotate. The headings are used to generate the training labels for heading detection.
It contains json files as produced by Parsr. Each file must correspond to a file in the markdown folder. It must have the same name but use the .json extension instead of .md.
For each document, the script generates a corresponding csv in the output folder. The csv contains 7 columns:
- line: the line as extracted by Parsr. It should correspond to a line in the pdf document.
- word_count: the number of words in the line.
- font_size: the size of the most common font present on the line.
- is_bold: a boolean indicating if the line is entirely bold.
- color: the color (in hex format) of the color of the most common font of the line.
- title_case: a boolean indicating whether the line is written in title case. This is mostly valid only for English.
- label: "paragraph" or "heading"
usage: build_dataset.py [-h] md_dir json_dir out_dir Extracts features to csv from .json files using .md files as labels positional arguments: md_dir folder containing the .md files (labels) json_dir folder containing the .json files (data) out_dir folder in which to save the .csv files optional arguments: -h, --help show this help message and exit
Training the Model
To train the model run
python train_model.py <csv folder> <output folder>.
It contains the csv generated by
build_dataset.py and must contain csv files formatted as described previously.
The script generates a file name
model.js containing an executable model. This script should be placed in
usage: train_model.py [-h] dataset_dir out_dir Train a decision tree to recognize headings. positional arguments: dataset_dir folder containing the .csv files generated by build_dataset.py out_dir folder in which to save the trained model optional arguments: -h, --help show this help message and exit