Skip to content
Branch: master
Find file History

Latest commit

Fetching latest commit…
Cannot retrieve the latest commit at this time.

Files

Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
README.md
build_dataset.py
train_model.py

README.md

Building the Dataset

To build the dataset used to train the model run python build_dataset.py <markdown folder> <json folder> <output folder>.

markdown folder:
It contains markdown files corresponding to the documents to annotate. The headings are used to generate the training labels for heading detection.

json folder:
It contains json files as produced by Parsr. Each file must correspond to a file in the markdown folder. It must have the same name but use the .json extension instead of .md.

output:
For each document, the script generates a corresponding csv in the output folder. The csv contains 7 columns:

  1. line: the line as extracted by Parsr. It should correspond to a line in the pdf document.
  2. word_count: the number of words in the line.
  3. font_size: the size of the most common font present on the line.
  4. is_bold: a boolean indicating if the line is entirely bold.
  5. color: the color (in hex format) of the color of the most common font of the line.
  6. title_case: a boolean indicating whether the line is written in title case. This is mostly valid only for English.
  7. label: "paragraph" or "heading"
usage: build_dataset.py [-h] md_dir json_dir out_dir

Extracts features to csv from .json files using .md files as labels

positional arguments:
  md_dir      folder containing the .md files (labels)
  json_dir    folder containing the .json files (data)
  out_dir     folder in which to save the .csv files

optional arguments:
  -h, --help  show this help message and exit

Training the Model

To train the model run python train_model.py <csv folder> <output folder>.

csv folder:
It contains the csv generated by build_dataset.py and must contain csv files formatted as described previously.

output:
The script generates a file name model.js containing an executable model. This script should be placed in server/src/processing/HeadingDetectionDTModule.

usage: train_model.py [-h] dataset_dir out_dir

Train a decision tree to recognize headings.

positional arguments:
  dataset_dir  folder containing the .csv files generated by build_dataset.py
  out_dir      folder in which to save the trained model

optional arguments:
  -h, --help   show this help message and exit
You can’t perform that action at this time.