Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
5 contributors

Users who have contributed to this file

@marianorodriguez @jvalls-axa @BinaryBrain @aarohijohal @0xflotus
235 lines (191 sloc) 8.04 KB

Parsr Configuration

To configure the pipeline and choose what modules will be called and with what parameters, you have to provide a JSON file. There is only a few required keys:

  • version [Number] is the version number of the API.
  • extractor [Object] is a bunch of parameters about the extraction.
  • cleaner [Array] is a list of every cleaning tools that will be called.
  • output [Object] contains the list of formats to export and some other details.

Cleaning tools have default parameters that work pretty well, but you can override the parameters by providing the in the config.

The cleaner array may appear unconventionnal but is really easy to use. Every item can be of type:

  • String: it's the name of the cleaning tool you want to call.
  • Object: it's the parameters for the cleaning tool below.

1. Structure

{
    "version": 0.9,               // Version number of the configuration file format
    "extractor": {                // Extraction options (See section 2.)
        "pdf": "extractor-tool",  // Select the tool to extract PDF files
        // "img": "extractor-tool", Deprecated since 0.9 version
        "ocr": "extractor-tool",  // Select the tool to extract image files (JPG, PNG, TIFF, etc.)
        "language": "lang",       // Select the defaut language of your document. This is used to increase the accuracy of OCR tools (See section 2.2)
        "credentials": {          // Extractors running online services may require credentials to work. (see section 2.3)
          "API_KEY": "..."
        }
    },
    // The cleaner pipeline consists of a list of modules that will run on given file (See section 3.)
    "cleaner": [
        // The first module to run on the document and send the result to the next module
        "module-name-1",
        // The second module to run. This syntax is also accepted. It will use only the default module options
        [
            "module-name-2"
        ],
        // The thrid module to run with some special options
        [
            "module-name-3",
            { "option-1": { value: 50, min: 0, max: 100 }, "option-2": { value: true } }
        ],
    ],
    // Output options (See section 4.)
    "output": {
        "formats": {                       // list of format that will be outputed (See section 4.1.)
            "json": true,
            "text": false,
            "csv": true,
            "markdown": false
        },
        "granularity": "word | character", // Set the granularity of the output (See section 4.2.)
        "includeMarginals": false          // Chose whether the output will include headers and footers (See section 4.3.)
    }
}

This means the module called fontMerge will be called, then removeOutOfPage, then removeWhitespace with some special parameters, etc.

2. Extractor Config

2.1. Extractor Tools

Different extractors are available for each input file format.

  • PDF files: two extractors are currently available for PDF files:
    • pdfminer, which is an advanced python based extractor capable of extracting low and high level textual structures (from characters to paragraphs),
    • pdfjs, Mozilla's free solution for parsing documents. This is the recommended extractor to parse large documents (200+ pages).
  • Images: five OCR extractors are supported for images:
    • tesseract which is an Open Source OCR software,
    • abbyy, that relies on ABBYY Finereader, a paid solution for OCR on documents and images,
    • google-vision, which uses the Google Vision API to detect the contents of an image (see the google vision documentation for more),
    • ms-cognitive-services, that uses Microsoft Cognitive Services OCR to detect and process text inside an image.
    • amazon-textract, that uses Amazon Textract service to detect and process text inside an image.

2.2. Language

The language parameter is an option that will be pass to Tesseract when using it. It must be in the Tesseract language format, which is an equivalent of ISO 639-2/T.

2.3 Credentials

Each input module may require some credentials to work properly. For example, ms-cognitive-services extractor requires two values:

{
  "OCP_APIM_SUBSCRIPTION_KEY": "<SUBSCRIPTION_KEY>",
  "OCP_APIM_ENDPOINT": "https://westeurope.api.cognitive.microsoft.com/"
}

OCP_APIM_SUBSCRIPTION_KEY has to be obtained through Azure web console. OCP_APIM_ENDPOINT is required, but has a default value set.

Default credential values for each module can be found in each credentials.json file.

The recommended way to set credentials is to add them to the extractor config:

  "extractor": {
    "pdf": "pdfminer",
    "ocr": "ms-cognitive-services",
    "credentials": {
      "OCP_APIM_SUBSCRIPTION_KEY": "<API_KEY>",
      "OCP_APIM_ENDPOINT": "<API_ENDPOINT>" //as OCP_APIM_ENDPOINT is already defined in credentials.json, this key can be omitted.
    }
  },

For more information about input modules and their required credentials, you can check the Input Modules Documentation.

3. Cleaner Config

The pipeline is defined by every module names and their options. Modules will then be run one by one in the same order as in the configuration.

Module can be called in the full form:

[
	"module",
	{
		"option-1": 100,
		"option-2": true
	}
],

Or just with there default options:

[
	"module"
],

Or just with there default options in the condensed format:

"module",

Note: some modules have dependencies that need to be called before in the pipeline.

4. Output Config

4.1. Output Format

The platform can export the following formats:

  • json
  • markdown
  • text
  • csv
  • pdf (pandoc required)

4.2. Granularity

The granularity parameter can be either word or character and defines at what level of granularity the export will be.

Warning: exporting with a character granularity will result on very big Json files (probably more than 10Mo).

4.3. Include Marginals

The includeMarginals: boolean parameter allows to chose whether the output will include headers and footers.

5. Exempli gratia

{
  "version": 0.9,
  "extractor": {
    "pdf": "pdfminer",
    "ocr": "tesseract",
    "language": ["eng", "fra"]
  },
  "cleaner": [
    "out-of-page-removal",
    "whitespace-removal",
    "redundancy-detection",
    "table-detection",
    ["header-footer-detection", { "maxMarginPercentage": 15 }],
    ["reading-order-detection", { "minColumnWidthInPagePercent": 15 }],
    "link-detection",
    ["words-to-line", { "maximumSpaceBetweenWords": 100 }],
    "lines-to-paragraph",
    "page-number-detection",
    "hierarchy-detection",
    [
      "regex-matcher",
      {
        "queries": [
          {
            "label": "Car",
            "regex": "([A-Z]{2}\\-[\\d]{3}\\-[A-Z]{2})"
          },
          {
            "label": "Age",
            "regex": "(\\d+)[ -]*(ans|jarige)"
          },
          {
            "label": "Percent",
            "regex": "([\\-]?(\\d)+[\\.\\,]*(\\d)*)[ ]*(%|per|percent|pourcent|procent)"
          }
        ]
      }
    ]
  ],
  "output": {
    "granularity": "word",
    "includeMarginals": false,
    "formats": {
      "json": true,
      "text": true,
      "csv": true,
      "markdown": true,
      "pdf": false
    }
  }
}
You can’t perform that action at this time.