Skip to content

Commit

Permalink
[ocr-credentials] Updated documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
jvalls-axa committed Feb 19, 2020
1 parent 0fa1e36 commit 29c1374
Showing 1 changed file with 6 additions and 4 deletions.
10 changes: 6 additions & 4 deletions docs/configuration.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Parsr Configuration

- [Parsr Configuration](#parsr-configuration)

- [1. Structure](#1-structure)
- [2. Extractor Config](#2-extractor-config)
- [2.1. Extractor Tools](#21-extractor-tools)
Expand All @@ -11,7 +12,7 @@
- [4.2. Granularity](#42-granularity)
- [4.3. Include Marginals](#43-include-marginals)
- [5. Exempli gratia](#5-exempli-gratia)

To configure the pipeline and choose what modules will be called and with what parameters, you have to provide a JSON file.
There is only a few required keys:

Expand All @@ -31,9 +32,10 @@ The cleaner array may appear unconventionnal but is really easy to use. Every it

```js
{
"version": 0.5, // Version number of the configuration file format
"version": 0.9, // Version number of the configuration file format
"extractor": { // Extraction options (See section 2.)
"pdf": "extractor-tool", // Select the tool to extract PDF files
// "img": "extractor-tool", Deprecated since 0.9 version
"ocr": "extractor-tool", // Select the tool to extract image files (JPG, PNG, TIFF, etc.)
"language": "lang", // Select the defaut language of your document. This is used to increase the accuracy of OCR tools (See section 2.2)
"credentials": { // Extractors running online services may require credentials to work. (see section 2.3)
Expand Down Expand Up @@ -85,6 +87,7 @@ Different extractors are available for each input file format.
- `google-vision`, which uses the [Google Vision](https://cloud.google.com/vision/) API to detect the contents of an image (see the [google vision documentation for more](../server/src/input/google-vision/README.md)),
- `ms-cognitive-services`, that uses [Microsoft Cognitive Services](https://azure.microsoft.com/es-es/services/cognitive-services/) OCR to detect and process text inside an image.
- `amazon-textract`, that uses [Amazon Textract](https://us-east-2.console.aws.amazon.com/textract/home) service to detect and process text inside an image.

### 2.2. Language

The language parameter is an option that will be pass to Tesseract when using it. It must be in the [Tesseract language format](https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#languages), which is an equivalent of [ISO 639-2/T](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes).
Expand All @@ -102,7 +105,7 @@ For example, `ms-cognitive-services` extractor requires two values:
```

`OCP_APIM_SUBSCRIPTION_KEY` has to be obtained through Azure web console.
`OCP_APIM_ENDPOINT` is required, but has a default value set.
`OCP_APIM_ENDPOINT` is required, but has a default value set.

Default credential values for each module can be found in each `credentials.json` file.

Expand All @@ -119,7 +122,6 @@ The recommended way to set credentials is to add them to the extractor config:
},
```


For more information about input modules and their required credentials, you can check the [Input Modules Documentation](../server/src/input/README.md).

## 3. Cleaner Config
Expand Down

0 comments on commit 29c1374

Please sign in to comment.