Skip to content
Permalink
Browse files

Run a spell checker on the whole project

  • Loading branch information
BinaryBrain committed Nov 1, 2019
1 parent 2488b93 commit 49be8656becb9ab2126e308c15ccfa5b6f9372b1
Showing with 174 additions and 157 deletions.
  1. +5 −5 README.md
  2. +2 −2 api/server/src/FileManager.ts
  3. +1 −2 api/server/src/ProcessManager.ts
  4. +2 −2 api/server/src/ServerManager.ts
  5. +3 −3 api/server/src/api.ts
  6. +2 −2 api/server/src/types.ts
  7. +1 −1 demo/jupyter-notebook/sampleConfig.json
  8. +1 −1 docker-compose-build.yml
  9. +4 −4 docs/API-deprecated.md
  10. +32 −7 docs/api-guide.md
  11. +1 −1 server/bin/index.ts
  12. +2 −2 server/configKeyValueSearch.json
  13. +1 −1 server/defaultConfig.json
  14. +1 −1 server/remoteModuleConfig.json
  15. +25 −25 server/src/input/abbyy/AbbyyTools.ts
  16. +1 −1 server/src/input/google-vision/GoogleVisionExtractor.ts
  17. +8 −8 server/src/input/pdfminer/pdfminer.ts
  18. +1 −1 server/src/input/tesseract/TesseractExtractor.ts
  19. +1 −1 server/src/input/tesseract/tesseract2json.ts
  20. +1 −1 server/src/processing/HeaderFooterDetectionModule/README.md
  21. +2 −2 server/src/processing/KeyValueDetectionModule/KeyValueDetectionModule.ts
  22. +2 −2 server/src/processing/KeyValueDetectionModule/README.md
  23. +7 −7 server/src/processing/LinesToParagraphModule/LinesToParagraphModule.ts
  24. +6 −6 server/src/processing/LinesToParagraphModule/README.md
  25. +1 −1 server/src/processing/LinkDetectionModule/README.md
  26. +1 −1 server/src/processing/ListDetectionModule/ListDetectionModule.ts
  27. +3 −3 server/src/processing/NumberCorrectionModule/NumberCorrectionModule.ts
  28. +1 −1 server/src/processing/OutOfPageRemovalModule/OutOfPageRemovalModule.ts
  29. +3 −3 server/src/processing/README.md
  30. +2 −2 server/src/processing/ReadingOrderDetectionModule/README.md
  31. +4 −4 server/src/processing/ReadingOrderDetectionModule/ReadingOrderDetectionModule.ts
  32. +1 −1 server/src/processing/SeparateWordsModule/README.md
  33. +0 −7 server/src/processing/SeparateWordsModule/SeparateWordsModule.ts
  34. +1 −1 server/src/processing/TableDetectionModule/README.md
  35. +2 −2 server/src/processing/WhitespaceRemovalModule/README.md
  36. +1 −1 server/src/processing/WhitespaceRemovalModule/WhitespaceRemovalModule.ts
  37. +1 −1 server/src/processing/WordsToLineModule/README.md
  38. +7 −7 server/src/types/DocumentRepresentation/BoundingBox.ts
  39. +1 −1 server/src/types/DocumentRepresentation/Character.ts
  40. +3 −3 server/src/types/DocumentRepresentation/Element.ts
  41. +1 −1 server/src/types/DocumentRepresentation/Font.ts
  42. +1 −1 server/src/types/DocumentRepresentation/Page.ts
  43. +8 −8 server/src/types/DocumentRepresentation/Paragraph.ts
  44. +1 −1 server/src/types/DocumentRepresentation/Table.ts
  45. +18 −18 server/src/utils.ts
  46. +1 −1 server/src/utils/json2document.ts
@@ -48,7 +48,7 @@ You can install Parsr either using Docker containers, or directly on your machin

### 1.1. Docker Installation

Containers are already avaiable on [Docker Hub](https://hub.docker.com/u/axarev).
Containers are already available on [Docker Hub](https://hub.docker.com/u/axarev).

The documentation to build and run Docker containers is [here](docs/docker.md).

@@ -87,7 +87,7 @@ Next, install the required dependencies:
brew install node qpdf imagemagick graphicsmagick tesseract tesseract-lang
```

To install the python based depedencies (pdfminer and camelot), install, first install `pip`:
To install the python based dependencies (pdfminer and camelot), install, first install `pip`:

```sh
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
@@ -104,7 +104,7 @@ pip install ghostscript camelot-py

#### 1.2.3. Installing Dependencies under Windows

1. We recommand using [Chocolatey](https://chocolatey.org) as the package manager for installing dependencies under Windows. To install Chocolatey, [follow these instructions](https://chocolatey.org/install#installing-chocolatey).
1. We recommend using [Chocolatey](https://chocolatey.org) as the package manager for installing dependencies under Windows. To install Chocolatey, [follow these instructions](https://chocolatey.org/install#installing-chocolatey).
2. [Download and install **`node.js`**](https://nodejs.org/en/download)
3. For the **pdfminer** extractor for pdfs, [follow these steps](https://github.com/pdfminer/pdfminer.six#how-to-install).
4. Install **`qpdf`** and **`imagemagick`** using Powershell (Run as Administrator):
@@ -163,7 +163,7 @@ To install MuPDF, follow the steps corresponding to your environment:
```

If MuPDF is not installed, a corrupt/unreadable PDF file at input will be left untreated.
A message of such an occurance will be logged.
A message of such an occurrence will be logged.

#### 1.3.2. Pandoc

@@ -337,7 +337,7 @@ If images (`jpg`, `png`, `tiff`, etc.) are to be used with the tool, then the to

The following _optional_ dependencies may to be installed:

1. `mupdf-tools`: For error-correcting corrupt PDF's at input.
1. `mupdf-tools`: For error-correcting corrupt PDFs at input.
2. `pandoc`: Generate PDF files from an intermediate Markdown output after the cleaning operation in the pipeline.

## 5. Contribute
@@ -77,8 +77,8 @@ export class FileManager {
return this.checkFile(binder, `${binder.name}.xml`);
}

if (type === 'confidances') {
return this.checkFile(binder, `${binder.name}-confidances.txt`);
if (type === 'confidences') {
return this.checkFile(binder, `${binder.name}-confidences.txt`);
}

if (type === 'csvs') {
@@ -66,8 +66,7 @@ export class ProcessManager {
try {
logger.info(JSON.parse(json).msg);
} catch (err) {
console.log(json);
console.log(err);
logger.info(json);
}
}
});
@@ -44,8 +44,8 @@ export class ServerManager {
*/
public getModules(): string[] {
return readdirSync(this.defaultModulesFolder, { withFileTypes: true })
.filter(dirent => dirent.isDirectory())
.map(dirent => dirent.name)
.filter(dirEntry => dirEntry.isDirectory())
.map(dirEntry => dirEntry.name)
.map(d => d.replace(/([a-z])([A-Z])/g, '$1-$2').toLowerCase())
.map(d => d.replace(/-module/g, ''));
}
@@ -80,7 +80,7 @@ export class ApiServer {
v1_0.get('/queue/:id', this.handleGetQueue.bind(this));
v1_0.get('/json/:id', this.handleGetJson.bind(this));
v1_0.get('/text/:id', this.handleGetText.bind(this));
v1_0.get('/confidances/:id', this.handleGetConfidances.bind(this));
v1_0.get('/confidences/:id', this.handleGetConfidences.bind(this));
v1_0.get('/csv/:id', this.handleGetCsvList.bind(this));
v1_0.get('/csv/:id/:page/:table', this.handleGetCsv.bind(this));
v1_0.get('/markdown/:id', this.handleGetMarkdown.bind(this));
@@ -279,8 +279,8 @@ export class ApiServer {
this.handleGetFile(req, res, 'text');
}

private handleGetConfidances(req: Request, res: Response) {
this.handleGetFile(req, res, 'confidances');
private handleGetConfidences(req: Request, res: Response) {
this.handleGetFile(req, res, 'confidences');
}

private handleGetCsv(req: Request, res: Response) {
@@ -51,7 +51,7 @@ export type SingleFileType =
| 'text'
| 'pdf'
| 'markdown'
| 'confidances';
| 'confidences';

export type QueueStatus = {
'progress-percentage': number;
@@ -69,7 +69,7 @@ export type OutputConfig = {
text?: boolean;
markdown?: boolean;
xml?: boolean;
confidances?: boolean;
confidences?: boolean;
csv?: boolean;
pdf?: boolean;
};
@@ -25,7 +25,7 @@
"regex": "(\\d+)[ -]*(ans|jarige)"
}, {
"label": "Percent",
"regex": "([\\-]?(\\d)+[\\.\\,]*(\\d)*)[ ]*(%|per|percent|pourcent|procent)"
"regex": "([\\-]?(\\d)+[\\.\\,]*(\\d)*)[ ]*(%|per|percent|pourcent|prozent)"
}]
}]
],
@@ -1,7 +1,7 @@
version: '3.3'

services:
# parsr-base is the baseimage with all the dependencies already installed on it
# parsr-base is the base image with all the dependencies already installed on it
# just build it if you need new dependencies otherwise use the publish one
parsr-base:
image: axarev/parsr-base
@@ -23,9 +23,9 @@ In the end, the full request may look like this:
```http
POST /upload HTTP/1.1
Host: localhost:3000
Content-Type: multipart/form-data; boundary=MultipartBoundry
Content-Type: multipart/form-data; boundary=MultipartBoundary
--MultipartBoundry
--MultipartBoundary
Content-Disposition: form-data; name="config"
Content-Type: application/json
@@ -51,12 +51,12 @@ Content-Type: application/json
"paragraphLastLine"
]
}
--MultipartBoundry--
--MultipartBoundary--
Content-Disposition: form-data; name="file"; filename="example.pdf"
Content-Type: application/pdf
<pdf_content>
--MultipartBoundry
--MultipartBoundary
```

## Response
@@ -2,7 +2,32 @@

This page is a guide on how to use the API.

- [API Guide](#api-guide) - [0. Introduction](#0-introduction) - [1. Send Your Document: POST /document](#1-send-your-document-post-document) - [`curl` command](#curl-command) - [Status: 202 - Accepted](#status-202---accepted) - [Status: 415 - Unsupported Media Type](#status-415---unsupported-media-type) - [2. Get the queue status: GET /queue/{id}](#2-get-the-queue-status-get-queueid) - [`curl` command](#curl-command-1) - [Status: 200 - OK](#status-200---ok) - [Status: 201 - Created](#status-201---created) - [Status: 404 - Not Found](#status-404---not-found) - [Status: 500 - Internal Server Error](#status-500---internal-server-error) - [3. Get the results](#3-get-the-results) - [3.1. JSON, Markdown and Text results](#31-json-markdown-and-text-results) - [`curl` command](#curl-command-2) - [Status: 200 - OK](#status-200---ok-1) - [Status: 404 - Not Found](#status-404---not-found-1) - [3.2. CSV List of Files: GET /csv/{id}](#32-csv-list-of-files-get-csvid) - [`curl` command](#curl-command-3) - [Status: 200 - OK](#status-200---ok-2) - [Status: 404 - Not Found](#status-404---not-found-2) - [3.3. CSV File: GET /csv/{id}/{page}/{table}](#33-csv-file-get-csvidpagetable) - [`curl` command](#curl-command-4) - [Status: 200 - OK](#status-200---ok-3) - [Status: 404 - Not Found](#status-404---not-found-3) - [4. Server Configuration Access](#4-server-configuration-access)
- [API Guide](#api-guide)
- [0. Introduction](#0-introduction)
- [1. Send Your Document: POST /document](#1-send-your-document-post-document)
- [`curl` command](#curl-command)
- [Status: 202 - Accepted](#status-202---accepted)
- [Status: 415 - Unsupported Media Type](#status-415---unsupported-media-type)
- [2. Get the queue status: GET /queue/{id}](#2-get-the-queue-status-get-queueid)
- [`curl` command](#curl-command-1)
- [Status: 200 - OK](#status-200---ok)
- [Status: 201 - Created](#status-201---created)
- [Status: 404 - Not Found](#status-404---not-found)
- [Status: 500 - Internal Server Error](#status-500---internal-server-error)
- [3. Get the results](#3-get-the-results)
- [3.1. JSON, Markdown and Text results](#31-json-markdown-and-text-results)
- [`curl` command](#curl-command-2)
- [Status: 200 - OK](#status-200---ok-1)
- [Status: 404 - Not Found](#status-404---not-found-1)
- [3.2. CSV List of Files: GET /csv/{id}](#32-csv-list-of-files-get-csvid)
- [`curl` command](#curl-command-3)
- [Status: 200 - OK](#status-200---ok-2)
- [Status: 404 - Not Found](#status-404---not-found-2)
- [3.3. CSV File: GET /csv/{id}/{page}/{table}](#33-csv-file-get-csvidpagetable)
- [`curl` command](#curl-command-4)
- [Status: 200 - OK](#status-200---ok-3)
- [Status: 404 - Not Found](#status-404---not-found-3)
- [4. Server Configuration Access](#4-server-configuration-access)

## 0. Introduction

@@ -11,7 +36,7 @@ First of all there is a few things to know:
- **The API is RESTful:** The API is over HTTP and follow REST standards.
- **The API is asynchronous:** There is a simple queue system and every job is managed by the API server.

The API has an endpoint prefix `/api` and then, optionaly, the version number `/v1.0`. That mean every request must be send to:
The API has an endpoint prefix `/api` and then, optionally, the version number `/v1.0`. That mean every request must be send to:

- `/api/v1.0`: will use the API version 1.0
- `/api/v1`: will use the latest API version 1.x
@@ -93,7 +118,7 @@ This error means the queue ID doesn't refer to any known processing queue.

### Status: 500 - Internal Server Error

This error means that something went terribly wrong on the backend, probably an error comming Parsr.
This error means that something went terribly wrong on the backend, probably an error coming from Parsr.

## 3. Get the results

@@ -131,7 +156,7 @@ For more information on the JSON format, please [refer to the specific guide](js

#### Status: 404 - Not Found

This error means that the result file doesn't exist. Maybe it wasn't asked to be outputed in the config you sent in the first request.
This error means that the result file doesn't exist. Maybe it wasn't asked to be outputted in the config you sent in the first request.

### 3.2. CSV List of Files: [GET /csv/{id}](https://axatechlab.github.io/Parsr/docs/api.html#api-Output-getCsvList)

@@ -159,7 +184,7 @@ curl -X GET \

#### Status: 404 - Not Found

This error means that the result file doesn't exist. Maybe it wasn't asked to be outputed in the config you sent in the first request.
This error means that the result file doesn't exist. Maybe it wasn't asked to be outputted in the config you sent in the first request.

### 3.3. CSV File: [GET /csv/{id}/{page}/{table}](https://axatechlab.github.io/Parsr/docs/api.html#api-Output-getCsv)

@@ -190,11 +215,11 @@ This CSV output example contains multiline cells and an empty column.

#### Status: 404 - Not Found

This error means that the result file doesn't exist. Maybe `{page}` and `{table}` parameters doesn't refer to an or it wasn't asked to be outputed in the config you sent in the first request.
This error means that the result file doesn't exist. Maybe `{page}` and `{table}` parameters doesn't refer to an or it wasn't asked to be outputted in the config you sent in the first request.

## 4. Server Configuration Access

The API can also be queried to gain access to the following server assts:
The API can also be queried to gain access to the following server assets:

1. **Default Configuration**: The server's default configuration can be queried (at `/api/v1/default-config`) using:

@@ -50,7 +50,7 @@ function main(): void {
.option('-n, --document-name [name]', 'Name of the document')
.option(
'-c, --config <filename>',
"The file's path from which the application's parameres will be loaded",
"The file's path from which the application's parameters will be loaded",
)
.option(
'-l, --log-level <verbosity>',
@@ -87,7 +87,7 @@
"Reported": ["Reported"],
"Title": ["Title"],
"Visit ID": ["VISIT ID", "Visit ID"],
"Visite TYPE": ["VISIT TYPE", "Visit Type"],
"Visit TYPE": ["VISIT TYPE", "Visit Type"],
"Visit No": ["Visit No"]
},
"thresholdRatio": 0.8
@@ -107,7 +107,7 @@
},
{
"label": "Percent",
"regex": "([\\-]?(\\d)+[\\.\\,]*(\\d)*)[ ]*(%|per|percent|pourcent|procent)"
"regex": "([\\-]?(\\d)+[\\.\\,]*(\\d)*)[ ]*(%|per|percent|pourcent|prozent)"
}
]
}
@@ -84,7 +84,7 @@
},
{
"label": "Percent",
"regex": "([\\-]?(\\d)+[\\.\\,]*(\\d)*)[ ]*(%|per|percent|pourcent|procent)"
"regex": "([\\-]?(\\d)+[\\.\\,]*(\\d)*)[ ]*(%|per|percent|pourcent|prozent)"
}
]
}
@@ -35,7 +35,7 @@
},
{
"label": "Percent",
"regex": "([\\-]?(\\d)+[\\.\\,]*(\\d)*)[ ]*(%|per|percent|pourcent|procent)"
"regex": "([\\-]?(\\d)+[\\.\\,]*(\\d)*)[ ]*(%|per|percent|pourcent|prozent)"
}
]
}

0 comments on commit 49be865

Please sign in to comment.
You can’t perform that action at this time.