-
Notifications
You must be signed in to change notification settings - Fork 116
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add a quick guide for adding new languages
Closes: #942
- Loading branch information
Showing
2 changed files
with
68 additions
and
6 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,68 @@ | ||
+++ | ||
title = "Adding new language" | ||
weight = 30 | ||
+++ | ||
|
||
# Adding a new language for document processing | ||
|
||
Then there are other commits and issues to look at: | ||
|
||
- [Add Spanish language](https://github.com/eikek/docspell/commit/26dff18ae0d32ce2b32b4d11ce381ada0e99314f) | ||
- [Add Latvian language](https://github.com/eikek/docspell/issues/679) and [PR](https://github.com/eikek/docspell/pull/694/commits/9991ad5fcc43ccefe011a6cc4d01bdae4bcd4573) | ||
- [Add Japanese language](https://github.com/eikek/docspell/issues/948) and [PR](https://github.com/eikek/docspell/pull/961/commits/f994d4b2488e64668ee064676f8c6469d9ccc1be), had some corrections: [1](https://github.com/eikek/docspell/commit/c59d4f8a6d021ec4b01a92320c211248503f16a5), [Issue](https://github.com/eikek/docspell/issues/973) | ||
- [Add Hebrew language](https://github.com/eikek/docspell/pull/1027) | ||
|
||
Some older commits may be a bit out of date, but still show the | ||
relevant things to do. These are: | ||
|
||
- add it to `Language.scala`, create a new `case object` and add it to | ||
the `all` list (then fix compile errors) | ||
- define a list of month names to support date recognition and update | ||
`DateFind.scala` to recognize date patterns for that language. Add | ||
some tests to `DateFindTest`. | ||
- add it to joex' dockerfile to be available for tesseract | ||
- update the solr migration/field definitions in `SolrSetup`. Create a | ||
new solr migration that adds the content field for the new | ||
language - it is a copy&paste from other similar changes. | ||
- update `FtsRepository` for the PostgreSQL fulltext search variant: | ||
if not sure, use `simple` here | ||
- update the elm file so it shows up on the client. Also requires to | ||
add translations in `Messages.Data.Language` | ||
|
||
## Test | ||
|
||
Check if everything is fine with `sbt Test/compile`. After the project | ||
compiles without errors, run `sbt fix` to apply formatting fixes. | ||
|
||
It would be good to startup docspell and check the new lanugage a bit, | ||
including whether fulltext search is working. | ||
|
||
Sometimes, SOLR doesn't support a language. In this case the migration | ||
needs to first add the new *field type*. There are examples for | ||
Lithuanian and Hebrew in the code. | ||
|
||
For the docker image, you can run | ||
|
||
```bash | ||
PLATFORMS=linux/amd64 ./build.sh 0.36.0-SNAPSHOT | ||
``` | ||
|
||
in `docker/dockerfile` directory to build the docker image (just | ||
choose some version, it doesn't matter). | ||
|
||
## Non-NLP only | ||
|
||
Note that this is without support for NLP. Including support for NLP | ||
means that the [stanford nlp](https://github.com/stanfordnlp/CoreNLP) | ||
library needs to provide models for it and these must be included in | ||
the build and tested a bit. | ||
|
||
## Opening issues on Github | ||
|
||
You can also open an issue on github requesting to support a language. | ||
I kindly ask to include all necessary information, like in | ||
[this](https://github.com/eikek/docspell/issues/1540) issue. I know | ||
that I can dig it out from websites, but it would be nice to have | ||
everything ready. Also it is better to know from a local person some | ||
details, like which date patterns are more likely to appear than | ||
others. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters