Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add example scripts and documentation.
- Loading branch information
Showing
10 changed files
with
100,144 additions
and
50 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,33 @@ | ||
# word-level-language-id | ||
Simple word-level language ID using Viterbi based on unigram frequencies and character n-grams. | ||
# Word-level language ID | ||
Simple word-level language identification using the Viterbi algorithm based on unigram frequencies and character n-grams. | ||
|
||
## Usage | ||
|
||
I recommend using Python 3 for better Unicode support. | ||
|
||
To quickly try out the system, corpora and language models are already included for British English and Irish. See below how to add new ones. You might want to do some post-processing on the lexicons because e.g. the Irish one contains some English as well and vice versa. | ||
|
||
Run word-level language ID on some example sentences: | ||
|
||
```bash | ||
python word-level-language-id/identify.py | ||
``` | ||
|
||
## Train new language models | ||
|
||
Create or download a unigram frequency lexicon, e.g. from the [Crúbadán Project](http://crubadan.org/) which has those readily available for over 2000 languages. | ||
|
||
For example, download and unzip British English and Irish: | ||
|
||
```bash | ||
wget http://crubadan.org/files/en-GB.zip | ||
wget http://crubadan.org/files/ga.zip | ||
|
||
unzip '*.zip' -d word-level-language-id/corpora | ||
``` | ||
|
||
Train the language models. | ||
|
||
```bash | ||
python word-level-language-id/train.py | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
Crúbadán language dataset (c) by Kevin Scannell | ||
|
||
This Crúbadán language dataset is licensed under a | ||
Creative Commons Attribution 4.0 International License. | ||
|
||
You should have received a copy of the license along with this | ||
work. If not, see <http://creativecommons.org/licenses/by/4.0/>. |
Oops, something went wrong.