Yomichan Dictionary Parser

This is a library that handles parsing the Yomichan dictionary format.

This library is created to simplify the process of using a Yomichan dictionary in a Java application.

The Yomichan dictionary format cannot be easily parsed in Java without manual parsing, because the JSON terms can use arrays, objects, or strings for same keys, making it difficult to integrate with Java's type system (without using Object everywhere and checking instanceof and casting).

This library was created based on the JSON schema definitions here.

Requirements

Java 17

Installation

Add the dependency using JitPack:

repositories {
    maven { url 'https://jitpack.io' }
}

dependencies {
    // Specific version
    implementation 'com.github.caseyscarborough:yomichan-dictionary-parser:1.0.2'
    
    // Master branch (latest)
    implementation 'com.github.caseyscarborough:yomichan-dictionary-parser:master-SNAPSHOT'
}

Usage

Parse a Dictionary File

You can parse a dictionary .zip file directly by passing the path to file, or the File object.

YomichanParser parser = new YomichanParser();
YomichanDictionary dictionary = parser.parseDictionary("/path/to/yomichan/dictionary.zip");

This will return a YomichanDictionary object, which contains the object representation of the dictionary including the index, terms, and tags.

Note: The dictionary file will be extracted to a temporary directory which will be removed after parsing.

Parse Extracted Dictionary Files

You can also individually parse the index, terms, and tags by passing the path (or File object) to the JSON file from the extracted dictionary.

Index index = parser.parseIndex("/path/to/yomichan/index.json");
List<Term> terms = parser.parseTerms("/path/to/yomichan/term_bank_1.json");
List<Tag> tags = parser.parseTags("/path/to/yomichan/tag_bank_1.json");
List<Kanji> kanjis = parser.parseKanjis("/path/to/yomichan/kanji_bank_1.json");

Using the `YomichanDictionary` Object

The YomichanDictionary object contains the index, terms, kanji, and tags from the dictionary.

// The index parsed from the index.json file within the dictionary.
// Contains the metadata for the dictionary.
Index index = dictionary.getIndex();

// One of TERM, KANJI
YomichanDictionaryType type = dictionary.getType();

// The terms parsed from the term_bank.json files within the dictionary.
// This will be populated with then type is TERM
List<Term> terms = dictionary.getTerms();

// The metadata parsed from the term_meta_bank.json files within the dictionary.
// This will be populated when the type is FREQUENCY or PITCH.
List<TermMetadata> metadata = dictionary.getTermMetadata();

// The kanji parsed from the kanji_bank.json files within the dictionary.
// This will be populated with the type is KANJI
List<Kanji> kanjis = dictionary.getKanjis();

// The metadata parsed from the kanji_meta_bank.json files within the dictionary.
// This will be populated when the type is KANJI_FREQUENCY.
List<KanjiMetadata> metadata = dictionary.getKanjiMetadata();

// The tags parsed from the tag_bank.json files within the dictionary.
List<Tag> tags = dictionary.getTags();

The `Index` Object

The index contains metadata about the dictionary such as the name, description, attribution, and version:

Format - The version of the dictionary
Version - The version of the dictionary (alias for format)
Title - The title of the dictionary
Description - The description of the dictionary
Author - The author of the dictionary
Attribution - Attribution information
Url - URL for the source of the dictionary
Revision - Revision of the dictionary
Frequency Mode - OCCURRENCE or RANK based frequency mode

Java Examples

// The version of the dictionary (both methods return the version).
index.getFormat();
index.getVersion();
// The title and description of the dictionary.
index.getTitle();
index.getDescription();
// The author of the dictionary.
index.getAuthor();
// Attribution information.
index.getAttribution();
// URL for the source of the dictionary.
index.getUrl();
// Revision of the dictionary.
index.getRevision();
// OCCURRENCE or RANK based frequency mode.
Index.FrequencyMode mode = index.getFrequencyMode();

For more details and all functions, take a look at the Yomichan Index JSON Schema or take a look at the Index class.

Working with Terms

The terms have been converted from their array format in the dictionary file to an object with the following properties:

Term - The term itself, e.g. "読む"
Reading - The reading of the term, e.g. "よむ"
Definition Tags - Tags for the definitions, e.g. "v1", "vt"
Term Tags - Tags for the entire term, e.g. "common"
Score - Score used to determine popularity.
Rules - String of space-separated rule identifiers for the definition which is used to validate delinflection, e.g. v1, v5, vs, adj-i
Sequence Number - Sequence number for the term. Terms with the same sequence number are usually shown together.
Contents - List of definitions for the term.

Java Examples

Term term = terms.get(0);
// The term itself, e.g. "読む"
String word = term.getTerm();
// The reading of the term, e.g. "よむ"    
String reading = term.getReading();
// Tags for the definitions, e.g. "v1", "vt"
List<String> definitionTags = term.getDefinitionTags();
// Tags for the entire term, e.g. "common"
List<String> termTags = term.getTermTags();
// Score used to determine popularity.
Integer score = term.getScore();
// String of space-separated rule identifiers for
// the definition which is used to validate delinflection
// e.g. v1, v5, vs, adj-i
List<String> rules = term.getRules();     
// Sequence number for the term. Terms with the
// same sequence number are usually shown together.
Integer sequence = term.getSequenceNumber();
// List of definitions for the term.
List<Content> contents = term.getContents();

The definitions (the Content list) can be in three separate formats, TEXT, IMAGE, or STRUCTURED_CONTENT.

TEXT definitions are simple and only contain a string of text for the definition:

Content content = contents.get(0);
// The type of content, e.g. TEXT, IMAGE, STRUCTURED_CONTENT
ContentType type = content.getType();
// The text of the definition when the type is TEXT, e.g. "to read"
String text = content.getText();

The STRUCTURED_CONTENT type is a more complex definition that essentially maps to the structure of specific HTML tags. This full structure from the Yomichan dictionary is retained in the Java object.

For example, it might be a ul or table type. Examples are shown below:

Unordered List Example

{
  "content": [
    {
      "text": "to read",
      "tag": "li"
    },
    {
      "text": "to decipher",
      "tag": "li"
    }
  ],
  "tag": "ul"
}

Table Example

{
  "content": [
    {
      "content": [
        {
          "text": "definition",
          "tag": "th"
        }
      ],
      "tag": "tr"
    },
    {
      "content": [
        {
          "text": "to read",
          "tag": "td"
        }
      ],
      "tag": "tr"
    }
  ],
  "tag": "table"
}

The structured content also has many additional properties on them such as styles (which map to CSS properties), data (which map to data tags on the HTML entities), and language.

For more information take a look at the Yomichan Term Bank v3 JSON Schema or the Term class.

Working with Tags

Similar to terms, the tags have been converted from their array format in the dictionary file to an object, but the structure is far simpler. Tags have the following:

Name - The name of the tag
Category - The category of the tag
Order - The sorting order of the tag
Notes - Notes for the tag
Score - The score used to determine popularity. Negative values are more rare and positive values are more frequent. This score is also used to sort search results.

Java Examples

Tag tag = terms.get(0);
// The name of the tag.
String name = tag.getName();
// The category for th tag.
String category = tag.getCategory();
// Sorting order for the tag.
Integer order = tag.getOrder();
// Notes for the tag.
String notes = tag.getNotes();
// Score used to determine popularity. Negative values are more
// rare and positive values are more frequent. This score is
// also used to sort search results.
Integer score = tag.getScore();

For more information take a look at the Yomichan Tag Bank v3 JSON Schema or the Tag class.

Working with Kanji

Kanji have the following fields:

Character - The kanji character
On'yomi - A list of on'yomi readings (in katakana)
Kun'yomi - A list of kun'yomi readings (in hiragana)
Meanings - A list of all meanings
Tags - A list of tags for the kanji
Stats - Key-value pairs of statistics for the kanji

Java Examples

Kanji kanji = kanjis.get(0);
// The kanji character
String character = kanji.getCharacter();
// A list of on'yomi readings (in katakana)
List<String> onyomi = kanji.getOnyomi();
// A list of kun'yomi readings (in hiragana)
List<String> kunyomi = kanji.getKunyomi();
// A list of all meanings
List<String> meanings = kanji.getMeanings();
// A list of tags for the kanji
List<String> tags = kanji.getTags();
// Key-value pairs of statistics for the kanji
Map<String, String> stats = kanji.getStats();

For more information take a look at the Yomichan Kanji Bank v3 JSON Schema or the Kanji class.

TODO

Implement index.json files
Implement term_bank.json files for version 3
Implement term_meta_bank.json files
Implement tag_bank.json files
Implement kanji_bank.json files for version 3
Implement kanji_meta_bank.json files for version 3
Implement kanji_bank.json files for version 1
Implement term_bank.json files for version 1

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github/workflows		.github/workflows
gradle/wrapper		gradle/wrapper
src		src
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat
jitpack.yml		jitpack.yml
lombok.config		lombok.config
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Yomichan Dictionary Parser

Requirements

Installation

Usage

Parse a Dictionary File

Parse Extracted Dictionary Files

Using the `YomichanDictionary` Object

The `Index` Object

Working with Terms

Working with Tags

Working with Kanji

TODO

About

Releases 4

Languages

License

caseyscarborough/yomichan-dictionary-parser

Folders and files

Latest commit

History

Repository files navigation

Yomichan Dictionary Parser

Requirements

Installation

Usage

Parse a Dictionary File

Parse Extracted Dictionary Files

Using the YomichanDictionary Object

The Index Object

Working with Terms

Working with Tags

Working with Kanji

TODO

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 4

Languages

Using the `YomichanDictionary` Object

The `Index` Object