Skip to content

caseyscarborough/yomichan-dictionary-parser

Repository files navigation

Yomichan Logo Yomichan Dictionary Parser

This is a library that handles parsing the Yomichan dictionary format.

This library is created to simplify the process of using a Yomichan dictionary in a Java application.

The Yomichan dictionary format cannot be easily parsed in Java without manual parsing, because the JSON terms can use arrays, objects, or strings for same keys, making it difficult to integrate with Java's type system (without using Object everywhere and checking instanceof and casting).

This library was created based on the JSON schema definitions here.

Requirements

  • Java 17

Installation

Add the dependency using JitPack:

repositories {
    maven { url 'https://jitpack.io' }
}

dependencies {
    // Specific version
    implementation 'com.github.caseyscarborough:yomichan-dictionary-parser:1.0.2'
    
    // Master branch (latest)
    implementation 'com.github.caseyscarborough:yomichan-dictionary-parser:master-SNAPSHOT'
}

Usage

Parse a Dictionary File

You can parse a dictionary .zip file directly by passing the path to file, or the File object.

YomichanParser parser = new YomichanParser();
YomichanDictionary dictionary = parser.parseDictionary("/path/to/yomichan/dictionary.zip");

This will return a YomichanDictionary object, which contains the object representation of the dictionary including the index, terms, and tags.

Note: The dictionary file will be extracted to a temporary directory which will be removed after parsing.

Parse Extracted Dictionary Files

You can also individually parse the index, terms, and tags by passing the path (or File object) to the JSON file from the extracted dictionary.

Index index = parser.parseIndex("/path/to/yomichan/index.json");
List<Term> terms = parser.parseTerms("/path/to/yomichan/term_bank_1.json");
List<Tag> tags = parser.parseTags("/path/to/yomichan/tag_bank_1.json");
List<Kanji> kanjis = parser.parseKanjis("/path/to/yomichan/kanji_bank_1.json");

Using the YomichanDictionary Object

The YomichanDictionary object contains the index, terms, kanji, and tags from the dictionary.

// The index parsed from the index.json file within the dictionary.
// Contains the metadata for the dictionary.
Index index = dictionary.getIndex();

// One of TERM, KANJI
YomichanDictionaryType type = dictionary.getType();

// The terms parsed from the term_bank.json files within the dictionary.
// This will be populated with then type is TERM
List<Term> terms = dictionary.getTerms();

// The metadata parsed from the term_meta_bank.json files within the dictionary.
// This will be populated when the type is FREQUENCY or PITCH.
List<TermMetadata> metadata = dictionary.getTermMetadata();

// The kanji parsed from the kanji_bank.json files within the dictionary.
// This will be populated with the type is KANJI
List<Kanji> kanjis = dictionary.getKanjis();

// The metadata parsed from the kanji_meta_bank.json files within the dictionary.
// This will be populated when the type is KANJI_FREQUENCY.
List<KanjiMetadata> metadata = dictionary.getKanjiMetadata();

// The tags parsed from the tag_bank.json files within the dictionary.
List<Tag> tags = dictionary.getTags();

The Index Object

The index contains metadata about the dictionary such as the name, description, attribution, and version:

  • Format - The version of the dictionary
  • Version - The version of the dictionary (alias for format)
  • Title - The title of the dictionary
  • Description - The description of the dictionary
  • Author - The author of the dictionary
  • Attribution - Attribution information
  • Url - URL for the source of the dictionary
  • Revision - Revision of the dictionary
  • Frequency Mode - OCCURRENCE or RANK based frequency mode
Java Examples
// The version of the dictionary (both methods return the version).
index.getFormat();
index.getVersion();
// The title and description of the dictionary.
index.getTitle();
index.getDescription();
// The author of the dictionary.
index.getAuthor();
// Attribution information.
index.getAttribution();
// URL for the source of the dictionary.
index.getUrl();
// Revision of the dictionary.
index.getRevision();
// OCCURRENCE or RANK based frequency mode.
Index.FrequencyMode mode = index.getFrequencyMode();

For more details and all functions, take a look at the Yomichan Index JSON Schema or take a look at the Index class.

Working with Terms

The terms have been converted from their array format in the dictionary file to an object with the following properties:

  • Term - The term itself, e.g. "読む"
  • Reading - The reading of the term, e.g. "よむ"
  • Definition Tags - Tags for the definitions, e.g. "v1", "vt"
  • Term Tags - Tags for the entire term, e.g. "common"
  • Score - Score used to determine popularity.
  • Rules - String of space-separated rule identifiers for the definition which is used to validate delinflection, e.g. v1, v5, vs, adj-i
  • Sequence Number - Sequence number for the term. Terms with the same sequence number are usually shown together.
  • Contents - List of definitions for the term.
Java Examples
Term term = terms.get(0);
// The term itself, e.g. "読む"
String word = term.getTerm();
// The reading of the term, e.g. "よむ"    
String reading = term.getReading();
// Tags for the definitions, e.g. "v1", "vt"
List<String> definitionTags = term.getDefinitionTags();
// Tags for the entire term, e.g. "common"
List<String> termTags = term.getTermTags();
// Score used to determine popularity.
Integer score = term.getScore();
// String of space-separated rule identifiers for
// the definition which is used to validate delinflection
// e.g. v1, v5, vs, adj-i
List<String> rules = term.getRules();     
// Sequence number for the term. Terms with the
// same sequence number are usually shown together.
Integer sequence = term.getSequenceNumber();
// List of definitions for the term.
List<Content> contents = term.getContents();

The definitions (the Content list) can be in three separate formats, TEXT, IMAGE, or STRUCTURED_CONTENT.

TEXT definitions are simple and only contain a string of text for the definition:

Content content = contents.get(0);
// The type of content, e.g. TEXT, IMAGE, STRUCTURED_CONTENT
ContentType type = content.getType();
// The text of the definition when the type is TEXT, e.g. "to read"
String text = content.getText();

The STRUCTURED_CONTENT type is a more complex definition that essentially maps to the structure of specific HTML tags. This full structure from the Yomichan dictionary is retained in the Java object.

For example, it might be a ul or table type. Examples are shown below:

Unordered List Example
{
  "content": [
    {
      "text": "to read",
      "tag": "li"
    },
    {
      "text": "to decipher",
      "tag": "li"
    }
  ],
  "tag": "ul"
}
Table Example
{
  "content": [
    {
      "content": [
        {
          "text": "definition",
          "tag": "th"
        }
      ],
      "tag": "tr"
    },
    {
      "content": [
        {
          "text": "to read",
          "tag": "td"
        }
      ],
      "tag": "tr"
    }
  ],
  "tag": "table"
}

The structured content also has many additional properties on them such as styles (which map to CSS properties), data (which map to data tags on the HTML entities), and language.

For more information take a look at the Yomichan Term Bank v3 JSON Schema or the Term class.

Working with Tags

Similar to terms, the tags have been converted from their array format in the dictionary file to an object, but the structure is far simpler. Tags have the following:

  • Name - The name of the tag
  • Category - The category of the tag
  • Order - The sorting order of the tag
  • Notes - Notes for the tag
  • Score - The score used to determine popularity. Negative values are more rare and positive values are more frequent. This score is also used to sort search results.
Java Examples
Tag tag = terms.get(0);
// The name of the tag.
String name = tag.getName();
// The category for th tag.
String category = tag.getCategory();
// Sorting order for the tag.
Integer order = tag.getOrder();
// Notes for the tag.
String notes = tag.getNotes();
// Score used to determine popularity. Negative values are more
// rare and positive values are more frequent. This score is
// also used to sort search results.
Integer score = tag.getScore();

For more information take a look at the Yomichan Tag Bank v3 JSON Schema or the Tag class.

Working with Kanji

Kanji have the following fields:

  • Character - The kanji character
  • On'yomi - A list of on'yomi readings (in katakana)
  • Kun'yomi - A list of kun'yomi readings (in hiragana)
  • Meanings - A list of all meanings
  • Tags - A list of tags for the kanji
  • Stats - Key-value pairs of statistics for the kanji
Java Examples
Kanji kanji = kanjis.get(0);
// The kanji character
String character = kanji.getCharacter();
// A list of on'yomi readings (in katakana)
List<String> onyomi = kanji.getOnyomi();
// A list of kun'yomi readings (in hiragana)
List<String> kunyomi = kanji.getKunyomi();
// A list of all meanings
List<String> meanings = kanji.getMeanings();
// A list of tags for the kanji
List<String> tags = kanji.getTags();
// Key-value pairs of statistics for the kanji
Map<String, String> stats = kanji.getStats();

For more information take a look at the Yomichan Kanji Bank v3 JSON Schema or the Kanji class.

TODO

  • Implement index.json files
  • Implement term_bank.json files for version 3
  • Implement term_meta_bank.json files
  • Implement tag_bank.json files
  • Implement kanji_bank.json files for version 3
  • Implement kanji_meta_bank.json files for version 3
  • Implement kanji_bank.json files for version 1
  • Implement term_bank.json files for version 1