Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Special Text, Image Elements and Nonstandard Layouts #5
Unicode is used to represent Chinese text in the NTI Reader. Siddhaṃ can not yet be represented in Unicode although there is a proposal to add Siddhaṃ to the Unicode standard (Pandey 2012). Some documents in the Esoteric section of the Taishō contain a mix of Chinese and Siddhaṃ script, which presents a challenge for text representation. The solution adopted in SAT and CBETA is to display the script elements in image format, which prevents searching and other text processing.
Some texts in the Esoteric section of the Taishō contain Siddhaṃ script, which are represented in image format in SAT and CBETA. A view of the Siddhaṃ in T 944B in SAT can be found at the link below and screenshot can be found in the attached file.
The original scanned images can be seen on CBETA. Other texts including Siddhaṃ are T 974B, T 983B, T 1005B, T 1034, T 1062B, T 116, T 1168B, T 1208, T 1213 and T 1244. There is presently no way for the NTI Reader to display scripts that are not stored as Unicode. Unicode is a text format that can express scripts in most of the world languages. Siddhaṃ can not yet be represented in Unicode although there is a current proposal to add Siddhaṃ to the Unicode standard (Pandey, 2012). Some investigation is required to determine the best method to handle this. Since the NTI Reader is considered a monolingual corpus, this is somewhat out of scope.
In addition to Siddhaṃ, some rare Chinese characters in the Taishō are not expressed as Unicode characters. Wittern describes encoding of rare characters in CBETA (Wittern, 2006). T 1115, T 1159A and T 1238 include rare characters without Unicode in the titles, which makes them difficult to index. These need an updated investigation to check if the rare characters have been recently added to the Unicode standard or needed to be otherwise handled. Some of these rare characters have been recently added to the Unicode standard. For example, in the title of T 1115, the character represented in CBETA by [齒來] can be expressed Unicode as 𪘨 zhāi. In the title of T 1238, the character represented as [牛句] can be expressed Unicode as 𤘽 hǒu. In T 1159A, the character represented in CBETA by 大+(企-止) has Unicode equivalent 𡇪 but there is no pronunciation given in the Unicode standard (Unihan s.v. ‘𡇪’). This makes transliteration of the title difficult.
Besides text elements, images and special layouts are included in the Taishō. T 1108B, T 1221, T 1265, T 1290 and T 1891 include images that are not displayed in the NTI Reader. In the print version of a number of texts, such as T 1221, some characters are formatted to indicate actions during a ceremony or commentary on another text. For example, actions may be joining palms or the number of times to repeat a chant. These are indicated using modern punctuation, typically “()”, in the CBETA version. This topic needs investigation on the best representation.