New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Special Text, Image Elements and Nonstandard Layouts #5

Open
alexamies opened this Issue Mar 20, 2017 · 0 comments

Comments

Projects
None yet
1 participant
@alexamies
Owner

alexamies commented Mar 20, 2017

Unicode is used to represent Chinese text in the NTI Reader. Siddhaṃ can not yet be represented in Unicode although there is a proposal to add Siddhaṃ to the Unicode standard (Pandey 2012). Some documents in the Esoteric section of the Taishō contain a mix of Chinese and Siddhaṃ script, which presents a challenge for text representation. The solution adopted in SAT and CBETA is to display the script elements in image format, which prevents searching and other text processing.

Some texts in the Esoteric section of the Taishō contain Siddhaṃ script, which are represented in image format in SAT and CBETA. A view of the Siddhaṃ in T 944B in SAT can be found at the link below and screenshot can be found in the attached file.

http://21dzk.l.u-tokyo.ac.jp/SAT/ddb-bdk-sat2.php?lang=en

taisho_siddam

The original scanned images can be seen on CBETA. Other texts including Siddhaṃ are T 974B, T 983B, T 1005B, T 1034, T 1062B, T 116, T 1168B, T 1208, T 1213 and T 1244. There is presently no way for the NTI Reader to display scripts that are not stored as Unicode. Unicode is a text format that can express scripts in most of the world languages. Siddhaṃ can not yet be represented in Unicode although there is a current proposal to add Siddhaṃ to the Unicode standard (Pandey, 2012). Some investigation is required to determine the best method to handle this. Since the NTI Reader is considered a monolingual corpus, this is somewhat out of scope.

In addition to Siddhaṃ, some rare Chinese characters in the Taishō are not expressed as Unicode characters. Wittern describes encoding of rare characters in CBETA (Wittern, 2006). T 1115, T 1159A and T 1238 include rare characters without Unicode in the titles, which makes them difficult to index. These need an updated investigation to check if the rare characters have been recently added to the Unicode standard or needed to be otherwise handled. Some of these rare characters have been recently added to the Unicode standard. For example, in the title of T 1115, the character represented in CBETA by [齒來] can be expressed Unicode as 𪘨 zhāi. In the title of T 1238, the character represented as [牛句] can be expressed Unicode as 𤘽 hǒu. In T 1159A, the character represented in CBETA by 大+(企-止) has Unicode equivalent 𡇪 but there is no pronunciation given in the Unicode standard (Unihan s.v. ‘𡇪’). This makes transliteration of the title difficult.

Besides text elements, images and special layouts are included in the Taishō. T 1108B, T 1221, T 1265, T 1290 and T 1891 include images that are not displayed in the NTI Reader. In the print version of a number of texts, such as T 1221, some characters are formatted to indicate actions during a ceremony or commentary on another text. For example, actions may be joining palms or the number of times to repeat a chant. These are indicated using modern punctuation, typically “()”, in the CBETA version. This topic needs investigation on the best representation.

###References

  1. Pandey, A 2012, “Proposal to Encode Section Marks for Siddham in ISO/IEC 10646”, viewed 30 September 2016, ftp://std.dkuug.dk/jtc1/sc2/wg2/docs/n4336.pdf

  2. Wittern, C 2006, “Chinese Buddhist Texts for the New Millennium — The Chinese Buddhist Electronic Text Association (CBETA) and its Digital Tripitaka”, Journal of Digital Information, vol. 3, no. 2, viewed 1 October 2016, https://journals.tdl.org/jodi/index.php/jodi/article/view/84

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment