-
Notifications
You must be signed in to change notification settings - Fork 16
/
character.theory.txt
59 lines (52 loc) · 4.85 KB
/
character.theory.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
CHARACTER
REPRESENTATION ==> #Binary representation is character encoding
#A code unit is the charset binary unit, e.g. 7 bits with ASCII, 1 byte with UTF-8, 2 bytes with UTF-16
DIFFERENCES ==> #Character:
# - character meaning unit, e.g. punctuation, diacritics, letter, ideogram
# - correspond to a Unicode codepoint
#Grapheme:
# - character + its combining characters
#Glyph:
# - visual unit, i.e. visual representation of a grapheme.
# - some graphemes do not have any.
# - some graphemes share same glyph
# - some graphemes have similar looking glyphs ("homoglyph")
#Phoneme:
# - sound unit, i.e. single sound
#Syllable:
# - rhythm unit, i.e. several sounds pronounced as one
#Morpheme:
# - meaning unit, e.g. prefixes (e.g. "un"), suffixes (e.g. "s"), roots ("believe")
# - lexical morpheme: has a meaning on its own, e.g. noun, verb, adjective
# - grammatical morpheme: must be associated with another morpheme, e.g. preposition, articles, conjunctions
#Lexeme:
# - word beyond different forms (conjugaison, plural, etc.)
# - lemma: canonical form of a lexeme, usually with least marks/form
# - paradigm: series of possible forms for a lexeme
#Word
TYPES ==> #Combining:
# - e.g. diacritics or partial ideogram
# - precomposed|composite character:
# - character containing combining characters
# - e.g. é (U-00E9) contains both e (U-0065) and ́ (U-0301)
# - Unicode equivalent to its non-precomposed form
#Non-printing / control character:
# - C0 control characters: U-0000-U-001F, including \0, \a, \b, \t, \n, \v, \f, \r, \e, space, DEL
# - C1 control characters: U-0080-U-009F
# - RTL chars: U-061C, U-200E-U-200F, U-202A-U-202E
# - line|paragraph separators U-2028-U-2029 are not control characters
STRING ==> #Array of characters
#Length can be:
# - set compile-time ("fixed-length") or runtime ("variable-length")
# - indicated by:
# - a leading number ("Pascal string" / "P-string" / "length-prefixed")
# - a sentinel value (often \0) ("C-string" / "null-terminated string")
# - pros:
# - unlimited length
# - takes less space
# - cons:
# - slower length access
# - cannot contain termination character
# - this leads to distinguishing "binary" strings vs "non-binary" strings