## Character Representation

_burton rosenberg, 24 june 2023_

model,

### Representations of characters

<div style="float:right;margin:2em;">
<a title="an unknown officer or employee of the United States Government, Public domain, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:USASCII_code_chart.png"><img width="512" alt="USASCII code chart" src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/cf/USASCII_code_chart.png/512px-USASCII_code_chart.png"></a>
</div>

The 8 bit byte proved useful for storing the characters of the alphabet, with numerical digits, punctuation, and then some, including very important control characters such as new line, end of transmission, delete and break. As a part of the historical accident, the characters considered were,

- the 26 unaccented letters of the latin alphabeta, including minisule and majiscule
- the 10 digits
- a character representing a single spacing element, called the space or blank
- all common punctuation including the period, comma, parenthesis, question and exclamation marks, and so one,
- and control characters there were sort of "meta" in that they talked about the text rather than being in the text, such as a newline, or a delete.

The assignment of bit patterns to characters was the ASCII standard.



In [3]:
%%file hello-chars.c

#include<stdio.h>
#include<string.h>

int main(int argc, char * argv[]){
    int i ;
    char hello[] = {104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 33, 10} ;
    int n = sizeof(hello) ;
    
    for (i=0;i<n;i++){
        printf("%c", hello[i]) ;
    }

    return 0 ;
}

Overwriting hello-chars.c


In [4]:
%%bash
cc -o hello-chars hello-chars.c
./hello-chars
rm hello-chars

hello world!



### Checksums

The ASCII chart is a 7 bit chart, yet it was clear that bytes would be larger, most likely 8 bits. This is because in that era data corruption was more likely than now, and the extra bit was used to detect errors that might occur in the communication of the character.


#### Parity checksum

This is called a checksum. The type of check sum was the parity. For an even parity checksum, the 8th bit is set to one if the number of ones in the other 7 bits is odd. If there were a single bit flip in the communication of the character, the parity would be off, and the receiver could signal a parity-error. Odd parity is defined the same however the 8th bit is set to make the total number of one's be odd.


#### Luhn checksum

Checksums are also found on creditcards. The Lund algorithm calculates the very last digit of a credit card number from the others. If someone mistypes their cardnumber it is likely that the checksum will find that there is an error. These are _error detecting codes_. They all have their limits as to the type and quantity of errors they can check.

#### LC checksum

Checksums are also found on the Library Of Congress ISBN numbers. It is the digit or ltter X that suffixed at the end of the number, after a dash.

These days, parity for characters is used in this same way. 


In [115]:
%%file luhn-checksum.c

#include<stdio.h>
#include<string.h>

#define NOT_VALID "not valid"
#define VALID "valid"

int main(int argc, char * argv[]) {
    char * s = argv[1] ;
    int i, mult, d ;
    int ck_sum = 0 ;
    char * r ;
    
    for (i=0;i<strlen(s);i++) {
        mult = (i%2)?1:2 ;
        d = mult*(s[i]-'0') ;
        d = (d>9)?(1+d%10):d ;
        ck_sum += d ;
    }
    
    r = (ck_sum%10)? NOT_VALID : VALID ;
    printf("The credit card number %s is %s\n", s, r ) ;
    return 0 ;
}

Overwriting luhn-checksum.c


In [116]:
%%bash
S=luhn-checksum
cc -o $S $S.c
./$S 4417123456789113
./$S 4417323456789113
./$S 4417123456989113
./$S 4417123465789113
rm $S

The credit card number 4417123456789113 is valid
The credit card number 4417323456789113 is not valid
The credit card number 4417123456989113 is not valid
The credit card number 4417123465789113 is not valid


## 

In [119]:
%%file ten-digit-lc.c

#include<stdio.h>
#include<string.h>

#define NOT_VALID "not valid"
#define VALID "valid"

int main(int argc, char * argv[]) {
    char * s = argv[1] ;
    int i, d ;
    int ck_sum = 0 ;
    int mult = 10 ;
    char * r ;
    
    for (i=0;i<strlen(s);i++) {
        if (s[i]=='-') continue ;
        d = s[i]-'0' ;
        if (d>9) d = 10 ;
        ck_sum += (mult--) * d ;
    }
    
    r = (ck_sum%11)? NOT_VALID : VALID ;
    ck_sum %= 11 ;
    printf("The ISBN 10 number %s is %s\n", s, r ) ;
    return 0 ;
}

Overwriting ten-digit-lc.c


In [120]:
%%bash
S=ten-digit-lc
cc -o $S $S.c
./$S 0-534-08072-3
./$S 0-915144-76-X
./$S 0-8076-0453-4

rm $S

The ISBN 10 number 0-534-08072-3 is valid
The ISBN 10 number 0-915144-76-X is valid
The ISBN 10 number 0-8076-0453-4 is valid


### UTF-8

It is possible that the 8 bit byte was influenced by the need for a byte to signify one text character. At minimum this meant the 26 upper and 26 lower case letters and 10 digits, plus a bunch of common punctuation. There were also _control characters_, which are non-printing characters that are sent in the character stream such as,

- The newline character, ASCII 0x0a, C Language `\n` (sometimes shows up as a control-M) =,
- The tab character, ASCII 0x09, C Language `\t`,
- The End of Transmission (EOT) character, ASCII 0x09, a.k.a. control-D.

And many more fascinating details.

This added up to enough to set the byte size to 8 bits, and to use only the lower 7 bits for the 128 lucky characters included in ASCII.

__But what about other alphabets?__

Soon the issue arose of representing an expanded collection of alphabets. A variety of systems were invented to accomodate these characters. 

One idea was to use the unused upper 128 code slots for these characters, and to have those slots defined according to a _code page_ that was set into the context.

To get even greater space to accommodate languages with a very large number of characters, like Chinese, going to a 16-bit character, called a _wide character_ was used.

The 16-bit solution lead to Unicode, a single space for every character for every language. When defining Unicode, the word _glyph_ is used for the character as printed. For instance, certain languages have two forms of a letter, depending if it ends a word or not. Each is a different glyph for what one might think of as the same character.

#### Storing wide characters

The question is how to store these character streams so that older 8 bit code is compatible with newere 16 bit codes, or perhaps codes of intermediate sizes. The answer appeared on a napkin in a New Jersey diner where Rob Pike and Ken Thompson were eating, and it became the standard UTF-8.

The idea is that the number of bytes uses is variable according to the requirements of the encoding. The standard ASCII, should remain unchanged. Which means that other encodings could make use of setting the high order bit to 1, to signal that this is not a 1 byte code.

If 11 bits of code space is needed, then 2 bytes are used in a sequence. Of the 16 bits available in 2 bytes, 5 of them are used to indicate that this is a two byte sequence. The first of the two uses the top 3 bits to indicate this is exactly a 2 byte sequence, and the byte following is the second in the sequence. The byte following uses the top 2 bits to mark that this byte is part of a sequence.

The UTF-8 standard also has 3 and 4 byte sequences, so 16 bit and 21 bit code pages can be accommodated.

- If the high order bit is 0, therefore it is a 1 byte ASCII code point.
- If the top two bits of the byte are 0b10, then this is an additional byte for a multi-byte character.
- If the top three bits of the byte are 0b110, then this is the first of a two byte sequence whose bits collectively define a 11 bit code point.
- If the top four bits of the byte are 0b1110, then this is the first of a three byte sequence whose bits collectively define a 16 bit code point.
- If thie top five bits of the byte are 0b11110, then this is the first of a four byte sequence whose bits collectively define a 21 bit code point.





In [170]:
%%file utf-8-convert.c

#include<stdio.h>
#include<stdlib.h>

char * short_to_utf8(char * out, short u) {
    out[2] = 0x80 | (u & 0x3f) ;
    u >>= 6 ;
    out[1] = 0x80 | (u & 0x3f) ;
    u >>= 6 ;
    out[0] = 0xe0 | (u & 0x0f) ;
    return out ;
}


short utf8_to_short(char * u) {
    short r ;
    r = u[2] & 0x3f ;
    r |= (u[1] & 0x3f)<<6 ;
    r |= (u[0] & 0x0f)<<12 ;
    return r ;
}

int check_utf8(char * utf_buf) {

    if (! (utf_buf[0]&0x80) ) 
        return 1 ;
    
    if ((utf_buf[0]&0xe0) == 0xc0) {
        if ((utf_buf[1]&0xc0) == 0x80) return 2 ;
    }
    
    if ((utf_buf[0]&0xf0)==0xe0) {
        if (
            ((utf_buf[1]&0xc0)==0x80) 
            && ((utf_buf[2]&0xc0)==0x80)
        ) return 3 ;
    }

     if ((utf_buf[0]&0xf8)==0xf0) {
        if (
            ((utf_buf[1]&0xc0)==0x80) 
            && ((utf_buf[2]&0xc0)==0x80)
            && ((utf_buf[3]&0xc0)==0x80)
        ) return 4 ;
    }
    return 0 ;
}

int main(int argc, char * argv[]) {
    short r = (short) atoi(argv[1]) ;
    char utf_buf[3] ;
    short r_r ;
    short_to_utf8( utf_buf, r) ;
    r_r = utf8_to_short( utf_buf ) ;
    printf("0x%x: 0x[%x %x %x] 0x%x \n", r, 
           (unsigned char) utf_buf[0], (unsigned char) utf_buf[1], (unsigned char) utf_buf[2], 
           r_r ) ;
    if (check_utf8(utf_buf)==3) printf("\tEncoded with 3 bytes.\n") ;
    return 0 ; 
}

Overwriting utf-8-convert.c


In [171]:
%%bash
S=utf-8-convert
cc -o $S $S.c
./$S 8192
./$S 8720
rm $S

0x2000: 0x[e2 80 80] 0x2000 
	Encoded with 3 bytes.
0x2210: 0x[e2 88 90] 0x2210 
	Encoded with 3 bytes.


### Unicode