## Bits and Bytes


_burton rosenberg, 29 june 2023_


### Table of contents.

1. <a href="#intro">Introduction</a>
1. <a href=#bits>Ways to represent a bit</a>



### <a name=intro>Introduction</a>


The first consideration is how integers are stored on a computer.

The basic element of storage on a computer, as is the basic element of information is the _bit_. A bit can be in one of two states, which for the moment we will call 0 and 1. Bits are be assembled into a orderd collection of several bits, and each 0-1 combination can denote a uniquely identifiable symbol. It is not universal that the fundamental storage element of a computer is when 8 bits are assembled into a _byte_. The 8 bits give 256 different combinations, which can be set in correspondence with the integers 0 through 255.

<div style="float:right;margin:2em;">
<img width="512" src="../images/TCPL-1ed-bytesize.png"></a>
</div>

The use of binary was basically practical. The bit state would be based on some physical phenomena, and so there would be noise. A binary indicator can be made very tolerant of noise but having a threshold in the reading of the phenomena so that about the threashod is a 1, and below a 0. The noise amounts woud not matter as long as the noise amplitude kept the intended signal on the proper side of the threshold.

Furthermore, the correspondence between the bit pattern and an integer can be made completely natural thorugh the binary representation of the integer value. Recall that a number $n$  can be represented as a collection of zero-one values, $b_i$ by,

$$
n = \sum_i b_i \, 2^i
$$

In the case of the byte, the bits are assigned their bit number, giving the $b_i$ that is to appear in this equation. 

Note that a byte is an integer that has exactly 8 $b_i$. We will follow C language syntax and indicate a binary number by prefixing a 0-1 string with the indicatior `0b`. The number 5 in binary is `0b101`. However thinking about a byte it is best to visualize this written as `0b0000010`, accounting for all 8 bits in the byte.

Already we see how tedious it is to write numbers in binary, although they suit the hardware just fine. Instead, when trying to thing about an integer in terms of which bits are 1 and which are 0, the number is better written in base 16, known as _hexidecimal_. It is pretty important for a computer scientist to know hexidecimal, so I hope to give you some exericses to practice this number notation.

Hexidecimal is base 16, hence there must be 16 "digits" for each place, and these are, 

$$
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e, f
$$

We will fillow C language syntax and indicate a hexidecimal number by prefixing the string of thise hex-its with the indicator `0x`. The number 5 in hexidecimal is `0x5`, However thinking about a byte it is best to write this as `0x05`, so all 8 bits are accounted for.


### <a name=bits>Ways to represent a bit</a>

<div style="float:right;margin:2em;">
<a title="Billie Grace Ward from New York, USA, CC BY 2.0 &lt;https://creativecommons.org/licenses/by/2.0&gt;, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Paper_Tape_Drive_(31437412070).jpg"><img width="334" alt="Paper Tape Drive (31437412070)" src="https://upload.wikimedia.org/wikipedia/commons/thumb/6/6f/Paper_Tape_Drive_%2831437412070%29.jpg/512px-Paper_Tape_Drive_%2831437412070%29.jpg"></a>
</div>


#### Punched tape

A bit can be represented by any manner in which there can be a distinction between two states. Here we see an early data storage strategy with a paper tape. Each column across the tape was a 5 bit byte, and it locations marked off in 5-th way from edge to edge, a hole was a 1 and no hoe was a 0. The 5 bit byte was supported by an early code called the [Baudot code](https://en.wikipedia.org/wiki/Baudot_code), invented by Emile Baudot in 1870. Its 32 different combinations barely fit enough symbols to be useful. While the Baudo code is obsolete, the word _baud_ is still with us, and it refers to the number of symbols per second in a communication channel.

The code for this punched tape might rather be the Murray code, 1901, which modified the Baudot code to minimize the average numbers of holes punched given a typical message. 

Both these codes use a system of _shift_, where a shift character uses the reminaing 31 characters in either _letter_ or _figure_ contexts. This idea is still used today, for example in C language where the letter `t` is either itself, or the tab symbol, when preceded with the "shift" character, the backspace, `\t`,


#### SR Latch

A bit can be stored by using hardware that can implement simple logic circuits. Consider the equations with input variable $S$ and $R$, and output variable $Q$ and $Q'$.

<div style="float:right;margin:2em;">
<a title="Goodphy, CC BY-SA 4.0 &lt;https://creativecommons.org/licenses/by-sa/4.0&gt;, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:SR_NOR_Latch_How_to_Work_Ver1_Dong-Gyu_Jang_20200309.png"><img width="334" alt="SR NOR Latch How to Work Ver1 Dong-Gyu Jang 20200309" src="https://upload.wikimedia.org/wikipedia/commons/thumb/2/26/SR_NOR_Latch_How_to_Work_Ver1_Dong-Gyu_Jang_20200309.png/256px-SR_NOR_Latch_How_to_Work_Ver1_Dong-Gyu_Jang_20200309.png"></a>
</div>

\begin{eqnarray}
Q &=& \lnot\, (R \lor Q')\\
Q' &=& \lnot\, (S \lor Q)\\
\end{eqnarray}

- When $S$ and $R$ are both zero, then $Q = \lnot\,Q'$, for either $Q=0$ or $Q=1$. This is the _hold_ state.
- When $S=1$ and $R=0$, then the unique solution to the equation is $Q=1$ and $Q'=0$. This sigal is called _set_.
- When $S=0$ and $R=1$, then the unique solution to the equation is $Q=0$ and $Q'=1$. This sigal is called _reset_.
- The case where $S$ and $R$ are both 1 is not allowed in practice.

Such logic circuits were possible with early electronics, and now can be built with either transitors or Op Amps. When electronic circuits were miniturized, it became possible to store tens or hundreds of bits using SR latches.




#### Parity check

#### (7,4) Hamming

#### Base 64 

In [None]:
#
# a python program to ennumerate all the bit sequence on i bits
# it uses recursion to create a list for i-1 bits, then adds one more 
# bit.
#

def ennumerate_zero_one_patterns(i):
    
    def ennumerate_zero_one_patterns_aux(i):
        if i==1:
            return ['0','1']
        l = ennumerate_zero_one_patterns_aux(i-1)
        r = l[:]
        for i in range(len(l)):
            r[i] = '1'+l[i]
            l[i] = '0'+l[i]
        return l+r
    
    assert i>0, 'input must be greater than one'
    return ennumerate_zero_one_patterns_aux(i)
    
print(ennumerate_zero_one_patterns(3))


###  <a name=intrepr>Representations of integers</a>


The bit patterns can also be associated with positive integers by the formula,

$$
\mathcal{N}(b_l, b_{l-1}, \ldots, b_0) = \sum_i 2^i b_i
$$

That is, write $n$ in binary, and make a sequence out of the bits in the representation.



In [None]:
%%file string-to-int.c

#include<stdio.h>
#include<string.h>

int main(int argc, char * argv[]){
    int i ;
    int sum = 0 ;
    int two_to_the_i = 1 ;
    char * s = argv[1] ; 
    printf("%s\t", s) ;
    
    for (i=strlen(s);i>0;i--){
        if (s[i-1]=='1') {
            sum = sum + two_to_the_i ;
        }
        two_to_the_i = 2 * two_to_the_i ;
    }
    
    printf("%d\n", sum) ;
    return 0 ;
}

In [None]:
!cc -o string-to-int string-to-int.c
int_representations = ennumerate_zero_one_patterns(3)
for a_representation in int_representations:
    !./string-to-int {a_representation}
!rm string-to-int

#### The int and long int datatypes



We have shown that the computer can represent integers in binary, and have discussed so far only bytes. Since bytes have only 256 bit patterns, they can only store a small range of integers. So far we have shown how it can store the integers 0 through 255. There are two deficiencies,

- We must be able to store much larger intergers
- We must be able to represent both positive and negative integers.

C Language has two data types for integers, _signed_ and _unsigned_. The type _unsigned char_ is one byte and the various bit patterns are used to represent the integers 0 through 255, using the obvious binary representation. 

We set aside for now the representation of negative numbers, and address that we would like a much larger range of positive numbers represented.

To store larger numbers the computer will use more bytes, and will collect them so that they have consecutive adresses in the RAM. This way, the location of the integer remains a single address. The number of bytes is known because the reference has a type that includes the number of bytes. 

<div style="float:right;margin:2em;">
<img width="512" src="../images/TCPL-1ed-bytesize.png"></a>
</div>

It is a fact that C Language did not lay down the law about the number of bytes for each integer datatype, except that a char is one byte, and "larger" data types should have more bytes. However, 32 bits is the standard integer, with type names `int` and `unsigned int`. The image is from TCPL first edition, where they give the number of bits in the various integer and byte types of computers of that time.

There were then two variants of `int`, the `short int` and the `long int`. The actual number of bytes is not defined in the C Language, except that a short int cannot be longer than an int, and a long int cannot be shorter than an int. Let's say for normality that a short is 16 bits and a long is 64 bits. Beware though, this will depend on the computer and the compiler.

The builtin operator `sizeof` gives the number of bytes of the object mentioned as its argument. The argument can be a data type or a variable. Although `sizeof` looks like a function call, it is not. If it were a function call, we would have to wait until the prgram ran before the value of `sizeof` is known. It is already known at compile time.


In [None]:
%%file sizeof-wow.c

#include<stdio.h>

int main(int argc, char * argv[]) {
    printf("type:\tbytes\n") ;
    printf("char:\t%lu\n", sizeof(char)) ;
    printf("short:\t%lu\n", sizeof(short int)) ;
    printf("int:\t%lu\n",  sizeof(int)) ;
    printf("long:\t%lu\n",  sizeof(long int)) ;
    return 0 ;
}

In [None]:
%%bash
cc -o sizeof-wow sizeof-wow.c
./sizeof-wow
rm sizeof-wow

### <a name=intmem>The memory layout of integers</a>

We will learn something about computer architectures and something about the C programming language together.

We have described how an integer is stored in a computer using multiple bytes, and for the convenience of the hardware those bytes will be in consecutive locations in the memory. They will also be at memory locations, when the index is considered as a integer, the multiple of the data type size. We will demonstrated this, but with a little hackery.

We have also said that the memory unit consists of an array of bytes, each with an index, in fact, an integer. In many C language situations, we can actualize this. 

Given a memory item, say the integer `int i`, as a 32-bit integer is occupies four addresses in memory, say $m, m+1, m+2, m+3$. The notation `&i` gives a _pointer_ to `i`, which is an abstract memory reference, which in this case would be of type _pointer to an int_, or in C notation `int *` (said "int-star").

It is a grave error to confuse a memory pointer with an integer, the "location" of the byte or the starting location of bytes, with a pointer. But we will do just that, by coercing the pointer to an _unsigned long int_. We need it to be unsigned, as there are no negative indexed locations in memory, and long, as most computers now are said to be 64-bit machines, meaning that their potential memory space is $2^{64}$ locations. Now no computer today actualizes this, but it actualizes some amount of that space.


#### C arrays

To look at the memory for a single integer type data item, be it short, int or long, we will consider an _array_ of integers. This is a sequence of several data items identified with the name of the array and an integer indicating whether we are considering the zeroth, first, second, third, etc, item along the array of items. C lays these out sequentially in memory so that the $i$-th element is easy to find from the location of the zero-th element and the size (number of bytes) for each element.

It packs these in tightly. So if we define a two element array of int, the integer address of the zero-th element and that of the first element should be separated by the `sizeof` of the element. 

In [None]:
%%file sizeof-ints.c

#include<stdio.h>
int main(int argc, char * argv[]) {
    short s[2] ;
    int i[2] ;
    long l[2] ;
    
    printf("s[0] @ %lu\ns[1] @ %lu\n", (unsigned long) &s[0], (unsigned long) &s[1] ) ;
    printf("i[0] @ %lu\ni[1] @ %lu\n", (unsigned long) &i[0], (unsigned long) &i[1] ) ;
    printf("l[0] @ %lu\nl[1] @ %lu\n", (unsigned long) &l[0], (unsigned long) &l[1] ) ;

    return 0 ;
}


In [None]:

%%bash
cc -o sizeof-ints sizeof-ints.c
./sizeof-ints
rm sizeof-ints