## Bits, Bytes and Integers


_burton rosenberg, 23 june 2023_


For the purposes of this section, a computer consists of three elements,

- A memory device, called the RAM
- A control and computation device, called the CPU
- A connection between the RAM and CPU called the bus.

The CPU sends over the bus a request to the RAM to read or write a particular data item at a particular address. In a simplified model, the RAM is a collection of cells, each capable of storing a _byte_, and each cell as an integer index. The index is important in several ways,

- It locates the byte.
- Consecutive indices are useful in creating data items that are multiple bytes.
- Certain access patterns do calculations on the idecies for finding data items.

The CPU calls out through the bus to the RAM to retrieve or store bytes. (In reality, multiple bytes are retrieced or stored in a single data movement, sometimes even full _cache lines_. In the case of a cache line, it is like asking for a glass of water and the waiter brings a glass and a bucket of water, assuming you will eventually want another water and cost of bringing water is more about makign the trinp than much is brought in one trip.)

<pre>

data read:

    +-----+                 +-----+
    |     |   address --&gt;   |     |
    | CPU | ===== BUS ===== | RAM |
    |     |    &lt;-- data     |     | 
    +-----+                 +-----+    
    

data write:

    +-----+   address --&gt;   +-----+
    |     |     data  --&gt;   |     |
    | CPU | ===== BUS ===== | RAM |
    |     |                 |     | 
    +-----+                 +-----+    
</pre>


### What is a byte?

<div style="float:right;margin:2em;">
<a title="Billie Grace Ward from New York, USA, CC BY 2.0 &lt;https://creativecommons.org/licenses/by/2.0&gt;, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Paper_Tape_Drive_(31437412070).jpg"><img width="512" alt="Paper Tape Drive (31437412070)" src="https://upload.wikimedia.org/wikipedia/commons/thumb/6/6f/Paper_Tape_Drive_%2831437412070%29.jpg/512px-Paper_Tape_Drive_%2831437412070%29.jpg"></a>
</div>


The fundatmental data item in a computer is a _bit_. A bit is any sort of physical phenomena that can distinguish two states, called _zero_ and _one_. I some memories it whether a particular small region in silicon wafer has a charge or not; for "spinning rust" hard drives it was whether a small section of magnitizable material was magnitized north polarity or south polarity. The ancient data storage device the punched tape memorialized the zero/one as the presence or not of a hole in a precise location on a paper tape.

A byte is a collection of 8 bits, such that each bit has its proper place, as in the 0th bit, the 1st bit, up to the 7th bit. Note that computers like to count from 0. So for 8 bits, the largest location is called the 7th location.

While other bytes sizes were tried, history of computing somehow settled down on 8 bits to form the most basic data item, the byte. C Language adopted this and give the name `char` to this basic building block of storage. The word char is short for character, because the byte was the appropriate data unit to represent characters.


A byte in its most basic is a signfication of 256 different possibilies for the values of the 8 bits in the byte. We can calculate this value 256 in the following manner.

<div style="border:thin solid green;margin:2em;padding:2em;">
    <b>How many different possibilities are there for i bits?</b> 
    
The way to calculate 256 is to create a formula for how many different possibilities there are for i bits, and then calculate with i equal to 8. For $i=1$, there are two possibilites. For each additional bit, the number of possibilities doubles. 

For instance, for $i=2$ we have,

`['00','01','10','11']`. 

Adding a third bit we write down the list twice, with a 0 or 1 in front to distinguish the first from the second writing and get the 8 possibilites,

`['000','010','010','011','100','101','110','111']`.

The conclusion is the very important formula, there are $2^i$ different combinations on $i$ bits.

</div>

In [1]:
#
# a python program to ennumerate all the bit sequence on i bits
# it uses recursion to create a list for i-1 bits, then adds one more 
# bit.
#

def ennumerate_zero_one_patterns(i):
    
    def ennumerate_zero_one_patterns_aux(i):
        if i==1:
            return ['0','1']
        l = ennumerate_zero_one_patterns_aux(i-1)
        r = l[:]
        for i in range(len(l)):
            r[i] = '1'+l[i]
            l[i] = '0'+l[i]
        return l+r
    
    assert i>0, 'input must be greater than one'
    return ennumerate_zero_one_patterns_aux(i)
    
print(ennumerate_zero_one_patterns(3))

['000', '001', '010', '011', '100', '101', '110', '111']



### Representations of integers


The bit patterns can also be associated with positive integers by the formula,

$$
\mathcal{N}(b_l, b_{l-1}, \ldots, b_0) = \sum_i 2^i b_i
$$

That is, write $n$ in binary, and make a sequence out of the bits in the representation.



In [2]:
%%file string-to-int.c

#include<stdio.h>
#include<string.h>

int main(int argc, char * argv[]){
    int i ;
    int sum = 0 ;
    int two_to_the_i = 1 ;
    char * s = argv[1] ; 
    printf("%s\t", s) ;
    
    for (i=strlen(s);i>0;i--){
        if (s[i-1]=='1') {
            sum = sum + two_to_the_i ;
        }
        two_to_the_i = 2 * two_to_the_i ;
    }
    
    printf("%d\n", sum) ;
    return 0 ;
}

Writing string-to-int.c


In [3]:
!cc -o string-to-int string-to-int.c
int_representations = ennumerate_zero_one_patterns(3)
for a_representation in int_representations:
    !./string-to-int {a_representation}
!rm string-to-int

000	0
001	1
010	2
011	3
100	4
101	5
110	6
111	7


#### The int and long int datatypes



We have shown that the computer can represent integers in binary, and have discussed so far only bytes. Since bytes have only 256 bit patterns, they can only store a small range of integers. So far we have shown how it can store the integers 0 through 255. There are two deficiencies,

- We must be able to store much larger intergers
- We must be able to represent both positive and negative integers.

C Language has two data types for integers, _signed_ and _unsigned_. The type _unsigned char_ is one byte and the various bit patterns are used to represent the integers 0 through 255, using the obvious binary representation. 

We set aside for now the representation of negative numbers, and address that we would like a much larger range of positive numbers represented.

To store larger numbers the computer will use more bytes, and will collect them so that they have consecutive adresses in the RAM. This way, the location of the integer remains a single address. The number of bytes is known because the reference has a type that includes the number of bytes. 

<div style="float:right;margin:2em;">
<img width="512" src="../images/TCPL-1ed-bytesize.png"></a>
</div>

It is a fact that C Language did not lay down the law about the number of bytes for each integer datatype, except that a char is one byte, and "larger" data types should have more bytes. However, 32 bits is the standard integer, with type names `int` and `unsigned int`. The image is from TCPL first edition, where they give the number of bits in the various integer and byte types of computers of that time.

There were then two variants of `int`, the `short int` and the `long int`. The actual number of bytes is not defined in the C Language, except that a short int cannot be longer than an int, and a long int cannot be shorter than an int. Let's say for normality that a short is 16 bits and a long is 64 bits. Beware though, this will depend on the computer and the compiler.

The builtin operator `sizeof` gives the number of bytes of the object mentioned as its argument. The argument can be a data type or a variable. Although `sizeof` looks like a function call, it is not. If it were a function call, we would have to wait until the prgram ran before the value of `sizeof` is known. It is already known at compile time.


In [4]:
%%file sizeof-wow.c

#include<stdio.h>

int main(int argc, char * argv[]) {
    printf("type:\tbytes\n") ;
    printf("char:\t%lu\n", sizeof(char)) ;
    printf("short:\t%lu\n", sizeof(short int)) ;
    printf("int:\t%lu\n",  sizeof(int)) ;
    printf("long:\t%lu\n",  sizeof(long int)) ;
    return 0 ;
}

Writing sizeof-wow.c


In [5]:
%%bash
cc -o sizeof-wow sizeof-wow.c
./sizeof-wow
rm sizeof-wow

type:	bytes
char:	1
short:	2
int:	4
long:	8


### The memory layout of integers

We will learn something about computer architectures and something about the C programming language together.

We have described how an integer is stored in a computer using multiple bytes, and for the convenience of the hardware those bytes will be in consecutive locations in the memory. They will also be at memory locations, when the index is considered as a integer, the multiple of the data type size. We will demonstrated this, but with a little hackery.

We have also said that the memory unit consists of an array of bytes, each with an index, in fact, an integer. In many C language situations, we can actualize this. 

Given a memory item, say the integer `int i`, as a 32-bit integer is occupies four addresses in memory, say $m, m+1, m+2, m+3$. The notation `&i` gives a _pointer_ to `i`, which is an abstract memory reference, which in this case would be of type _pointer to an int_, or in C notation `int *` (said "int-star").

It is a grave error to confuse a memory pointer with an integer, the "location" of the byte or the starting location of bytes, with a pointer. But we will do just that, by coercing the pointer to an _unsigned long int_. We need it to be unsigned, as there are no negative indexed locations in memory, and long, as most computers now are said to be 64-bit machines, meaning that their potential memory space is $2^{64}$ locations. Now no computer today actualizes this, but it actualizes some amount of that space.


#### C arrays

To look at the memory for a single integer type data item, be it short, int or long, we will consider an _array_ of integers. This is a sequence of several data items identified with the name of the array and an integer indicating whether we are considering the zeroth, first, second, third, etc, item along the array of items. C lays these out sequentially in memory so that the $i$-th element is easy to find from the location of the zero-th element and the size (number of bytes) for each element.

It packs these in tightly. So if we define a two element array of int, the integer address of the zero-th element and that of the first element should be separated by the `sizeof` of the element. 

In [6]:
%%file sizeof-ints.c

#include<stdio.h>
int main(int argc, char * argv[]) {
    short s[2] ;
    int i[2] ;
    long l[2] ;
    
    printf("s[0] @ %lu\ns[1] @ %lu\n", (unsigned long) &s[0], (unsigned long) &s[1] ) ;
    printf("i[0] @ %lu\ni[1] @ %lu\n", (unsigned long) &i[0], (unsigned long) &i[1] ) ;
    printf("l[0] @ %lu\nl[1] @ %lu\n", (unsigned long) &l[0], (unsigned long) &l[1] ) ;

    return 0 ;
}


Writing sizeof-ints.c


In [7]:

%%bash
cc -o sizeof-ints sizeof-ints.c
./sizeof-ints
rm sizeof-ints

s[0] @ 13089518988
s[1] @ 13089518990
i[0] @ 13089519024
i[1] @ 13089519028
l[0] @ 13089519008
l[1] @ 13089519016
