The project is based on Yonatan Zilpa's excersie. A brief explanation can be found here. The majority of the following is just a direct copy of that site. Some differences:
- Hexadecimal (base 16) numeric system is used instead of octal.
- .entry MAIN needed to be defined explicitly.
- No relative addressing.
- The generated object code file uses a different format (it includes the entries and the externals).
Documentation can be found here.
A working virtual machine is created for this project, in orded to run the assembled programs. The machine can be found here: tvm.
Our computer architecture consists from Central Processing Unit (CPU), registers and Random Access Memory (RAM), where part of the memory is being used as a stack. The size of each word in memory is 16 bits. Arithmetics is to be carried by the '2's complement' method. Our computer machine can only handle integers (Positives or negatives), it doesn't handle real numbers.
Our computer machine includes the following list of registers:
- Eight general registers (r0, r1, r2, r3, r4, r5, r6, r7)
- One Program Counter register (pc).
- One Stack Pointer register (sp).
- One Status register (psw - Program Status Word) which has two flags: carry flag and zero flag.
All registers are 16 bits in size. The two first bits of the PSW register are C and Z in correspondence Characters are coded in ASCII.
The size of memory is 2000 words (each word is 16 bits in size).
The stack is in the end of the main memory, starts at memory address 1999 (07cf hex)(in words) and it can grow downwards. The size of the stack is 16 words.
On startup the all regsiters have a value of zero, including the flags. The contents of the memory is also zero.
In our computer machine, instruction is a word (16 bits in size) that carries information about the operator and operands. Although instruction is a string of 16 bits, it can be divided into fields. The following table provides further information about the instruction. The bits are in decimal number system.
Fields | Operation | Source Operand | Destination Operand | ||
---|---|---|---|---|---|
Addressing Mode | Register | Addressing Mode | Register | ||
Bits | 15-12 | 11-9 | 8-6 | 5-3 | 2-0 |
The following table maps operator's name to its corresponding instruction code (opcode).
Operator | Opcode |
---|---|
mov |
0 |
cmp |
1 |
add |
2 |
sub |
3 |
mul |
4 |
div |
5 |
lea |
6 |
inc |
7 |
dec |
8 |
jnz |
9 |
jnc |
a |
shl |
b |
prn |
c |
jsr |
d |
rts |
e |
hlt |
f |
All operators are written in lower case letters, details on the meaning of these operators will be specified later.
-
Bits 9-11: This field refers to the addressing mode of the source operand. Depending on the value of this field (numeric values of bits 9-11) , the instruction may refer to additional word (first additional word)
-
Bits 6-8: This field refers to the register of the source operand. The field (bits 6-8) maps its numeric value n to register rn.
Notice: If the addressing mode in the source operand does not require the source register, then the source register field are not in use. In such a case the numeric value of the field (bits 6-8) is equal to zero.
-
Bits 3-5: This field refers to the addressing mode of the destination operand. Depending on the numeric value of this field (bits 3-5) , the instruction may refer to additional word (second additional word)
-
Bits 0-2: This field refers to the register of the destination operand. The field (bits 0-2) maps its numeric value n to register rn.
Notice: If the addressing mode in the destination operand does not require the destination register, then the source register field are not in use. In such a case the numeric value of the field (bits 6-8) is equal to zero.
There are six types of addressing modes in our assembly language, some of these modes require additional information, i.e. additional word. The following table provides information on all types of addressing mode.
First Word | Additional Word | Operand | Way of Writing |   Example  | ||
---|---|---|---|---|---|---|
Field Value | Name | Register | ||||
0 | Instant addressing | zero (not in use) | yes | The numeric value of the operand is determined by the numeric value of the additional word. | The operand is a number preceded by the '#' sign. | mov #-1,r2 |
1 | Direct addressing | zero (not in use) | yes | The additional word contains memory address. The numeric value of the operand is the value of this address. | The operand is a label, either declared or expected to be declared later in the file. | mov x,r2 |
2 | Indirect addressing | zero (not in use) | yes | The numeric value of the additional word contains memory address. The value of this address is also a memory address. The value of the second address is the numeric value of the operand. | Indirect addressing is indicated by the '@' sign which appeared just before the label. The label is declared in the same way as in the direct addressing mode. | mov @x,r2 |
3 | Direct register addressing | n (positive integer) | no | Register rn contains the value of the operand. | The operand is a legal register name. | mov r1,r2 |
4 | Indirect register addressing | n (positive integer)) | no | Register rn contains information on memory address. This memory address contains the operand. | The operand is a legal register name indicated by the '@' sign. | mov @r1,r2 |
Machine instruction may be classified into three different classes (according to the number of operands appeared in each instruction).
The first class contains all machine instructions that get two operands. Any machine instruction that belongs to this class may contain one of the following operators:
mov, cmp, add, sub, mul, div, lea, shl
The following table provides further explanation on the operational aspects of these operators:
Numeric Code | Operator | Description |   Example  | Example Description |
---|---|---|---|---|
0 | mov |
Copies the value of the source operand (the first operand) to the destination operand (the second operand). | mov A, r1 |
Copy the value of A to register r1. |
1 | cmp |
Compare between two operands. The cmp operator subtracts the destination operand from the source operand, without saving the subtraction result, it then updates the zero flag, flag z, in the status register, PSW. | cmp A, r1 |
If the values of A and r1 are equal, then the zero flag A, in the status register PSW, is turned on. Else the zero flag is turned off. |
2 | add |
The destination operand is assigned with the value of the source operand plus the value of the destination operand. | add A, r0 |
Register r0 gets the sum of r0 and A. |
3 | sub |
The destination operand is assigned with the value of the destination operand minus the value of the source operand. | sub #3, r1 |
Register r1 is assigned with the value of r1 minus 3. |
4 | mul |
Destination operand assigned with the value of the source operand times the value of destination operand | mul A, r2 |
Register r2 assigned with A times r2. |
5 | div |
Destination operand is assigned with the value of destination operand divided by the source operand. | div A, r2 |
Register r2 assigned with r2/A. |
6 | lea |
Acronym for 'load effective address'. This operation loads memory address, marked with the label appeared in the first operand to the destination operand. | lea ABC, r1 |
The memory address of label ABC is assigned to register r1. |
b | shl |
Shift bits to the left in the source operand. The number of shifts is determined by the value of the destination operand. | shl r1, #1 |
Register r1 is shifted 1 bit to the left. |
The second class contains all machine instructions that gets one operand. In such cases there is no source operand, thus bits 6-11 are meaningless (their values is zero). Any machine instruction in this class may contain one of the following instruction:
inc, dec, jnz, jnc, prn, jsr
The following table provides further explanation on the operational aspects of these operators:
Numeric Code | Operator | Description |   Example  | Example Description |
---|---|---|---|---|
7 | inc |
The operand is increased by one. | inc r2 |
Register r2 is assigned with r2 plus 1. |
8 | dec |
The operand is decreased by one. | dec r2 |
Register r2 is assigned with r2 minus 1. |
9 | jnz |
Acronym: jump if not zero. The Program Counter register PC is assigned with the source operand if the Z flag, in the Program Status Word register PSW is not zero. | jnz LINE |
If the Z flag (in the PSW register) is not zero, then PC register is assigned with LINE. |
a | jnc |
Acronym: jump if not carry. The Program Counter register PC is assigned with zero if the C flag, in the Program Status Word register PSW is not 0. | jnc LINE |
If the C flag (in the PSW register) is not zero, then PC register is assigned with LINE. |
c | prn |
Prints the ASCII equivalent of the operand to the standard output file (stdout). | prn r1 |
The ASCII equivalent character of the value stored in r1 is printed to standard file. |
d | jsr |
Calls a subroutine that pushes register PC to the running time stack and assign the operand to the Program Counter register PC. | jsr FUNC |
stack[SP] = PC SP = SP-1 PC = FUNC |
The third class contains all machine instructions that gets no operands. In such cases bits 0-11 are meaningless (their values is zero). Any machine instruction in this class may contain one of the following instruction:
rts, hlt
The following table provides further explanation on the operational aspects of these operators:
Numeric Code | Operator | Description |   Example  | Example Description |
---|---|---|---|---|
e | rts |
Pops a value from the running time stack and move this value to the Program Counter register. | rts |
SP = SP+1 PC = stack[SP] |
f | hlt |
Halts the program. | hlt |
Halting the program. |
The following table contains information on legal addressing mode for the source and destination operands.
Operator | Legal Addressing Modes for the Source Operand | Legal Addressing Modes for the Destination Operand |
---|---|---|
mov |
0,1,2,3,4 | 1,2,3,4 |
cmp |
0,1,2,3,4 | 0,1,2,3,4 |
add |
0,1,2,3,4 | 1,2,3,4 |
sub |
0,1,2,3,4 | 1,2,3,4 |
mul |
0,1,2,3,4 | 1,2,3,4 |
div |
0,1,2,3,4 | 1,2,3,4 |
lea |
1 | 1,2,3,4 |
inc |
No source operand | 1,2,3,4 |
dec |
No source operand | 1,2,3,4 |
jnz |
No source operand | 1,2,4 |
jnc |
No source operand | 1,2,4 |
shl |
1,2,3,4 | 0,1,2,3,4 |
prn |
No source operand | 0,1,2,3,4 |
jsr |
No source operand | 1,2,4 |
rts |
No source operand | No source operand |
hlt |
No source operand | No source operand |
The following table contains information on the flags modified by the instructions.
Operator | Zero Flag Modified | Carry Flag Modified |
---|---|---|
mov |
No | No |
cmp |
Yes | No |
add |
Yes | Yes |
sub |
Yes | Yes |
mul |
Yes | Yes |
div |
No | No |
lea |
No | No |
inc |
Yes | No |
dec |
Yes | No |
jnz |
No | No |
jnc |
No | No |
shl |
Yes | Yes |
prn |
No | No |
jsr |
No | No |
rts |
No | No |
hlt |
No | No |
Our assembly language is consisted of statements separated by the new line character '\n'. When we look into a file it appeared to be made out of lines of statements, each statement appeared in its own line. Our assembly language has four types of statements. These statements described in the following table.
Type of statement | General Explanation |
---|---|
Empty Statement | Line with this kind of statement may contains only white spaces: tab character '\t' or space character ' ' |
Comment Statement | The first character in a line with this statement is the semicolon ';' character. This line should be completely ignored by the assembler. |
Declarative Statement | This statement is a directive to the assembler program. It does not generate machine instruction. |
Operation Statement | This statement generates machine instruction that needs to be executed by the CPU. The statement represent machine instruction in symbolic form. |
Directive statement is of the following form: Directive statement may optionally start with a label, the label has to follow certain syntax rules (to be described later). Directive can start with or without a label, in any case a directive name, preceded by a dot '.' character, must be included. NO whitespace allowed between the '.' character and the directive name. If the directive does include a label, then at least one whitespace character is separating between the label and the '.' character. Following the directive name, whitespace-separated, appearing, in the same line, the directive parameters (the number of parameters is determined by the type of the directive). As mentioned, directive statement may include four types of directive:
-
.data
The parameter(s) of data is a list of legal numbers separated by a comma ',' character. For example:
.data +7,-57 ,17 , 9
Notice that any number of whitespace characters may appear between the number(s) and the comma character(s). However, the comma character must separate between two numeric values.
The '.data' directive statement directs the assembler to allocate space in its data image where the appropriate numeric parameters is to be stored. It also direct the assembler to advance the data counter by the number of parameters (of the '.data' directive). If the '.data' directive has a label name, then this label name is assigned with the value in the data image (before it was advanced) and get inserted to the symbols table. This way we can refer to certain place in the data image using the label name. For instance, if we write
XYZ: .data +7,-57,17,9
mov XYZ, r1
then register r1 is assigned with the value +7. If we continue to write
lea XYZ, r1
then r1 would have been assigned with the address (in the data image) that stores the +7 value.
-
.string
The '.string' directive statement gets only one legal string as parameter. The meaning of '.string' directive statement is similar to the '.data' directive statement. The ASCII characters composed the string are coded to their appropriate numeric ASCII values) and get inserted to the data image by their order. At the end a zero value is being inserted, to mark the end of the string. The value of the data counter is to be increase, according to the length of the string + one. If the line includes a label name, then the value of the label name is going to point to the location in memory that stores the ASCII code of the first character of the string, at the same way as it was done for the '.data' string. For instance the directive statement
ABC: .string "abcdef"
is going to allocate an array of characters of length 7 starting from the address stored in the ABC label name. This "array" is initialized to the ASCII value of characters 'a', 'b', 'c', 'd', 'e', 'f' in correspondence, the array is to be ended with the zero value concatenate to the end of the array.
-
.entry
The '.entry' directive statement gets one parameter only. This parameter is a label name, declared by other directive statement in the very same file where the The purpose of the '.entry' directive statement is to deal handle cases where a label name defined in an assembly source file A needs to be referred by other assembly source file(s) B, C, D, etc. In this case the '.entry' directive statement, written in the file A, gets the label name as its parameter (the '.entry' directive statement has to have a single parameter). For instance, if an assembly source file A contains the following lines
.entry HELLO
HELLO: add #1, r1
then other assembly source file(s), may refer to HELLO label name. Notice that a label at the beginning of the '.entry' directive is meaningless.
-
.extern
The '.extern' directive statement gets one parameter this parameter is the name of a label name defined in other assembly source file. The purpose of this directive statement is to declare that the label has been defined in other source file and that this assembly source file (the one that contains the '.extern' directive statement) is using it. The correspondence between the value of the label, as appeared in the source file where it was defined, and the operation instruction(s) that are using it as an argument is to be done at linking time.
.extern HELLO
Notice that a label at the beginning of the '.extern' directive is meaningless.
Operation statement is composed from the following:
-
Optional label.
-
Operation name.
-
Operands (the number of operands may be 0, 1 or 2 depending on the operation).
The length of a statement (of any type) cannot exceed 80 characters. The name of the operation is to be written in lower case letter, operation name can be one of the 16 operations mentioned above. After the operation name, separated with whitespace character(s), one or two operands may appear. In the case of two operands, the operands are separated with a comma ',' character. As mentioned before, whitespace character(s) may separate the comma and the operands. Operation statement with two operands has the following form:
Label | Operation | Operands | |
---|---|---|---|
Source | Destination | ||
HELLO: |
add |
r7, |
B |
JUMP: |
jnc |
XYZ |
|
END: |
hlt |
Every label must begin with an upper or lower case letter, the rest of the label may contain letters or numbers. The length of the label cannot exceed 30 characters. The label ends with a column ':' character. The column character is not part of the label name it is just a sign representing the end of the character. The label must begin with the first column of the line. Label name cannot have more than one definition. The following labels are written correctly.
hEllo:
x:
He78940:
Label name cannot be the same as register or operation name. The label derived its value from the syntax. Label written at the beginning of '.data' or '.string' directive gets the value of the appropriate data counter. Label written at the beginning of an operation statement gets the value of the appropriate operation counter.
Number is a string of decimal digits (0-9) that may optionally be preceded by either '-' or '+' sign. The number gets its value from its decimal representation represented by the string of digits. For instance the numbers
76, -5, +123
can be accepted as numbers. As mentioned, we do not handle rational or real numbers, only integers.
String is a sequence of visible ASCII characters surrounded by double quotation marks. The quotation marks are not part of the string. The string
"Hello World"
is an example for legal string.
When the assembler is starting to translate code it needs to carry two major assignments. Its first assignment is to identify and translate the operation code and its second assignment is to determine addresses for all data and variables appeared in the source file(s). For instance, when the assembler reads the following code:
.entry MAIN
MAIN: mov LENGTH, r1
lea STR, r2
LOOP: prn @r2
inc r2
sub #1, r1
jnz LOOP
END: hlt
STR: .string "abcdef"
LENGTH: .data 6
it has to replace the operation names mov, lea, jnz, prn, sub, inc, jnc, hlt with their equivalent binary codes, in addition, the assembler has to replace the symbols STR, LEN, MAIN, LOOP, END with their appropriate addresses that have been allocated for the directive statements. Assuming that the code in example I has being translated by the assembler and has been stored (operations and directives) in a memory block that starts from address 0000, then this translation can be described as follow:
Label | Address | Command | Operand(s) | Machine Code |
---|---|---|---|---|
.entry |
MAIN |
|||
MAIN: |
0000 | mov |
LEN, r1 |
0219 |
0001 | 0012 | |||
0002 | lea |
STR, r2 |
621a | |
0003 | 000b | |||
LOOP: |
0004 | prn |
@r2 |
c022 |
0005 | inc |
r2 |
701a | |
0006 | sub |
#1, r1 |
3019 | |
0007 | 0001 | |||
0008 | jnz |
LOOP |
9008 | |
0009 | 0004 | |||
END: |
000a | hlt |
f000 | |
STR: |
000b | .string |
"abcdef" |
0061 |
000c | 0062 | |||
000d | 0063 | |||
000e | 0064 | |||
000f | 0065 | |||
0010 | 0066 | |||
0011 | 0000 | |||
LEN: |
0012 | .data |
6 |
0006 |
If the assembler maintains a table of all the operation names and their corresponding binary codes, then all operation names can be easily converted. Whenever the assembler reads an operation name it can simply use the table to find its equivalent binary code. In order to carry the same conversion for the addresses of symbols the assembler has to build similar table. For instance, in example I, prior to reading the source file(s) the assembler has no way to know that the LOOP symbol relates to address 0004. Thus, in regards to all symbols that have been defined by the programmer, the assembler has to accomplish two separate tasks. The first task is to build a table of all symbols and their related numeric values, and the second is to replace all the symbols, appeared in the source file(s) with the numeric values of the address fields. This two assignments can be achieved by performing two separate scans (passes) on the source file(s). In the first pass the assembler builds a table of symbols, this table correspond address to each symbol. In the second pass the assembler translate the source file(s) into binary machine code. Notice that the two passes are done by the assembler, during translation (in the assembly time), before the linking process. After the translation process, the program may be linked and load to memory for execution.
In the first pass, each instruction is being substituted with its appropriate code and the table of symbols is being built. The rest of the code are left untouched. The code should be loaded at address zero. After applying the first pass on example I, we should get the following result
The table of symbols:
Name | Value | Image |
---|---|---|
MAIN | 0000 | instruction |
LOOP | 0004 | instruction |
END | 000a | instruction |
STR | 0000 | data |
LEN | 0007 | data |
List of entries:
Name | Value |
---|---|
MAIN | ???? |
Data image:
Address | Value |
---|---|
0000 | 0061 |
0001 | 0062 |
0002 | 0063 |
0003 | 0064 |
0004 | 0065 |
0005 | 0066 |
0006 | 0000 |
0007 | 0006 |
Instruction image:
Address | Value |
---|---|
0000 | 0219 |
0001 | ???? |
0002 | 621a |
0003 | ???? |
0004 | c022 |
0005 | 701a |
0006 | 3019 |
0007 | ???? |
0008 | 9008 |
0009 | ???? |
000a | f000 |
Applying the second pass on the code of example I yields the following final results:
Name | Value | Image |
---|---|---|
MAIN | 0000 | object code |
LOOP | 0004 | object code |
END | 000a | object code |
STR | 000b | object code |
LEN | 0012 | object code |
List of entries:
Name | Value |
---|---|
MAIN | 0000 |
Object code:
Address | Machine Word |
---|---|
0000 | 0219 |
0001 | 0012 |
0002 | 621a |
0003 | 000b |
0004 | c022 |
0005 | 701a |
0006 | 3019 |
0007 | 0001 |
0008 | 9008 |
0009 | 0004 |
000a | f000 |
000b | 0061 |
000c | 0062 |
000d | 0063 |
000e | 0064 |
000f | 0065 |
0010 | 0066 |
0011 | 0000 |
0012 | 0006 |
When the assembler program is done an object code is generated this object code is to be sent to a linker program. The purpose of the linker program is described as follows:
- To allocate the program with place in memory (allocation).
- To link the object file into one executable file (linking)
- To change addresses according to the loading place (relocation)
- To physically load the code into memory.
After the linker program is done the program can be loaded to memory and is ready to run. We are not going to make further discussion on how the linker program works.
The object file written by the assembler provides informations about machine's memory. The first instruction is to be inserted to memory address 0, the second instruction is to be inserted to be inserted to memory address 2,3 or 4 (depending on the length of the first instruction) and so fourth until the translation of the last instruction. The next memory address, after the last translated instruction, contains the data that were built by the '.data' and '.string' instructions, their order of appearance in memory depends on their precedence of appearance in the source file (first instruction occupies first free memory in a rising order).
The object file is composed out of lines of text and contains 3 sections: code, entries, externals.
The code section starts with '.cbegin' and ends with '.cend'. The first line contains (in hex) the length of the code and the length of data, both are in terms of memory words. Those two numbers must be separated by white space. Each of the next lines provides information on the content of memory address (in hex form) starting from memory address 0. In addition, for each memory address, occupied by instruction (not data), there appear additional information for the linker. This additional information could be one of the following three characters: 'e' 'a' or 'r'. The character 'a' designates the fact that the content of the memory address is absolute and does not depend on where the file is to be loaded (the assembler assumes it to start from memory address 0). The character 'r' designates the fact that memory address is relocatable and should be added with the appropriate offset, in regards to where the file is to be loaded. The offset is the first memory address from which the first instruction of the program is to be loaded. The letter 'd' designates the fact that the content of the file depends on external variable, the linker program is to take care on the insertion of the appropriate value.
The entries section starts with '.lbegin' and ends with '.lend'. The entries section is composed out of lines of text. Each line contains the entry name and value, as it was computed for this file.
The entries section starts with '.ebegin' and ends with '.eend'. The externals section is composed out of lines of text. Each line contains the name and memory address of the external variable.
The binary file contains the object code in binary (non-text) format. It can't be created, if the source code contains .extern directives.
Prints the string "abcdef".
test.as
; test.as
; Prints the string "abcdef".
.entry MAIN ; file contains the definition of MAIN
MAIN: mov LEN, r1 ; move LEN(=6) to r1
lea STR, r2 ; load the address of STR to r2
LOOP: prn @r2 ; print the character at the memory location that r2 holds
inc r2 ; r2 = r2 + 1
sub #1, r1 ; r1 = r1 - 1
jnz LOOP ; jump to LOOP if the zero flag is not set (sub sets it)
END: hlt ; end of the program
STR: .string "abcdef" ; string to print
LEN: .data 6 ; length of the string
test.oc
.cbegin
b 8
0000 0219 a
0001 0012 r
0002 621a a
0003 000b r
0004 c022 a
0005 701a a
0006 3019 a
0007 0001 a
0008 9008 a
0009 0004 r
000a f000 a
000b 0061
000c 0062
000d 0063
000e 0064
000f 0065
0010 0066
0011 0000
0012 0006
.cend
.lbegin
MAIN 0000
.lend
.ebegin
.eend
tas <options> source-file
where the options are:
-l : prints debugging lists after each pass
-n : creates NO output files
-b : creates binary output file
-h : shows this text
Windows
cd tas
mkdir build
cd build
cmake ..
tas.sln
Linux
cd tas
mkdir build
cd build
cmake ..
make