# Tokenizing notebook

Here, we demonstrate some usage of the tokenizing utility functions.
The content is in two main sections. First, we demonstrate the
usage of various functions to get information. Once this is
done, we demonstrate how to change the content in a reliable way.

In [1]:
from ideas import token_utils

## Getting information
We start with a very simple example, where we have a repeated token, `a`.

In [2]:
source = "a = a"
tokens = token_utils.tokenize(source)
for token in tokens:
    print(token)

type=1 (NAME)  string='a'  start=(1, 0)  end=(1, 1)  line='a = a'
type=53 (OP)  string='='  start=(1, 2)  end=(1, 3)  line='a = a'
type=1 (NAME)  string='a'  start=(1, 4)  end=(1, 5)  line='a = a'
type=4 (NEWLINE)  string=''  start=(1, 5)  end=(1, 6)  line=''
type=0 (ENDMARKER)  string=''  start=(2, 0)  end=(2, 0)  line=''


Notice how the `NEWLINE` token here, in spite of its name, does not correpond to `\n`.

### Comparing tokens

Tokens are considered equals if they have the same `string` attribute. Given this notion of equality, we make things even simpler by allowing to compare a token directly to a string as shown below.

In [3]:
print(tokens[0] == tokens[2])
print(tokens[0] == tokens[2].string)
print(tokens[0] == 'a')  #  <--  Our normal choice

True
True
True


### Printing tokens by line of code
If we simply want to tokenize a source and print the result, or simply print a list of tokens, we can use `print_tokens` to do it in a single instruction, with the added benefit of separating tokens from different lines of code.

In [4]:
source = """
if True:
    pass
"""
token_utils.print_tokens(source)

type=56 (NL)  string='\n'  start=(1, 0)  end=(1, 1)  line='\n'

type=1 (NAME)  string='if'  start=(2, 0)  end=(2, 2)  line='if True:\n'
type=1 (NAME)  string='True'  start=(2, 3)  end=(2, 7)  line='if True:\n'
type=53 (OP)  string=':'  start=(2, 7)  end=(2, 8)  line='if True:\n'
type=4 (NEWLINE)  string='\n'  start=(2, 8)  end=(2, 9)  line='if True:\n'

type=5 (INDENT)  string='    '  start=(3, 0)  end=(3, 4)  line='    pass\n'
type=1 (NAME)  string='pass'  start=(3, 4)  end=(3, 8)  line='    pass\n'
type=4 (NEWLINE)  string='\n'  start=(3, 8)  end=(3, 9)  line='    pass\n'

type=6 (DEDENT)  string=''  start=(4, 0)  end=(4, 0)  line=''
type=0 (ENDMARKER)  string=''  start=(4, 0)  end=(4, 0)  line=''



### Getting tokens by line of code
Once a source is broken down into token, it might be difficult to find some particular tokens of interest if we print the entire content. Instead, using `get_lines`, we can tokenize by line of code , and just focus on a few lines of interest.

In [16]:
source = """
if True:
    if False:
        pass
    else:
        a = 42 # a comment
print('ok')
"""
lines = token_utils.get_lines(source)
for line in lines[4:6]:
    for token in line:
        print(token)
    print()

type=6 (DEDENT)  string=''  start=(5, 4)  end=(5, 4)  line='    else:\n'
type=1 (NAME)  string='else'  start=(5, 4)  end=(5, 8)  line='    else:\n'
type=53 (OP)  string=':'  start=(5, 8)  end=(5, 9)  line='    else:\n'
type=4 (NEWLINE)  string='\n'  start=(5, 9)  end=(5, 10)  line='    else:\n'

type=5 (INDENT)  string='        '  start=(6, 0)  end=(6, 8)  line='        a = 42 # a comment\n'
type=1 (NAME)  string='a'  start=(6, 8)  end=(6, 9)  line='        a = 42 # a comment\n'
type=53 (OP)  string='='  start=(6, 10)  end=(6, 11)  line='        a = 42 # a comment\n'
type=2 (NUMBER)  string='42'  start=(6, 12)  end=(6, 14)  line='        a = 42 # a comment\n'
type=55 (COMMENT)  string='# a comment'  start=(6, 15)  end=(6, 26)  line='        a = 42 # a comment\n'
type=4 (NEWLINE)  string='\n'  start=(6, 26)  end=(6, 27)  line='        a = 42 # a comment\n'



### Getting particular tokens
Let's focus on the sixth line.

In [7]:
line = lines[5]
print( token_utils.untokenize(line) )

        a = 1 # a comment



Ignoring the indentation, the first token is `a`; ignoring newlines indicator and comments, the last token is `1`. We can get at these tokens using some utility functions.

In [12]:
print("The first useful token is:\n   ", token_utils.get_first(line))
print("The index of the first token is: ", token_utils.get_first_index(line))
print()
print("The last useful token on that line is:\n  ", token_utils.get_last(line))
print("Its index is", token_utils.get_last_index(line))

The first useful token is:
    type=1 (NAME)  string='a'  start=(6, 8)  end=(6, 9)  line='        a = 1 # a comment\n'
The index of the first token is:  1

The last useful token on that line is:
   type=2 (NUMBER)  string='1'  start=(6, 12)  end=(6, 13)  line='        a = 1 # a comment\n'
Its index is 2


Note that these four functions, `get_first`, `get_first_index`, `get_last`, `get_last_index` exclude end of line comments by default; but this can be changed by setting the optional parameter `exclude_comment` to `False`.

In [14]:
print( token_utils.get_last(line, exclude_comment=False))

type=55 (COMMENT)  string='# a comment'  start=(6, 14)  end=(6, 25)  line='        a = 1 # a comment\n'


### Getting the indentation of a line
This particular line starts with an `INDENT` token. We can get the indentation of that line, either by printing the length of the `INDENT` token string, or by looking at the `start_col` attribute of the first "useful" token. The attribute `start_col` is part of the two-tuple `start = (start_row, start_col)`. 

In [9]:
print(len(line[0].string))
first = token_utils.get_first(line)
print(first.start_col)

8
8


In general, **the second method is more reliable**. For example, if we look at tokens the previous line (line 5, index 4), we can see that the length of the string of the first token, `INDENT`, does not give us the information about the line indentation. Furthermore, a given line may start with multiple `INDENT` tokens. However, once again, the `start_col` attribute of the first "useful" token can give us this value.

In [15]:
for token in lines[4]:
    print(token)
print("-" * 50)
    
print(token_utils.untokenize(lines[4]))
first = token_utils.get_first(lines[4])
print("indentation = ", first.start_col)

type=6 (DEDENT)  string=''  start=(5, 4)  end=(5, 4)  line='    else:\n'
type=1 (NAME)  string='else'  start=(5, 4)  end=(5, 8)  line='    else:\n'
type=53 (OP)  string=':'  start=(5, 8)  end=(5, 9)  line='    else:\n'
type=4 (NEWLINE)  string='\n'  start=(5, 9)  end=(5, 10)  line='    else:\n'
--------------------------------------------------
    else:

indentation =  4
