# the `teeth.array` module: explore and tokenize text

## intro

text is deep. its size is epic. its rules are murky. it summons things from across timespace. it has to be chaotic, so everything can fit.

even just splitting text into tokens is hard. there are rules, but those almost always have exceptions, and the exceptions tend to multiply. it's helpful to experiment and adjust, in small starts.

these data structures are intended as a tool to do just that.

## *technical detail

this notebook is bundled with source; we need to make sure the cells below can import from it:

In [None]:
import sys
import os

package_path = os.path.abspath( '..' )

if package_path not in sys.path:
    sys.path.append( package_path )

## getting started

let's try this out on a novel borrowed from the public domain:

In [None]:
raw_text_path = 'moby_dick.txt'

with open( raw_text_path, 'r') as f:
    raw_text = f.read()

In [None]:
raw_text[ 0 : 27 ]

the `str` returned from the file read is the starting point for our new datastructure. let's use it to create a new instance:

In [None]:
from teeth.array import TextStrata

t = TextStrata( raw_text )

initially, a `TextStrata` exposes indices and slices just like the underlying string. the smallest tokens are characters, and slices are just subsequences of these:

In [None]:
t[ 0 : 27 ]

a common next step would be to tokenize the text into words. if we can define which strings are not words, `TextStrata` will construct a split of the string that distinguishes words from separators.

here's a first attempt:

In [None]:
from teeth.array import split

def not_a_word( x ):
    return x in ' \n'

with split( not_a_word, t ) as words:
    print( words[ 0 : 7 ] )

- the `split` expression takes a predicate as its first argument and an instance of `TextStrata` as it second.
- any token for which the predicate returns true will be identified as a delimiter.
- indexing and slicing works similar to a normal `list`, both before and after the split
- within the scope of the split, ordinal indices identify tokens induced by the predicate.
- the split is scoped to the `with` statement; the underlying value of `t` does not change.

that split is not quite clean. the whitespace has been separated, but not the punctuation. that's easy to fix! all we have to do is adjust the pattern:

In [None]:
def not_a_word( x ):
    return x in ' \n;,.!?'

with split( not_a_word, t ) as words:
    print( words[ 0 : 7 ] )

note again that the underlying contents of `t` don't change outside the scope of the `with`:

In [None]:
print( t[ 0 : 7 ] )

this behavior is quite useful when doing exploratory work in a notebook, since it prevents accidentally changing the underlying data when cells are executed multiple times, or out of order.

## persistent tokenization

splitting the same text over and over will be computationally expensive, requiring at least `O(n)` time in the length of the text. once the right split has been worked out, it would be helpful to make it persistent. the `TextStrata` object exposes a method for doing so:

In [None]:
t.split_where( not_a_word )

now, subsequent slices into `t` will reference tokens generated by the split, instead of the underlying characters:

In [None]:
print( t[ 0 : 7 ] )

In [None]:
print( t[ 111198 : 111256 ] )

`t` is also iterable by token:

In [None]:
token = iter( t )
for _ in range( 7 ):
    print( next( token ) )

## layered token splits

in many nlp tasks, we'll be interested in higher-order tokens, i.e. tokens comprised of other tokens. for example, documents are often represented as sequences of sentences, which are in turn sequences of words.

the interface exposed by `TextStrata` generalizes seamlessly to these tasks. all we have to do is define a predicate for the new split. let's try it out:

In [None]:
def sentence_ends( x ):
    return 0 < len( x ) and x[ 0 ][ 0 ] in '.?!'

with split( sentence_ends, t ) as sentences:
    for ix in range( 10050, 10060 ):
        print( sentences[ ix ] )

the underlying layers are still accessible, using the `layer` function:

In [None]:
from teeth.array import layer

with layer( 0, t ) as char_level:
    print( char_level[ 0 : 27 ] )
    
with layer( 1, t ) as word_level:
    print( word_level[ 0 : 7 ] )

layers are counted upward; `layer( 0, t )` exposes the underlying iterable of characters, while `layer( 1, t )` exposes the word-level split added by `split_where`.