# the `teeth.array` module: explore and tokenize text

## intro

text is deep. its size is epic. its rules are murky. it summons things from across timespace. it has to be chaotic, so everything can fit.

even just splitting text into tokens is hard. there are rules, but those almost always have exceptions, and the exceptions tend to multiply. it's helpful to experiment and adjust, in small starts.

these data structures are intended as a tool to do just that.

## *technical detail

this notebook is bundled with source; we need to make sure the cells below can import from it:

In [None]:
import sys
import os

package_path = os.path.abspath( '..' )

if package_path not in sys.path:
    sys.path.append( package_path )

## getting started

let's try this out on a novel borrowed from the public domain:

In [None]:
raw_text_path = 'moby_dick.txt'

with open( raw_text_path, 'r') as f:
    raw_text = f.read()

In [None]:
raw_text[ 0 : 27 ]

the `str` returned from the file read is the starting point for our new datastructure. let's use it to create a new instance:

In [None]:
from teeth.array import TextStrata

t = TextStrata( raw_text )

initially, a `TextStrata` exposes indices and slices just like the underlying string. the smallest tokens are characters, and slices are just subsequences of these:

In [None]:
t[ 0 : 27 ]

a common next step would be to tokenize the text into words. if we can define which strings are not words, `TextStrata` will construct a split of the string that distinguishes words from separators.

here's a first attempt:

In [None]:
from teeth.array import split
from teeth.pattern import matches

def not_a_word( x ):
    return x in ' \n'

with split( not_a_word, t ) as words:
    print( words[ 0 : 7 ] )

- the `matches` function takes a regular expression and returns a bool-valued function that returns true if a string matches.
- the `split` expression takes the predicte in its first argument and uses it to separate values in the `TextStrata` argument into tokens and delimiters.
- the split is scoped to the `with` statement; the underlying value of `t` does not change.

that split is not quite clean. the whitespace has been separated, but not the punctuation. that's easy to fix! all we have to do is adjust the pattern:

In [None]:
def not_a_word( x ):
    return x in ' \n;,.!?'

with split( not_a_word, t ) as words:
    print( words[ 0 : 7 ] )