This is the Analyzers, Tokenizers, and Filters metapy tutorial. First, you should read the following two MeTA tutorials:
- [MeTA System Overview](https://meta-toolkit.org/overview-tutorial.html). Everything on this page is relevant to metapy except for the *Unit tests* section (you can't run them in Python).
- [Analyzers, Tokenizers, and Filters](https://meta-toolkit.org/analyzers-filters-tutorial.html). Everything on this page is relevant except for the *Extending MeTA With Your Own Filters* section.

Let's get started!

First, let's create a document to play with.

In [1]:
import metapy
doc = metapy.index.Document()
doc.content("I said that I can't believe that it only costs $19.95!")

We can make our own filter chain and run it on the document's content. Let's start with a simple example of only using `ICUTokenizer`.

In [2]:
tok = metapy.analyzers.ICUTokenizer()
tok.set_content(doc.content())
[t for t in tok]

['<s>',
 'I',
 'said',
 'that',
 'I',
 "can't",
 'believe',
 'that',
 'it',
 'only',
 'costs',
 '$',
 '19.95',
 '!',
 '</s>']

See how the begin and end sentence markers (`<s>` and `</s>`) are inserted at the beginning and end of each sentence. We get an ordered list from using a tokenizer or filter.

Next, use `LowercaseFilter` to convert each token to lowercase. We use the previous `tok` (which is an `ICUTokenizer`) in the constructor to `LowercaseFilter`. This lets us connect an arbitrary amount of filters together with a tokenizer at the start.

In [3]:
tok = metapy.analyzers.ICUTokenizer()
tok = metapy.analyzers.LowercaseFilter(tok)
tok.set_content(doc.content())
[t for t in tok]

['<s>',
 'i',
 'said',
 'that',
 'i',
 "can't",
 'believe',
 'that',
 'it',
 'only',
 'costs',
 '$',
 '19.95',
 '!',
 '</s>']

Just like in MeTA, metapy's filter chain can be created from a config file. Create the following file called `config.toml`. It will perform the same tokenization and filtering as above (`ICUTokenizer -> LowercaseFilter`). Then, it will aggregate token counts together using an *n*-gram words analyzer.

```toml
[[analyzers]]
method = "ngram-word"
ngram = 1
filter = [{type = "icu-tokenizer"}, {type = "lowercase"}]
```

Now, you can load this config file to create a unigram words analyzer. This uses the specified tokenizer/filter chain and analyzer type to convert a document into a dictionary of features and their counts.

In [4]:
ana = metapy.analyzers.load('config.toml')
ana.analyze(doc)

{"can't": 1,
 'believe': 1,
 'that': 2,
 'i': 2,
 '</s>': 1,
 '19.95': 1,
 'only': 1,
 'said': 1,
 '!': 1,
 '<s>': 1,
 '$': 1,
 'costs': 1,
 'it': 1}

The tokens *i* and *that* are shown with two counts, while all the other tokens have 1 count. These features can then be passed to other parts of metapy, such as ranking functions or indexers.

We can also manually specify the analyzer instead of loading it from the config file:

In [5]:
ana = metapy.analyzers.NGramWordAnalyzer(1, tok)
ana.analyze(doc)

{"can't": 1,
 'believe': 1,
 'that': 2,
 'i': 2,
 '</s>': 1,
 '19.95': 1,
 'only': 1,
 'said': 1,
 '!': 1,
 '<s>': 1,
 '$': 1,
 'costs': 1,
 'it': 1}

In [6]:
ana = metapy.analyzers.NGramWordAnalyzer(3, tok)
ana.analyze(doc)

{('$', '19.95', '!'): 1,
 ('19.95', '!', '</s>'): 1,
 ('costs', '$', '19.95'): 1,
 ('it', 'only', 'costs'): 1,
 ("can't", 'believe', 'that'): 1,
 ('believe', 'that', 'it'): 1,
 ('i', "can't", 'believe'): 1,
 ('said', 'that', 'i'): 1,
 ('that', 'it', 'only'): 1,
 ('only', 'costs', '$'): 1,
 ('<s>', 'i', 'said'): 1,
 ('i', 'said', 'that'): 1,
 ('that', 'i', "can't"): 1}

Usually, metapy applications will create and call analyzers based on a config file, so you won't have to create your own manually. However, it may still be useful if you are performing your own analysis that is not part of MeTA.