# NLP Preprocessing

In [None]:
from fastai.gen_doc.nbdoc import *
from fastai.text import * 
from fastai import *

The `text.tranform` module contains the function that deal behind the scenes with the two main tasks to prepare texts for the models: tokenization and numericalization.

The first one consists in splitting the raw texts into tokens (wich can be words, or punctuation signs...). The most basic way to do this would be to separate according to spaces, but it's possible to be more subtle; for instance, the contractions like "isn't" or "don't" should be split in \["is","n't"\] or \["do","n't"\].

The second one is easier as it just consists in attributing a unique id to each token and mapping each of those tokens to their respective ids.

## Tokenization

This step is actually divided in two phases: first, we apply a certain list of `rules` to the raw texts as preprocessing, then we use the tokenizer to split them in lists of tokens. Combining together those `rules`, the `tok_func`and the `lang` to process the texts is the role of the [`Tokenizer`](/text.transform.html#Tokenizer) class.

In [None]:
show_doc(Tokenizer, doc_string=False)

### <a id=Tokenizer></a><em>class</em> `Tokenizer`
`Tokenizer`(<code>tok_func</code>:<code>Callable</code>=`<class 'fastai.text.transform.SpacyTokenizer'>`, <code>lang</code>:<code>str</code>=`'en'`, <code>rules</code>:<code>Collection</code>[<code>Callable</code>[<code>str</code>, <code>str</code>]]=`None`, <code>special_cases</code>:<code>Collection</code>[<code>str</code>]=`None`, <code>n_cpus</code>:<code>int</code>=`None`)
<a href="https://github.com/fastai/fastai_pytorch/blob/master/fastai/text/transform.py#L82">[source]</a>

This class will process texts by appling them the `rules` then tokenizing them with `tok_func(lang)`. `special_cases` are a list of tokens passed as special to the tokenizer and `n_cpus` is the number of cpus to use for multi-processing (by default, half the cpus available). We don't directly pass a tokenizer for multi-processing purposes: each process needs to initiate a tokenizer of its own.

In [None]:
show_doc(Tokenizer.process_text)

#### <a id=process_text></a>`process_text`
`process_text`(<code>t</code>:<code>str</code>, <code>tok</code>:[<code>BaseTokenizer</code>](/text.transform.html#BaseTokenizer)) -> <code>List</code>[<code>str</code>]


Processe one text `t` with tokenizer `tok`. <a href="https://github.com/fastai/fastai_pytorch/blob/master/fastai/text/transform.py#L96">[source]</a>

In [None]:
show_doc(Tokenizer.process_all)

#### <a id=process_all></a>`process_all`
`process_all`(<code>texts</code>:<code>Collection</code>[<code>str</code>]) -> <code>List</code>[<code>List</code>[<code>str</code>]]


Process a list of `texts`. <a href="https://github.com/fastai/fastai_pytorch/blob/master/fastai/text/transform.py#L107">[source]</a>

### Global Variable Definitions:

`default_rules = [fixup, replace_rep, replace_wrep, deal_caps, spec_add_spaces, rm_useless_spaces, sub_br]` <div style="text-align: right"><a href="https://github.com/fastai/fastai_pytorch/blob/master/fastai/text/transform.py#L78">[source]</a></div>

`default_spec_tok = [BOS, FLD, UNK, PAD]` <div style="text-align: right"><a href="https://github.com/fastai/fastai_pytorch/blob/master/fastai/text/transform.py#L79">[source]</a></div>

In [None]:
show_doc(BaseTokenizer)

### <a id=BaseTokenizer></a><em>class</em> `BaseTokenizer`(<code>lang</code>:<code>str</code>)<div style="text-align: right"><a href="https://github.com/fastai/fastai_pytorch/blob/master/fastai/text/transform.py#L11">[source]</a></div>


Basic class for a tokenizer function.

[`BaseTokenizer`](/text.transform.html#BaseTokenizer)

In [None]:
show_doc(BaseTokenizer.add_special_cases)

#### <a id=add_special_cases></a>`add_special_cases`(<code>toks</code>:<code>Collection</code>[<code>str</code>])<div style="text-align: right"><a href="https://github.com/fastai/fastai_pytorch/blob/master/fastai/text/transform.py#L17">[source]</a></div>

`BaseTokenizer.add_special_cases`

In [None]:
show_doc(BaseTokenizer.tokenizer)

#### <a id=tokenizer></a>`tokenizer`(<code>t</code>:<code>Doc</code>) -> <code>List</code>[<code>str</code>]<div style="text-align: right"><a href="https://github.com/fastai/fastai_pytorch/blob/master/fastai/text/transform.py#L16">[source]</a></div>

`BaseTokenizer.tokenizer`

In [None]:
show_doc(SpacyTokenizer)

### <a id=SpacyTokenizer></a><em>class</em> `SpacyTokenizer`(<code>lang</code>:<code>str</code>) :: Inherits ([<code>BaseTokenizer</code>](fastai.text.transform.html#BaseTokenizer))<div style="text-align: right"><a href="https://github.com/fastai/fastai_pytorch/blob/master/fastai/text/transform.py#L20">[source]</a></div>


Little wrapper around a <code>spacy</code> tokenizer

[`SpacyTokenizer`](/text.transform.html#SpacyTokenizer)

In [None]:
show_doc(SpacyTokenizer.add_special_cases)

#### <a id=add_special_cases></a>`add_special_cases`(<code>toks</code>:<code>Collection</code>[<code>str</code>])<div style="text-align: right"><a href="https://github.com/fastai/fastai_pytorch/blob/master/fastai/text/transform.py#L29">[source]</a></div>

`SpacyTokenizer.add_special_cases`

In [None]:
show_doc(SpacyTokenizer.tokenizer)

#### <a id=tokenizer></a>`tokenizer`(<code>t</code>:<code>Doc</code>) -> <code>List</code>[<code>str</code>]<div style="text-align: right"><a href="https://github.com/fastai/fastai_pytorch/blob/master/fastai/text/transform.py#L26">[source]</a></div>

`SpacyTokenizer.tokenizer`

In [None]:
show_doc(Tokenizer.proc_text)

#### <a id=proc_text></a>`proc_text`
(<code>t</code>:<code>str</code>, <code>tok</code>:[<code>BaseTokenizer</code>](fastai.text.transform.html#BaseTokenizer)) -> <code>List</code>[<code>str</code>]<div style="text-align: right"><a href="https://github.com/fastai/fastai_pytorch/blob/master/fastai/text/transform.py#L95">[source]</a></div>


Processes one text

`Tokenizer.proc_text`

In [None]:
show_doc(Tokenizer.process_all)

#### <a id=process_all></a>`process_all`
(<code>texts</code>:<code>Collection</code>[<code>str</code>]) -> <code>List</code>[<code>List</code>[<code>str</code>]]<div style="text-align: right"><a href="https://github.com/fastai/fastai_pytorch/blob/master/fastai/text/transform.py#L106">[source]</a></div>


Processes a list of texts in several processes

`Tokenizer.process_all`

In [None]:
show_doc(Tokenizer.process_all_1)

#### <a id=process_all_1></a>`process_all_1`
(<code>texts</code>:<code>Collection</code>[<code>str</code>]) -> <code>List</code>[<code>List</code>[<code>str</code>]]<div style="text-align: right"><a href="https://github.com/fastai/fastai_pytorch/blob/master/fastai/text/transform.py#L100">[source]</a></div>


Processes a list of texts in one process

`Tokenizer.process_all_1`

In [None]:
show_doc(Vocab)

### <a id=Vocab></a><em>class</em> `Vocab`(<code>path</code>:<code>None</code>[<code>Path</code>, <code>str</code>])<div style="text-align: right"><a href="https://github.com/fastai/fastai_pytorch/blob/master/fastai/text/transform.py#L112">[source]</a></div>


Contains the correspondance between numbers and tokens and numericalizes

[`Vocab`](/text.transform.html#Vocab)

In [None]:
show_doc(Vocab.create)

#### <a id=create></a>`create`
(<code>path</code>:<code>None</code>[<code>Path</code>, <code>str</code>], <code>tokens</code>:<code>Collection</code>[<code>Collection</code>[<code>str</code>]], <code>max_vocab</code>:<code>int</code>, <code>min_freq</code>:<code>int</code>) -> <code>str</code><div style="text-align: right"><a href="https://github.com/fastai/fastai_pytorch/blob/master/fastai/text/transform.py#L127">[source]</a></div>


Create a vocabulary from a set of tokens.

`Vocab.create`

In [None]:
show_doc(Vocab.numericalize)

#### <a id=numericalize></a>`numericalize`
(<code>t</code>:<code>Collection</code>[<code>str</code>]) -> <code>List</code>[<code>int</code>]<div style="text-align: right"><a href="https://github.com/fastai/fastai_pytorch/blob/master/fastai/text/transform.py#L119">[source]</a></div>


Converts a list of tokens to their ids

`Vocab.numericalize`

In [None]:
show_doc(Vocab.textify)

#### <a id=textify></a>`textify`
(<code>nums</code>:<code>Collection</code>[<code>int</code>]) -> <code>List</code>[<code>str</code>]<div style="text-align: right"><a href="https://github.com/fastai/fastai_pytorch/blob/master/fastai/text/transform.py#L123">[source]</a></div>


Converts a list of ids to their tokens

`Vocab.textify`

In [None]:
show_doc(deal_caps)

#### <a id=deal_caps></a>`deal_caps`(<code>t</code>:<code>str</code>) -> <code>str</code><div style="text-align: right"><a href="https://github.com/fastai/fastai_pytorch/blob/master/fastai/text/transform.py#L62">[source]</a></div>


Replace words in all caps

[`deal_caps`](/text.transform.html#deal_caps)

In [None]:
show_doc(fixup)

#### <a id=fixup></a>`fixup`(<code>x</code>:<code>str</code>) -> <code>str</code><div style="text-align: right"><a href="https://github.com/fastai/fastai_pytorch/blob/master/fastai/text/transform.py#L69">[source]</a></div>


List of replacements from html strings

[`fixup`](/text.transform.html#fixup)

In [None]:
show_doc(replace_rep)

#### <a id=replace_rep></a>`replace_rep`(<code>t</code>:<code>str</code>) -> <code>str</code><div style="text-align: right"><a href="https://github.com/fastai/fastai_pytorch/blob/master/fastai/text/transform.py#L46">[source]</a></div>


Replace repetitions at the character level

[`replace_rep`](/text.transform.html#replace_rep)

In [None]:
show_doc(replace_wrep)

#### <a id=replace_wrep></a>`replace_wrep`(<code>t</code>:<code>str</code>) -> <code>str</code><div style="text-align: right"><a href="https://github.com/fastai/fastai_pytorch/blob/master/fastai/text/transform.py#L54">[source]</a></div>


Replace word repetitions

[`replace_wrep`](/text.transform.html#replace_wrep)

In [None]:
show_doc(rm_useless_spaces)

#### <a id=rm_useless_spaces></a>`rm_useless_spaces`(<code>t</code>:<code>str</code>) -> <code>str</code><div style="text-align: right"><a href="https://github.com/fastai/fastai_pytorch/blob/master/fastai/text/transform.py#L42">[source]</a></div>


Remove multiple spaces

[`rm_useless_spaces`](/text.transform.html#rm_useless_spaces)

In [None]:
show_doc(spec_add_spaces)

#### <a id=spec_add_spaces></a>`spec_add_spaces`(<code>t</code>:<code>str</code>) -> <code>str</code><div style="text-align: right"><a href="https://github.com/fastai/fastai_pytorch/blob/master/fastai/text/transform.py#L38">[source]</a></div>


Add spaces between special characters

[`spec_add_spaces`](/text.transform.html#spec_add_spaces)

In [None]:
show_doc(sub_br)

#### <a id=sub_br></a>`sub_br`(<code>t</code>:<code>str</code>) -> <code>str</code><div style="text-align: right"><a href="https://github.com/fastai/fastai_pytorch/blob/master/fastai/text/transform.py#L33">[source]</a></div>


Replaces the <br /> by

[`sub_br`](/text.transform.html#sub_br)