# NLP datasets

In [None]:
from fastai.gen_doc.nbdoc import *
from fastai.text import * 
from fastai.gen_doc.nbdoc import *


This module contains the [`TextDataset`](/text.data.html#TextDataset) class, which is the main dataset you should use for your NLP tasks. It automatically does the preprocessing steps described in [`text.transform`](/text.transform.html#text.transform). It also contains all the functions to quickly get a [`TextDataBunch`](/text.data.html#TextDataBunch) ready.

## Quickly assemble your data

You should get your data in one of the following formats to make the most of the fastai library and use one of the factory methods of one of the [`TextDataBunch`](/text.data.html#TextDataBunch) classes:
- raw text files in folders train, valid, test in an ImageNet style,
- a csv where some column(s) gives the label(s) and the folowwing one the associated text,
- a dataframe structured the same way,
- tokens and labels arrays,
- ids, vocabulary (correspondance id to word) and labels.

If you are assembling the data for a language model, you should define your labels as always 0 to respect those formats. The first time you create a [`DataBunch`](/basic_data.html#DataBunch) with one of those functions, your data will be preprocessed automatically. You can save it, so that the next time you call it is almost instantaneous. 

Below are the classes that help assembling the raw data in a [`DataBunch`](/basic_data.html#DataBunch) suitable for NLP.

In [None]:
show_doc(TextLMDataBunch, title_level=3)

<h3 id="TextLMDataBunch"><code>class</code> <code>TextLMDataBunch</code><a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L196" class="source_link">[source]</a></h3>

> <code>TextLMDataBunch</code>(<b>`train_dl`</b>:[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader), <b>`valid_dl`</b>:[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader), <b>`fix_dl`</b>:[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)=<b><i>`None`</i></b>, <b>`test_dl`</b>:`Optional`\[[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)\]=<b><i>`None`</i></b>, <b>`device`</b>:[`device`](https://pytorch.org/docs/stable/tensor_attributes.html#torch-device)=<b><i>`None`</i></b>, <b>`tfms`</b>:`Optional`\[`Collection`\[`Callable`\]\]=<b><i>`None`</i></b>, <b>`path`</b>:`PathOrStr`=<b><i>`'.'`</i></b>, <b>`collate_fn`</b>:`Callable`=<b><i>`'data_collate'`</i></b>, <b>`no_check`</b>:`bool`=<b><i>`False`</i></b>) :: [`TextDataBunch`](/text.data.html#TextDataBunch)

Create a [`TextDataBunch`](/text.data.html#TextDataBunch) suitable for training a language model.  

All the texts in the [`datasets`](/datasets.html#datasets) are concatenated and the labels are ignored. Instead, the target is the next word in the sentence.

In [None]:
show_doc(TextLMDataBunch.create)

<h4 id="TextLMDataBunch.create"><code>create</code><a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L198" class="source_link">[source]</a></h4>

> <code>create</code>(<b>`train_ds`</b>, <b>`valid_ds`</b>, <b>`test_ds`</b>=<b><i>`None`</i></b>, <b>`path`</b>:`PathOrStr`=<b><i>`'.'`</i></b>, <b>`no_check`</b>:`bool`=<b><i>`False`</i></b>, <b>`kwargs`</b>) → [`DataBunch`](/basic_data.html#DataBunch)

Create a [`TextDataBunch`](/text.data.html#TextDataBunch) in `path` from the `datasets` for language modelling.  

In [None]:
show_doc(TextClasDataBunch, title_level=3)

<h3 id="TextClasDataBunch"><code>class</code> <code>TextClasDataBunch</code><a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L205" class="source_link">[source]</a></h3>

> <code>TextClasDataBunch</code>(<b>`train_dl`</b>:[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader), <b>`valid_dl`</b>:[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader), <b>`fix_dl`</b>:[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)=<b><i>`None`</i></b>, <b>`test_dl`</b>:`Optional`\[[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)\]=<b><i>`None`</i></b>, <b>`device`</b>:[`device`](https://pytorch.org/docs/stable/tensor_attributes.html#torch-device)=<b><i>`None`</i></b>, <b>`tfms`</b>:`Optional`\[`Collection`\[`Callable`\]\]=<b><i>`None`</i></b>, <b>`path`</b>:`PathOrStr`=<b><i>`'.'`</i></b>, <b>`collate_fn`</b>:`Callable`=<b><i>`'data_collate'`</i></b>, <b>`no_check`</b>:`bool`=<b><i>`False`</i></b>) :: [`TextDataBunch`](/text.data.html#TextDataBunch)

Create a [`TextDataBunch`](/text.data.html#TextDataBunch) suitable for training an RNN classifier.  

In [None]:
show_doc(TextClasDataBunch.create)

<h4 id="TextClasDataBunch.create"><code>create</code><a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L207" class="source_link">[source]</a></h4>

> <code>create</code>(<b>`train_ds`</b>, <b>`valid_ds`</b>, <b>`test_ds`</b>=<b><i>`None`</i></b>, <b>`path`</b>:`PathOrStr`=<b><i>`'.'`</i></b>, <b>`bs`</b>=<b><i>`64`</i></b>, <b>`pad_idx`</b>=<b><i>`1`</i></b>, <b>`pad_first`</b>=<b><i>`True`</i></b>, <b>`no_check`</b>:`bool`=<b><i>`False`</i></b>, <b>`kwargs`</b>) → [`DataBunch`](/basic_data.html#DataBunch)

Function that transform the `datasets` in a [`DataBunch`](/basic_data.html#DataBunch) for classification.  

All the texts are grouped by length (with a bit of randomness for the training set) then padded so that the samples have the same length to get in a batch.

In [None]:
show_doc(TextDataBunch, title_level=3)

<h3 id="TextDataBunch"><code>class</code> <code>TextDataBunch</code><a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L104" class="source_link">[source]</a></h3>

> <code>TextDataBunch</code>(<b>`train_dl`</b>:[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader), <b>`valid_dl`</b>:[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader), <b>`fix_dl`</b>:[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)=<b><i>`None`</i></b>, <b>`test_dl`</b>:`Optional`\[[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)\]=<b><i>`None`</i></b>, <b>`device`</b>:[`device`](https://pytorch.org/docs/stable/tensor_attributes.html#torch-device)=<b><i>`None`</i></b>, <b>`tfms`</b>:`Optional`\[`Collection`\[`Callable`\]\]=<b><i>`None`</i></b>, <b>`path`</b>:`PathOrStr`=<b><i>`'.'`</i></b>, <b>`collate_fn`</b>:`Callable`=<b><i>`'data_collate'`</i></b>, <b>`no_check`</b>:`bool`=<b><i>`False`</i></b>) :: [`DataBunch`](/basic_data.html#DataBunch)

General class to get a [`DataBunch`](/basic_data.html#DataBunch) for NLP. Subclassed by [`TextLMDataBunch`](/text.data.html#TextLMDataBunch) and [`TextClasDataBunch`](/text.data.html#TextClasDataBunch).  

In [None]:
jekyll_warn("This class can only work directly if all the texts have the same length.")

<div markdown="span" class="alert alert-danger" role="alert"><i class="fa fa-danger-circle"></i> <b>Warning: </b>This class can only work directly if all the texts have the same length.</div>

### Factory methods (TextDataBunch)

All those classes have the following factory methods.

In [None]:
show_doc(TextDataBunch.from_folder)

<h4 id="TextDataBunch.from_folder"><code>from_folder</code><a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L183" class="source_link">[source]</a></h4>

> <code>from_folder</code>(<b>`path`</b>:`PathOrStr`, <b>`train`</b>:`str`=<b><i>`'train'`</i></b>, <b>`valid`</b>:`str`=<b><i>`'valid'`</i></b>, <b>`test`</b>:`Optional`\[`str`\]=<b><i>`None`</i></b>, <b>`classes`</b>:`ArgStar`=<b><i>`None`</i></b>, <b>`tokenizer`</b>:[`Tokenizer`](/text.transform.html#Tokenizer)=<b><i>`None`</i></b>, <b>`vocab`</b>:[`Vocab`](/text.transform.html#Vocab)=<b><i>`None`</i></b>, <b>`kwargs`</b>)

Create a [`TextDataBunch`](/text.data.html#TextDataBunch) from text files in folders.  

The floders are scanned in `path` with a <code>train</code>, `valid` and maybe `test` folders. Text files in the <code>train</code> and `valid` folders should be places in subdirectories according to their classes (not applicable for a language model). `tokenizer` will be used to parse those texts into tokens.

You can pass a specific `vocab` for the numericalization step (if you are building a classifier from a language model you fine-tuned for instance). kwargs will be split between the [`TextDataset`](/text.data.html#TextDataset) function and to the class initialization, you can precise there parameters such as `max_vocab`, `chunksize`, `min_freq`, `n_labels` (see the [`TextDataset`](/text.data.html#TextDataset) documentation) or `bs`, `bptt` and `pad_idx` (see the sections LM data and classifier data).

In [None]:
show_doc(TextDataBunch.from_csv)

<h4 id="TextDataBunch.from_csv"><code>from_csv</code><a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L170" class="source_link">[source]</a></h4>

> <code>from_csv</code>(<b>`path`</b>:`PathOrStr`, <b>`csv_name`</b>, <b>`valid_pct`</b>:`float`=<b><i>`0.2`</i></b>, <b>`test`</b>:`Optional`\[`str`\]=<b><i>`None`</i></b>, <b>`tokenizer`</b>:[`Tokenizer`](/text.transform.html#Tokenizer)=<b><i>`None`</i></b>, <b>`vocab`</b>:[`Vocab`](/text.transform.html#Vocab)=<b><i>`None`</i></b>, <b>`classes`</b>:`StrList`=<b><i>`None`</i></b>, <b>`header`</b>=<b><i>`'infer'`</i></b>, <b>`text_cols`</b>:`IntsOrStrs`=<b><i>`1`</i></b>, <b>`label_cols`</b>:`IntsOrStrs`=<b><i>`0`</i></b>, <b>`label_delim`</b>:`str`=<b><i>`None`</i></b>, <b>`kwargs`</b>) → [`DataBunch`](/basic_data.html#DataBunch)

Create a [`TextDataBunch`](/text.data.html#TextDataBunch) from texts in csv files.  

This method will look for `csv_name` in  `path`, and maybe a `test` csv file opened with `header`. You can specify `text_cols` and `label_cols`. If there are several `text_cols`, the texts will be concatenated together with an optional field token. If there are several `label_cols`, the labels will be assumed to be one-hot encoded and `classes` will default to `label_cols` (you can ignore that argument for a language model). `tokenizer` will be used to parse those texts into tokens.

You can pass a specific `vocab` for the numericalization step (if you are building a classifier from a language model you fine-tuned for instance). kwargs will be split between the [`TextDataset`](/text.data.html#TextDataset) function and to the class initialization, you can precise there parameters such as `max_vocab`, `chunksize`, `min_freq`, `n_labels` (see the [`TextDataset`](/text.data.html#TextDataset) documentation) or `bs`, `bptt` and `pad_idx` (see the sections LM data and classifier data).

In [None]:
show_doc(TextDataBunch.from_df)

<h4 id="TextDataBunch.from_df"><code>from_df</code><a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L156" class="source_link">[source]</a></h4>

> <code>from_df</code>(<b>`path`</b>:`PathOrStr`, <b>`train_df`</b>:`DataFrame`, <b>`valid_df`</b>:`DataFrame`, <b>`test_df`</b>:`OptDataFrame`=<b><i>`None`</i></b>, <b>`tokenizer`</b>:[`Tokenizer`](/text.transform.html#Tokenizer)=<b><i>`None`</i></b>, <b>`vocab`</b>:[`Vocab`](/text.transform.html#Vocab)=<b><i>`None`</i></b>, <b>`classes`</b>:`StrList`=<b><i>`None`</i></b>, <b>`text_cols`</b>:`IntsOrStrs`=<b><i>`1`</i></b>, <b>`label_cols`</b>:`IntsOrStrs`=<b><i>`0`</i></b>, <b>`label_delim`</b>:`str`=<b><i>`None`</i></b>, <b>`kwargs`</b>) → [`DataBunch`](/basic_data.html#DataBunch)

Create a [`TextDataBunch`](/text.data.html#TextDataBunch) from DataFrames.  

This method will use `train_df`, `valid_df` and maybe `test_df` to build the [`TextDataBunch`](/text.data.html#TextDataBunch) in `path`. You can specify `text_cols` and `label_cols`. If there are several `text_cols`, the texts will be concatenated together with an optional field token. If there are several `label_cols`, the labels will be assumed to be one-hot encoded and `classes` will default to `label_cols` (you can ignore that argument for a language model). `tokenizer` will be used to parse those texts into tokens.

You can pass a specific `vocab` for the numericalization step (if you are building a classifier from a language model you fine-tuned for instance). kwargs will be split between the [`TextDataset`](/text.data.html#TextDataset) function and to the class initialization, you can precise there parameters such as `max_vocab`, `chunksize`, `min_freq`, `n_labels` (see the [`TextDataset`](/text.data.html#TextDataset) documentation) or `bs`, `bptt` and `pad_idx` (see the sections LM data and classifier data).

In [None]:
show_doc(TextDataBunch.from_tokens)

<h4 id="TextDataBunch.from_tokens"><code>from_tokens</code><a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L143" class="source_link">[source]</a></h4>

> <code>from_tokens</code>(<b>`path`</b>:`PathOrStr`, <b>`trn_tok`</b>:`Tokens`, <b>`trn_lbls`</b>:`Collection`\[`Union`\[`int`, `float`\]\], <b>`val_tok`</b>:`Tokens`, <b>`val_lbls`</b>:`Collection`\[`Union`\[`int`, `float`\]\], <b>`vocab`</b>:[`Vocab`](/text.transform.html#Vocab)=<b><i>`None`</i></b>, <b>`tst_tok`</b>:`Tokens`=<b><i>`None`</i></b>, <b>`classes`</b>:`ArgStar`=<b><i>`None`</i></b>, <b>`kwargs`</b>) → [`DataBunch`](/basic_data.html#DataBunch)

Create a [`TextDataBunch`](/text.data.html#TextDataBunch) from tokens and labels.  

This function will create a [`DataBunch`](/basic_data.html#DataBunch) from `trn_tok`, `trn_lbls`, `val_tok`, `val_lbls` and maybe `tst_tok`.

You can pass a specific `vocab` for the numericalization step (if you are building a classifier from a language model you fine-tuned for instance). kwargs will be split between the [`TextDataset`](/text.data.html#TextDataset) function and to the class initialization, you can precise there parameters such as `max_vocab`, `chunksize`, `min_freq`, `n_labels`, `tok_suff` and `lbl_suff` (see the [`TextDataset`](/text.data.html#TextDataset) documentation) or `bs`, `bptt` and `pad_idx` (see the sections LM data and classifier data).

In [None]:
show_doc(TextDataBunch.from_ids)

<h4 id="TextDataBunch.from_ids"><code>from_ids</code><a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L119" class="source_link">[source]</a></h4>

> <code>from_ids</code>(<b>`path`</b>:`PathOrStr`, <b>`vocab`</b>:[`Vocab`](/text.transform.html#Vocab), <b>`train_ids`</b>:`Collection`\[`Collection`\[`int`\]\], <b>`valid_ids`</b>:`Collection`\[`Collection`\[`int`\]\], <b>`test_ids`</b>:`Collection`\[`Collection`\[`int`\]\]=<b><i>`None`</i></b>, <b>`train_lbls`</b>:`Collection`\[`Union`\[`int`, `float`\]\]=<b><i>`None`</i></b>, <b>`valid_lbls`</b>:`Collection`\[`Union`\[`int`, `float`\]\]=<b><i>`None`</i></b>, <b>`classes`</b>:`ArgStar`=<b><i>`None`</i></b>, <b>`processor`</b>:[`PreProcessor`](/data_block.html#PreProcessor)=<b><i>`None`</i></b>, <b>`kwargs`</b>) → [`DataBunch`](/basic_data.html#DataBunch)

Create a [`TextDataBunch`](/text.data.html#TextDataBunch) from ids, labels and a `vocab`.  

Texts are already preprocessed into `train_ids`, `train_lbls`, `valid_ids`, `valid_lbls` and maybe `test_ids`. You can specify the corresponding `classes` if applicable. You must specify a `path` and the `vocab` so that the [`RNNLearner`](/text.learner.html#RNNLearner) class can later infer the corresponding sizes in the model it will create. kwargs will be passed to the class initialization.

### Load and save

To avoid losing time preprocessing the text data more than once, you should save/load your [`TextDataBunch`](/text.data.html#TextDataBunch) using thse methods.

In [None]:
show_doc(TextDataBunch.load)

<h4 id="TextDataBunch.load"><code>load</code><a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L132" class="source_link">[source]</a></h4>

> <code>load</code>(<b>`path`</b>:`PathOrStr`, <b>`cache_name`</b>:`PathOrStr`=<b><i>`'tmp'`</i></b>, <b>`processor`</b>:[`PreProcessor`](/data_block.html#PreProcessor)=<b><i>`None`</i></b>, <b>`kwargs`</b>)

Load a [`TextDataBunch`](/text.data.html#TextDataBunch) from `path/cache_name`. `kwargs` are passed to the dataloader creation.  

In [None]:
show_doc(TextDataBunch.save)

<h4 id="TextDataBunch.save"><code>save</code><a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L107" class="source_link">[source]</a></h4>

> <code>save</code>(<b>`cache_name`</b>:`PathOrStr`=<b><i>`'tmp'`</i></b>)

Save the [`DataBunch`](/basic_data.html#DataBunch) in `self.path/cache_name` folder.  

### Example

Untar the IMDB sample dataset if not already done:

In [None]:
path = untar_data(URLs.IMDB_SAMPLE)
path

PosixPath('/home/ubuntu/.fastai/data/imdb_sample')

Since it comes in the form of csv files, we will use the corresponding `text_data` method. Here is an overview of what your file you should look like:

In [None]:
pd.read_csv(path/'texts.csv').head()

Unnamed: 0,label,text,is_valid
0,negative,Un-bleeping-believable! Meg Ryan doesn't even ...,False
1,positive,This is a extremely well-made film. The acting...,False
2,negative,Every once in a long while a movie will come a...,False
3,positive,Name just says it all. I watched this movie wi...,False
4,negative,This movie succeeds at being one of the most u...,False


And here is a simple way of creating your [`DataBunch`](/basic_data.html#DataBunch) for language modelling or classification.

In [None]:
data_lm = TextLMDataBunch.from_csv(Path(path), 'texts.csv')
data_clas = TextClasDataBunch.from_csv(Path(path), 'texts.csv')

## The TextList input classes

Behind the scenes, the previous functions will create a training, validation and maybe test [`TextList`](/text.data.html#TextList) that will be tokenized and numericalized (if needed) using [`PreProcessor`](/data_block.html#PreProcessor).

In [None]:
show_doc(Text, title_level=3)

<h3 id="Text"><code>class</code> <code>Text</code><a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L226" class="source_link">[source]</a></h3>

> <code>Text</code>(<b>`ids`</b>, <b>`text`</b>) :: [`ItemBase`](/core.html#ItemBase)

Basic item for <code>text</code> data in numericalized `ids`.  

In [None]:
show_doc(TextList, title_level=3)

<h3 id="TextList"><code>class</code> <code>TextList</code><a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L261" class="source_link">[source]</a></h3>

> <code>TextList</code>(<b>`items`</b>:`Iterator`\[`T_co`\], <b>`vocab`</b>:[`Vocab`](/text.transform.html#Vocab)=<b><i>`None`</i></b>, <b>`pad_idx`</b>:`int`=<b><i>`1`</i></b>, <b>`kwargs`</b>) :: [`ItemList`](/data_block.html#ItemList)

Basic [`ItemList`](/data_block.html#ItemList) for text data.  

`vocab` contains the correspondance between ids and tokens, `pad_idx` is the id used for padding. You can pass a custom `processor` in the `kwargs` to change the defaults for tokenization or numericalization. It should have the following form:

In [None]:
processor = [TokenizeProcessor(tokenizer=SpacyTokenizer('en')), NumericalizeProcessor(max_vocab=30000)]

See below for all the arguments those tokenizers can take.

In [None]:
show_doc(TextList.label_for_lm)

<h4 id="TextList.label_for_lm"><code>label_for_lm</code><a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L277" class="source_link">[source]</a></h4>

> <code>label_for_lm</code>(<b>`kwargs`</b>)

A special labelling method for language models.  

In [None]:
show_doc(TextList.from_folder)

<h4 id="TextList.from_folder"><code>from_folder</code><a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L286" class="source_link">[source]</a></h4>

> <code>from_folder</code>(<b>`path`</b>:`PathOrStr`=<b><i>`'.'`</i></b>, <b>`extensions`</b>:`StrList`=<b><i>`{'.txt'}`</i></b>, <b>`vocab`</b>:[`Vocab`](/text.transform.html#Vocab)=<b><i>`None`</i></b>, <b>`processor`</b>:[`PreProcessor`](/data_block.html#PreProcessor)=<b><i>`None`</i></b>, <b>`kwargs`</b>) → `TextList`

Get the list of files in `path` that have a text suffix. `recurse` determines if we search subfolders.  

In [None]:
show_doc(TextList.show_xys)

<h4 id="TextList.show_xys"><code>show_xys</code><a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L293" class="source_link">[source]</a></h4>

> <code>show_xys</code>(<b>`xs`</b>, <b>`ys`</b>, <b>`max_len`</b>:`int`=<b><i>`70`</i></b>)

Show the `xs` (inputs) and `ys` (targets). `max_len` is the maximum number of tokens displayed.  

In [None]:
show_doc(TextList.show_xyzs)

<h4 id="TextList.show_xyzs"><code>show_xyzs</code><a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L302" class="source_link">[source]</a></h4>

> <code>show_xyzs</code>(<b>`xs`</b>, <b>`ys`</b>, <b>`zs`</b>, <b>`max_len`</b>:`int`=<b><i>`70`</i></b>)

Show `xs` (inputs), `ys` (targets) and `zs` (predictions). `max_len` is the maximum number of tokens displayed.  

In [None]:
show_doc(OpenFileProcessor, title_level=3)

<h3 id="OpenFileProcessor"><code>class</code> <code>OpenFileProcessor</code><a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L256" class="source_link">[source]</a></h3>

> <code>OpenFileProcessor</code>(<b>`ds`</b>:`Collection`\[`T_co`\]=<b><i>`None`</i></b>) :: [`PreProcessor`](/data_block.html#PreProcessor)

[`PreProcessor`](/data_block.html#PreProcessor) that opens the filenames and read the texts.  

In [None]:
show_doc(open_text)

<h4 id="open_text"><code>open_text</code><a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L222" class="source_link">[source]</a></h4>

> <code>open_text</code>(<b>`fn`</b>:`PathOrStr`, <b>`enc`</b>=<b><i>`'utf-8'`</i></b>)

Read the text in `fn`.  

In [None]:
show_doc(TokenizeProcessor, title_level=3)

<h3 id="TokenizeProcessor"><code>class</code> <code>TokenizeProcessor</code><a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L231" class="source_link">[source]</a></h3>

> <code>TokenizeProcessor</code>(<b>`ds`</b>:[`ItemList`](/data_block.html#ItemList)=<b><i>`None`</i></b>, <b>`tokenizer`</b>:[`Tokenizer`](/text.transform.html#Tokenizer)=<b><i>`None`</i></b>, <b>`chunksize`</b>:`int`=<b><i>`10000`</i></b>, <b>`mark_fields`</b>:`bool`=<b><i>`False`</i></b>) :: [`PreProcessor`](/data_block.html#PreProcessor)

[`PreProcessor`](/data_block.html#PreProcessor) that tokenizes the texts in `ds`.  

`tokenizer` is uded on bits of `chunsize`. If `mark_fields=True`, add field tokens between each parts of the texts (given when the texts are read in several columns of a dataframe). See more about tokenizers in the [transform documentation](/text.transform.html).

In [None]:
show_doc(NumericalizeProcessor, title_level=3)

<h3 id="NumericalizeProcessor"><code>class</code> <code>NumericalizeProcessor</code><a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L244" class="source_link">[source]</a></h3>

> <code>NumericalizeProcessor</code>(<b>`ds`</b>:[`ItemList`](/data_block.html#ItemList)=<b><i>`None`</i></b>, <b>`vocab`</b>:[`Vocab`](/text.transform.html#Vocab)=<b><i>`None`</i></b>, <b>`max_vocab`</b>:`int`=<b><i>`60000`</i></b>, <b>`min_freq`</b>:`int`=<b><i>`2`</i></b>) :: [`PreProcessor`](/data_block.html#PreProcessor)

[`PreProcessor`](/data_block.html#PreProcessor) that numericalizes the tokens in `ds`.  

Uses `vocab` for this (if not None), otherwise create one with `max_vocab` and `min_freq` from tokens.

## Language Model data

A language model is trained to guess what the next word is inside a flow of words. We don't feed it the different texts separately but concatenate them all together in a big array. To create the batches, we split this array into `bs` chuncks of continuous texts. Note that in all NLP tasks, we don't use the usual convention of sequence length being the first dimension so batch size is the first dimension and sequence lenght is the second. Here you can read the chunks of texts in lines. 

In [None]:
path = untar_data(URLs.IMDB_SAMPLE)
data = TextLMDataBunch.from_csv(path, 'texts.csv')
x,y = next(iter(data.train_dl))
example = x[:15,:15].cpu()
texts = pd.DataFrame([data.train_ds.vocab.textify(l).split(' ') for l in example])
texts

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,xxbos,...,the,first,film,i,had,to,walk,out,on,.,xxmaj,and,it
1,and,you,'ll,agree,.,"""",xxmaj,eyes,xxmaj,wide,xxmaj,shut,"""",when,released
2,the,love,xxunk,of,xxmaj,alex,and,xxmaj,kiki,",",and,xxmaj,kiki,and,her
3,you,and,so,could,someone,who,had,never,seen,a,movie,before,.,xxmaj,it
4,for,their,liking,.,7,/,10,xxbos,xxmaj,predictable,",",told,a,thousand,times
5,wrote,something,as,boring,and,utterly,ridiculous,as,this,i,would,be,laughed,at,and
6,the,xxmaj,beatles,;,it,has,very,little,plot,",",in,fact,",",and,takes
7,that,he,'s,found,love,again,in,xxmaj,xxunk,.,\n\n,xxmaj,as,for,those
8,are,good,.,xxmaj,however,",",the,xxunk,gang,rape,scene,is,the,most,appalling
9,xxmaj,fantastic,and,i,thought,xxmaj,michelle,xxmaj,xxunk,did,a,good,job,in,the


In [None]:
jekyll_warn("If you are used to another convention, beware! fastai always uses batch as a first dimension, even in NLP.")

<div markdown="span" class="alert alert-danger" role="alert"><i class="fa fa-danger-circle"></i> <b>Warning: </b>If you are used to another convention, beware! fastai always uses batch as a first dimension, even in NLP.</div>

Then, as suggested in [this article](https://arxiv.org/abs/1708.02182) from Stephen Merity et al., we don't use a fixed `bptt` through the different batches but slightly change it from batch to batch.

In [None]:
iter_dl = iter(data.train_dl)
for _ in range(5):
    x,y = next(iter_dl)
    print(x.size())

torch.Size([64, 74])
torch.Size([64, 60])
torch.Size([64, 72])
torch.Size([64, 69])
torch.Size([64, 68])


This is all done internally when we use [`TextLMDataBunch`](/text.data.html#TextLMDataBunch), by creating [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) using the following class:

In [None]:
show_doc(LanguageModelLoader)

<h2 id="LanguageModelLoader"><code>class</code> <code>LanguageModelLoader</code><a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L14" class="source_link">[source]</a></h2>

> <code>LanguageModelLoader</code>(<b>`dataset`</b>:[`LabelList`](/data_block.html#LabelList), <b>`lengths`</b>:`Collection`\[`int`\]=<b><i>`None`</i></b>, <b>`bs`</b>:`int`=<b><i>`64`</i></b>, <b>`bptt`</b>:`int`=<b><i>`70`</i></b>, <b>`backwards`</b>:`bool`=<b><i>`False`</i></b>, <b>`shuffle`</b>:`bool`=<b><i>`False`</i></b>, <b>`drop_last`</b>:`bool`=<b><i>`False`</i></b>, <b>`max_len`</b>:`int`=<b><i>`25`</i></b>, <b>`p_bptt`</b>:`int`=<b><i>`0.95`</i></b>)

Create a dataloader with bptt slightly changing.  

Takes the texts from `dataset` that have vertain `lengths` (if this argument isn't passed, `lengths` are computed at initiliazation). It will spits batches with a batch size of `bs` and a sequence length approximately equal to `bptt` but changing at every batch. If `backwards=True`, reverses the original text. If `shuffle=True`, we shuffle the texts before going through them, at the start of each epoch. `max_len` is the maximum amount we add to `bptt` (to avoid out of memory errors). With probability `p_bptt` we divide the bptt by 2.

## Classifier data

When preparing the data for a classifier, we keep the different texts separate, which poses another challenge for the creation of batches: since they don't all have the same length, we can't easily collate them together in batches. To help with this we use two different techniques:
- padding: each text is padded with the `PAD` token to get all the ones we picked to the same size
- sorting the texts (ish): to avoid having together a very long text with a very short one (which would then have a lot of `PAD` tokens), we regroup the texts by order of length. For the training set, we still add some randomness to avoid showing the same batches at every step of the training.

Here is an example of batch with padding (the padding index is 1, and the padding is applied before the sentences start).

In [None]:
path = untar_data(URLs.IMDB_SAMPLE)
data = TextClasDataBunch.from_csv(path, 'texts.csv')
iter_dl = iter(data.train_dl)
_ = next(iter_dl)
x,y = next(iter_dl)
x[-10:,:20]

tensor([[   1,    1,    1,    1,    1,    1,    1,    1,    2,    4,  404,  101,
         3263,   10, 4111,   66,   75,   23,  337,   66],
        [   1,    1,    1,    1,    1,    1,    1,    1,    1,    2,    4, 4914,
            4, 1635,   22, 1098,  709,   23, 1418,  882],
        [   1,    1,    1,    1,    1,    1,    1,    1,    1,    2,   21,    4,
         4392,   21,   15,   43,   13,    8,  144, 2031],
        [   1,    1,    1,    1,    1,    1,    1,    1,    1,    2,    4,  135,
          340,   23, 5865,   94,   36,    0,   88,  340],
        [   1,    1,    1,    1,    1,    1,    1,    1,    1,    2,   18,   40,
          142,  130,   10,  130, 2493,   13,    4,    8],
        [   1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    2,    4,
            8,  102,   80,   18,  155,  259,   20,   29],
        [   1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    2,    4,
           20,   24,   12,  387,   31,    9,    4,   54],
        [   1,    1,    1, 

This is all done internally when we use [`TextClasDataBunch`](/text.data.html#TextClasDataBunch), by using the following classes:

In [None]:
show_doc(SortSampler)

<h2 id="SortSampler"><code>class</code> <code>SortSampler</code><a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L60" class="source_link">[source]</a></h2>

> <code>SortSampler</code>(<b>`data_source`</b>:`NPArrayList`, <b>`key`</b>:`KeyFunc`) :: [`Sampler`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Sampler)

Go through the text data by order of length.  

This pytorch [`Sampler`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Sampler) is used for the validation and (if applicable) the test set. 

In [None]:
show_doc(SortishSampler)

<h2 id="SortishSampler"><code>class</code> <code>SortishSampler</code><a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L68" class="source_link">[source]</a></h2>

> <code>SortishSampler</code>(<b>`data_source`</b>:`NPArrayList`, <b>`key`</b>:`KeyFunc`, <b>`bs`</b>:`int`) :: [`Sampler`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Sampler)

Go through the text data by order of length with a bit of randomness.  

This pytorch [`Sampler`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Sampler) is generally used for the training set.

In [None]:
show_doc(pad_collate)

<h4 id="pad_collate"><code>pad_collate</code><a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L89" class="source_link">[source]</a></h4>

> <code>pad_collate</code>(<b>`samples`</b>:`BatchSamples`, <b>`pad_idx`</b>:`int`=<b><i>`1`</i></b>, <b>`pad_first`</b>:`bool`=<b><i>`True`</i></b>) → `Tuple`\[`LongTensor`, `LongTensor`\]

Function that collect samples and adds padding.  

This will collate the `samples` in batches while adding padding with `pad_idx`. If `pad_first=True`, padding is applied at the beginning (before the sentence starts) otherwise it's applied at the end.

## Undocumented Methods - Methods moved below this line will intentionally be hidden

In [None]:
show_doc(TextList.new)

<h4 id="ItemList.new"><code>new</code><a href="https://github.com/fastai/fastai/blob/master/fastai/data_block.py#L86" class="source_link">[source]</a></h4>

> <code>new</code>(<b>`items`</b>:`Iterator`\[`T_co`\], <b>`processor`</b>:[`PreProcessor`](/data_block.html#PreProcessor)=<b><i>`None`</i></b>, <b>`kwargs`</b>) → `ItemList`

Create a new [`ItemList`](/data_block.html#ItemList) from `items`, keeping the same attributes.  

In [None]:
show_doc(TextList.get)

<h4 id="TextList.get"><code>get</code><a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L273" class="source_link">[source]</a></h4>

> <code>get</code>(<b>`i`</b>)

Subclass if you want to customize how to create item `i` from `self.items`.  

In [None]:
show_doc(TokenizeProcessor.process_one)

<h4 id="TokenizeProcessor.process_one"><code>process_one</code><a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L236" class="source_link">[source]</a></h4>

> <code>process_one</code>(<b>`item`</b>)

In [None]:
show_doc(TokenizeProcessor.process)

<h4 id="TokenizeProcessor.process"><code>process</code><a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L237" class="source_link">[source]</a></h4>

> <code>process</code>(<b>`ds`</b>)

In [None]:
show_doc(OpenFileProcessor.process_one)

<h4 id="OpenFileProcessor.process_one"><code>process_one</code><a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L258" class="source_link">[source]</a></h4>

> <code>process_one</code>(<b>`item`</b>)

In [None]:
show_doc(NumericalizeProcessor.process)

<h4 id="NumericalizeProcessor.process"><code>process</code><a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L251" class="source_link">[source]</a></h4>

> <code>process</code>(<b>`ds`</b>)

In [None]:
show_doc(NumericalizeProcessor.process_one)

<h4 id="NumericalizeProcessor.process_one"><code>process_one</code><a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L250" class="source_link">[source]</a></h4>

> <code>process_one</code>(<b>`item`</b>)

In [None]:
show_doc(TextList.reconstruct)

<h4 id="TextList.reconstruct"><code>reconstruct</code><a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L282" class="source_link">[source]</a></h4>

> <code>reconstruct</code>(<b>`t`</b>:`Tensor`)

Reconstuct one of the underlying item for its data `t`.  

## New Methods - Please document or move to the undocumented section