# NLP datasets

In [None]:
from fastai.gen_doc.nbdoc import *
from fastai.text import * 
from fastai.docs import *
from fastai import *

This module contains the [`TextDataset`](/text.data.html#TextDataset) class, which is the main dataset you should use for your NLP tasks. It automatically does the preprocessing steps described in [`text.transform`](/text.transform.html#text.transform). It also defines a few helper function to quickly get a [`DataBunch`](/data.html#DataBunch) ready.

## Quickly assemble your data

You should get your data in one of the following formats to make the most of the fastai library and use one of the `text_data` function:
- raw text files in folders train, valid, test in an ImageNet style,
- a csv (with no index or Header) where the first column(s) gives the label(s) and the folowwing one the associated text,
- tokens and labels arrays already saved,
- ids, vocabulary (correspondance id to word) and labels already saved.

If you are assembling the data for a language model, you should define your labels as always 0 to respect those formats. The first time you create a [`DataBunch`](/data.html#DataBunch) with one of those functions, your data will be preprocessed automatically and saved, so that the next time you call it is almost instantaneous.

### text_data functions

All those functions will require a `data_func` argument that explains how to assemble the datasets into a [`DataBunch`](/data.html#DataBunch). It can be one of the following:
- [`standard_data`](/text.data.html#standard_data): the datasets are directly used to create a [`DataBunch`](/data.html#DataBunch),
- [`lm_data`](/text.data.html#lm_data): the datasets are assembled to create a [`DataBunch`](/data.html#DataBunch) suitable for a language model,
- [`classifier_data`](/text.data.html#classifier_data): the datasets are assembled to create a [`DataBunch`](/data.html#DataBunch) suitable for an NLP classifier.

In [None]:
show_doc(text_data_from_folder, doc_string=False)

#### <a id=text_data_from_folder></a>`text_data_from_folder`
> `text_data_from_folder`(`path`:`PathOrStr`, `tokenizer`:[`Tokenizer`](/text.transform.html#Tokenizer)=`None`, `train`:`str`=`'train'`, `valid`:`str`=`'valid'`, `test`:`Optional`\[`str`\]=`None`, `shuffle`:`bool`=`True`, `data_func`:`Callable`\[`Collection`\[[`DatasetBase`](/data.html#DatasetBase)\], `PathOrStr`, `KWArgs`, [`DataBunch`](/data.html#DataBunch)\]=`'standard_data'`, `vocab`:[`Vocab`](/text.transform.html#Vocab)=`None`, `kwargs`)
<a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L315">[source]</a>

This function will create a [`DataBunch`](/data.html#DataBunch) from texts placed in `path` in a [`train`](/train.html#train), `valid` and maybe `test` folders. Text files in the [`train`](/train.html#train) and `valid` folders should be places in subdirectories according to their classes (always the same for a language model) and the ones for the `test` folder should all be placed there directly. `tokenizer` will be used to parse those texts into tokens. The `shuffle` flag will optionally shuffle the texts found.

You can pass a specific `vocab` for the numericalization step (if you are building a classifier from a language model you fine-tuned for instance). kwargs will be split between the [`TextDataset`](/text.data.html#TextDataset) function and to the `get_data` function, you can precise there parameters such as `max_vocab`, `chunksize`, `min_freq`, `n_labels` (see the [`TextDataset`](/text.data.html#TextDataset) documentation) or `bs`, `bptt` and `pad_idx` (see the sections LM data and classifier data).

In [None]:
show_doc(text_data_from_csv, doc_string=False)

#### <a id=text_data_from_csv></a>`text_data_from_csv`
> `text_data_from_csv`(`path`:`PathOrStr`, `tokenizer`:[`Tokenizer`](/text.transform.html#Tokenizer)=`None`, `train`:`str`=`'train'`, `valid`:`str`=`'valid'`, `test`:`Optional`\[`str`\]=`None`, `data_func`:`Callable`\[`Collection`\[[`DatasetBase`](/data.html#DatasetBase)\], `PathOrStr`, `KWArgs`, [`DataBunch`](/data.html#DataBunch)\]=`'standard_data'`, `vocab`:[`Vocab`](/text.transform.html#Vocab)=`None`, `kwargs`) -> [`DataBunch`](/data.html#DataBunch)
<a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L304">[source]</a>

This function will create a [`DataBunch`](/data.html#DataBunch) from texts placed in `path` in a [`train`](/train.html#train).csv, `valid`.csv and maybe `test`.csv files. These csv files should have no header or index, and the label(s) should be the first column(s) (be sure to adjust the parameter `n_labels` if you have more than one). `tokenizer` will be used to parse those texts into tokens.

You can pass a specific `vocab` for the numericalization step (if you are building a classifier from a language model you fine-tuned for instance). kwargs will be split between the [`TextDataset`](/text.data.html#TextDataset) function and to the `get_data` function, you can precise there parameters such as `max_vocab`, `chunksize`, `min_freq`, `n_labels` (see the [`TextDataset`](/text.data.html#TextDataset) documentation) or `bs`, `bptt` and `pad_idx` (see the sections LM data and classifier data).

In [None]:
show_doc(text_data_from_tokens, doc_string=False)

#### <a id=text_data_from_tokens></a>`text_data_from_tokens`
> `text_data_from_tokens`(`path`:`PathOrStr`, `train`:`str`=`'train'`, `valid`:`str`=`'valid'`, `test`:`Optional`\[`str`\]=`None`, `data_func`:`Callable`\[`Collection`\[[`DatasetBase`](/data.html#DatasetBase)\], `PathOrStr`, `KWArgs`, [`DataBunch`](/data.html#DataBunch)\]=`'standard_data'`, `vocab`:[`Vocab`](/text.transform.html#Vocab)=`None`, `kwargs`) -> [`DataBunch`](/data.html#DataBunch)
<a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L294">[source]</a>

This function will create a [`DataBunch`](/data.html#DataBunch) from texts already tokenized placed in `path` in files named `f{train}{tok_suff}`.npy, `f{train}{lbl_suff}`.npy, `f{valid}{tok_suff}`.npy, `f{valid}{lbl_suff}`.npy and maybe `f{test}{tok_suff}`.npy. If no label file exists, labels will default to all zeros. `tok_suff` and `lbl_suff` are '\_tok' and '\_lbl' respectively.

You can pass a specific `vocab` for the numericalization step (if you are building a classifier from a language model you fine-tuned for instance). kwargs will be split between the [`TextDataset`](/text.data.html#TextDataset) function and to the `get_data` function, you can precise there parameters such as `max_vocab`, `chunksize`, `min_freq`, `n_labels`, `tok_suff` and `lbl_suff` (see the [`TextDataset`](/text.data.html#TextDataset) documentation) or `bs`, `bptt` and `pad_idx` (see the sections LM data and classifier data).

In [None]:
show_doc(text_data_from_ids, doc_string=False)

#### <a id=text_data_from_ids></a>`text_data_from_ids`
> `text_data_from_ids`(`path`:`PathOrStr`, `train`:`str`=`'train'`, `valid`:`str`=`'valid'`, `test`:`Optional`\[`str`\]=`None`, `data_func`:`Callable`\[`Collection`\[[`DatasetBase`](/data.html#DatasetBase)\], `PathOrStr`, `KWArgs`, [`DataBunch`](/data.html#DataBunch)\]=`'standard_data'`, `itos`:`str`=`'itos.pkl'`, `kwargs`) -> [`DataBunch`](/data.html#DataBunch)
<a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L284">[source]</a>

This function will create a [`DataBunch`](/data.html#DataBunch) from texts already tokenized placed in `path` in files named `f{train}{id_suff}`.npy, `f{train}{lbl_suff}`.npy, `f{valid}{id_suff}`.npy, `f{valid}{lbl_suff}`.npy and maybe `f{test}{id_suff}`.npy. If no label file exists, labels will default to all zeros. `id_suff` and `lbl_suff` are '\_ids' and '\_lbl' respectively. The `itos` file should contain the correspondance from ids to words.

kwargs will be split between the [`TextDataset`](/text.data.html#TextDataset) function and to the `get_data` function, you can precise there parameters such as `max_vocab`, `chunksize`, `min_freq`, `n_labels`, `tok_suff` and `lbl_suff` (see the [`TextDataset`](/text.data.html#TextDataset) documentation) or `bs`, `bptt` and `pad_idx` (see the sections LM data and classifier data).

### Example

Untar the IMDB sample dataset if not already done:

In [None]:
untar_data(IMDB_PATH)
IMDB_PATH

PosixPath('../data/imdb_sample')

Since it comes in the form of csv files, we will use the corresponding `text_data` method. Here is an overview of what your file you should look like:

In [None]:
pd.read_csv(IMDB_PATH/'train.csv', header=None).head()

Unnamed: 0,0,1
0,0,Un-bleeping-believable! Meg Ryan doesn't even ...
1,1,This is a extremely well-made film. The acting...
2,0,Every once in a long while a movie will come a...
3,1,Name just says it all. I watched this movie wi...
4,0,This movie succeeds at being one of the most u...


And here is a simple way of creating your [`DataBunch`](/data.html#DataBunch).

In [None]:
data_lm = text_data_from_csv(Path(IMDB_PATH), data_func=lm_data)

Tokenizing valid.


HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='0.00% [0/1 00:00<00:00]')))

Numericalizing valid.


## The TextDataset class

Behind the scenes, the previous functions will create a training, validation and maybe test [`TextDataset`](/text.data.html#TextDataset) which is the class responsible for collecting and preprocessing the data.

In [None]:
show_doc(TextDataset, doc_string=False)

## <a id=TextDataset></a>`class` `TextDataset`
> `TextDataset`(`path`:`PathOrStr`, `tokenizer`:[`Tokenizer`](/text.transform.html#Tokenizer)=`None`, `vocab`:[`Vocab`](/text.transform.html#Vocab)=`None`, `max_vocab`:`int`=`60000`, `chunksize`:`int`=`10000`, `name`:`str`=`'train'`, `min_freq`:`int`=`2`, `n_labels`:`int`=`1`, `create_mtd`:[`TextMtd`](/text.data.html#TextMtd)=`<TextMtd.CSV: 1>`, `classes`:`Classes`=`None`)
<a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L16">[source]</a>

This class shouldn't be initialized directly as it will rely on internal files being put in an 'tmp' folder of `path`. `tokenizer` and `vocab` will be used to tokenize and numericalize the texts (if needed). `max_vocab` and `min_freq` are passed at the create of the vocabulary (if needed). `chunksize` is the size of chunks preprocessed when loading the data from csv or folders. `name` is the name of the set that will be used to name the temporary files. `n_labels` is the number of labels if creating the data from a csv file. `classes` is the correspondance between label and classe. `create_mtd` is an internal flag that tells the [`TextDataset`](/text.data.html#TextDataset) how it was created. It can be:
- `CSV` if it was created from texts or csv
- `TOK` if it was created from tokens (which means the [`TextDataset`](/text.data.html#TextDataset) will always skip the tokenization)
- `IDS` if it was created from tokens (which means the [`TextDataset`](/text.data.html#TextDataset) will always skip the tokenization and the numericalization)

### Factory methods

Instead of using the [`TextDataset`](/text.data.html#TextDataset) init method, one of the following factory functions should be used instead:

In [None]:
show_doc(TextDataset.from_folder, doc_string=False)

#### <a id=from_folder></a>`from_folder`
> `from_folder`(`folder`:`PathOrStr`, `tokenizer`:[`Tokenizer`](/text.transform.html#Tokenizer)=`None`, `name`:`str`=`'train'`, `classes`:`Classes`=`None`, `shuffle`:`bool`=`True`, `kwargs`) -> `TextDataset`
<a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L164">[source]</a>

Creates a [`TextDataset`](/text.data.html#TextDataset) named `name` by scanning the subfolders in `folder` and using `tokenizer`. If `classes` are passed, only the subfolders named accordingly are checked. If `shuffle` is True, the data will be shuffled. Any additional `kwargs` are passed to the init method of [`TextDataset`](/text.data.html#TextDataset). 

In [None]:
show_doc(TextDataset.from_one_folder, doc_string=False)

#### <a id=from_one_folder></a>`from_one_folder`
> `from_one_folder`(`folder`:`PathOrStr`, `classes`:`Classes`, `tokenizer`:[`Tokenizer`](/text.transform.html#Tokenizer)=`None`, `name`:`str`=`'train'`, `shuffle`:`bool`=`True`, `kwargs`) -> `TextDataset`
<a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L143">[source]</a>

Creates a [`TextDataset`](/text.data.html#TextDataset) named `name` by scanning the text files in `folder` and using `tokenizer`. All files are labelled `classes[0]` so this is typically used for the test set. If `shuffle` is True, the data will be shuffled. Any additional `kwargs` are passed to the init method of [`TextDataset`](/text.data.html#TextDataset). 

In [None]:
show_doc(TextDataset.from_csv, doc_string=False)

#### <a id=from_csv></a>`from_csv`
> `from_csv`(`folder`:`PathOrStr`, `tokenizer`:[`Tokenizer`](/text.transform.html#Tokenizer)=`None`, `name`:`str`=`'train'`, `kwargs`) -> `TextDataset`
<a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L134">[source]</a>

Creates a [`TextDataset`](/text.data.html#TextDataset) named `name` with the texts in `name`.csv using `tokenizer`. Any additional `kwargs` are passed to the init method of [`TextDataset`](/text.data.html#TextDataset). 

In [None]:
show_doc(TextDataset.from_tokens, doc_string=False)

#### <a id=from_tokens></a>`from_tokens`
> `from_tokens`(`folder`:`PathOrStr`, `name`:`str`=`'train'`, `tok_suff`:`str`=`'_tok'`, `lbl_suff`:`str`=`'_lbl'`, `kwargs`) -> `TextDataset`
<a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L122">[source]</a>

Creates a [`TextDataset`](/text.data.html#TextDataset) named `name` from tokens and labels saved in `f{name}{tok_suff}.npy` and `f{name}{lbl_suff}.npy` respectively. Any additional `kwargs` are passed to the init method of [`TextDataset`](/text.data.html#TextDataset). 

In [None]:
show_doc(TextDataset.from_ids, doc_string=False)

#### <a id=from_ids></a>`from_ids`
> `from_ids`(`folder`:`PathOrStr`, `name`:`str`=`'train'`, `id_suff`:`str`=`'_ids'`, `lbl_suff`:`str`=`'_lbl'`, `itos`:`str`=`'itos.pkl'`, `kwargs`) -> `TextDataset`
<a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L110">[source]</a>

Creates a [`TextDataset`](/text.data.html#TextDataset) named `name` from ids, labels and dictionary saved in `f{name}{id_suff}.npy`, `f{name}{lbl_suff}.npy` and `itos` respectively. Any additional `kwargs` are passed to the init method of [`TextDataset`](/text.data.html#TextDataset). 

### Preprocessing

The internal preprocessing is done by the two following methods:

In [None]:
show_doc(TextDataset.tokenize)

#### <a id=tokenize></a>`tokenize`
> `tokenize`()


Tokenize the texts in the csv file. <a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L68">[source]</a>

In [None]:
show_doc(TextDataset.numericalize)

#### <a id=numericalize></a>`numericalize`
> `numericalize`()


Numericalize the tokens in the token file. <a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L87">[source]</a>

Internally, the [`TextDataset`](/text.data.html#TextDataset) will create a 'tmp' folder in which he will copy or save the following files:
- `name`.csv (if created from folders or csv)
- `name`\_tok.npy and `name`\_lbl.npy (created by [`TextDataset.tokenize`](/text.data.html#tokenize) from the last step or copied if created from tokens)
- `name`\_ids.npy, `name`\_lbl.npy and `itos` (created by [`TextDataset.numericalize`](/text.data.html#numericalize) from the last step or copied if created from ids)

Then, when you invoke the [`TextDataset`](/text.data.html#TextDataset) again, it will look for those temporary files and check their consistency to use them, in order to avoid doing again the numericalization or the tokenization. If you feel those files have been corrupted in any way, the following method will clear the 'tmp' subfolder of those files:

In [None]:
show_doc(TextDataset.clear)

#### <a id=clear></a>`clear`
> `clear`()


Remove all temporary files. <a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L95">[source]</a>

### Internal methods

In [None]:
show_doc(TextDataset.check_ids)

#### <a id=check_ids></a>`check_ids`
> `check_ids`() -> `bool`


Check if a new numericalization is needed. <a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L48">[source]</a>

In [None]:
show_doc(TextDataset.check_toks)

#### <a id=check_toks></a>`check_toks`
> `check_toks`() -> `bool`


Check if a new tokenization is needed. <a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L60">[source]</a>

In [None]:
show_doc(TextDataset.general_check)

#### <a id=general_check></a>`general_check`
> `general_check`(`pre_files`:`Collection`\[`PathOrStr`\], `post_files`:`Collection`\[`PathOrStr`\])


Check that post_files exist and were modified after all the prefiles. <a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L41">[source]</a>

## Language Model data

A language model is trained to guess what the next word is inside a flow of words. We don't feed it the different texts separately but concatenate them all together in a big array. To create the batches, we split this array into `bs` chuncks of continuous texts. Note that in all NLP tasks, we use the pytoch convention of sequence length being the first dimension (and batch size being the second one) so we transpose that array so that we can read the chunks of texts in columns. Here is an example of batch from our imdb sample dataset. 

In [None]:
data = get_imdb()
x,y = next(iter(data.train_dl))
example = x[:20,:10].cpu()
texts = pd.DataFrame([data.train_ds.vocab.textify(l).split(' ') for l in example])
texts

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,\n,for,the,could,wells,are,big,.,for,my
1,xxbos,all,xxunk,have,',:,an,after,him,xxunk
2,xxfld,of,beer,ran,original,stop,incompetent,shooting,in,shot
3,1,five,xxunk,away,narrative,and,as,carradine,future,so
4,un,minutes,could,from,,xxunk,he,she,.,far
5,-,while,have,that,as,what,seemed,beats,\n,up
6,xxunk,the,been,thing,has,you,in,a,xxbos,my
7,-,protagonist,the,",",been,want,the,xxunk,xxfld,xxunk
8,believable,is,xxunk,but,noted,.,dream,xxunk,1,","
9,!,xxunk,for,anyway,.,use,.,before,after,i


Then, as suggested in [this article](https://arxiv.org/abs/1708.02182) from Stephen Merity et al., we don't use a fixed `bptt` through the different batches but slightly change it from batch to batch.

In [None]:
iter_dl = iter(data.train_dl)
for _ in range(5):
    x,y = next(iter_dl)
    print(x.size())

torch.Size([73, 64])
torch.Size([68, 64])
torch.Size([76, 64])
torch.Size([76, 64])
torch.Size([66, 64])


This is all done internally when we use the following inside of the `text_data` functions.

In [None]:
show_doc(lm_data, doc_string=False)

#### <a id=lm_data></a>`lm_data`
> `lm_data`(`datasets`:`Collection`\[[`TextDataset`](/text.data.html#TextDataset)\], `path`:`PathOrStr`, `kwargs`) -> [`DataBunch`](/data.html#DataBunch)
<a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L267">[source]</a>

Create a [`DataBunch`](/data.html#DataBunch) from the `datasets` for language modelling. Internally calls the next class, that takes the kwargs to define the dataloaders.

In [None]:
show_doc(LanguageModelLoader, doc_string=False)

## <a id=LanguageModelLoader></a>`class` `LanguageModelLoader`
> `LanguageModelLoader`(`dataset`:[`TextDataset`](/text.data.html#TextDataset), `bs`:`int`=`64`, `bptt`:`int`=`70`, `backwards`:`bool`=`False`)
<a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L189">[source]</a>

Takes the texts from `dataset` and concatenate them all, then create a big array with `bs` columns (transposed from the data source so that we read the texts in the columns). Spits batches with a size approximately equal to `bptt` but changing at every batch. If `backwards` is True, reverses the original text.

In [None]:
show_doc(LanguageModelLoader.batchify, doc_string=False)

#### <a id=batchify></a>`batchify`
> `batchify`(`data`:`ndarray`) -> `LongTensor`
<a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L211">[source]</a>

Called at the inialization to create the big array of text ids from the [`data`](/data.html#data) array.

In [None]:
show_doc(LanguageModelLoader.get_batch)

#### <a id=get_batch></a>`get_batch`
> `get_batch`(`i`:`int`, `seq_len`:`int`) -> `Tuple`\[`LongTensor`, `LongTensor`\]


Create a batch at `i` of a given `seq_len`. <a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L218">[source]</a>

## Classifier data

When preparing the data for a classifier, we keep the different texts separate, which poses another challenge for the creation of batches: since they don't all have the same length, we can't easily collate them together in batches. To help with this we use two different techniques:
- padding: each text is padded with the `PAD` token to get all the ones we picked to the same size
- sorting the texts (ish): to avoid having together a very long text with a very short one (which would then have a lot of `PAD` tokens), we regroup the texts by order of length. For the training set, we still add some randomness to avoid showing the same batches at every step of the training.

Here is an example of batch with padding (the padding index is 1, and the padding is applied before the sentences start).

In [None]:
data = get_imdb(classifier=True)
iter_dl = iter(data.train_dl)
_ = next(iter_dl)
x,y = next(iter_dl)
x[:20,-10:]

tensor([[   1,    1,    1,    1,    1,    1,    1,    1,    1,    1],
        [   1,    1,    1,    1,    1,    1,    1,    1,    1,    1],
        [   1,    1,    1,    1,    1,    1,    1,    1,    1,    1],
        [   1,    1,    1,    1,    1,    1,    1,    1,    1,    1],
        [  42,   42,   42,   42,    1,    1,    1,    1,    1,    1],
        [  43,   43,   43,   43,   42,   42,   42,    1,    1,    1],
        [  44,   44,   44,   44,   43,   43,   43,   42,    1,    1],
        [  39,   39,   39,   39,   44,   44,   44,   43,   42,   42],
        [  19,    0,   12,  224,   39,   39,   39,   44,   43,   43],
        [  15,  772,  307, 2118,   12,  111, 5128,   39,   44,   44],
        [2753,    5, 1809, 1119,  661, 1224,   53,   14,   39,   39],
        [5972,    0,  891,    3,  792,    4,  315,   25,   14,   14],
        [  15,    0,    4,   19,    6,   24,  344,    9,   22,   17],
        [   9,  249,   12,   12,  218,    2,  303,   53,   17,    6],
        [   6,   64,

All of this is done behind the scenes when calling the following in a `text_data` function

In [None]:
show_doc(classifier_data, doc_string=False)

#### <a id=classifier_data></a>`classifier_data`
> `classifier_data`(`datasets`:`Collection`\[[`TextDataset`](/text.data.html#TextDataset)\], `path`:`PathOrStr`, `kwargs`) -> [`DataBunch`](/data.html#DataBunch)
<a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L272">[source]</a>

Create a [`DataBunch`](/data.html#DataBunch) in `path` from the `datasets` for language modelling. The kwargs are passed to the next classes and can contain: the batchsize `bs`, the `bptt`, the padding index `pad_idx` and wether or not to apply the `pad_first`.

In [None]:
show_doc(SortSampler, doc_string=False)

## <a id=SortSampler></a>`class` `SortSampler`
> `SortSampler`(`data_source`:`NPArrayList`, `key`:`KeyFunc`) :: [`Sampler`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Sampler)
<a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L223">[source]</a>

pytorch [`Sampler`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Sampler) to batchify the `data_source` by order of length of the texts. Used for the validation and (if applicable) the test set. 

In [None]:
show_doc(SortishSampler, doc_string=False)

## <a id=SortishSampler></a>`class` `SortishSampler`
> `SortishSampler`(`data_source`:`NPArrayList`, `key`:`KeyFunc`, `bs`:`int`) :: [`Sampler`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Sampler)
<a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L232">[source]</a>

pytorch [`Sampler`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Sampler) to batchify with size `bs` the `data_source` by order of length of the texts with a bit of randomness. Used for the training set.

In [None]:
show_doc(pad_collate, doc_string=False)

#### <a id=pad_collate></a>`pad_collate`
> `pad_collate`(`samples`:`BatchSamples`, `pad_idx`:`int`=`1`, `pad_first`:`bool`=`True`) -> `Tuple`\[`LongTensor`, `LongTensor`\]
<a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L253">[source]</a>

Function used by the pytorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) to collate the `samples` in batches while adding padding with `pad_idx`. If `pad_first` is True, padding is applied at the beginning (before the sentence starts) otherwise it's applied at the end.

## Undocumented Methods - Methods moved below this line will intentionally be hidden

`TextMtd = IntEnum('TextMtd', 'CSV TOK IDS')` <div style="text-align: right"><a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L8">[source]</a></div>

In [None]:
show_doc(standard_data)

#### <a id=standard_data></a>`standard_data`
> `standard_data`(`datasets`:`Collection`\[[`DatasetBase`](/data.html#DatasetBase)\], `path`:`PathOrStr`, `kwargs`) -> [`DataBunch`](/data.html#DataBunch)


Simply create a [`DataBunch`](/data.html#DataBunch) from the `datasets`. <a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L263">[source]</a>

## New Methods - Please document or move to the undocumented section

In [None]:
show_doc(read_classes)

#### <a id=read_classes></a>`read_classes`
> `read_classes`(`fname`)
<a href="https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L12">[source]</a>