TorchText has many canonical datasets included for classification, language modelling, sequence tagging, etc. However, frequently you'll be wanting to use your own datasets. Luckily, TorchText has functions to help you to this.

Recall in the first notebook number #1
- defined the `Field`s
- loaded the dataset
- created the splits

As a reminder, the code is shown below:

```python
TEXT = data.Field()
LABEL = data.LabelField()

train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

train_data, valid_data = train_data.split()
```

**There are three data formats TorchText can read: `json`, `tsv` (tab separated values) and`csv` (comma separated values).**


In [1]:
import pandas as pd
from torchtext import data
from torchtext import datasets
import torch

In [2]:
df_train = pd.read_csv('./data/imdb_train.csv')
df_train.head()

df_test = pd.read_csv('./data/imdb_test.csv')
df_train.head()

df_val = pd.read_csv('./data/imdb_val.csv')
df_test.head()

Unnamed: 0,id,sentiment,review
0,8169_4,0,"The movie starts with a pair of campers, a man..."
1,7830_10,1,"In \Die Nibelungen: Siegfried\"", Siegfried was..."
2,3719_9,1,Just caught it at the Toronto International Fi...
3,4402_1,0,Usually I love Lesbian movies even when they a...
4,11134_9,1,"Acidic, unremitting, and beautiful, John Schle..."


In [3]:
len(df_train), len(df_test), len(df_val)

(15000, 5000, 5000)

## First we define our fields and labels

In [4]:
TEXT = data.Field(tokenize = 'spacy', batch_first=True)
LABEL = data.LabelField(sequential=False, dtype = torch.float, batch_first=True,use_vocab=False)

We now use a list of tuples, where each element is also a tuple. The first element of these inner tuples will become the batch object's attribute name, second element is the `Field` name.

The tuples have to be in the same order that they are within the `tsv` data. Due to this, when skipping a column of data a tuple of `None`s needs to be used

In [5]:
#           id              sentiment        review
fields = [(None, None),('sentiment',LABEL),('review',TEXT)]

If your data has a header, which ours does, it must be skipped by passing `skip_header = True`. If not, TorchText will think the header is an example. By default, `skip_header` will be `False`.

In [6]:
train_data, valid_data, test_data = data.TabularDataset.splits(
                                        path = './data',
                                        train = 'imdb_train.csv',
                                        validation = 'imdb_val.csv',
                                        test = 'imdb_test.csv',
                                        format = 'csv',
                                        fields = fields,
                                        skip_header = True
)

In [7]:
TEXT.build_vocab(train_data,max_size = 25000)

In [8]:
TEXT.vocab.freqs.most_common(10)

[('the', 173110),
 (',', 163792),
 ('.', 140517),
 ('and', 93819),
 ('a', 93475),
 ('of', 86009),
 ('to', 80173),
 ('is', 65575),
 ('in', 52610),
 ('I', 45963)]

In [9]:
print(vars(train_data[0]))

{'sentiment': '1', 'review': ['So', 'fortunate', 'were', 'we', 'to', 'see', 'this', 'fantastic', 'film', 'at', 'the', 'Palm', 'Springs', 'International', 'Film', 'festival', '.', 'Upon', 'entering', 'the', 'theater', 'we', 'were', 'handed', 'a', 'small', 'opinion', 'card', 'that', 'would', 'be', 'used', 'for', 'our', 'personal', 'rating', 'of', 'the', 'film', '.', 'Looking', 'at', 'the', 'card', 'I', 'turned', 'to', 'my', 'wife', 'and', 'said', ',', '\\How', 'many', 'movies', 'in', 'your', 'life', 'do', 'you', 'think', 'you', 'can', 'rate', 'as', 'superb', '?', 'Only', 'about', '5', 'for', 'me.\\', '"', 'But', 'then', 'watching', 'the', 'interaction', 'between', 'Peter', 'Falk', 'and', 'Paul', 'Reiser', 'while', 'viewing', 'the', 'spectacular', 'scenery', 'in', 'the', 'film', "'s", 'setting', 'of', 'New', 'York', 'state', ',', 'I', 'slowly', 'starting', 'bumping', 'the', 'movie', 'up', 'a', 'category', 'at', 'a', 'time', '.', 'Certainly', 'it', 'was', 'good', 'but', 'the', 'totally', '

## Iterators 


We build the vocab and create the iterators.

By default, the train data is shuffled each epoch, but the validation/test data is sorted. However, TorchText doesn't know what to use to sort our data and it would throw an error if we don't tell it. 

There are two ways to handle this, you can either tell the iterator not to sort the validation/test data by passing `sort = False`, or you can tell it how to sort the data by passing a `sort_key`. A sort key is a function that returns a key on which to sort the data on. For example, `lambda x: x.s` will sort the examples by their `s` attribute, i.e their quote. Ideally, you want to use a sort key as the `BucketIterator` will then be able to sort your examples and then minimize the amount of padding within each batch.

We can then iterate over our iterator to get batches of data. Note how by default TorchText has the batch dimension second but we added `batch_first=True`

In [10]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

BATCH_SIZE = 4

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data),
    sort_key = lambda x: x.sentiment, #sort by s attribute (sentiment)
    batch_size=BATCH_SIZE,
    device=device)



In [11]:
data = next(iter(train_iterator))

In [12]:
data


[torchtext.data.batch.Batch of size 4]
	[.sentiment]:[torch.cuda.FloatTensor of size 4 (GPU 0)]
	[.review]:[torch.cuda.LongTensor of size 4x172 (GPU 0)]

In [13]:
for batch in train_iterator:
    print("batch.sentiment.size:")
    print(batch.sentiment.size())
    print("batch.review:")
    print(batch.review.size())

    break

batch.sentiment.size:
torch.Size([4])
batch.review:
torch.Size([4, 416])
