<a href="https://colab.research.google.com/github/aleksgeorgi/NLP_with_Pytorch/blob/main/03_01b_PP_Build_Train_Val_Test_Vocab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
!pip install torchtext==0.10.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torchtext==0.10.0
  Downloading torchtext-0.10.0-cp37-cp37m-manylinux1_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 20.0 MB/s 
Collecting torch==1.9.0
  Downloading torch-1.9.0-cp37-cp37m-manylinux1_x86_64.whl (831.4 MB)
[K     |████████████████████████████████| 831.4 MB 2.8 kB/s 
Installing collected packages: torch, torchtext
  Attempting uninstall: torch
    Found existing installation: torch 1.12.1+cu113
    Uninstalling torch-1.12.1+cu113:
      Successfully uninstalled torch-1.12.1+cu113
  Attempting uninstall: torchtext
    Found existing installation: torchtext 0.13.1
    Uninstalling torchtext-0.13.1:
      Successfully uninstalled torchtext-0.13.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchvision 0.13.1

In [1]:
import torch
from torchtext.legacy import data, datasets
import random

In [2]:
seed = 966
torch.manual_seed(seed)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


**Fields**

[Check documentation](https://pytorch.org/text/_modules/torchtext/data/field.html)

In [6]:
# define fields
TEXT = data.Field(tokenize='spacy', lower=True)
LABEL = data.LabelField()



the following is a built in PyTorch Dataset:

**Text REtrieval Conference (TREC) Question Classification Dataset**

*Data Examples and Six Categories:*

| Text | Label | Category |
| --- | --- | --- |
|CNN is the abbreviation for what ?|ABBR| ABBREVIATION |
| What is the date of Boxing Day ? | NUM |NUMERIC|
|Who discovered electricity ?| HUM |HUMAN|
|What 's the colored part of the eye called ?|ENTY|ENTITY|
|Why do horseshoes bring luck ?|DESC|DESCRIPTION|
|What is California 's capital ?|LOC|LOCATION|

In [8]:
train, test = datasets.TREC.splits(TEXT, LABEL)
train, val = train.split(random_state=random.seed(seed)) #valindation dataset, remember seed saves the random split

In [9]:
vars(train[-1]) #checks a sample of the training data

{'text': ['how', 'do', 'you', 'say', '2', 'in', 'latin', '?'], 'label': 'ENTY'}

In [12]:
# build vocab
TEXT.build_vocab(train, min_freq = 2) 
#to reduce the number of unique words in vocab obbject use #min_freq=2 which means a word must appear at least 2 times 
#in training data for it to be included in vocab object of text field 
LABEL.build_vocab(train)

In [13]:
print(LABEL.vocab.stoi)

defaultdict(None, {'ENTY': 0, 'HUM': 1, 'DESC': 2, 'NUM': 3, 'LOC': 4, 'ABBR': 5})


In [14]:
print("Vocabulary size of TEXT:",len(TEXT.vocab.stoi))
print("Vocabulary size of LABEL:",len(LABEL.vocab.stoi))

Vocabulary size of TEXT: 2643
Vocabulary size of LABEL: 6


In [15]:
#constructs the iterators for each set:

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train, val, test),
    batch_size = 64,
    sort_key=lambda x: len(x.text), 
    device=device
)



But why do we need iterators? <br>
The bucket iterator transforms the train validation and test datasets into batches at line number two. <br>
The batch size is set to 64 at line number three, which means the number of training examples in one batch is 64. <br>
Then in a sort key argument at line number four, we are sorting based on the length of each sentence, which means it batches the text of length together.<br> 
Finally, we set the device to GPU for an even faster training process at line number five. <br>

That's all for pre-proccessing the text dataset with PyTorch. 