# Classification using Pytorch and Torchtext

- bidirectional LSTM on some news
- some labeled data on news....

- Torchtext to help us numericalize and load some data
- Torchtext is backed by PyTorch....so naturally torchtext is quite good
- Torchtext is not meant to replace spacy....spacy is still like better in general
  
- PyTorch to help us make some neural network

In [1]:
import torch, torchdata, torchtext
from torch import nn

import time

#1. puffer - it's outdated....
#2. spend some money - 300 baht get collab pro

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

#reproducibility 
SEED = 1234
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

cpu


In [2]:
torch.__version__

'1.13.0'

In [3]:
torchtext.__version__

'0.14.0'

In [4]:
torchdata.__version__

'0.5.1'

In [5]:
# torch.cuda.get_device_name(0)

## 1. Load the dataset

Make our life easy by using some ready-to-be-used dataset by torchtext

- in your assignment, i will ask you to use penn treebank.....

In [6]:
#if you are using puffer
# import os
# os.environ['http_proxy']  = 'http://192.41.170.23:3128'
# os.environ['https_proxy'] = 'http://192.41.170.23:3128'

from torchtext.datasets import AG_NEWS
train, test = AG_NEWS()

In [7]:
train  #a new object by torchdata.....streaming data (yield ....)

ShardingFilterIterDataPipe

## 2. EDA - exploratory data analysis
- check common words
- look at some random sample....how it looks, so that we can design proper neural network
- visualize statistics

In [8]:
next(iter(train))  #generator
# (“World”, “Sports”, “Business”, “Sci/Tech”)
#  1,        2,        3,          4

(3,
 "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.")

In [9]:
list(iter(train))[100]  #generator
# (“World”, “Sports”, “Business”, “Sci/Tech”)
#  1,        2,        3,          4

(4,
 'Comets, Asteroids and Planets around a Nearby Star (SPACE.com) SPACE.com - A nearby star thought to harbor comets and asteroids now appears to be home to planets, too. The presumed worlds are smaller than Jupiter and could be as tiny as Pluto, new observations suggest.')

In [10]:
set([y for y, x in list(iter(train))])

{1, 2, 3, 4}

In [11]:
train_size = len(list(iter(train)))
train_size

120000

In [12]:
train

ShardingFilterIterDataPipe

In [13]:
#i gonna cheat a little bit, not gonna use all...my fans will work too hard....
too_much, train, valid = train.random_split(total_length=train_size, 
                                            weights = {"too_much": 0.7, 
                                                       "smaller_train": 0.2,
                                                       "valid": 0.1},
                                            seed = SEED)

In [14]:
train_size = len(list(iter(train)))
val_size   = len(list(iter(valid)))
test_size  = len(list(iter(test)))

In [15]:
train_size, val_size, test_size

(24000, 12000, 7600)

## 3. Preprocessing

In [16]:
## 3.1 Tokenizing

from torchtext.data.utils import get_tokenizer

tokenizer = get_tokenizer('spacy', language='en_core_web_sm')

#check whether the tokenizer works.....
# tokens    = tokenizer("Chaky likes deep learning very much and wants his \
                    #   student to be number 1 in Asia....")
# tokens

2023-02-08 15:03:02.541196: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [17]:
next(iter(train))

(3,
 'Safety Net (Forbes.com) Forbes.com - After earning a PH.D. in Sociology, Danny Bazil Riley started to work as the general manager at a commercial real estate firm at an annual base salary of  #36;70,000. Soon after, a financial planner stopped by his desk to drop off brochures about insurance benefits available through his employer. But, at 32, "buying insurance was the furthest thing from my mind," says Riley.')

In [18]:
## 3.2 Numericalization

from torchtext.vocab import build_vocab_from_iterator

def yield_tokens(data_iter):  #data_iter, e.g., train
    for _, text in data_iter:
        yield tokenizer(text)
        
vocab = build_vocab_from_iterator(yield_tokens(train), specials=['<unk>', '<pad>',
                                                                 '<bos>', '<eos>'])

In [19]:
vocab.set_default_index(vocab["<unk>"]) #if you don't the id of this word, set it unk

In [20]:
vocab(['Chaky', 'wants', 'his', 'student', 'to', 'be', 'number', '1', '.'])

[0, 944, 38, 3956, 8, 43, 498, 109, 6]

In [21]:
id2word = vocab.get_itos()

In [22]:
id2word[0]

'<unk>'

In [23]:
vocab(['<pad>', '<bos>', '<eos>'])

[1, 2, 3]

In [24]:
len(vocab)  #52k unique words.....

52686

## 4. FastText embedding

We gonna insert this embedding to the NN on the fly.....

In [25]:
from torchtext.vocab import FastText
fast_vectors = FastText(language='simple')

In [26]:
fast_embedding = fast_vectors.get_vecs_by_tokens(vocab.get_itos()).to(device)

In [27]:
fast_embedding.shape #(vocab size, 300) == (52k, 300)

torch.Size([52686, 300])

In [28]:
#please lookup the fasttext embedding of id 100
fast_embedding[100][:10] #size of 300 dim of this word id 100

tensor([-0.0935,  0.0915,  0.2640,  0.0387,  0.0843,  0.3809, -0.1776,  0.1745,
        -0.0362, -0.0278])

## 5. Preparing dataloader

Optional - you can either make your own batch loader....
You can use pytorch dataloader....

In [29]:
text_pipeline  = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: int(x) - 1  #1, 2, 3, 4 ---> 0, 1, 2, 3

In [30]:
'''
why padding????

in the same batch, e.g., batch size = 2

"chaky eat sushi", ==> "chaky", "eat", "sushi" ==> 0, 22, 11, 1, 1
"chaky sleep" ==> "chaky", "sleep" ==> 0, 99, 1, 1, 1

'''

'\nwhy padding????\n\nin the same batch, e.g., batch size = 2\n\n"chaky eat sushi", ==> "chaky", "eat", "sushi" ==> 0, 22, 11, 1, 1\n"chaky sleep" ==> "chaky", "sleep" ==> 0, 99, 1, 1, 1\n\n'

In [31]:
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence #making each batch same length

pad_ix = vocab['<pad>']

#this function gonna be called by DataLoader
def collate_batch(batch):
    label_list, text_list, length_list = [], [], []
    for (_label, _text) in batch:
        label_list.append(label_pipeline(_label))
        processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
        text_list.append(processed_text)
        length_list.append(processed_text.size(0)) #for padding
        
    return torch.tensor(label_list, dtype=torch.int64), \
        pad_sequence(text_list, padding_value=pad_ix, batch_first=True), \
        torch.tensor(length_list, dtype=torch.int64)

In [32]:
batch_size = 64

train_loader = DataLoader(train, batch_size = batch_size,
                          shuffle=True, collate_fn=collate_batch)

val_loader   = DataLoader(valid, batch_size = batch_size,
                          shuffle=True, collate_fn=collate_batch)

test_loader  = DataLoader(test, batch_size = batch_size,
                          shuffle=True, collate_fn=collate_batch)


In [33]:
for label, text, length in train_loader:
    break

label, text, length  #why we need length --> we can later ignore padding....

(tensor([3, 1, 2, 2, 2, 3, 2, 2, 0, 0, 3, 3, 0, 1, 3, 3, 0, 3, 2, 0, 3, 2, 0, 3,
         3, 0, 3, 2, 2, 3, 1, 2, 3, 3, 1, 0, 3, 2, 1, 3, 2, 0, 2, 1, 0, 2, 2, 1,
         0, 2, 1, 2, 1, 0, 1, 1, 1, 3, 0, 0, 3, 0, 1, 3]),
 tensor([[  446,    16,    19,  ...,     1,     1,     1],
         [16120,     9,  3652,  ...,     1,     1,     1],
         [14229,  1282,   322,  ...,     1,     1,     1],
         ...,
         [ 3125,  2475,   613,  ...,     1,     1,     1],
         [ 1904,  2821,    32,  ...,     1,     1,     1],
         [  446,  7514,  7551,  ...,     1,     1,     1]]),
 tensor([56, 74, 50, 38, 77, 53, 30, 51, 52, 55, 63, 35, 25, 56, 50, 16, 33, 38,
         44, 59, 48, 69, 53, 28, 44, 57, 38, 51, 45, 71, 44, 49, 42, 50, 88, 43,
         38, 33, 79, 43, 49, 49, 39, 32, 51, 41, 64, 35, 49, 44, 44, 55, 49, 27,
         57, 28, 40, 43, 51, 58, 43, 35, 55, 76]))