<h1>Building a Dataset Class for (NLP) text applications</h1>








<h1><span style='color:yellow'>For NLP applications, there are many preprocessing steps that should be performed on the raw data.</span></h1>

<h3><span style='color:yellow'>Torchtext: A powerful library for text preprocessing.</span></h3>
<h3><span style='color:yellow'>Its capabilities include:</span></h3>

<ul style='font-size: 1.2em;'>
    <li>File loading</li>
    <li>Tokenization</li>
    <li>Vocabulary building</li>
    <li>Numericalization/Indexing</li>
    <li>Word embedding</li>
    <li>Batching</li>
    <li>Embedding lookup: Mapping sentences to fixed-dimension word vectors</li>
</ul>



<h1><span style='color:yellow'>Text Preprocessing Pipeline:</span></h1>

<ul style='font-size: 1.2em;'>
    <li>Tokenization: Split a sentence into a sequence of words, such as ["Hello", "world", "."].</li>
    <li>Vocabulary: Map each word to an index, for example, [0, 1, ...].</li>
    <li>Numericalization: Map each word from the list based on its index from the vocabulary to build the feature vector, like [0, 1, ...].</li>
    <li>Embedding Lookup: For each word, there is a d-dimensional embedding vector representing that word.</li>
    <li>The above d-dimensional vector can be sourced from pretrained embeddings such as GloVe or FastText. Words are mapped from the list based on their index in the vocabulary to construct feature vectors like [5, 1, ...].</li>
</ul>




<h1><span style='color:yellow'>This tutorial can be applied to any JSON, CSV, or TSV (tab-separated files).</span></h1>
<h1><span style='color:yellow'>Ensure the data is located in the dataset directory and separated into train and test files.</span></h1>



In [30]:
import torchtext
from torchtext.data import Field,TabularDataset,BucketIterator

In [31]:
# Initializes a Field object from torchtext.data, which defines how the DATA should be processed

tokenize=lambda x : x.split()

quote=Field(sequential=True, use_vocab=True, tokenize=tokenize, lower=True)

# sequential=Indicates: that the data is a sequence (like a sentence) and not a single value (like a label)
# use_vocab=True: This means unique tokens in the data will be converted to unique integers, which is a standard preprocessing step for text 
# lower=True: This converts all the text to lower case



In [32]:
# Initializes a Field object from torchtext.data, which defines how the SCORES (labels) should be processed
score=Field(sequential=False, use_vocab=False)
# Note that we are dealing with a sentiment analysis example, so we set "sequential" to False. Conversely, for other applications like translation, we must set it to True.


In [33]:
# Identify which column should be used from the dataset.
fields={'quote':('q',quote), 'score':('s',score)}
fields

{'quote': ('q', <torchtext.data.field.Field at 0x7f1b08487130>),
 'score': ('s', <torchtext.data.field.Field at 0x7f1b08487b20>)}

In [34]:
# Tabular dataset split
data_path='./datastes/text data/'
train_data, test_data=TabularDataset.splits(
    path=data_path,
    train='train.json',
    test='test.json',
    format='json',
    fields=fields
)

# Print a sample of train_data
print(train_data[1].__dict__.keys())
print('')
print(train_data[1].__dict__.values())
print('')
# Printing the length of train_data 
print(f' The length of the training data is {len(train_data)}')

dict_keys(['q', 's'])

dict_values([['do', 'not', 'pray', 'for', 'an', 'easy', 'life,', 'pray', 'for', 'the', 'strength', 'to', 'endure', 'a', 'difficult', 'one.'], 1])

 The length of the training data is 6




In [35]:
# Building a vocabulary on training data
quote.build_vocab(
    train_data,
    max_size=10000,  # max_size sets a limit on the number of tokens in the vocabulary
    min_freq=2,  # min_freq sets a minimum frequency threshold for a token to be included in the vocabulary
)


In [39]:
# Constructing the iterators to do patch and padding
# BucketIterator Creates padding to ensure the length of tokenized (mapped) sentences is consistent (each padding value will be set to 1).

train_iterator,test_iterator=BucketIterator.splits(
    (train_data,test_data),
    batch_size=2,
    device='cpu')




In [40]:
for batch in train_iterator:
    print(batch)

    



[torchtext.data.batch.Batch of size 2]
	[.q]:[torch.LongTensor of size 16x2]
	[.s]:[torch.LongTensor of size 2]

[torchtext.data.batch.Batch of size 2]
	[.q]:[torch.LongTensor of size 14x2]
	[.s]:[torch.LongTensor of size 2]

[torchtext.data.batch.Batch of size 2]
	[.q]:[torch.LongTensor of size 16x2]
	[.s]:[torch.LongTensor of size 2]




In [38]:
for batch in train_iterator:
    print(batch.q)
    print(batch.s)

tensor([[33, 27],
        [19, 29],
        [24,  7],
        [14, 26],
        [15, 18],
        [34,  2],
        [32, 25],
        [31,  1],
        [16,  1],
        [20,  1],
        [22,  1],
        [12,  1],
        [ 5,  1],
        [ 8,  1]])
tensor([1, 0])
tensor([[10, 10],
        [21, 21],
        [ 4,  4],
        [ 3,  3],
        [ 6,  6],
        [11, 11],
        [17, 17],
        [ 4,  4],
        [ 3,  3],
        [30, 30],
        [28, 28],
        [ 5,  5],
        [13, 13],
        [ 2,  2],
        [ 9,  9],
        [23, 23]])
tensor([1, 1])
tensor([[33, 27],
        [19, 29],
        [24,  7],
        [14, 26],
        [15, 18],
        [34,  2],
        [32, 25],
        [31,  1],
        [16,  1],
        [20,  1],
        [22,  1],
        [12,  1],
        [ 5,  1],
        [ 8,  1]])
tensor([1, 0])




In [42]:
# We implemented the tokenization step using a lambda function that splits sequences of words by space.
# Professionally, we will redo the tokenization using the space library.

In [49]:
# Repeat the above implementation using the spacy library
from torchtext.data import Field,TabularDataset,BucketIterator
import spacy
# !python -m spacy download en_core_web_sm to install the spacy en model

spacy_en_model=spacy.load('en_core_web_sm')


In [50]:
# define spacy-based tokenizer to tokenize the text
def tokenize(text):
    return [token.text for token in spacy_en_model.tokenizer(text)]

In [57]:
quote=Field(sequential=True, use_vocab=True, tokenize=tokenize, lower=True)
score=Field(sequential=False, use_vocab=False)

fields={'quote':('q',quote), 'score':('s',score)}

train_data, test_data=TabularDataset.splits(
    path=data_path,
    train='train.json',
    test='test.json',
    format='json',
    fields=fields
)

quote.build_vocab(
    train_data,
    max_size=10000,
    min_freq=2,
    vectors='glove.6B.100d'  # using pretrained word embeddings based on GloVe vectors. The size of the vectors is 1GB
)

train_iterator,test_iterator=BucketIterator.splits(
    (train_data,test_data),
    batch_size=2,
    device='cpu')
    




In [58]:
for batch in train_iterator:
    print(batch.q)
    print(batch.s)

tensor([[14, 29],
        [25, 31],
        [ 7,  3],
        [ 5, 11],
        [10, 28],
        [15, 22],
        [21,  4],
        [ 3, 27],
        [ 7,  9],
        [ 5,  1],
        [32,  1],
        [30,  1],
        [ 8,  1],
        [17,  1],
        [ 4,  1],
        [13,  1],
        [ 6,  1],
        [ 2,  1]])
tensor([1, 0])
tensor([[35, 29],
        [23, 31],
        [26,  3],
        [18, 11],
        [19, 28],
        [36, 22],
        [34,  4],
        [ 2, 27],
        [33,  9],
        [20,  1],
        [24,  1],
        [ 6,  1],
        [16,  1],
        [ 8,  1],
        [12,  1],
        [ 2,  1]])
tensor([1, 0])
tensor([[35, 14],
        [23, 25],
        [26,  7],
        [18,  5],
        [19, 10],
        [36, 15],
        [34, 21],
        [ 2,  3],
        [33,  7],
        [20,  5],
        [24, 32],
        [ 6, 30],
        [16,  8],
        [ 8, 17],
        [12,  4],
        [ 2, 13],
        [ 1,  6],
        [ 1,  2]])
tensor([1, 1])




In [62]:
# We can repeat the above implementation for CSV and TSV files. The only difference is the format parameter in the TabularDataset.splits() function.

train_data, test_data=TabularDataset.splits(
    path=data_path,
    train='train.csv',
    test='test.csv',
    format='csv',
    fields=fields)
    
    
#or 

train_data, test_data=TabularDataset.splits(
    path=data_path,
    train='train.tsv',
    test='test.tsv',
    format='tsv',
    fields=fields)



# We will revisit this implementation to build a model, such as an RNN-LSTM, in upcoming tutorials when we cover RNNs