## **Embeddings**
The main objective of this notebook to reduce the dimension of the sparse one-hot encoded word vectors. The goal of using word embeddings are:


1.   Finding the meaning of words based on their word approximation to other words. This is done by taken two word vectors and analyzing how often the words in the vectors are used together. The higher the frequency, the more you can find a correlation and relationship between the words.
2.   This process of training the word embedding to find word approximations between words in a given dimension is how we reduce the word representation to low-dimensions.
3. Embedding vectors serve as numeric representations of words and are used as input to other machine learning network layers.
4. The embedding vector becomes the stored lookup table for words in the vocabulary.




In [None]:
!wget -q https://raw.githubusercontent.com/MicrosoftDocs/pytorchfundamentals/main/nlp-pytorch/torchnlp.py

In [None]:
!pip install torch==1.11.0
!pip install torchtext==0.12.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
!pip install torchinfo
!pip install torchdata==0.3.0
import torch
import torchtext
from torchtext.data import get_tokenizer
import numpy as np
from torchnlp import *
#from torchinfo import summary
train_dataset, test_dataset, classes, vocab = load_dataset()

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Loading dataset...
Building vocab...


TypeError: ignored

### **Dealing with variable sequence size**

When working with words, you are going to have text sequences or sentences that are of different lengths.  This can be problematic in training the word embeddings neural network. For consistency in the word embedding and improve training performance, we would have to apply some padding. This can be done using the `torch.nn.functional.pad` on a tokenized dataset. It adds zero values to the empty indices at the end of the vector.


In [None]:
def padify(b):
    # b is the list of tuples of length batch_size
    #   - first element of a tuple = label, 
    #   - second = feature (text sequence)
    # build vectorized sequence
    v = [encode(x[1]) for x in b]
    # first, compute max length of a sequence in this minibatch
    l = max(map(len,v))
    return ( # tuple of two tensors - labels and features
        torch.LongTensor([t[0]-1 for t in b]),
        torch.stack([torch.nn.functional.pad(torch.tensor(t),(0,l-len(t)),mode='constant',value=0) for t in v])
    )

Let's use the first 2 sentences as example to view the text length differences and effects of padding.

In [None]:
first_sentence = train_dataset[0][1]
second_sentence = train_dataset[1][1]

f_tokens = encode(first_sentence)
s_tokens = encode(second_sentence)

print(f'First Sentence in dataset:\n{first_sentence}')
print("Length:", len(train_dataset[0][1]))
print(f'\nSecond Sentence in dataset:\n{second_sentence}')
print("Length: ", len(train_dataset[1][1]))

NameError: ignored

In [None]:
torchtext.vocab.Vocab.stoi.get(train_dataset[0][1])