# New Era Embeddings

#### Embedding : Are vectors or 1D arrays that use numbers to represent semantic properties

Word embeddings are vital in NLP because they capture the relationship between words. Unless a model learns the relationship between words, it cannot perform more complex NLP tasks, such as text classification, well.

One-hot encoding matrix would have high dimensions, it  would be a sparse matrix (mostly zero) and would suffer from the curse of dimensionality (e.g., we would need a lot of data to train a model that generalizes well because this matrix is both large and sparse, making parameter estimation more difficult). Also these are static vectors that don’t accurately capture the meanings of words. 
Today richer contextual embeddings exist, and they are much better than context-free embeddings like Word2vec and GloVe.

Word embeddings trained by Word2Vec, GloVe, and fastText store **contextual and semantic information** for each word in a much lower dimensional space, unlike one-hot encoding. Words such as “queen” and “king” have vectors that are closer together in space, implying that there is some semantic relationship/similarity between the two.

In 2013, pretrained word embeddings became popular with the rise of Word2Vec, the first of the major word embedding algorithms. 

Despite its successes, Word2Vec has shortcomings :

1. It relies on a relatively small window-based model to learn the word embedding for a particular word. It does not consider the word in the context of the entire document. 

2. It does not consider subword information, which means that it cannot efficiently learn, for example, how a noun and an adjective that are derived from the same subword are related. For instance, “intelligent” and “intelligence” share the subword “intelligen” and are related as a result, sharing similar semantic information.

3. Word2Vec cannot handle Out of Vocabulary (OOV) words; it can only vectorize words that it has seen in training.

4. Word2Vec cannot disambiguate the context-specific semantic properties of words. For example, with Word2Vec, the word “bank” has the same word vector regardless of whether it appears in the financial setting or in the river setting.

5. With fastText, the only major shortcoming is its inability to produce multiple vectors for each word depending on the context.

As good as they are, word embeddings trained by **Word2Vec, GloVe, and fastText** are not context-aware. 

The large, pretrained language models based on the Transformer architecture, such as ELMo and BERT, that came onto the scene starting in 2018 changed this: they introduce context-aware word representations



In [None]:
!python -m spacy download en_core_web_sm 

In [2]:
import torch.nn.functional as F
import torch.nn as nn
from torch import optim
import torch

from torchtext.legacy import data

In [3]:
dev = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

In [4]:
import pandas as pd
import numpy as np
from simpletransformers.classification import ClassificationModel

### Bootstrapping Labelling using zer0-shot-classifier

In [9]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification", device=0)
candidate_labels=["Defect", "Software Research", "Software Upgrade", "Testing", "Software Development", "Software Maintenance", "Software Deployment", "Software Enhancement", "Software Integration", "New Feature", "Innovation", "R&D", "Continue Operation", "Other", "Technical Debt"]

In [10]:
filter_keys = ['labels']
def predict_0_shot_label(summary):
    result = classifier(summary, candidate_labels)
    temp = list(map(result.get, filter_keys))
    zero_shot_label = temp[0][0]
    return zero_shot_label

In [11]:
def translate_label(label_text):
    if label_text in ['New Feature', 'Innovation', 'R&D', 'Software Research', 'Software Development', 'Software Enhancement']:
        return 2
    elif label_text in ['Testing', 'Continue Operation', 'Technical Debt', 'Software Maintenance', 'Software Upgrade', 'Software Deployment', 'Software Integration']:
        return 0
    elif label_text in ['Defect']:
        return 1   
    else:
        return 3

In [12]:
IssuesDF['zero_shot_label'] = IssuesDF['summary'].apply(predict_0_shot_label)

In [13]:
IssuesDF['label'] = IssuesDF['zero_shot_label'].apply(translate_label)

In [37]:
IssuesDF.head(10)

Unnamed: 0,issue_id,summary,label
0,53674,Sub-task Grow number of contributors from 37 t...,3
1,53611,Task Own all aspects of customer success funct...,0
2,53609,Task Understand software development/testing p...,2
3,53494,Task Grow skills in troubleshooting customer's...,2
4,53260,Story Grow Kubernetes Knowledge,2
5,53254,Sub-task Define new policies and keep adding n...,0
6,53076,Story Grow Knowledge & Skills,0
7,51923,"Task Verify cluster creation with add-on EKS, ...",0
8,47754,Task Implement log collection and aggregation ...,2
9,53817,Task PSP deprecation/Kyverno blog post,1


In [15]:
IssuesDF = IssuesDF.drop('zero_shot_label', axis=1)

In [16]:
from torchtext.legacy.data import Field, Dataset, Example

#### We are in need of a dataframe Dataset to use inout pytorch model.

In [17]:
class DataFrameDataset(Dataset):
    def __init__(self, df: pd.DataFrame, fields: list):
        super(DataFrameDataset, self).__init__(
            [
                Example.fromlist(list(r), fields) 
                for i, r in df.iterrows()
            ], 
            fields
        )

#### Set up fields

In [18]:
# set up fields
ISSUE = data.Field()
SUMMARY = data.Field(sequential=True, tokenize='basic_english', lower=True, include_lengths=True
                     , batch_first=False)
LABEL = data.LabelField()

#### Getting the vectors and building the vocabulary

In [19]:
train, test = DataFrameDataset(
    df=IssuesDF, 
    fields=(
        ('issue_id', ISSUE),
        ('summary', SUMMARY),
        ('label', LABEL)
    )
).split()

#### Build the vocabulary

In [20]:
ISSUE.build_vocab(train)
SUMMARY.build_vocab(train
                    #, vectors= 'glove.6B.100d'
                    #, vectors='glove.42B.300d'
                     , vectors='fasttext.simple.300d'
                   )
LABEL.build_vocab(train)
EmbeddingSize = 300

In [21]:
SUMMARY.vocab.stoi['cloud']

107

In [None]:
print(vars(test[100]))

Dataset downloaded, tokenized, and vectorized.

#### Make iterator for splits

In [23]:
train_iter, test_iter = data.BucketIterator.splits((train, test),
batch_sizes=(128,1024), device=dev
, sort=False
, repeat=False)

In [None]:
vars(train_iter.dataset[11])

**We'll now build a model to process the Jira data word vectors.
We will try different word embeddings and see their effect on performance on a simple dataset.**

### Model

In [25]:
class RNN_classifier(torch.nn.Module):
    
    def __init__(self, embedding_size = EmbeddingSize, hidden_size = 512, num_layers = 3):
        super().__init__()

        # Set up an embedding layer with the right dimensions, and copy the weights from the pretrained 
        # glove embeddings
        vocab = SUMMARY.vocab
        self.embed = torch.nn.Embedding(len(vocab), embedding_size).cuda()
        self.embed.weight.data.copy_(vocab.vectors)

        # Set up a standard PyTorch RNN sections with the right
        # dimensions and a variable number of layers
        self.rnn = torch.nn.RNN(embedding_size, hidden_size, num_layers)

        # Add a two layer classification head with the right dimensions
        # The final layer must output a single number
        self.classificationLayer1 = torch.nn.Linear(hidden_size, 10)
        self.classificationLayer2 = torch.nn.Linear(10, 1)


    def forward(self, input, lengths=None):

        embed_input = self.embed(input)
        packed_emb = nn.utils.rnn.pack_padded_sequence(embed_input, lengths, batch_first=False, enforce_sorted=False)

        output, hidden = self.rnn(packed_emb)
        hidden = hidden[-1]
        x = hidden.squeeze(0)
        x = self.classificationLayer1(x)
        x = self.classificationLayer2(x)

        logits = x.view(-1)
        return logits

In [26]:
model = RNN_classifier(hidden_size=256, num_layers=1)
model.to(dev)

RNN_classifier(
  (embed): Embedding(3450, 300)
  (rnn): RNN(300, 256)
  (classificationLayer1): Linear(in_features=256, out_features=10, bias=True)
  (classificationLayer2): Linear(in_features=10, out_features=1, bias=True)
)

In [27]:
for batch in train_iter:
    #print(batch)
    x, x_len = batch.summary
    x = x.to(dev)
    x_len = torch.as_tensor(x_len, dtype=torch.int64, device='cpu')
    print(x)
    pred = model(x,x_len)
    print(pred.shape)
    break

tensor([[   8,   14,   22,  ...,   22,   14,   40],
        [  17,   16,  125,  ..., 1593,   16, 1004],
        [  46,  860,   50,  ..., 2476,   27, 1666],
        ...,
        [   1,    1,    1,  ...,    1, 1021,    1],
        [   1,    1,    1,  ...,    1,   32,    1],
        [   1,    1,    1,  ...,    1,    6,    1]], device='cuda:0')
torch.Size([128])


In [28]:
loss_func = F.binary_cross_entropy_with_logits
opt = optim.Adam(model.parameters(), lr=1e-4)
epochs = 6

In [29]:
def get_metrics(model, test_data):
    model.eval()
    correct, total = 0, 0
    with torch.no_grad():
        for batch_idx, batch_data in enumerate(test_data):
            text, text_lengths = batch_data.summary
            text = text.to(dev)
            text_lengths = torch.as_tensor(text_lengths, dtype=torch.int64, device='cpu') 
            logits = model(text, text_lengths)
            predicted_labels = (torch.sigmoid(logits) > 0.5).long()
            total += batch_data.label.size(0)
            correct += (predicted_labels == batch_data.label.long()).sum()
        return correct.float()/total

**With the change in the embedding vectors used when building the vocabulary from glove.42B.300d to fasttext.simple.300d, We see a 41% improvement in accuracy of classification from 56% to 79%.**

In [30]:
from tqdm import tqdm_notebook as tqdm

for epoch in tqdm(range(epochs)):
    model.train()
    for batch in tqdm(train_iter):
        x,x_lengths = batch.summary
        x = x.to(dev)
        x_lengths = torch.as_tensor(x_lengths, dtype=torch.int64, device='cpu')        
        pred = model(x,x_lengths)

        actual=batch.label.float()
        loss = loss_func(pred,actual)

        loss.backward()
        opt.step()
        opt.zero_grad()

    if (epoch==5):
        for g in opt.param_groups:
            g['lr'] = 3e-3

    print("Accuracy: " + str(get_metrics(model, test_iter).cpu().numpy()))

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  This is separate from the ipykernel package so we can avoid doing imports until


  0%|          | 0/6 [00:00<?, ?it/s]

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  """


  0%|          | 0/35 [00:00<?, ?it/s]

Accuracy: 0.35535428


  0%|          | 0/35 [00:00<?, ?it/s]

Accuracy: 0.7842302


  0%|          | 0/35 [00:00<?, ?it/s]

Accuracy: 0.7927544


  0%|          | 0/35 [00:00<?, ?it/s]

Accuracy: 0.79381996


  0%|          | 0/35 [00:00<?, ?it/s]

Accuracy: 0.78902507


  0%|          | 0/35 [00:00<?, ?it/s]

Accuracy: 0.79222167


### Summary

Embeddings have historically been generated with algorithms like Word2Vec, but with the advent of transfer learning, copying model weights allows you to copy embeddings as well, with no extra effort. 

Transformers use a feed-forward encoder-decoder with attention. This breakthrough in parallelization during training led to the advent of very large, pretrained language models and their emdeddings.

With ELMo, it became possible to generate different word representations for the same word, such as “bank,” depending on the context it appeared in (financial bank versus river bank).

Moreover, these word representations are character-based, like fastText word embeddings, which allows ELMo-based models to handle OOV tokens that weren’t seen during training. Unsurprisingly, adding ELMo’s contextualized word representations to existing NLP systems improved the state-of-the-art performance for every task.

### References

[1] Data processing utilities for NLP, [Torchtext](https://pytorch.org/text/stable/index.html)


[2] Applied Natural Language Processing in the Enterprise.

    