<a href="https://colab.research.google.com/github/adamzki99/nlp-zlatan/blob/feature%2Fall-MiniLM-L6-v2-implementation/nlp_zlatan.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Connect to Google Drive

This notebook is designed to be used together with Google Colab. We start by connecting the notebook to our personal Google Drive.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Be careful to check the you have the same filepath for the dataset in your drive

In [2]:
%cd /content/drive/MyDrive/nlp-datasets/wizard_of_wikipedia

/content/drive/MyDrive/nlp-datasets/wizard_of_wikipedia


# all-MiniLM-L6-v2

This implementation is based on the all-MiniLM-L6-v2 model which is available from [Huggingface](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).

The all-MiniLM-L6-v2 is a sentence-transformers model. It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search. This is a later model compared to the one showed in one of the tutorials, but is used more or less in the same way.


We have selected to pick a BERT model as we wanted to explore the posibility of creating a "vector database". The use-case is as follows:

From a natural user input, we want to retrive the correct Wikipedia passage. So that the input from the user is as small as possible. 

The reduction of data input comes from the exlusion of topics etc..  

## Data extraction

The dataset used is very nested, hard to navigate and just difficult to wrap ones head around. So it is recommended to see this [resource](https://parl.ai/projects/wizard_of_wikipedia/) to get a better undersatnding.

In [3]:
import json

with open('data.json', 'r') as file:
    json_data = file.read()
    data = json.loads(json_data)

print('Datatype:', type(data))

Datatype: <class 'list'>


Use the following keys to double check with the [resource](https://parl.ai/projects/wizard_of_wikipedia/) that you have loaded in the right dataset.

In [4]:
data[0]['dialog'][0].keys()

dict_keys(['speaker', 'text', 'checked_sentence', 'checked_passage', 'retrieved_passages', 'retrieved_topics'])

We split up the data into a 80/20 split. We use 80% of the original dataset to perform fine-tuing of the all-MiniLM-L6-v2, the rest is then used for validating the reslut.

We aim to have the model being able to search the vector-space with new input and still being able to find the correct Wiki-passage.

It is important to note that this is a best case scenario as the input is generated from text that is present in the Wiki-passage.

In [5]:
import pandas as pd

data_extract_train = {
    "chosen_topic": [],
    "speaker_passage": [],
    "checked_sentence": [],
    "chosen_topic_passage": []
}

data_extract_test = {
    "chosen_topic": [],
    "speaker_passage": [],
    "checked_sentence": [],
    "chosen_topic_passage": []
}

for i, conversation in enumerate(data):

  for j, dialog in enumerate(conversation['dialog']):    

    if "Wizard" in dialog['speaker']:

      checked_sentence = list(dialog['checked_sentence'].values())

      if "no_passages_used" not in checked_sentence:

        if j % 4 == 0:

          data_extract_test['chosen_topic'].append(conversation['chosen_topic'])
          data_extract_test['speaker_passage'].append(dialog['text'])
          data_extract_test['checked_sentence'].append(checked_sentence)
          data_extract_test['chosen_topic_passage'].append(conversation['chosen_topic_passage'])

        else:
      
          data_extract_train['chosen_topic'].append(conversation['chosen_topic'])
          data_extract_train['speaker_passage'].append(dialog['text'])
          data_extract_train['checked_sentence'].append(checked_sentence)
          data_extract_train['chosen_topic_passage'].append(conversation['chosen_topic_passage'])

extract_train_df = pd.DataFrame(data_extract_train)

extract_test_df = pd.DataFrame(data_extract_test)

extract_test_df

Unnamed: 0,chosen_topic,speaker_passage,checked_sentence,chosen_topic_passage
0,Science fiction,I think science fiction is an amazing genre fo...,[Science fiction (often shortened to SF or sci...,[Science fiction (often shortened to SF or sci...
1,Science fiction,"It's not quite sci-fi, but my favorite version...",[The central premise for these stories oftenti...,[Science fiction (often shortened to SF or sci...
2,Romance (love),I don't know how to be romantic. I have troubl...,[Romance is the expressive and pleasurable fee...,[Romance is the expressive and pleasurable fee...
3,Romance (love),For sure. Romantic love is relative but usuall...,"[Romantic love is a relative term, but general...",[Romance is the expressive and pleasurable fee...
4,Romance (love),Good point. Romance is associated with perfect...,"[This feeling is associated with, but does not...",[Romance is the expressive and pleasurable fee...
...,...,...,...,...
25983,Kendrick Lamar,Kendrick Lamar is great! He is a rapper,"[Kendrick Lamar Duckworth (born June 17, 1987)...","[Kendrick Lamar Duckworth (born June 17, 1987)..."
25984,Kendrick Lamar,"It is not, but he is very acclaimed in the gen...","[His critically acclaimed third album ""To Pimp...","[Kendrick Lamar Duckworth (born June 17, 1987)..."
25985,Skiing,I knew skiing was a winter sport but I never t...,"[Skiing can be a means of transport, a recreat...","[Skiing can be a means of transport, a recreat..."
25986,Skiing,It seems that it may also have been practiced ...,[Although modern skiing has evolved from begin...,"[Skiing can be a means of transport, a recreat..."


### Reducing the size

The size of the dataset is too big for the amount of available VRAM on the GPU. Therefor we need to reduce the size of the extracted dataset

In [6]:
testing_size = len(extract_test_df.index)*0.01
testing_size = int(testing_size)

training_size = len(extract_train_df.index)*0.01
training_size = int(training_size)

print("Ratio %:", testing_size/training_size * 100)

Ratio %: 37.755102040816325


In [7]:
extract_test_df = extract_test_df.loc[:testing_size]
extract_test_df

Unnamed: 0,chosen_topic,speaker_passage,checked_sentence,chosen_topic_passage
0,Science fiction,I think science fiction is an amazing genre fo...,[Science fiction (often shortened to SF or sci...,[Science fiction (often shortened to SF or sci...
1,Science fiction,"It's not quite sci-fi, but my favorite version...",[The central premise for these stories oftenti...,[Science fiction (often shortened to SF or sci...
2,Romance (love),I don't know how to be romantic. I have troubl...,[Romance is the expressive and pleasurable fee...,[Romance is the expressive and pleasurable fee...
3,Romance (love),For sure. Romantic love is relative but usuall...,"[Romantic love is a relative term, but general...",[Romance is the expressive and pleasurable fee...
4,Romance (love),Good point. Romance is associated with perfect...,"[This feeling is associated with, but does not...",[Romance is the expressive and pleasurable fee...
...,...,...,...,...
255,The New York Times,I work for the New York Times! It's a newspaper.,[The New York Times (sometimes abbreviated as ...,[The New York Times (sometimes abbreviated as ...
256,The New York Times,"It was founded in 1851, and it's based in New ...","[Founded in 1851, the paper has won 122 Pulitz...",[The New York Times (sometimes abbreviated as ...
257,Tokyo,I recently moved from Los Angeles to Tokyo! It...,"[Tokyo (, ), officially Tokyo Metropolis, is t...","[Tokyo (, ), officially Tokyo Metropolis, is t..."
258,Tokyo,It's been the capital city of Japan since 1868...,[It officially became the capital after Empero...,"[Tokyo (, ), officially Tokyo Metropolis, is t..."


In [8]:
extract_train_df = extract_train_df.loc[:training_size]
extract_train_df

Unnamed: 0,chosen_topic,speaker_passage,checked_sentence,chosen_topic_passage
0,Science fiction,Awesome! I really love how sci-fi storytellers...,[Science fiction films have often been used to...,[Science fiction (often shortened to SF or sci...
1,Science fiction,If you really want a look at the potential neg...,[Science fiction often explores the potential ...,[Science fiction (often shortened to SF or sci...
2,Internet access,No I could not! I couldn't imagine living when...,"[Internet access was once rare, but has grown ...",[Internet access is the ability of individuals...
3,Internet access,"It used to be restricted, but around 1995, the...",[Use by a wider audience only came in 1995 whe...,[Internet access is the ability of individuals...
4,Internet access,"Yes, it was developed from a government funded...","[The Internet developed from the ARPANET, whic...",[Internet access is the ability of individuals...
...,...,...,...,...
682,Husky,"Nope, although that makes sense. They are oft...",[Huskies are used in sled dog racing.],[Husky is a general name for a sled-type of do...
683,Trumpet,The trumpet group contains instruments with th...,[The trumpet group contains the instruments wi...,[A trumpet is a brass instrument commonly used...
684,Trumpet,It is an interesting concept to think about be...,"[Trumpets are used in art music styles, for in...",[A trumpet is a brass instrument commonly used...
685,Parenting,"Yes, but I would rather my legal guardian be m...",[The most common caretaker in parenting is the...,[Parenting or child rearing is the process of ...


## Finetuning the model

As we can see from the pevious output, the dataset is very huge. Too big to fit on our GPU V-RAM. 

In [9]:
%pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.29.2-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m31.8 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m24.3 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m20.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transformers-4.29.2


In [10]:
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')


Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

In order to fine-tune the model we need to construct sentence pairs. These pairs consists of a "user-input" and a sentence from the Wiki-passage

As stated earlier, the "user-input" is a genereated human-like input. The input from is genereated from the same sentence as the Wiki-passage which it is matched with. 

We note that this isn't the best case senario, as it can be interperted as the dataset is traning it self and creates a circle dependence. But we see it as being a "optimal" scenario instead.

In [11]:
def data_division(dataframe, sample_size:int):

  selected_sentences = []
  selected_conversation_topics = []

  for c, row in dataframe.sample(sample_size).iterrows():
    
    selected_conversation_topics.append(row['chosen_topic'])

    for resp in row['checked_sentence']:
      pair = (row['speaker_passage'], resp)
      selected_sentences.append(pair)

  return selected_sentences, selected_conversation_topics

In [12]:
selected_sentences_training, conversation_topics_traning = data_division(extract_train_df, len(extract_train_df.index))

Back to training the model...

In [13]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

print(device)

# Move model to GPU
model.to(device)

cpu


BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 384, padding_idx=0)
    (position_embeddings): Embedding(512, 384)
    (token_type_embeddings): Embedding(2, 384)
    (LayerNorm): LayerNorm((384,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-5): 6 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=384, out_features=384, bias=True)
            (key): Linear(in_features=384, out_features=384, bias=True)
            (value): Linear(in_features=384, out_features=384, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=384, out_features=384, bias=True)
            (LayerNorm): LayerNorm((384,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
    

## Using mean-pooling

Becasue of the variable-length of the input we need to transform the input into a fixed-length representation so we can pass it to our model for traning.

The process involves taking the average of all the token embeddings in the sequence. More or less, this is achieved by summing up the embeddings and dividing the sum by the total number of tokens in the sequence.

Note that mean-pooling does not consider the positional information or the relative importance of individual tokens within the sequence. In order to combat this we make use of a attetion mask inorder to highlight some importance in the embedding.

The **mean_pooling** function performs mean pooling on token embeddings while considering an attention mask for correct averaging. It takes *model_output* and *attention_mask* as inputs.

The function first extracts the token embeddings from *model_output*. It then expands the attention mask to match the dimensions of the token embeddings. The expanded mask is used to mask out the embeddings that should be ignored.

Next, the masked token embeddings are summed along the second dimension (axis 1) to obtain the sum of the embeddings for each token. The attention mask is also summed along the second dimension and clamped to avoid division by zero.

Finally, the masked token embeddings are divided by the clamped attention mask sum to compute the mean pooling. The resulting mean-pooled embeddings are returned.

In [14]:
#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

The embedding is performed in the same way as desrcibes in the documentation for the all-MiniLM-L6-v2 model. One step that has been left out is the normalizaiton of the embedding. 

The normalization whould provide a list of benefits, such as: improved training and stability, reducing dimensionality, alignment of embedding spaces.

The reason that we whould like to perform normalization is to have similarity of meaurements when evaluating the performance of the model.

In [15]:
def perform_embedding(documents:list, device, model):

  encoded_documents = tokenizer(documents, padding=True, truncation=True, return_tensors='pt')

  encoded_documents.to(device)
  with torch.no_grad():
      model_output_documents = model(**encoded_documents)

  # Perform pooling
  embedding = mean_pooling(model_output_documents, encoded_documents['attention_mask'])

  # Normalize embedding
  #embedding = F.normalize(model_output_message, p=2, dim=1)

  return embedding

In [16]:
sentence_embeddings = perform_embedding(documents = selected_sentences_training, device = device, model = model)

KeyboardInterrupt: ignored

## Visualizing Cluster with Hypertools

In order to get a better understanding of the dataset we will use Hypertools to transform the very high dimensional space into a something we can understand. 

We will generate a 3-dimensional wiev of the embeddings and color in correspondence to k-mean clusters. The amount of clusters corresponds to the amount of topics that is included in the dataset.

In [None]:
%pip install hypertools

In [None]:
import hypertools as hyp

n_clusters = len(set(conversation_topics_traning))

print("Number of clusters:", n_clusters)

hyp.plot(sentence_embeddings.cpu().detach().numpy(), '.', n_clusters = n_clusters)

## Model usage

In [None]:
%pip install hnswlib

In order to search the embedding vector space we perform **k-nearest neighbors** (KNN) query. This works by creating a index using hnswlib, this is to improve the efficiency of the search.

We then perform the embedding process on the query, which in our case is the *speaker_passage*, and calculates the absolute distance to the *k* closest elements in the index.

In [None]:
import hnswlib

# Create the HNSW index
index = hnswlib.Index(space='l2', dim=sentence_embeddings.shape[1])
index.init_index(max_elements=len(sentence_embeddings), ef_construction=200, M=16)

# Add sentence embeddings to the index
index.add_items(sentence_embeddings.cpu().numpy())

In [None]:
# Perform a similarity search
def search_embeddings(query:str, k, device, model):

  query_embedding = perform_embedding(documents=query, device=device, model=model)

  indexes, distances = index.knn_query(query_embedding.cpu(), k=k)

  return indexes[0], distances, query_embedding

Get a random speaker passage from the training dataset just to verify that we can use the model

In [None]:
random_message = list(extract_train_df.sample(1).to_dict()['speaker_passage'].values())[0]
message = [random_message]

message

In [None]:
indexes, distances, query_embedding = search_embeddings(query=message, k=10, device=device, model=model)

print(indexes)

In [None]:
query_subset = []

for i, ind in enumerate(indexes):
  print("Distance:", distances[0][i], "\t", selected_sentences_training[ind][1])
  query_subset.append(selected_sentences_training[ind])

## Looking at the result

Hypertools has it's limitations, so in order to check the how the results look in comparison to the query embedding we will use *matplotlib.pyplot*.

Note: Having a 2-dimensional representation of such a high dimensional vectorspace that the embeddings are isn't optimal. But it is better than nothing 😉.





In [None]:
selected_sentences_embedding = perform_embedding(documents=query_subset, device=device, model=model)

In [None]:
import matplotlib.pyplot as plt

plt.scatter(sentence_embeddings.cpu()[:,0] , sentence_embeddings.cpu()[:,1], c = '#a9a9a9')
plt.scatter(selected_sentences_embedding.cpu()[:,0] , selected_sentences_embedding.cpu()[:,1], c = '#4363d8')
plt.scatter(query_embedding.cpu()[:,0] , query_embedding.cpu()[:,1], color = '#ffe119')
plt.show()

## Testing the model

Now we have extracted the data, finetuned the model, and proved that it works once. Now we will have to prove that it works for more cases. 

Earlier we set aside 20% of the original data for testing. Becasue we are aming to create something that is working as a vector database, we want to have absolute accuracy and we are not interested in similarity. This is by we have a one-to-one comparison and not a BLEU-evaluation or similar.

In [None]:
# Get traning data
selected_sentences_testing, _ = data_division(extract_test_df, len(extract_test_df.index))

In [None]:
score = 0

for _, sentence_pair in enumerate(selected_sentences_testing):

  indexes, distances, query_embedding = search_embeddings(query=sentence_pair[0], k=1, device=device, model=model)

  results = []
  for _, i in enumerate(indexes):
    results.append(selected_sentences_testing[i])
  

  if sentence_pair in results:
    score += 1
  
print("Accuracy:", score/len(selected_sentences_testing)*100, "%")

## Finding the correct Wikipedia passage

To wrap it up, we want to find the correct Wiki-passage. This will be perfomed by just finding the passage in the bigger Wiki-passage and presenting it to the user.

In [None]:
def find_article(checked_sentence:str, data_extract):

  for passage in data_extract['chosen_topic_passage']:

    extracted_passage = ""

    for line in passage:
      extracted_passage = extracted_passage + " " + line

    if extracted_passage.find(checked_sentence) == 1:
      
      return extracted_passage

In [31]:
print("Sentence found:", query_subset[0][1])

# Here the whole data_extract_train is passed in, so it is a lot of uncessesary searing. 
complete_wiki_passge = find_article(checked_sentence=query_subset[0][1], data_extract=data_extract_train)

print("Wiki-passage:", complete_wiki_passge)

Addiction Services, a division of the Nova Scotia Department of Health Promotion and Protection, aims to assist all individuals in achieving a safe and healthy lifestyle.


NameError: ignored