<a href="https://colab.research.google.com/github/adamzki99/nlp-zlatan/blob/feature%2Fclustering_verification/nlp_zlatan.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Connect to Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
%cd /content/drive/MyDrive/nlp-datasets/wizard_of_wikipedia

/content/drive/MyDrive/nlp-datasets/wizard_of_wikipedia


# Retrieval-based chatbots

This approach is more or less the same as showed during Tutorial_08.

## Data extraction

In [None]:
import json

with open('train.json', 'r') as file:
    json_data = file.read()
    data = json.loads(json_data)

print('Datatype:', type(data))

In [None]:
# just for looking at the raw dataset
data[0]

In [None]:
# This dataframe is never used, but it is useful for looking at the dataset

import pandas as pd

df = pd.DataFrame(data)
df

Now we do some data extraction from the dataset. We want to produce a set were we have the dialog with a apprentice and wizard, these are then used to fine train the model. 

This limits the model, as it won't have any "memory"/context from the complete conversation. But the aim is for it to be acting as a "smart vector-database" and retrive similar enough passages. 

In [None]:
user_query = []
wizard_responses = []

chosen_topic = ""

for dialogue in data:

  if not 'Wizard' in dialogue['dialog'][0]['speaker']:
      continue

  chosen_topic = dialogue['chosen_topic']

  user_query.append(chosen_topic + " " + dialogue['persona'])

  for i, prompt in enumerate(dialogue['dialog']):

    if i % 2 == 0:
      wizard_responses.append(chosen_topic + " " + prompt['text'])
    else:
      user_query.append(chosen_topic + " " + prompt['text'])

data_pairs = []

for i, _ in enumerate(wizard_responses):

  data_pairs.append(
      {'message': user_query[i], 'response': wizard_responses[i]}
      )

## Model training

Now we are able to train the model

In [None]:
%pip install sentence_transformers

In [None]:
from sentence_transformers import SentenceTransformer, CrossEncoder, util

semb_model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
xenc_model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

In [None]:
corpus_embeddings = semb_model.encode([sample['message'] for sample in data_pairs], convert_to_tensor=True, show_progress_bar=True, device='cuda')

## Model usage

In [None]:
%pip install hnswlib

In [None]:
import os
import hnswlib

# Create empty index
hnswlib_index = hnswlib.Index(space='cosine', dim=corpus_embeddings.size(1))

# Define hnswlib index path
index_path = "./emp_dialogue_hnswlib.index"

# Load index if available
if os.path.exists(index_path):
    print("Loading index...")
    hnswlib_index.load_index(index_path)
# Else index data collection
else:
    # Initialise the index
    print("Start creating HNSWLIB index")
    hnswlib_index.init_index(max_elements=corpus_embeddings.size(0), ef_construction=400, M=64)
    #  Compute the HNSWLIB index (it may take a while)
    hnswlib_index.add_items(corpus_embeddings.cpu(), list(range(len(corpus_embeddings))))
    # Save the index to a file for future loading
    print("Saving index to:", index_path)
    hnswlib_index.save_index(index_path)

In [None]:
import numpy as np

def get_response(message, mes_resp_pairs, index, re_ranking_model=None, top_k=32):
    message_embedding = semb_model.encode(message, convert_to_tensor=True).cpu()

    corpus_ids, _ = index.knn_query(message_embedding, k=top_k)

    model_inputs = [(message, mes_resp_pairs[idx]['response']) for idx in corpus_ids[0]]
    cross_scores = xenc_model.predict(model_inputs)

    idx = np.argsort(-cross_scores)[0]

    return mes_resp_pairs[corpus_ids[0][idx]]['response']

In [None]:
chatbot_response = get_response(
    "I'm a huge fan of science fiction myself!", data_pairs, hnswlib_index, re_ranking_model=xenc_model
)
chatbot_response

## Testing the model

Testing the model by loading in the **test_random_split.json** file.

### Data extraction

Before we can perform the testing, we need to perform some data extraction. The strategy is to find a conversation between a wizard and a apprentice, and use that to test the accuracy/precision of the model.

What we expect is that the model produces a responce that is similar to the one that was used in the conversation. Note that this does not satisfy the "correct passage" requirement.

In [None]:
with open('test_random_split.json', 'r') as file:
    json_data = file.read()
    test = json.loads(json_data)

print('Datatype:', type(test))

In [None]:
test_extract = []

for i, conversation in enumerate(test):

  test_extract.append("new_conv_" + str(i))

  for j, dialog in enumerate(conversation['dialog']):

    if "Wizard" in dialog['speaker']:

      if j == 0:
        continue

      test_extract.append({'wizard':dialog['text']})

    if "Apprentice" in dialog['speaker']:
      test_extract.append({'apprentice':dialog['text']})

test_extract[:10]

The data is still quite "dirty". So we will perform the cumbersome clean up in the next cell to get a list of directories, were the directories contians the matches/pairs that will be used for testing.

In [None]:
pair = []

test_pairs = []

for i, text in enumerate(test_extract):

  if "new_conv_" in text:
    continue

  pair.append(text)

  if len(pair) == 2:
    
    entry = {'apprentice':"", 'wizard': ""}

    for _, e in enumerate(pair):

      if 'apprentice' in e.keys():
        entry['apprentice'] = e['apprentice']

      if 'wizard' in e.keys():
        entry['wizard'] = e['wizard']


    test_pairs.append(entry)
    pair = []

test_pairs[:5]

In [None]:
import random

rand_int = random.randrange(0,500)

chatbot_response = get_response(
      test_pairs[rand_int]['apprentice'], data_pairs, hnswlib_index, re_ranking_model=xenc_model
  )

print(test_pairs[rand_int]['apprentice'])
print(test_pairs[rand_int]['wizard'])
print(chatbot_response)

Now we should be able to do some testing. Here we use two approaches, a naive one were we are looking at the exact matches, and one were we are doing BLEU-scoring

The naive approach is useful for the assignment requirement were it is specified to find the "correct passage". 

The BLEU-score is a score to see how close the precision is. It might not provide that much (if any) useful informaiton to us, as we are not doing a sentence-to-sentence transformation.

In [None]:
from nltk.translate.bleu_score import sentence_bleu

correct_responses = 0

bleu_scores = []

for _, entry in enumerate(test_pairs):
  chatbot_response = get_response(
      entry['apprentice'], data_pairs, hnswlib_index, re_ranking_model=xenc_model
  )

  # Naive accuracy
  if chatbot_response == entry['wizard']:
    correct_responses += 1
  
  # BLEU score calculation

  reference = [entry['apprentice'].split()]
  candidate = chatbot_response.split()
  bleu_scores.append(sentence_bleu(reference, candidate))

accuracy = correct_responses / len(test_pairs)

print("Test accuracy (%):", accuracy * 100)
print("Average BLEU-score:", sum(bleu_scores) / len(bleu_scores))