<a href="https://colab.research.google.com/github/adamzki99/nlp-zlatan/blob/feature%2Fdoc2vec_approach/nlp_zlatan.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Connect to Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
%cd /content/drive/MyDrive/nlp-datasets/wizard_of_wikipedia

# Reading the dataset

In [None]:
import json

with open('data.json', 'r') as file:
    json_data = file.read()
    data = json.loads(json_data)

print('Datatype:', type(data))

# Data exploration


In [None]:
len(data)

In [None]:
data[:5]

In [None]:
data[0].keys()

In [None]:
data[0]['chosen_topic_passage']

In [None]:
data[0].keys()

In [None]:
print(data[0]['persona'])
print(data[0]['chosen_topic'])

Dictionary keys of Wizard

In [None]:
data[0]['dialog'][0].keys()

Dictionary keys of Apprentice

In [None]:
data[0]['dialog'][1].keys()

In [None]:
for i in range(10):
    print(i, ":", data[0]['dialog'][i]['text'])

In [None]:
for i in range(10):
    print(i, ":", data[0]['dialog'][i]['retrieved_topics'])

In [None]:
for i in range(10):
    print(i, ":", data[0]['dialog'][i]['retrieved_passages'])

## Exploring uniqe types

Exploring how many uniqe "chosen_topic"s, "persona"s and "wizard_eval"s there are in the dataset

In [None]:
topics = []
personas = []
wizardEvals = []

for entry in data:

  topics.append(entry['chosen_topic'])
  personas.append(entry['persona'])
  wizardEvals.append(entry['wizard_eval'])

# Making the list containing only uniqe items
topics = list(set(topics))
personas = list(set(personas))
wizardEvals = list(set(wizardEvals))

print("topic:", len(topics), "persona:", len(personas), "wizard_eval:", len(wizardEvals))

Why are there more than 5 different "wizard_eval"s? The paper only mentions a rating from 1-5. What are the other 2?

In [None]:
for entry in wizardEvals:
  print(wizardEvals[entry] )
#what's up with -1 and 0? In paper only ratings from 1 to 5 are mentioned

How often does each rating occur in "wizard_eval"s? Visualize all the different instances in a histogram

In [None]:
import matplotlib.pyplot as plt
import numpy as np

wEval = []

for entry in data:
    wEval.append(entry['wizard_eval'])

plt.hist(wEval, bins=2*len(set(wEval))) #the number of bins can probably be improved to look nicer
plt.yscale('log')
plt.show()

In [None]:
# What is a topic?

topics[:10]

In [None]:
# What is a persona?

personas[:10]

## Open question 1

Maybe there is some relation between topics and personas that we might be able to cluster in order to get som further insight?

##Trying to cluster (Farid)

###Data preprocessing

Preprocess data before clustering 

Combining chosen_topic and chosen_topic_passage (basically the Wiki article) to try to cluster them afterwards 

In [None]:
topics = [f"{sample['chosen_topic']}\n\n" + "\n".join([f"{passage}" for passage in sample['chosen_topic_passage']]) for sample in data]

In [None]:
print(topics[10]) #The 'chosen_topic' is repepated at the beginning of the article anyway, so no need in repeating it tbh

###Vectorization of topics using TFIDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_df=0.8, min_df=5, stop_words='english')

Fitting the vectorizer to the data

In [None]:
vectorizer.fit(topics)

Size of Vocabulary

In [None]:
vocab = vectorizer.get_feature_names_out()

print(f"Length of vocabulary: {len(vocab)}")

Random sampling from Vocabulary

In [None]:
import random

sorted(random.sample(vocab.tolist(),20))

Vectorization of topics

In [None]:
vector_topics = vectorizer.transform(topics)

TF-IDF values of first topic

In [None]:
sorted([(vocab[j], vector_topics[0, j]) for j in vector_topics[0].nonzero()[1]], key=lambda x: -x[1])

###Minibatch k-means

In [None]:
from sklearn.cluster import MiniBatchKMeans

####Elbow method to find number of clusters k

Generate the performance evaluation measure values across the range of k values -> Decrease k to around 50 to run faster

In [None]:
performance = [MiniBatchKMeans(n_clusters=k, batch_size=500, random_state=2307).fit(vector_topics).inertia_ for k in range(1,100)]

Use some standard code to plot the performance measure against the value k

In [None]:
plt.figure()
plt.plot(performance)
plt.ylabel('Within-cluster sum-of-squares')
plt.xlabel('k')
plt.show()

According to tutorial 4: "In theory it should always increase since the more cluster centroids there are, the more flexibility the model has for describing datapoints (assigning them to clusters)"

So something is probably wrong

#Doc2Vec Approach (Farid)

##Import necessary tools

In [None]:
!pip install --upgrade gensim

In [None]:
import gensim

Install blas to reduce computation time (this obvs doesn't work -> find out how to fix)

In [None]:
import scipy.linalg
from scipy.linalg import blas

In [None]:
!pip install numpy pyblas

##Import Data

In [None]:
import json

with open('data.json', 'r') as file:
    json_data = file.read()
    data = json.loads(json_data)

print('Datatype:', type(data))

##Quick look at the Data

In [None]:
# This dataframe is never used, but it is useful for looking at the dataset

import pandas as pd

df = pd.DataFrame(data)
df

##Preprocessing the Data

We first have to decide which Data we want to use to train the model aka what goal are we trying to achieve.
As we want to retrieve the correct passage for each turn we should probably train the model on the passages given and then try to retrieve the chosen passage given a sentence from the dialogue

In [None]:
data[0]["chosen_topic_passage"]

So we want to take all the sentences from each "chosen_topic_passage" and separately use those as the training data

###First Try (not the correct format)

In [None]:
passages = [f" ".join([f"{passage}" for passage in sample['chosen_topic_passage']]) for sample in data]

In [None]:
passages[0]

In [None]:
def preprocess(data,tokens_only=False):
  for i, line in enumerate(data):
    tokens = gensim.utils.simple_preprocess(line)
    if tokens_only:
      yield tokens
    else:
      # For training data, add tags
      yield gensim.models.doc2vec.TaggedDocument(tokens, [i])

Preprocess training data

In [None]:
train_corpus = list(preprocess(passages))

In [None]:
train_corpus[0]

This is not what I want -> here I have all sentences together, need to seperate them!

###Second Try (seems to work as it should)

In [None]:
passages = [[passage for passage in sample['chosen_topic_passage']] for sample in data]

In [None]:
passages[0]

In [None]:
passages[0][0]

Now we have a nested list of lists -> let's unfold that list in a way that the nested entries of those lists are their own entries

In [None]:
sentences = []
for i in passages:
  for entry in i:
    sentences.append(entry)
    
sentences[1]

Let's define a function for preprocessing our data

In [None]:
def preprocess(data,tokens_only=False):
  for i, line in enumerate(data):
    tokens = gensim.utils.simple_preprocess(line)
    if tokens_only:
      yield tokens
    else:
      # For training data, add tags
      yield gensim.models.doc2vec.TaggedDocument(tokens, [i])

Preprocess training data

In [None]:
train_corpus = list(preprocess(sentences))

In [None]:
train_corpus[0]

# Retrieval-based chatbots

This approach is more or less the same as showed during Tutorial_08.

## Data extraction

In [None]:
import json

with open('train.json', 'r') as file:
    json_data = file.read()
    data = json.loads(json_data)

print('Datatype:', type(data))

In [None]:
# just for looking at the raw dataset
data[0]

In [None]:
# This dataframe is never used, but it is useful for looking at the dataset

import pandas as pd

df = pd.DataFrame(data)
df

Now we do some data extraction from the dataset. We want to produce a set were we have the dialog with a apprentice and wizard, these are then used to fine train the model. 

This limits the model, as it won't have any "memory"/context from the complete conversation. But the aim is for it to be acting as a "smart vector-database" and retrive similar enough passages. 

In [None]:
user_query = []
wizard_responses = []

chosen_topic = ""

for dialogue in data:

  if not 'Wizard' in dialogue['dialog'][0]['speaker']:
      continue

  chosen_topic = dialogue['chosen_topic']

  user_query.append(chosen_topic + " " + dialogue['persona'])

  for i, prompt in enumerate(dialogue['dialog']):

    if i % 2 == 0:
      wizard_responses.append(chosen_topic + " " + prompt['text'])
    else:
      user_query.append(chosen_topic + " " + prompt['text'])

data_pairs = []

for i, _ in enumerate(wizard_responses):

  data_pairs.append(
      {'message': user_query[i], 'response': wizard_responses[i]}
      )

## Model training

Now we are able to train the model

In [None]:
%pip install sentence_transformers

In [None]:
from sentence_transformers import SentenceTransformer, CrossEncoder, util

semb_model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
xenc_model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

In [None]:
corpus_embeddings = semb_model.encode([sample['message'] for sample in data_pairs], convert_to_tensor=True, show_progress_bar=True, device='cuda')

## Model usage

In [None]:
%pip install hnswlib

In [None]:
import os
import hnswlib

# Create empty index
hnswlib_index = hnswlib.Index(space='cosine', dim=corpus_embeddings.size(1))

# Define hnswlib index path
index_path = "./emp_dialogue_hnswlib.index"

# Load index if available
if os.path.exists(index_path):
    print("Loading index...")
    hnswlib_index.load_index(index_path)
# Else index data collection
else:
    # Initialise the index
    print("Start creating HNSWLIB index")
    hnswlib_index.init_index(max_elements=corpus_embeddings.size(0), ef_construction=400, M=64)
    #  Compute the HNSWLIB index (it may take a while)
    hnswlib_index.add_items(corpus_embeddings.cpu(), list(range(len(corpus_embeddings))))
    # Save the index to a file for future loading
    print("Saving index to:", index_path)
    hnswlib_index.save_index(index_path)

In [None]:
import numpy as np

def get_response(message, mes_resp_pairs, index, re_ranking_model=None, top_k=32):
    message_embedding = semb_model.encode(message, convert_to_tensor=True).cpu()

    corpus_ids, _ = index.knn_query(message_embedding, k=top_k)

    model_inputs = [(message, mes_resp_pairs[idx]['response']) for idx in corpus_ids[0]]
    cross_scores = xenc_model.predict(model_inputs)

    idx = np.argsort(-cross_scores)[0]

    return mes_resp_pairs[corpus_ids[0][idx]]['response']

In [None]:
chatbot_response = get_response(
    "I'm a huge fan of science fiction myself!", data_pairs, hnswlib_index, re_ranking_model=xenc_model
)
chatbot_response

## Testing the model

Testing the model by loading in the **test_random_split.json** file.

### Data extraction

Before we can perform the testing, we need to perform some data extraction. The strategy is to find a conversation between a wizard and a apprentice, and use that to test the accuracy/precision of the model.

What we expect is that the model produces a responce that is similar to the one that was used in the conversation. Note that this does not satisfy the "correct passage" requirement.

In [None]:
with open('test_random_split.json', 'r') as file:
    json_data = file.read()
    test = json.loads(json_data)

print('Datatype:', type(test))

In [None]:
test_extract = []

for i, conversation in enumerate(test):

  test_extract.append("new_conv_" + str(i))

  for j, dialog in enumerate(conversation['dialog']):

    if "Wizard" in dialog['speaker']:

      if j == 0:
        continue

      test_extract.append({'wizard':dialog['text']})

    if "Apprentice" in dialog['speaker']:
      test_extract.append({'apprentice':dialog['text']})

test_extract[:10]

The data is still quite "dirty". So we will perform the cumbersome clean up in the next cell to get a list of directories, were the directories contians the matches/pairs that will be used for testing.

In [None]:
pair = []

test_pairs = []

for i, text in enumerate(test_extract):

  if "new_conv_" in text:
    continue

  pair.append(text)

  if len(pair) == 2:
    
    entry = {'apprentice':"", 'wizard': ""}

    for _, e in enumerate(pair):

      if 'apprentice' in e.keys():
        entry['apprentice'] = e['apprentice']

      if 'wizard' in e.keys():
        entry['wizard'] = e['wizard']


    test_pairs.append(entry)
    pair = []

test_pairs[:5]

In [None]:
import random

rand_int = random.randrange(0,500)

chatbot_response = get_response(
      test_pairs[rand_int]['apprentice'], data_pairs, hnswlib_index, re_ranking_model=xenc_model
  )

print(test_pairs[rand_int]['apprentice'])
print(test_pairs[rand_int]['wizard'])
print(chatbot_response)

Now we should be able to do some testing. Here we use two approaches, a naive one were we are looking at the exact matches, and one were we are doing BLEU-scoring

The naive approach is useful for the assignment requirement were it is specified to find the "correct passage". 

The BLEU-score is a score to see how close the precision is. It might not provide that much (if any) useful informaiton to us, as we are not doing a sentence-to-sentence transformation.

In [None]:
from nltk.translate.bleu_score import sentence_bleu

correct_responses = 0

bleu_scores = []

for _, entry in enumerate(test_pairs):
  chatbot_response = get_response(
      entry['apprentice'], data_pairs, hnswlib_index, re_ranking_model=xenc_model
  )

  # Naive accuracy
  if chatbot_response == entry['wizard']:
    correct_responses += 1
  
  # BLEU score calculation

  reference = [entry['apprentice'].split()]
  candidate = chatbot_response.split()
  bleu_scores.append(sentence_bleu(reference, candidate))

accuracy = correct_responses / len(test_pairs)

print("Test accuracy (%):", accuracy * 100)
print("Average BLEU-score:", sum(bleu_scores) / len(bleu_scores))

# Retrieval-based response chatbot (Not accurate title)

This implementation aims to create a retrieval-based responce chatbot to provide the correct awnser to a given passage. This is done by taking all the correct awnsers, generating embeddings with them and then performing a "search" in the created vector space to find the passage that has the closest match with the given passage

## Data extraction

Extracts user prompts and wizard responses from a list of dialogues and stores them in separate lists based on the condition that the dialogue speaker is the wizard and the order in which they appear in the dialogue.

Here we also concatenate the strings with some extra information, like the chosen topic, in order to increase the precision of the search later. This is a valid approach and can be seen as that we are just adding more context to the passage.

In [None]:
#sets of documents
user_query = []
wizard_responses = []

chosen_topic = ""

for dialogue in data:

  if not 'Wizard' in dialogue['dialog'][0]['speaker']:
      continue

  chosen_topic = dialogue['chosen_topic']

  user_query.append(chosen_topic + " " + dialogue['persona'])

  for i, prompt in enumerate(dialogue['dialog']):

    if i % 2 == 0:
      wizard_responses.append(chosen_topic + " " + prompt['text'])
    else:
      user_query.append(chosen_topic + " " + prompt['text'])

## Document vectorization

In [None]:
# TfidfVectorizer 
# CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer

In [None]:
# instantiate the vectorizer object
#countvectorizer = CountVectorizer(analyzer= 'word', stop_words='english')
tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words= 'english')

In [None]:
# convert th documents into a matrix
#count_wm = countvectorizer.fit_transform(train)
query_wm = tfidfvectorizer.fit_transform(user_query)
response_wm = tfidfvectorizer.fit_transform(wizard_responses)

In [None]:
# retrieve the terms found in the corpora
# if we take same parameters on both Classes(CountVectorizer and TfidfVectorizer) , it will give same output of get_feature_names() methods)
query_tokens = tfidfvectorizer.get_feature_names_out(query_wm)
responce_tokens = tfidfvectorizer.get_feature_names_out(response_wm)

## Verification

Some output in order to quickly verify the embeddings

In [None]:
responce_vectors = tfidfvectorizer.transform(wizard_responses)
query_vectors = tfidfvectorizer.transform(user_query)

print('responce_vectors:\n', responce_vectors)

print('query_vectors:\n', query_vectors)

In [None]:
sorted([(query_tokens[j], query_vectors[0, j]) for j in query_vectors[0].nonzero()[1]], key=lambda x: -x[1])

## Search the vector space

Here we calculate the closest neighbor to the embedding of the query, and hopefully that is the "correct" passage we are looking for.

In [None]:
import numpy as np

query = 'Gardening: i like to garden.'

query_vec = tfidfvectorizer.transform([query])[0]

index = np.argmax([query_vec.multiply(vector_documents[i]).sum() for i in range(len(train))])
print(train[index])