<a href="https://colab.research.google.com/github/adamzki99/nlp-zlatan/blob/feature%2Fdoc2vec_approach/nlp_zlatan.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Connect to Google Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
%cd /content/drive/MyDrive/nlp-datasets/wizard_of_wikipedia

/content/drive/MyDrive/nlp-datasets/wizard_of_wikipedia


# Reading the dataset

In [3]:
import json

with open('data.json', 'r') as file:
    json_data = file.read()
    data = json.loads(json_data)

print('Datatype:', type(data))

Datatype: <class 'list'>


# Data exploration


In [None]:
len(data)

In [None]:
data[:5]

In [None]:
data[0].keys()

In [None]:
data[0]['chosen_topic_passage']

In [None]:
data[0].keys()

In [None]:
print(data[0]['persona'])
print(data[0]['chosen_topic'])

Dictionary keys of Wizard

In [None]:
data[0]['dialog'][0].keys()

Dictionary keys of Apprentice

In [None]:
data[0]['dialog'][1].keys()

In [None]:
for i in range(10):
    print(i, ":", data[0]['dialog'][i]['text'])

In [None]:
for i in range(10):
    print(i, ":", data[0]['dialog'][i]['retrieved_topics'])

In [None]:
for i in range(10):
    print(i, ":", data[0]['dialog'][i]['retrieved_passages'])

## Exploring uniqe types

Exploring how many uniqe "chosen_topic"s, "persona"s and "wizard_eval"s there are in the dataset

In [None]:
topics = []
personas = []
wizardEvals = []

for entry in data:

  topics.append(entry['chosen_topic'])
  personas.append(entry['persona'])
  wizardEvals.append(entry['wizard_eval'])

# Making the list containing only uniqe items
topics = list(set(topics))
personas = list(set(personas))
wizardEvals = list(set(wizardEvals))

print("topic:", len(topics), "persona:", len(personas), "wizard_eval:", len(wizardEvals))

Why are there more than 5 different "wizard_eval"s? The paper only mentions a rating from 1-5. What are the other 2?

In [None]:
for entry in wizardEvals:
  print(wizardEvals[entry] )
#what's up with -1 and 0? In paper only ratings from 1 to 5 are mentioned

How often does each rating occur in "wizard_eval"s? Visualize all the different instances in a histogram

In [None]:
import matplotlib.pyplot as plt
import numpy as np

wEval = []

for entry in data:
    wEval.append(entry['wizard_eval'])

plt.hist(wEval, bins=2*len(set(wEval))) #the number of bins can probably be improved to look nicer
plt.yscale('log')
plt.show()

In [None]:
# What is a topic?

topics[:10]

In [None]:
# What is a persona?

personas[:10]

## Open question 1

Maybe there is some relation between topics and personas that we might be able to cluster in order to get som further insight?

##Trying to cluster (Farid)

###Data preprocessing

Preprocess data before clustering 

Combining chosen_topic and chosen_topic_passage (basically the Wiki article) to try to cluster them afterwards 

In [None]:
topics = [f"{sample['chosen_topic']}\n\n" + "\n".join([f"{passage}" for passage in sample['chosen_topic_passage']]) for sample in data]

In [None]:
print(topics[10]) #The 'chosen_topic' is repepated at the beginning of the article anyway, so no need in repeating it tbh

###Vectorization of topics using TFIDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_df=0.8, min_df=5, stop_words='english')

Fitting the vectorizer to the data

In [None]:
vectorizer.fit(topics)

Size of Vocabulary

In [None]:
vocab = vectorizer.get_feature_names_out()

print(f"Length of vocabulary: {len(vocab)}")

Random sampling from Vocabulary

In [None]:
import random

sorted(random.sample(vocab.tolist(),20))

Vectorization of topics

In [None]:
vector_topics = vectorizer.transform(topics)

TF-IDF values of first topic

In [None]:
sorted([(vocab[j], vector_topics[0, j]) for j in vector_topics[0].nonzero()[1]], key=lambda x: -x[1])

###Minibatch k-means

In [None]:
from sklearn.cluster import MiniBatchKMeans

####Elbow method to find number of clusters k

Generate the performance evaluation measure values across the range of k values -> Decrease k to around 50 to run faster

In [None]:
performance = [MiniBatchKMeans(n_clusters=k, batch_size=500, random_state=2307).fit(vector_topics).inertia_ for k in range(1,100)]

Use some standard code to plot the performance measure against the value k

In [None]:
plt.figure()
plt.plot(performance)
plt.ylabel('Within-cluster sum-of-squares')
plt.xlabel('k')
plt.show()

According to tutorial 4: "In theory it should always increase since the more cluster centroids there are, the more flexibility the model has for describing datapoints (assigning them to clusters)"

So something is probably wrong

#Doc2Vec Approach (Farid)

##Import necessary tools

In [6]:
%pip install --upgrade gensim
import gensim

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Install blas to reduce computation time

In [None]:
#import scipy.linalg
#from scipy.linalg import blas
#from scipy.linalg.blas import cblas
#%pip install numpy pyblas

##Import Data

In [None]:
import json

with open('data.json', 'r') as file:
    json_data = file.read()
    data = json.loads(json_data)

print('Datatype:', type(data))

##Quick look at the Data

In [7]:
# This dataframe is never used, but it is useful for looking at the dataset

import pandas as pd

df = pd.DataFrame(data)
df

Unnamed: 0,chosen_topic,persona,wizard_eval,dialog,chosen_topic_passage
0,Science fiction,i enjoy movies about aliens invading the earth.,5,"[{'speaker': '0_Wizard', 'text': 'I think scie...",[Science fiction (often shortened to SF or sci...
1,Internet access,i have high speed internet.,5,"[{'speaker': '0_Apprentice', 'text': 'Can you ...",[Internet access is the ability of individuals...
2,Pharmacist,i am a pharmacist.,5,"[{'speaker': '0_Apprentice', 'text': 'I am jus...","[Pharmacists, also known as chemists (Commonwe..."
3,Homebrewing,i brew my own beer.,5,"[{'speaker': '0_Apprentice', 'text': 'I have h...",[Homebrewing is the brewing of beer on a small...
4,Red hair,i have red hair.,5,"[{'speaker': '0_Apprentice', 'text': 'red hair...",[Red hair (or ginger hair) occurs naturally in...
...,...,...,...,...,...
22306,Green,my favorite color is green.,4,"[{'speaker': '0_Wizard', 'text': 'So, you know...",[Green is the color between blue and yellow on...
22307,Motivation,i have trouble getting motivated.,2,"[{'speaker': '0_Wizard', 'text': 'I swear, it ...",[Motivation is the reason for people's actions...
22308,List of national parks of the United States,i like to visit national parks.,5,"[{'speaker': '0_Apprentice', 'text': 'I've bee...",[The United States has 59 protected areas know...
22309,Kendrick Lamar,i listen to rap.,3,"[{'speaker': '0_Wizard', 'text': 'Kendrick Lam...","[Kendrick Lamar Duckworth (born June 17, 1987)..."


##Preparing the Data

###Preparing the training set

We first have to decide which Data we want to use to train the model aka what goal are we trying to achieve.
As we want to retrieve the correct passage for each turn we should probably train the model on the passages given and then try to retrieve the chosen passage given a sentence from the dialogue

In [8]:
import pandas as pd

pd.DataFrame(data[0]["chosen_topic_passage"])

Unnamed: 0,0
0,Science fiction (often shortened to SF or sci-...
1,Science fiction often explores the potential c...
2,"It usually avoids the supernatural, unlike the..."
3,"Historically, science-fiction stories have had..."
4,"Science fiction is difficult to define, as it ..."
5,"Hugo Gernsback, who suggested the term ""scient..."
6,They supply knowledge... in a very palatable f...


So we want to take all the sentences from each "chosen_topic_passage" and separately use those as the training data

In [9]:
passages = [[passage for passage in sample['chosen_topic_passage']] for sample in data]
pd.DataFrame(passages[:2])

Unnamed: 0,0,1,2,3,4,5,6
0,Science fiction (often shortened to SF or sci-...,Science fiction often explores the potential c...,"It usually avoids the supernatural, unlike the...","Historically, science-fiction stories have had...","Science fiction is difficult to define, as it ...","Hugo Gernsback, who suggested the term ""scient...",They supply knowledge... in a very palatable f...
1,Internet access is the ability of individuals ...,"Various technologies, at a wide range of speed...","Internet access was once rare, but has grown r...","In 1995, only percent of the world's populatio...","By the first decade of the 21st century, many ...","The Internet developed from the ARPANET, which...",Use by a wider audience only came in 1995 when...


Now we have a nested list of lists -> let's unfold that list in a way that the nested entries of those lists are their own entries

In [10]:
sentences = []
for i in passages:
  for entry in i:
    sentences.append(entry)
    
pd.DataFrame(sentences[:10])

Unnamed: 0,0
0,Science fiction (often shortened to SF or sci-...
1,Science fiction often explores the potential c...
2,"It usually avoids the supernatural, unlike the..."
3,"Historically, science-fiction stories have had..."
4,"Science fiction is difficult to define, as it ..."
5,"Hugo Gernsback, who suggested the term ""scient..."
6,They supply knowledge... in a very palatable f...
7,Internet access is the ability of individuals ...
8,"Various technologies, at a wide range of speed..."
9,"Internet access was once rare, but has grown r..."


Let's check our dataset for duplicates

In [11]:
print(f"Dataset with duplicates: {len(sentences)}")

#Let's turn the list into a dictionary and then back into a list to eliminate duplicates
unique_sentences = list(dict.fromkeys(sentences))
print(f"Cleaned up Dataset: {len(unique_sentences)}")

Dataset with duplicates: 210354
Cleaned up Dataset: 12702


*We reduced our dataset to 6% of the original one!*

###Preparing the test set

For the test set we need all the sentences created by the wizard which are based on sentences from Wikipedia articles aka the training set so we can then test the similarity between those sentences and the training set.
This way we want to be able to recover the sentence that was used to craft a response given by the wizard.
We should also save the actual used sentence in some dictionary linking the response and the used sentence to be able to evaluate the model

Let's take a look at the structure of the dialogue using pandas

In [12]:
import pandas as pd
df_dialog = pd.DataFrame(data[0]['dialog'][:2])
df_dialog

Unnamed: 0,speaker,text,checked_sentence,checked_passage,retrieved_passages,retrieved_topics
0,0_Wizard,I think science fiction is an amazing genre fo...,{'chosen_Science_fiction_0': 'Science fiction ...,{'chosen_topic_0_Science_fiction': 'Science fi...,[{'Hyperspace (science fiction)': ['Hyperspace...,"[Hyperspace (science fiction), Science fiction..."
1,1_Apprentice,I'm a huge fan of science fiction myself!,,,[{'Science fiction': ['Science fiction (often ...,"[Science fiction, History of science fiction, ..."


In [13]:
def get_value_from_dict(dictionary):
    for _, value in dictionary.items():
            return value

In [14]:
print(get_value_from_dict(data[0]['dialog'][0]['checked_sentence']))

Science fiction (often shortened to SF or sci-fi) is a genre of speculative fiction, typically dealing with imaginative concepts such as futuristic science and technology, space travel, time travel, faster than light travel, parallel universes, and extraterrestrial life.


In [15]:
#Create dictionary with responses and chosen sentences and list with just responses
response_sentence_pairs = {}
wizard_resps = []

for dialogue in data:
  for entry in dialogue['dialog']:
    if not 'Wizard' in entry['speaker']: #the apprentice doesn't have any responses based on sentences from training set
      continue

    if 'no_passages_used' in entry['checked_sentence']:
      continue

    extracted_text = get_value_from_dict(entry['checked_sentence'])

    response_sentence_pairs.update({entry['text']:extracted_text})
    wizard_resps.append(entry['text'])

Let's check our new dictionary

In [16]:
dict_items = response_sentence_pairs.items()
print(list(dict_items)[:2])

[("I think science fiction is an amazing genre for anything. Future science, technology, time travel, FTL travel, they're all such interesting concepts.", 'Science fiction (often shortened to SF or sci-fi) is a genre of speculative fiction, typically dealing with imaginative concepts such as futuristic science and technology, space travel, time travel, faster than light travel, parallel universes, and extraterrestrial life.'), ('Awesome! I really love how sci-fi storytellers focus on political/social/philosophical issues that would still be around even in the future. Makes them relatable.', 'Science fiction films have often been used to focus on political or social issues, and to explore philosophical issues like the human condition.')]


Check out the list

In [17]:
wizard_resps[:2]

["I think science fiction is an amazing genre for anything. Future science, technology, time travel, FTL travel, they're all such interesting concepts.",
 'Awesome! I really love how sci-fi storytellers focus on political/social/philosophical issues that would still be around even in the future. Makes them relatable.']

*Great, now we have a list with all the responses given by the wizard and a dictionary linking all the responses to the original source sentences.*

Do we have any duplicates?

In [18]:
print(f"Dataset with duplicates: {len(wizard_resps)}")

#Eliminating a few duplicates
unique_resps = list(dict.fromkeys(wizard_resps))
print(f"Cleaned up Dataset: {len(unique_resps)}")

Dataset with duplicates: 94664
Cleaned up Dataset: 94585


It seems so, but just a few. How come we have around ten times more responses, than source sentences?

###Preprocess the Data

Let's define a function for preprocessing our data 

-> *Sadly simple_preprocess removes numbers which would be very useful for retrieval of very specific data*


-> *Also consider taking out stopwords*

In [19]:
#import nltk
#from nltk.corpus import stopwords

def preprocess(data,tokens_only=False):
  for i, line in enumerate(data):
    tokens = gensim.utils.simple_preprocess(line, min_len=2, max_len=20)
    if tokens_only:
      yield tokens
    else:
      # For training data, add tags
      yield gensim.models.doc2vec.TaggedDocument(tokens, [i])

In [20]:
train_corpus = list(preprocess(unique_sentences))
test_corpus = list(preprocess(wizard_resps,tokens_only=True))

Unnamed: 0,words,tags
0,"[science, fiction, often, shortened, to, sf, o...",[0]
1,"[science, fiction, often, explores, the, poten...",[1]
2,"[it, usually, avoids, the, supernatural, unlik...",[2]
3,"[historically, science, fiction, stories, have...",[3]
4,"[science, fiction, is, difficult, to, define, ...",[4]
5,"[hugo, gernsback, who, suggested, the, term, s...",[5]
6,"[they, supply, knowledge, in, very, palatable,...",[6]
7,"[internet, access, is, the, ability, of, indiv...",[7]
8,"[various, technologies, at, wide, range, of, s...",[8]
9,"[internet, access, was, once, rare, but, has, ...",[9]


####Visualization of structure of train_corpus and test_corpus

In [24]:
pd.DataFrame(train_corpus[:10])

Unnamed: 0,words,tags
0,"[science, fiction, often, shortened, to, sf, o...",[0]
1,"[science, fiction, often, explores, the, poten...",[1]
2,"[it, usually, avoids, the, supernatural, unlik...",[2]
3,"[historically, science, fiction, stories, have...",[3]
4,"[science, fiction, is, difficult, to, define, ...",[4]
5,"[hugo, gernsback, who, suggested, the, term, s...",[5]
6,"[they, supply, knowledge, in, very, palatable,...",[6]
7,"[internet, access, is, the, ability, of, indiv...",[7]
8,"[various, technologies, at, wide, range, of, s...",[8]
9,"[internet, access, was, once, rare, but, has, ...",[9]


In [61]:
pd.DataFrame(test_corpus[:10])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,19,20,21,22,23,24,25,26,27,28
0,think,science,fiction,is,an,amazing,genre,for,anything,future,...,such,interesting,concepts,,,,,,,
1,awesome,really,love,how,sci,fi,storytellers,focus,on,political,...,in,the,future,makes,them,relatable,,,,
2,it,not,quite,sci,fi,but,my,favorite,version,of,...,of,azkaban,breaks,zero,logical,rules,,,,
3,if,you,really,want,look,at,the,potential,negative,consequences,...,the,tv,show,fringe,incredibly,well,written,,,
4,no,could,not,couldn,imagine,living,when,internet,access,was,...,,,,,,,,,,
5,it,used,to,be,restricted,but,around,the,restricted,were,...,,,,,,,,,,
6,yes,it,was,developed,from,government,funded,projects,to,help,...,am,so,glad,they,expanded,it,,,,
7,what,is,your,favorite,thing,to,do,with,internet,access,...,to,use,my,email,and,browse,the,world,wide,web
8,yes,perform,administrative,duties,as,pharmacy,technician,,,,...,,,,,,,,,,
9,yes,work,directly,with,lot,of,patients,,,,...,,,,,,,,,,


In [26]:
print(f"{train_corpus[0]}\n{test_corpus[0]}")

TaggedDocument<['science', 'fiction', 'often', 'shortened', 'to', 'sf', 'or', 'sci', 'fi', 'is', 'genre', 'of', 'speculative', 'fiction', 'typically', 'dealing', 'with', 'imaginative', 'concepts', 'such', 'as', 'futuristic', 'science', 'and', 'technology', 'space', 'travel', 'time', 'travel', 'faster', 'than', 'light', 'travel', 'parallel', 'universes', 'and', 'extraterrestrial', 'life'], [0]>
['think', 'science', 'fiction', 'is', 'an', 'amazing', 'genre', 'for', 'anything', 'future', 'science', 'technology', 'time', 'travel', 'ftl', 'travel', 'they', 're', 'all', 'such', 'interesting', 'concepts']


##Training the model

We instantiate a Doc2Vec model with a vector size of 50 dimensions and iterate over the training corpus 40 times

If evaluation with test set is bad, maybe try to decrease min_count to 0, so unique words are not lost

In [40]:
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=80)

Build a vocabulary

In [41]:
model.build_vocab(train_corpus)

Essentially, the vocabulary is a list (accessible via model.wv.index_to_key) of all of the unique words extracted from the training corpus. Additional attributes for each word are available using the model.wv.get_vecattr() method, For example, to see how many times test appeared in the training corpus:

In [42]:
print(f"Word 'obama' appeared {model.wv.get_vecattr('obama', 'count')} times in the training corpus.")

Word 'obama' appeared 2 times in the training corpus.


Train the model on the corpus (Took 2 minutes with 80 epochs with cleaned up dataset)


In [43]:
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)

##Assessing the model

To assess our new model, we’ll first infer new vectors for each document of the training corpus, compare the inferred vectors with the training corpus, and then returning the rank of the document based on self-similarity

-> This took 8 minutes to execute with the cleaned up dataset

In [33]:
ranks = []
second_ranks = []
for doc_id in range(len(train_corpus)):
    inferred_vector = model.infer_vector(train_corpus[doc_id].words)
    sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))
    rank = [docid for docid, sim in sims].index(doc_id)
    ranks.append(rank)

    second_ranks.append(sims[1])

Let’s count how each document ranks with respect to the training corpus

In [34]:
import collections

counter = collections.Counter(ranks)
print(counter)

Counter({0: 12511, 1: 64, 3: 20, 2: 14, 4: 11, 11: 6, 8: 5, 5: 4, 6: 3, 22: 2, 70: 2, 9: 2, 55: 2, 21: 2, 57: 2, 17: 2, 10: 2, 37: 2, 15: 2, 23: 2, 7: 2, 64: 2, 30: 2, 12: 2, 79: 2, 76: 2, 361: 1, 13: 1, 38: 1, 36: 1, 18: 1, 47: 1, 2625: 1, 172: 1, 33: 1, 58: 1, 27: 1, 44: 1, 7771: 1, 14: 1, 51: 1, 16: 1, 121: 1, 69: 1, 40: 1, 43: 1, 28: 1, 31: 1, 41: 1, 45: 1, 32: 1, 119: 1, 350: 1, 26: 1, 9429: 1, 46: 1})


Looking at an example

In [35]:
# Pick a random document from the corpus and infer a vector from the model
import random
#random.seed(23)
doc_id = random.randint(0, len(train_corpus) - 1)
print('Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('SECOND-MOST', 1), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))

Document (10724): «this engine was initially installed in chevrolet and gmc trucks and has been an option since then in pickups vans and medium duty trucks»

SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec<dm/m,d50,n5,w5,mc2,s0.001,t3>:

MOST (12701, 0.8942609429359436): «in that sense organizing can also be defined as to place different objects in logical arrangement for better searching»

SECOND-MOST (986, 0.7492536306381226): «the see also section of this article provides links to more specific information about various schools and techniques of horse training»

MEDIAN (9678, 0.2351899892091751): «reruns shown on hln were initially retitled mystery detectives before settling with the main title of the show in»

LEAST (10603, -0.572883129119873): «she chose to have the collection made in australia using australian jersey fabric»



Notice above that the most similar document (usually the same text) is has a similarity score approaching 1.0. However, the similarity score for the second-ranked documents should be significantly lower (assuming the documents are in fact different) and the reasoning becomes obvious when we examine the text itself.


We can run the next cell repeatedly to see a sampling other target-document comparisons.

In [36]:
# Pick a random document from the corpus and infer a vector from the model
import random
#random.seed(23)
doc_id = random.randint(0, len(train_corpus) - 1)
#doc_id = 2
# Compare and print the second-most-similar document
print('Train Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
sim_id = second_ranks[doc_id]
print('Similar Document {}: «{}»\n'.format(sim_id, ' '.join(train_corpus[sim_id[0]].words)))

Train Document (12328): «additionally the relationship may be confused by opposing outside influences»

Similar Document (4400, 0.5700653791427612): «legends are perceived as real fairy tales may merge into legends where the narrative is perceived both by teller and hearers as being grounded in historical truth»



This doesn't really look good. Probably the sentences are too short and thus it doesn't work that well.

##Testing

In [52]:
import random
# Pick a random document from the test corpus and infer a vector from the model
#random.seed()
doc_id = random.randint(0, len(test_corpus) - 1)
inferred_vector = model.infer_vector(test_corpus[doc_id])
#sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))
sims = model.dv.most_similar([inferred_vector], topn=10)

# Compare and print the 10 most similar documents from the train corpus
print('Test Document ({}): «{}»\n'.format(doc_id, ' '.join(test_corpus[doc_id])))
for index in range(len(sims)):
#for index in range(10):
    print(f"{index+1}. {sims[index]}: «{' '.join(train_corpus[sims[index][0]].words)}»")
print('\n')
#for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
#    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))

print(f"Untokenized Wizard response: {wizard_resps[doc_id]}\n")
print(f"Original source sentence: {response_sentence_pairs[wizard_resps[doc_id]]}")

Test Document (32362): «extreme couponing for as much savings and money as possible seems like such wasted and stingy effort»

1. (7458, 0.5544727444648743): «neville symington suggested that such severely critical inner object is especially noticeable in narcissism»
2. (12047, 0.5477999448776245): «domestic workers particularly those low in the hierarchy such as maids and footmen were expected to remain unmarried while in service and even highest ranking workers such as butlers could be dismissed for marrying»
3. (2156, 0.5222700834274292): «this is the case of such noted designers as sid meier john romero chris sawyer and will wright»
4. (1603, 0.514151394367218): «risk factors include certain infections during pregnancy such as rubella as well as valproic acid alcohol or cocaine use during pregnancy»
5. (2078, 0.5109472870826721): «the particles can also be biological in origin such as mollusc shells or coralline algae»
6. (12222, 0.506793200969696): «activated seeking behavior such

Manually check the source sentence similarity against the wizard response

In [53]:
# import required libraries
import numpy as np
from numpy.linalg import norm

#Create vector for source sentence
tokenized_sentence = gensim.utils.simple_preprocess(response_sentence_pairs[wizard_resps[doc_id]], min_len=2, max_len=20)
inferred_source_vector = model.infer_vector(tokenized_sentence)
A = inferred_source_vector
B = inferred_vector
print(f"Tokenized original sentence: {tokenized_sentence} \n")
print(f"Tokenized wizard response: {test_corpus[doc_id]} \n")


print(f"The Wizard response:{wizard_resps[doc_id]} \n{B}")
print(f"The original sentence:{response_sentence_pairs[wizard_resps[doc_id]]} \n{A}")
 
# compute cosine similarity
cosine = np.dot(A,B)/(norm(A)*norm(B))
print(f"Cosine Similarity: {cosine}")

print(f"Model Similarity: {model.dv.n_similarity(A,B)}")

Tokenized original sentence: ['extreme', 'couponing', 'is', 'an', 'activity', 'that', 'combines', 'shopping', 'skills', 'with', 'couponing', 'in', 'an', 'attempt', 'to', 'save', 'as', 'much', 'money', 'as', 'possible', 'while', 'accumulating', 'the', 'most', 'groceries'] 

Tokenized wizard response: ['extreme', 'couponing', 'for', 'as', 'much', 'savings', 'and', 'money', 'as', 'possible', 'seems', 'like', 'such', 'wasted', 'and', 'stingy', 'effort'] 

The Wizard response:extreme couponing for as much savings and money as possible seems like such a wasted and stingy effort 
[-4.6868110e-01  5.5742109e-01 -7.8521156e-01  2.0770140e+00
 -3.2915199e-01  7.1073091e-01 -2.8970959e-03  8.7902844e-01
  1.9988912e+00 -5.6717718e-01  1.3788596e+00  1.7180147e+00
  5.2272666e-01  1.0208213e+00  2.2772864e-01 -1.6899743e+00
  1.4760162e+00  9.0900075e-01 -3.7409415e+00 -5.5118662e-01
 -3.2177702e-01  2.6115018e-01  8.4905827e-01  1.6216643e+00
 -1.3090852e+00  3.1399670e-01 -3.2465330e-01 -1.35869

Hmmmm something still seems to be wrong with the model. In this case the manually inferred vector should be the most similar document, but it doesn't come up with the model

Also it would be great if we could tokenize it without losing the numbers

##Evaluation

Evaluating the percentage running through 20% of test set maybe (Takes 2 minutes)

In [75]:
#Create test subset
test_subset_corpus = test_corpus

#counter that keeps track how often the right source sentence was in the top 10
counter = 0

for i in range(int(len(test_corpus)*0.20)):
  doc_id = random.randint(0, len(test_subset_corpus) - 1)
  inferred_vector = model.infer_vector(test_subset_corpus[doc_id])
  sims = model.dv.most_similar([inferred_vector], topn=10)
  tokenized_sentence = gensim.utils.simple_preprocess(response_sentence_pairs[wizard_resps[doc_id]], min_len=2, max_len=20)
  if tokenized_sentence in sims:
    counter += 1
  test_subset_corpus.pop(doc_id)

print(f"{counter/int(len(test_corpus)*0.20)}")


TypeError: ignored

In [74]:
#Create test subset
test_indices = {random.randint(0, len(test_corpus) - 1) for i in range(int(len(test_corpus)*0.25))}

#counter that keeps track how often the right source sentence was in the top 10
counter = 0

for doc_id in test_indices:
  inferred_vector = model.infer_vector(test_corpus[doc_id])
  sims = model.dv.most_similar([inferred_vector], topn=10)
  tokenized_sentence = gensim.utils.simple_preprocess(response_sentence_pairs[wizard_resps[doc_id]], min_len=2, max_len=20)
  if tokenized_sentence in sims:
    counter += 1

print(f"{counter/len(test_indices)}")


TypeError: ignored

In [69]:
test_indices = {random.randint(0, len(test_subset) - 1) for i in range(int(len(test_corpus)*0.2))}
print(f"{len(test_indices)}\n{int(len(test_corpus)*0.2)}")

13722
15146


In [73]:
print(f"{list(test_indices)[:10]}")

[5, 7, 32776, 32775, 65546, 65547, 32780, 32782, 32783, 65553]


# Retrieval-based chatbots

This approach is more or less the same as showed during Tutorial_08.

## Data extraction

In [None]:
import json

with open('train.json', 'r') as file:
    json_data = file.read()
    data = json.loads(json_data)

print('Datatype:', type(data))

In [None]:
# just for looking at the raw dataset
data[0]

In [None]:
# This dataframe is never used, but it is useful for looking at the dataset

import pandas as pd

df = pd.DataFrame(data)
df

Now we do some data extraction from the dataset. We want to produce a set were we have the dialog with a apprentice and wizard, these are then used to fine train the model. 

This limits the model, as it won't have any "memory"/context from the complete conversation. But the aim is for it to be acting as a "smart vector-database" and retrive similar enough passages. 

In [None]:
user_query = []
wizard_responses = []

chosen_topic = ""

for dialogue in data:

  if not 'Wizard' in dialogue['dialog'][0]['speaker']:
      continue

  chosen_topic = dialogue['chosen_topic']

  user_query.append(chosen_topic + " " + dialogue['persona'])

  for i, prompt in enumerate(dialogue['dialog']):

    if i % 2 == 0:
      wizard_responses.append(chosen_topic + " " + prompt['text'])
    else:
      user_query.append(chosen_topic + " " + prompt['text'])

data_pairs = []

for i, _ in enumerate(wizard_responses):

  data_pairs.append(
      {'message': user_query[i], 'response': wizard_responses[i]}
      )

## Model training

Now we are able to train the model

In [None]:
%pip install sentence_transformers

In [None]:
from sentence_transformers import SentenceTransformer, CrossEncoder, util

semb_model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
xenc_model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

In [None]:
corpus_embeddings = semb_model.encode([sample['message'] for sample in data_pairs], convert_to_tensor=True, show_progress_bar=True, device='cuda')

## Model usage

In [None]:
%pip install hnswlib

In [None]:
import os
import hnswlib

# Create empty index
hnswlib_index = hnswlib.Index(space='cosine', dim=corpus_embeddings.size(1))

# Define hnswlib index path
index_path = "./emp_dialogue_hnswlib.index"

# Load index if available
if os.path.exists(index_path):
    print("Loading index...")
    hnswlib_index.load_index(index_path)
# Else index data collection
else:
    # Initialise the index
    print("Start creating HNSWLIB index")
    hnswlib_index.init_index(max_elements=corpus_embeddings.size(0), ef_construction=400, M=64)
    #  Compute the HNSWLIB index (it may take a while)
    hnswlib_index.add_items(corpus_embeddings.cpu(), list(range(len(corpus_embeddings))))
    # Save the index to a file for future loading
    print("Saving index to:", index_path)
    hnswlib_index.save_index(index_path)

In [None]:
import numpy as np

def get_response(message, mes_resp_pairs, index, re_ranking_model=None, top_k=32):
    message_embedding = semb_model.encode(message, convert_to_tensor=True).cpu()

    corpus_ids, _ = index.knn_query(message_embedding, k=top_k)

    model_inputs = [(message, mes_resp_pairs[idx]['response']) for idx in corpus_ids[0]]
    cross_scores = xenc_model.predict(model_inputs)

    idx = np.argsort(-cross_scores)[0]

    return mes_resp_pairs[corpus_ids[0][idx]]['response']

In [None]:
chatbot_response = get_response(
    "I'm a huge fan of science fiction myself!", data_pairs, hnswlib_index, re_ranking_model=xenc_model
)
chatbot_response

## Testing the model

Testing the model by loading in the **test_random_split.json** file.

### Data extraction

Before we can perform the testing, we need to perform some data extraction. The strategy is to find a conversation between a wizard and a apprentice, and use that to test the accuracy/precision of the model.

What we expect is that the model produces a responce that is similar to the one that was used in the conversation. Note that this does not satisfy the "correct passage" requirement.

In [None]:
with open('test_random_split.json', 'r') as file:
    json_data = file.read()
    test = json.loads(json_data)

print('Datatype:', type(test))

In [None]:
test_extract = []

for i, conversation in enumerate(test):

  test_extract.append("new_conv_" + str(i))

  for j, dialog in enumerate(conversation['dialog']):

    if "Wizard" in dialog['speaker']:

      if j == 0:
        continue

      test_extract.append({'wizard':dialog['text']})

    if "Apprentice" in dialog['speaker']:
      test_extract.append({'apprentice':dialog['text']})

test_extract[:10]

The data is still quite "dirty". So we will perform the cumbersome clean up in the next cell to get a list of directories, were the directories contians the matches/pairs that will be used for testing.

In [None]:
pair = []

test_pairs = []

for i, text in enumerate(test_extract):

  if "new_conv_" in text:
    continue

  pair.append(text)

  if len(pair) == 2:
    
    entry = {'apprentice':"", 'wizard': ""}

    for _, e in enumerate(pair):

      if 'apprentice' in e.keys():
        entry['apprentice'] = e['apprentice']

      if 'wizard' in e.keys():
        entry['wizard'] = e['wizard']


    test_pairs.append(entry)
    pair = []

test_pairs[:5]

In [None]:
import random

rand_int = random.randrange(0,500)

chatbot_response = get_response(
      test_pairs[rand_int]['apprentice'], data_pairs, hnswlib_index, re_ranking_model=xenc_model
  )

print(test_pairs[rand_int]['apprentice'])
print(test_pairs[rand_int]['wizard'])
print(chatbot_response)

Now we should be able to do some testing. Here we use two approaches, a naive one were we are looking at the exact matches, and one were we are doing BLEU-scoring

The naive approach is useful for the assignment requirement were it is specified to find the "correct passage". 

The BLEU-score is a score to see how close the precision is. It might not provide that much (if any) useful informaiton to us, as we are not doing a sentence-to-sentence transformation.

In [None]:
from nltk.translate.bleu_score import sentence_bleu

correct_responses = 0

bleu_scores = []

for _, entry in enumerate(test_pairs):
  chatbot_response = get_response(
      entry['apprentice'], data_pairs, hnswlib_index, re_ranking_model=xenc_model
  )

  # Naive accuracy
  if chatbot_response == entry['wizard']:
    correct_responses += 1
  
  # BLEU score calculation

  reference = [entry['apprentice'].split()]
  candidate = chatbot_response.split()
  bleu_scores.append(sentence_bleu(reference, candidate))

accuracy = correct_responses / len(test_pairs)

print("Test accuracy (%):", accuracy * 100)
print("Average BLEU-score:", sum(bleu_scores) / len(bleu_scores))

# Retrieval-based response chatbot (Not accurate title)

This implementation aims to create a retrieval-based responce chatbot to provide the correct awnser to a given passage. This is done by taking all the correct awnsers, generating embeddings with them and then performing a "search" in the created vector space to find the passage that has the closest match with the given passage

## Data extraction

Extracts user prompts and wizard responses from a list of dialogues and stores them in separate lists based on the condition that the dialogue speaker is the wizard and the order in which they appear in the dialogue.

Here we also concatenate the strings with some extra information, like the chosen topic, in order to increase the precision of the search later. This is a valid approach and can be seen as that we are just adding more context to the passage.

In [None]:
#sets of documents
user_query = []
wizard_responses = []

chosen_topic = ""

for dialogue in data:

  if not 'Wizard' in dialogue['dialog'][0]['speaker']:
      continue

  chosen_topic = dialogue['chosen_topic']

  user_query.append(chosen_topic + " " + dialogue['persona'])

  for i, prompt in enumerate(dialogue['dialog']):

    if i % 2 == 0:
      wizard_responses.append(chosen_topic + " " + prompt['text'])
    else:
      user_query.append(chosen_topic + " " + prompt['text'])

## Document vectorization

In [None]:
# TfidfVectorizer 
# CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer

In [None]:
# instantiate the vectorizer object
#countvectorizer = CountVectorizer(analyzer= 'word', stop_words='english')
tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words= 'english')

In [None]:
# convert th documents into a matrix
#count_wm = countvectorizer.fit_transform(train)
query_wm = tfidfvectorizer.fit_transform(user_query)
response_wm = tfidfvectorizer.fit_transform(wizard_responses)

In [None]:
# retrieve the terms found in the corpora
# if we take same parameters on both Classes(CountVectorizer and TfidfVectorizer) , it will give same output of get_feature_names() methods)
query_tokens = tfidfvectorizer.get_feature_names_out(query_wm)
responce_tokens = tfidfvectorizer.get_feature_names_out(response_wm)

## Verification

Some output in order to quickly verify the embeddings

In [None]:
responce_vectors = tfidfvectorizer.transform(wizard_responses)
query_vectors = tfidfvectorizer.transform(user_query)

print('responce_vectors:\n', responce_vectors)

print('query_vectors:\n', query_vectors)

In [None]:
sorted([(query_tokens[j], query_vectors[0, j]) for j in query_vectors[0].nonzero()[1]], key=lambda x: -x[1])

## Search the vector space

Here we calculate the closest neighbor to the embedding of the query, and hopefully that is the "correct" passage we are looking for.

In [None]:
import numpy as np

query = 'Gardening: i like to garden.'

query_vec = tfidfvectorizer.transform([query])[0]

index = np.argmax([query_vec.multiply(vector_documents[i]).sum() for i in range(len(train))])
print(train[index])