<img src="ask questions to LLMs.jpg" width=70%>

## 1. Read PDF files with research papers using Llama-Index

In [1]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings, SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor

In [2]:
import os
import time
import pandas as pd

<b>1.1 Setup model for embeddings</b>

In [3]:
# import any embedding model on HF hub (https://huggingface.co/spaces/mteb/leaderboard)
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
# Settings.embed_model = HuggingFaceEmbedding(model_name="thenlper/gte-large") # alternative model

Settings.llm = None
Settings.chunk_size = 128
Settings.chunk_overlap = 25

LLM is explicitly disabled. Using MockLLM.


<b>1.2 Read files one by one from the directory</b>

In [4]:
reader = SimpleDirectoryReader(input_dir="papers", recursive=True)
reader.__dict__

{'fs': <fsspec.implementations.local.LocalFileSystem at 0x16d562fce20>,
 'errors': 'ignore',
 'encoding': 'utf-8',
 'exclude': None,
 'recursive': True,
 'exclude_hidden': True,
 'required_exts': None,
 'num_files_limit': None,
 'raise_on_error': False,
 'input_dir': WindowsPath('papers'),
 'input_files': [WindowsPath('C:/Users/18623/Desktop/PhiAi/Jupyter/papers/attention is all you need 1706.03762v7.pdf'),
  WindowsPath('C:/Users/18623/Desktop/PhiAi/Jupyter/papers/BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf')],
 'file_extractor': {},
 'file_metadata': <llama_index.core.readers.file.base._DefaultFileMetadataFunc at 0x16d49700f10>,
 'filename_as_id': False}

<b>1.3 Store Text chunks in a dict to trace chunk-paper mapping</b>

In [5]:
pdf_files = reader.__dict__['input_files']
all_docs = []
doc_dict = {}
for i,docs in enumerate(reader.iter_data()):
    pdf_name = os.path.basename(pdf_files[i])
    print(f'Extract from file {pdf_name}')
    if pdf_name in doc_dict:
        doc_dict[pdf_name] += [docs]
    else:
        doc_dict[pdf_name] = [docs]
    all_docs.extend(docs)

Extract from file attention is all you need 1706.03762v7.pdf
Extract from file BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf


<b>1.4 Inspect Llama-Index document class</b>

In [6]:
doc_dict.keys()

dict_keys(['attention is all you need 1706.03762v7.pdf', 'BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf'])

<b>1.5 Get total number of collected chunks</b>

In [7]:
len(all_docs)

31

<b>1.6 Preprocess chunk to drop first page and  Reshape a way to store the Llama-index docs</b>

In [8]:
articles = list(doc_dict.keys())

for key in articles:
    clean_docs = []
    for doc in doc_dict[key][0]:
        if 'See discussions, stats' in doc.text:
            continue
        clean_docs.append(doc)
    doc_dict[key] = clean_docs
    

<b>1.7 Get chuck distribution accross papers</b>

In [9]:
for key in doc_dict:
    num_chunks = len(doc_dict[key])
    print(f'{key} >>> {num_chunks} chunks')

attention is all you need 1706.03762v7.pdf >>> 15 chunks
BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf >>> 16 chunks


<b>1.8 Collect all chunks for each paper in one text</b>

In [10]:
articles = list(doc_dict.keys())
papers = {} 

for paper_name in articles:
    papers[paper_name] = ''
    for doc in doc_dict[paper_name]:
        papers[paper_name] += ' ' + doc.text

In [11]:
for paper_name in articles:
    papers[paper_name] = papers[paper_name].replace('.\n','. ')
    papers[paper_name] = papers[paper_name].replace('\n',' ')    

In [12]:
#papers[articles[3]]

## 2. QA: Ask questions about the papers

<b>2.1 Roberta-base-sqaud2-distilled model</b>

In [13]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline
import torch

In [14]:
dfs = {}

In [15]:
model_name = "deepset/roberta-base-squad2-distilled"

# a) Get predictions
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
nlp = pipeline('question-answering', 
               model=model_name, 
               tokenizer=model_name,
               device = device
              )


In [16]:
answer_list = []
score_list = []
for paper in articles:
    question = 'what is attention?'
    QA_input = {
        'question': question,
        'context': papers[paper]
    }
    res = nlp(QA_input)
    #print(f'from paper {paper} \n we learn that answer to {question} is \n {res} \n')
    answer_list.append(res['answer'])
    score_list.append(res['score'])

dfs[question + model_name] = pd.DataFrame({'question':len(articles)*[question],
                                           'model name':len(articles)*[model_name],
                                           'paper':articles,
                                           'answer': answer_list,
                                           'score':score_list
                             })
dfs[question + model_name]

  attn_output = torch.nn.functional.scaled_dot_product_attention(


Unnamed: 0,question,model name,paper,answer,score
0,what is attention?,deepset/roberta-base-squad2-distilled,attention is all you need 1706.03762v7.pdf,mapping a query and a set of key-value pairs t...,0.356291
1,what is attention?,deepset/roberta-base-squad2-distilled,BERT pre_training of deep bidirectional transf...,all you need,0.802495


In [17]:
answer_list = []
score_list = []
for paper in articles:
    question = 'what types of attention are defined?'
    QA_input = {
        'question': question,
        'context': papers[paper]
    }
    res = nlp(QA_input)
    #print(f'from paper {paper} \n we learn that answer to {question} is \n {res} \n')
    answer_list.append(res['answer'])
    score_list.append(res['score'])

dfs[question + model_name] = pd.DataFrame({'question':len(articles)*[question],
                              'model name':len(articles)*[model_name],
                              'paper':articles,
                              'answer': answer_list,
                              'score':score_list
                             })
dfs[question + model_name]



Unnamed: 0,question,model name,paper,answer,score
0,what types of attention are defined?,deepset/roberta-base-squad2-distilled,attention is all you need 1706.03762v7.pdf,Structured attention networks,0.67969
1,what types of attention are defined?,deepset/roberta-base-squad2-distilled,BERT pre_training of deep bidirectional transf...,bidirectional self-attention,0.005291


In [18]:
answer_list = []
score_list = []
for paper in articles:
    question = 'what is multi-head attention?'
    QA_input = {
        'question': question,
        'context': papers[paper]
    }
    res = nlp(QA_input)
    #print(f'from paper {paper} \n we learn that answer to {question} is \n {res} \n')
    answer_list.append(res['answer'])
    score_list.append(res['score'])

dfs[question + model_name] = pd.DataFrame({'question':len(articles)*[question],
                              'model name':len(articles)*[model_name],
                              'paper':articles,
                              'answer': answer_list,
                              'score':score_list
                             })
dfs[question + model_name]



Unnamed: 0,question,model name,paper,answer,score
0,what is multi-head attention?,deepset/roberta-base-squad2-distilled,attention is all you need 1706.03762v7.pdf,self-attention,0.623303
1,what is multi-head attention?,deepset/roberta-base-squad2-distilled,BERT pre_training of deep bidirectional transf...,multi-layered context,1.6e-05


<b>2.2 HuggingFace examplar pipeline</b>

In [67]:
qa_model = pipeline("question-answering", 
                    device=device)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [68]:
answer_list = []
score_list = []
for paper in articles:
    question = 'what is attention?'
    res = qa_model(question = question, context = papers[paper])
    #print(f'from paper {paper} \n we learn that answer to {question} is \n {res} \n')
    answer_list.append(res['answer'])
    score_list.append(res['score'])

dfs[question+'_nomodel'] = pd.DataFrame({'question':len(articles)*[question],
                                         'model name': len(articles)*['default'],
                                         'paper':articles,
                                         'answer': answer_list,
                                         'score':score_list
                                        })
dfs[question+'_nomodel']

Unnamed: 0,question,model name,paper,answer,score
0,what is attention?,default,attention is all you need 1706.03762v7.pdf,multi-head,0.716957
1,what is attention?,default,BERT pre_training of deep bidirectional transf...,few parameters need to be learned from scratch,0.765959


In [69]:
answer_list = []
score_list = []
for paper in articles:
    question = 'what types of attention are defined?'
    res = qa_model(question = question, context = papers[paper])
    #print(f'from paper {paper} \n we learn that answer to {question} is \n {res} \n')
    answer_list.append(res['answer'])
    score_list.append(res['score'])

dfs[question+'_nomodel'] = pd.DataFrame({'question':len(articles)*[question],
                                         'model name': len(articles)*['default'],
                                         'paper':articles,
                                         'answer': answer_list,
                                         'score':score_list
                                        })
dfs[question+'_nomodel']

Unnamed: 0,question,model name,paper,answer,score
0,what types of attention are defined?,default,attention is all you need 1706.03762v7.pdf,active memory,0.71572
1,what types of attention are defined?,default,BERT pre_training of deep bidirectional transf...,deeper self-attention,0.486192


In [70]:
answer_list = []
score_list = []
for paper in articles:
    question = 'what is multi-head attention?'
    res = qa_model(question = question, context = papers[paper])
    #print(f'from paper {paper} \n we learn that answer to {question} is \n {res} \n')
    answer_list.append(res['answer'])
    score_list.append(res['score'])

dfs[question+'_nomodel'] = pd.DataFrame({'question':len(articles)*[question],
                                         'model name': len(articles)*['default'],
                                         'paper':articles,
                                         'answer': answer_list,
                                         'score':score_list
                                        })
dfs[question+'_nomodel']

Unnamed: 0,question,model name,paper,answer,score
0,what is multi-head attention?,default,attention is all you need 1706.03762v7.pdf,scaled dot-product attention,0.8024
1,what is multi-head attention?,default,BERT pre_training of deep bidirectional transf...,natural language understanding,0.47042


<b>2.3 Summarize QA results in one DataFrame</b>

In [71]:
qa_keys = list(dfs.keys())

merged_df = pd.concat([dfs[key] for key in qa_keys])

In [72]:
merged_df

Unnamed: 0,question,model name,paper,answer,score
0,what is attention?,deepset/roberta-base-squad2-distilled,attention is all you need 1706.03762v7.pdf,mapping a query and a set of key-value pairs t...,0.356291
1,what is attention?,deepset/roberta-base-squad2-distilled,BERT pre_training of deep bidirectional transf...,all you need,0.802495
0,what types of attention are defined?,deepset/roberta-base-squad2-distilled,attention is all you need 1706.03762v7.pdf,Structured attention networks,0.67969
1,what types of attention are defined?,deepset/roberta-base-squad2-distilled,BERT pre_training of deep bidirectional transf...,bidirectional self-attention,0.005291
0,what is multi-head attention?,deepset/roberta-base-squad2-distilled,attention is all you need 1706.03762v7.pdf,self-attention,0.623303
1,what is multi-head attention?,deepset/roberta-base-squad2-distilled,BERT pre_training of deep bidirectional transf...,multi-layered context,1.6e-05
0,what is attention?,default,attention is all you need 1706.03762v7.pdf,multi-head,0.716957
1,what is attention?,default,BERT pre_training of deep bidirectional transf...,few parameters need to be learned from scratch,0.765959
0,what types of attention are defined?,default,attention is all you need 1706.03762v7.pdf,active memory,0.71572
1,what types of attention are defined?,default,BERT pre_training of deep bidirectional transf...,deeper self-attention,0.486192


In [73]:
merged_df = merged_df.reset_index()
cols = list(merged_df.columns)
print(cols)
cols.remove('index')
merged_df = merged_df.groupby(['question','model name'])[cols].apply(lambda x: x)

['index', 'question', 'model name', 'paper', 'answer', 'score']


In [74]:
merged_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,question,model name,paper,answer,score
question,model name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
what is attention?,deepset/roberta-base-squad2-distilled,0,what is attention?,deepset/roberta-base-squad2-distilled,attention is all you need 1706.03762v7.pdf,mapping a query and a set of key-value pairs t...,0.356291
what is attention?,deepset/roberta-base-squad2-distilled,1,what is attention?,deepset/roberta-base-squad2-distilled,BERT pre_training of deep bidirectional transf...,all you need,0.802495
what is attention?,default,6,what is attention?,default,attention is all you need 1706.03762v7.pdf,multi-head,0.716957
what is attention?,default,7,what is attention?,default,BERT pre_training of deep bidirectional transf...,few parameters need to be learned from scratch,0.765959
what is multi-head attention?,deepset/roberta-base-squad2-distilled,4,what is multi-head attention?,deepset/roberta-base-squad2-distilled,attention is all you need 1706.03762v7.pdf,self-attention,0.623303
what is multi-head attention?,deepset/roberta-base-squad2-distilled,5,what is multi-head attention?,deepset/roberta-base-squad2-distilled,BERT pre_training of deep bidirectional transf...,multi-layered context,1.6e-05
what is multi-head attention?,default,10,what is multi-head attention?,default,attention is all you need 1706.03762v7.pdf,scaled dot-product attention,0.8024
what is multi-head attention?,default,11,what is multi-head attention?,default,BERT pre_training of deep bidirectional transf...,natural language understanding,0.47042
what types of attention are defined?,deepset/roberta-base-squad2-distilled,2,what types of attention are defined?,deepset/roberta-base-squad2-distilled,attention is all you need 1706.03762v7.pdf,Structured attention networks,0.67969
what types of attention are defined?,deepset/roberta-base-squad2-distilled,3,what types of attention are defined?,deepset/roberta-base-squad2-distilled,BERT pre_training of deep bidirectional transf...,bidirectional self-attention,0.005291


## 3. Sematic Vector Search: Hands-On 

In [75]:
from sentence_transformers import SentenceTransformer

In [76]:
model_name = 'Snowflake/snowflake-arctic-embed-l-v2.0'
model = SentenceTransformer(model_name)

In [77]:
queries = ['what is attention?', 'what is multi-head attention?']
documents = [] 
for rec in all_docs:
    text_chunk = rec.text.replace('.\n','. ')
    text_chunk = text_chunk.replace('\n',' ')
    documents.append(text_chunk)

In [78]:
query_embeddings = model.encode(queries, prompt_name="query") 
document_embeddings = model.encode(documents)

In [79]:
scores = model.similarity(query_embeddings, document_embeddings)

In [80]:
scores.numpy()

array([[ 0.3195346 ,  0.24990332,  0.28661358,  0.3212555 ,  0.26088548,
         0.13312666,  0.13690537, -0.01953743,  0.07883323,  0.21618257,
         0.15051648,  0.13195562,  0.31160092,  0.28166437,  0.24416181,
         0.0802324 ,  0.05307864,  0.11930332,  0.04431819,  0.11695188,
         0.07378677,  0.0697437 ,  0.02865066,  0.08885486,  0.0631457 ,
         0.12509821,  0.09325863,  0.03556471,  0.08303171,  0.09524073,
         0.02277639],
       [ 0.37941754,  0.411205  ,  0.39959958,  0.4731409 ,  0.55398375,
         0.25181007,  0.31452248,  0.15422355,  0.28774974,  0.36324242,
         0.26917607,  0.25555977,  0.35022897,  0.3023172 ,  0.30858317,
         0.23788661,  0.21420331,  0.2624512 ,  0.20118485,  0.24324313,
         0.17298877,  0.18839133,  0.2460844 ,  0.29325908,  0.21080701,
         0.25936761,  0.29565617,  0.19223814,  0.19146955,  0.2046683 ,
         0.21918511]], dtype=float32)

In [108]:
# Output the results
thres = 0.25

query_score_list = []

for query, query_scores in zip(queries, scores):
    doc_score_pairs = list(zip(documents, query_scores))
    doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
    print("Query:", query)
    k=-1
    for document, score in doc_score_pairs:
        k += 1
        score_val = score.numpy()
        if score_val >= 0.25:
            print(score_val, document[:100])
            query_score_list.append({'query':query, 
                                     'doc':document,
                                     'llama_index_doc': all_docs[k],
                                     'score': score_val
                                    })        

Query: what is attention?
0.3212555 Scaled Dot-Product Attention  Multi-Head Attention Figure 2: (left) Scaled Dot-Product Attention. (r
0.3195346 Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and
0.31160092 Attention Visualizations Input-Input Layer5 It is in this spirit that a majority of American governm
0.28661358 Figure 1: The Transformer - model architecture. The Transformer follows this overall architecture us
0.28166437 Input-Input Layer5 The Law will never be perfect , but its application should be just - this is what
0.26088548 output values. These are concatenated and once again projected, resulting in the final values, as de
Query: what is multi-head attention?
0.55398375 output values. These are concatenated and once again projected, resulting in the final values, as de
0.4731409 Scaled Dot-Product Attention  Multi-Head Attention Figure 2: (left) Scaled Dot-Product Attention. (r
0.411205 1 Introduction Recurrent neural net

In [82]:
ind = 7
doc = query_score_list[ind]['doc']
q = query_score_list[ind]['query']

doc_sentence = doc.split('. ')


In [83]:
q

'what is multi-head attention?'

In [84]:
doc_sentence

['Scaled Dot-Product Attention  Multi-Head Attention Figure 2: (left) Scaled Dot-Product Attention',
 '(right) Multi-Head Attention consists of several attention layers running in parallel',
 'of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key',
 '3.2.1 Scaled Dot-Product Attention We call our particular attention "Scaled Dot-Product Attention" (Figure 2)',
 'The input consists of queries and keys of dimension dk, and values of dimension dv',
 'We compute the dot products of the query with all keys, divide each by √dk, and apply a softmax function to obtain the weights on the values',
 'In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix Q',
 'The keys and values are also packed together into matrices K and V ',
 'We compute the matrix of outputs as: Attention(Q, K, V) = softmax(QKT √dk )V (1) The two most commonly used attention functions are

In [85]:
doc_query_embeddings = model.encode(q, prompt_name="query") 
doc_sentence_embeddings = model.encode(doc_sentence)

In [86]:
sentence_scores = model.similarity(doc_query_embeddings, 
                                   doc_sentence_embeddings)

In [88]:
import matplotlib.cm as cm
import matplotlib as matplotlib
from IPython.display import display, Markdown

def color_map_color(value, 
                    cmap_name='PuBu',
                    #cmap_name='Wistia', 
                    vmin=0, 
                    vmax=1):
    # norm = plt.Normalize(vmin, vmax)
    norm = matplotlib.colors.Normalize(vmin=vmin, vmax=vmax)
    cmap = cm.get_cmap(cmap_name)  # PiYG
    rgb = cmap(norm(abs(value)))[:3]  # will return rgba, we take only first 3 so we get rgb
    color = matplotlib.colors.rgb2hex(rgb)
    return color

In [89]:
color_map_color(0.51)

  cmap = cm.get_cmap(cmap_name)  # PiYG


'#6fa7ce'

In [90]:
html_template = "<span style='background-color:{}'>{}</span>"
html_template.format(color_map_color(0.51),'text')

  cmap = cm.get_cmap(cmap_name)  # PiYG


"<span style='background-color:#6fa7ce'>text</span>"

In [91]:
display(Markdown(html_template.format(color_map_color(0.51),'text')))

  cmap = cm.get_cmap(cmap_name)  # PiYG


<span style='background-color:#6fa7ce'>text</span>

In [92]:
html_template = "<span style='background-color:{};opacity:0.8;'>{}</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:{:.2f}</sup>"
html_output = []
for query, query_scores in zip(q, sentence_scores):
    doc_score_pairs = list(zip(doc_sentence, query_scores))
    doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
    #print("Query:", q)
    for document, score in doc_score_pairs:
        score_val = score.numpy()
        html_output.append(html_template.format(color_map_color(1-score_val),document,score_val))

  cmap = cm.get_cmap(cmap_name)  # PiYG


In [93]:
display(Markdown(' '.join(html_output)))

<span style='background-color:#c4cbe3;opacity:0.8;'>(right) Multi-Head Attention consists of several attention layers running in parallel</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.71</sup> <span style='background-color:#79abd0;opacity:0.8;'>3.2.2 Multi-Head Attention Instead of performing a single attention function with dmodel-dimensional keys, values and queries, we found it beneficial to linearly project the queries, keys and values h times with different, learned linear projections to dk, dk and dv dimensions, respectively</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.51</sup> <span style='background-color:#78abd0;opacity:0.8;'>Scaled Dot-Product Attention  Multi-Head Attention Figure 2: (left) Scaled Dot-Product Attention</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.51</sup> <span style='background-color:#509ac6;opacity:0.8;'>In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix Q</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.43</sup> <span style='background-color:#4295c3;opacity:0.8;'>We compute the matrix of outputs as: Attention(Q, K, V) = softmax(QKT √dk )V (1) The two most commonly used attention functions are additive attention [2], and dot-product (multi- plicative) attention</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.40</sup> <span style='background-color:#3f93c2;opacity:0.8;'>Additive attention computes the compatibility function using a feed-forward network with a single hidden layer</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.39</sup> <span style='background-color:#2a88bc;opacity:0.8;'>While for small values of dk the two mechanisms perform similarly, additive attention outperforms dot product attention without scaling for larger values of dk [3]</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.35</sup> <span style='background-color:#2987bc;opacity:0.8;'>While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.34</sup> <span style='background-color:#2383ba;opacity:0.8;'>On each of these projected versions of queries, keys and values we then perform the attention function in parallel, yielding dv-dimensional 4To illustrate why the dot products get large, assume that the components of q and k are independent random variables with mean 0 and variance 1</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.32</sup> <span style='background-color:#2081b9;opacity:0.8;'>3.2.1 Scaled Dot-Product Attention We call our particular attention "Scaled Dot-Product Attention" (Figure 2)</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.32</sup> <span style='background-color:#2081b9;opacity:0.8;'>Dot-product attention is identical to our algorithm, except for the scaling factor of 1√dk </span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.32</sup> <span style='background-color:#045382;opacity:0.8;'>To counteract this effect, we scale the dot products by 1√dk </span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.10</sup> <span style='background-color:#045280;opacity:0.8;'>We suspect that for large values of dk, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients 4</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.10</sup> <span style='background-color:#034c78;opacity:0.8;'>4</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.08</sup> <span style='background-color:#034c78;opacity:0.8;'>The input consists of queries and keys of dimension dk, and values of dimension dv</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.07</sup> <span style='background-color:#034a74;opacity:0.8;'>of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.07</sup> <span style='background-color:#034973;opacity:0.8;'>We compute the dot products of the query with all keys, divide each by √dk, and apply a softmax function to obtain the weights on the values</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.06</sup> <span style='background-color:#03456c;opacity:0.8;'>Then their dot product, q · k = Pdk i=1 qiki, has mean 0 and variance dk</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.05</sup> <span style='background-color:#034369;opacity:0.8;'>The keys and values are also packed together into matrices K and V </span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.04</sup>

## 4. Semantic Vector Search: Vector DB

In [130]:
output_df = {}

In [134]:
pd.set_option('display.max_colwidth', 500)

In [135]:
# store docs into vector DB
index = VectorStoreIndex.from_documents(all_docs)

In [142]:
# set number of docs to retreive
top_k = 7

# configure retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=top_k,
)
     

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.3)],
)

<b>4.1 To BERT or not to BERT? this is the questions</b>

In [143]:
# query documents
query = "What is BERT?"
response = query_engine.query(query)

In [144]:
response.__dict__.keys()

dict_keys(['response', 'source_nodes', 'metadata'])

In [145]:
response.__dict__['source_nodes'][0].node.__dict__['metadata']['file_name']

'BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf'

In [147]:
# reformat response
cols = ['question','scores','source','text']
df_dict = {col:[] for col in cols}

for i in range(top_k):
    #context += f'>>> chunk {i+1} ' + str(response.source_nodes[i].score) + '\n\n' + response.source_nodes[i].text + "\n\n"
    #context += 'From: '+response.__dict__['source_nodes'][i].node.__dict__['metadata']['file_name'] + '\n\n'
    df_dict['scores'].append(response.source_nodes[i].score)
    df_dict['source'].append(response.__dict__['source_nodes'][i].node.__dict__['metadata']['file_name'])
    df_dict['text'].append(response.source_nodes[i].text)

df_dict['question'] = [query]*len(df_dict['scores'])
output_df[query] = pd.DataFrame(df_dict)

In [148]:
output_df[query]

Unnamed: 0,question,scores,source,text
0,What is BERT?,0.811123,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,"BERT is the ﬁrst ﬁne-\ntuning based representation model that achieves\nstate-of-the-art performance on a large suite\nof sentence-level and token-level tasks, outper-\nforming many task-speciﬁc architectures.\n• BERT advances the state of the art for eleven\nNLP tasks."
1,What is BERT?,0.751915,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,"BERT is conceptually simple and empirically\npowerful. It obtains new state-of-the-art re-\nsults on eleven natural language processing\ntasks, including pushing the GLUE score to\n80.5% (7.7% point absolute improvement),\nMultiNLI accuracy to 86.7% (4.6% absolute\nimprovement), SQuAD v1."
2,What is BERT?,0.751781,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,BERT BERT\nE[CLS] E1 E[SEP]. EN E1’ . EM’\nC\n T1\n T[SEP].\n TN\n T1’ .\n TM’\n[CLS] Tok 1 [SEP]. Tok N Tok 1 . TokM\nQuestion Paragraph\nStart/End Span\nBERT\nE[CLS] E1 E[SEP].
3,What is BERT?,0.740903,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,TokM\nQuestion Paragraph\nStart/End Span\nBERT\nE[CLS] E1 E[SEP]. EN E1’ . EM’\nC\n T1\n T[SEP].\n TN\n T1’ .\n TM’\n[CLS] Tok 1 [SEP]. Tok N Tok 1 .
4,What is BERT?,0.738021,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,"Input/Output Representations To make BERT\nhandle a variety of down-stream tasks, our input\nrepresentation is able to unambiguously represent\nboth a single sentence and a pair of sentences\n(e.g., ⟨Question, Answer ⟩) in one token sequence."
5,What is BERT?,0.733836,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,"Each downstream task has sep-\narate ﬁne-tuned models, even though they are ini-\ntialized with the same pre-trained parameters. The\nquestion-answering example in Figure 1 will serve\nas a running example for this section.\nA distinctive feature of BERT is its uniﬁed ar-\nchitecture across different tasks."
6,What is BERT?,0.731738,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,BERT\nE[CLS] E1 E[SEP]. EN E1’ . EM’\nC\n T1\n T[SEP].\n TN\n T1’ .\n TM’\n[CLS] Tok \n1\n [SEP]. Tok \nN\nTok \n1 .


<b>4.2 Pay Attention</b>

In [149]:
# query documents
query = "What is multi-head attention?"
response = query_engine.query(query)

In [150]:
# reformat response
cols = ['question','scores','source','text']
df_dict = {col:[] for col in cols}

for i in range(top_k):
    #context += f'>>> chunk {i+1} ' + str(response.source_nodes[i].score) + '\n\n' + response.source_nodes[i].text + "\n\n"
    #context += 'From: '+response.__dict__['source_nodes'][i].node.__dict__['metadata']['file_name'] + '\n\n'
    df_dict['scores'].append(response.source_nodes[i].score)
    df_dict['source'].append(response.__dict__['source_nodes'][i].node.__dict__['metadata']['file_name'])
    df_dict['text'].append(response.source_nodes[i].text)

df_dict['question'] = [query]*len(df_dict['scores'])
output_df[query] = pd.DataFrame(df_dict)

In [151]:
output_df[query]

Unnamed: 0,question,scores,source,text
0,What is multi-head attention?,0.80071,attention is all you need 1706.03762v7.pdf,"Scaled Dot-Product Attention\n Multi-Head Attention\nFigure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several\nattention layers running in parallel.\nof the values, where the weight assigned to each value is computed by a compatibility function of the\nquery with the corresponding key."
1,What is multi-head attention?,0.784158,attention is all you need 1706.03762v7.pdf,"3.2.3 Applications of Attention in our Model\nThe Transformer uses multi-head attention in three different ways:\n• In ""encoder-decoder attention"" layers, the queries come from the previous decoder layer,\nand the memory keys and values come from the output of the encoder. This allows every\nposition in the decoder to attend over all positions in the input sequence."
2,What is multi-head attention?,0.77205,attention is all you need 1706.03762v7.pdf,"output values. These are concatenated and once again projected, resulting in the final values, as\ndepicted in Figure 2.\nMulti-head attention allows the model to jointly attend to information from different representation\nsubspaces at different positions. With a single attention head, averaging inhibits this."
3,What is multi-head attention?,0.758912,attention is all you need 1706.03762v7.pdf,"With a single attention head, averaging inhibits this.\nMultiHead(Q, K, V) = Concat(head1, ...,headh)WO\nwhere headi = Attention(QWQ\ni , KWK\ni , V WV\ni )\nWhere the projections are parameter matricesWQ\ni ∈ Rdmodel×dk , WK\ni ∈ Rdmodel×dk , WV\ni ∈ Rdmodel×dv\nand WO ∈ Rhdv×dmodel ."
4,What is multi-head attention?,0.740496,attention is all you need 1706.03762v7.pdf,"Each layer has two\nsub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-\nwise fully connected feed-forward network. We employ a residual connection [11] around each of\nthe two sub-layers, followed by layer normalization [ 1]."
5,What is multi-head attention?,0.738943,attention is all you need 1706.03762v7.pdf,"To counteract this effect, we scale the dot products by 1√dk\n.\n3.2.2 Multi-Head Attention\nInstead of performing a single attention function with dmodel-dimensional keys, values and queries,\nwe found it beneficial to linearly project the queries, keys and values h times with different, learned\nlinear projections to dk, dk and dv dimensions, respectively."
6,What is multi-head attention?,0.738749,attention is all you need 1706.03762v7.pdf,"In this work we employ h = 8 parallel attention layers, or heads. For each of these we use\ndk = dv = dmodel/h = 64. Due to the reduced dimension of each head, the total computational cost\nis similar to that of single-head attention with full dimensionality."


<b>4.3 Who let Transformer out?</b>

In [153]:
# query documents
query = "What is Transformer?"
response = query_engine.query(query)

In [154]:
# reformat response
cols = ['question','scores','source','text']
df_dict = {col:[] for col in cols}

for i in range(top_k):
    #context += f'>>> chunk {i+1} ' + str(response.source_nodes[i].score) + '\n\n' + response.source_nodes[i].text + "\n\n"
    #context += 'From: '+response.__dict__['source_nodes'][i].node.__dict__['metadata']['file_name'] + '\n\n'
    df_dict['scores'].append(response.source_nodes[i].score)
    df_dict['source'].append(response.__dict__['source_nodes'][i].node.__dict__['metadata']['file_name'])
    df_dict['text'].append(response.source_nodes[i].text)

df_dict['question'] = [query]*len(df_dict['scores'])
output_df[query] = pd.DataFrame(df_dict)

In [155]:
output_df[query]

Unnamed: 0,question,scores,source,text
0,What is Transformer?,0.696613,attention is all you need 1706.03762v7.pdf,"6.2 Model Variations\nTo evaluate the importance of different components of the Transformer, we varied our base model\nin different ways, measuring the change in performance on English-to-German translation on the\n5We used values of 2.8, 3.7, 6.0 and 9.5 TFLOPS for K80, K40, M40 and P100, respectively.\n8"
1,What is Transformer?,0.695771,attention is all you need 1706.03762v7.pdf,The Transformer allows for significantly more parallelization and can reach a new state of the art in\ntranslation quality after being trained for as little as twelve hours on eight P100 GPUs.
2,What is Transformer?,0.679655,attention is all you need 1706.03762v7.pdf,"Figure 1: The Transformer - model architecture.\nThe Transformer follows this overall architecture using stacked self-attention and point-wise, fully\nconnected layers for both the encoder and decoder, shown in the left and right halves of Figure 1,\nrespectively.\n3.1 Encoder and Decoder Stacks\nEncoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two\nsub-layers."
3,What is Transformer?,0.674317,attention is all you need 1706.03762v7.pdf,"End-to-end memory networks are based on a recurrent attention mechanism instead of sequence-\naligned recurrence and have been shown to perform well on simple-language question answering and\nlanguage modeling tasks [34].\nTo the best of our knowledge, however, the Transformer is the first transduction model relying\nentirely on self-attention to compute representations of its input and output without using sequence-\naligned RNNs or convolution."
4,What is Transformer?,0.668299,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,"For example, the largest Transformer explored in\nVaswani et al. (2017) is (L=6, H=1024, A=16)\nwith 100M parameters for the encoder, and the\nlargest Transformer we have found in the literature\nis (L=64, H=512, A=2) with 235M parameters\n(Al-Rfou et al., 2018)."
5,What is Transformer?,0.662332,attention is all you need 1706.03762v7.pdf,"7 Conclusion\nIn this work, we presented the Transformer, the first sequence transduction model based entirely on\nattention, replacing the recurrent layers most commonly used in encoder-decoder architectures with\nmulti-headed self-attention.\nFor translation tasks, the Transformer can be trained significantly faster than architectures based\non recurrent or convolutional layers."
6,What is Transformer?,0.657973,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,"former is often referred to as a “Transformer encoder” while\nthe left-context-only version is referred to as a “Transformer\ndecoder” since it can be used for text generation.\nIn order to train a deep bidirectional representa-\ntion, we simply mask some percentage of the input\ntokens at random, and then predict those masked\ntokens."


In [157]:
merged_df = pd.concat([output_df[key] for key in output_df])
merged_df

Unnamed: 0,question,scores,source,text
0,What is Transformer?,0.696613,attention is all you need 1706.03762v7.pdf,"6.2 Model Variations\nTo evaluate the importance of different components of the Transformer, we varied our base model\nin different ways, measuring the change in performance on English-to-German translation on the\n5We used values of 2.8, 3.7, 6.0 and 9.5 TFLOPS for K80, K40, M40 and P100, respectively.\n8"
1,What is Transformer?,0.695771,attention is all you need 1706.03762v7.pdf,The Transformer allows for significantly more parallelization and can reach a new state of the art in\ntranslation quality after being trained for as little as twelve hours on eight P100 GPUs.
2,What is Transformer?,0.679655,attention is all you need 1706.03762v7.pdf,"Figure 1: The Transformer - model architecture.\nThe Transformer follows this overall architecture using stacked self-attention and point-wise, fully\nconnected layers for both the encoder and decoder, shown in the left and right halves of Figure 1,\nrespectively.\n3.1 Encoder and Decoder Stacks\nEncoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two\nsub-layers."
3,What is Transformer?,0.674317,attention is all you need 1706.03762v7.pdf,"End-to-end memory networks are based on a recurrent attention mechanism instead of sequence-\naligned recurrence and have been shown to perform well on simple-language question answering and\nlanguage modeling tasks [34].\nTo the best of our knowledge, however, the Transformer is the first transduction model relying\nentirely on self-attention to compute representations of its input and output without using sequence-\naligned RNNs or convolution."
4,What is Transformer?,0.668299,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,"For example, the largest Transformer explored in\nVaswani et al. (2017) is (L=6, H=1024, A=16)\nwith 100M parameters for the encoder, and the\nlargest Transformer we have found in the literature\nis (L=64, H=512, A=2) with 235M parameters\n(Al-Rfou et al., 2018)."
5,What is Transformer?,0.662332,attention is all you need 1706.03762v7.pdf,"7 Conclusion\nIn this work, we presented the Transformer, the first sequence transduction model based entirely on\nattention, replacing the recurrent layers most commonly used in encoder-decoder architectures with\nmulti-headed self-attention.\nFor translation tasks, the Transformer can be trained significantly faster than architectures based\non recurrent or convolutional layers."
6,What is Transformer?,0.657973,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,"former is often referred to as a “Transformer encoder” while\nthe left-context-only version is referred to as a “Transformer\ndecoder” since it can be used for text generation.\nIn order to train a deep bidirectional representa-\ntion, we simply mask some percentage of the input\ntokens at random, and then predict those masked\ntokens."
0,What is BERT?,0.811123,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,"BERT is the ﬁrst ﬁne-\ntuning based representation model that achieves\nstate-of-the-art performance on a large suite\nof sentence-level and token-level tasks, outper-\nforming many task-speciﬁc architectures.\n• BERT advances the state of the art for eleven\nNLP tasks."
1,What is BERT?,0.751915,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,"BERT is conceptually simple and empirically\npowerful. It obtains new state-of-the-art re-\nsults on eleven natural language processing\ntasks, including pushing the GLUE score to\n80.5% (7.7% point absolute improvement),\nMultiNLI accuracy to 86.7% (4.6% absolute\nimprovement), SQuAD v1."
2,What is BERT?,0.751781,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,BERT BERT\nE[CLS] E1 E[SEP]. EN E1’ . EM’\nC\n T1\n T[SEP].\n TN\n T1’ .\n TM’\n[CLS] Tok 1 [SEP]. Tok N Tok 1 . TokM\nQuestion Paragraph\nStart/End Span\nBERT\nE[CLS] E1 E[SEP].


In [158]:
merged_df = merged_df.reset_index()
cols = list(merged_df.columns)
print(cols)
cols.remove('index')
merged_df = merged_df.groupby(['question','source'])[cols].apply(lambda x: x)

['index', 'question', 'scores', 'source', 'text']


In [159]:
merged_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,question,scores,source,text
question,source,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
What is BERT?,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,7,What is BERT?,0.811123,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,"BERT is the ﬁrst ﬁne-\ntuning based representation model that achieves\nstate-of-the-art performance on a large suite\nof sentence-level and token-level tasks, outper-\nforming many task-speciﬁc architectures.\n• BERT advances the state of the art for eleven\nNLP tasks."
What is BERT?,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,8,What is BERT?,0.751915,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,"BERT is conceptually simple and empirically\npowerful. It obtains new state-of-the-art re-\nsults on eleven natural language processing\ntasks, including pushing the GLUE score to\n80.5% (7.7% point absolute improvement),\nMultiNLI accuracy to 86.7% (4.6% absolute\nimprovement), SQuAD v1."
What is BERT?,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,9,What is BERT?,0.751781,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,BERT BERT\nE[CLS] E1 E[SEP]. EN E1’ . EM’\nC\n T1\n T[SEP].\n TN\n T1’ .\n TM’\n[CLS] Tok 1 [SEP]. Tok N Tok 1 . TokM\nQuestion Paragraph\nStart/End Span\nBERT\nE[CLS] E1 E[SEP].
What is BERT?,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,10,What is BERT?,0.740903,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,TokM\nQuestion Paragraph\nStart/End Span\nBERT\nE[CLS] E1 E[SEP]. EN E1’ . EM’\nC\n T1\n T[SEP].\n TN\n T1’ .\n TM’\n[CLS] Tok 1 [SEP]. Tok N Tok 1 .
What is BERT?,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,11,What is BERT?,0.738021,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,"Input/Output Representations To make BERT\nhandle a variety of down-stream tasks, our input\nrepresentation is able to unambiguously represent\nboth a single sentence and a pair of sentences\n(e.g., ⟨Question, Answer ⟩) in one token sequence."
What is BERT?,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,12,What is BERT?,0.733836,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,"Each downstream task has sep-\narate ﬁne-tuned models, even though they are ini-\ntialized with the same pre-trained parameters. The\nquestion-answering example in Figure 1 will serve\nas a running example for this section.\nA distinctive feature of BERT is its uniﬁed ar-\nchitecture across different tasks."
What is BERT?,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,13,What is BERT?,0.731738,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,BERT\nE[CLS] E1 E[SEP]. EN E1’ . EM’\nC\n T1\n T[SEP].\n TN\n T1’ .\n TM’\n[CLS] Tok \n1\n [SEP]. Tok \nN\nTok \n1 .
What is Transformer?,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,4,What is Transformer?,0.668299,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,"For example, the largest Transformer explored in\nVaswani et al. (2017) is (L=6, H=1024, A=16)\nwith 100M parameters for the encoder, and the\nlargest Transformer we have found in the literature\nis (L=64, H=512, A=2) with 235M parameters\n(Al-Rfou et al., 2018)."
What is Transformer?,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,6,What is Transformer?,0.657973,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,"former is often referred to as a “Transformer encoder” while\nthe left-context-only version is referred to as a “Transformer\ndecoder” since it can be used for text generation.\nIn order to train a deep bidirectional representa-\ntion, we simply mask some percentage of the input\ntokens at random, and then predict those masked\ntokens."
What is Transformer?,attention is all you need 1706.03762v7.pdf,0,What is Transformer?,0.696613,attention is all you need 1706.03762v7.pdf,"6.2 Model Variations\nTo evaluate the importance of different components of the Transformer, we varied our base model\nin different ways, measuring the change in performance on English-to-German translation on the\n5We used values of 2.8, 3.7, 6.0 and 9.5 TFLOPS for K80, K40, M40 and P100, respectively.\n8"
