<img src="ask questions to LLMs what.jpg" width=70%>

## How to Ask Questions to LLMs

<h3>Code walk through video <a href="https://www.youtube.com/111">How to Ask Questions to LLMs</a></b></h3>

<a href="https://arxiv.org/pdf/1706.03762">Attention is all you need</a>

<img src="attention.jpg">

<a href="https://arxiv.org/pdf/1810.04805">BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</a>

<img src="bert.jpg">

## 1. Read PDF files with research papers using Llama-Index

In [78]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings, SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor

<img src="LlamaIndex_schema.jpg">

In [79]:
import os
import time
import pandas as pd

<b>1.1 Setup model for embeddings</b>

In [80]:
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
Settings.llm = None
Settings.chunk_size = 128
Settings.chunk_overlap = 25

LLM is explicitly disabled. Using MockLLM.


<b>1.2 Read files one by one from the directory</b>

In [99]:
reader = SimpleDirectoryReader(input_dir="papers", recursive=True)
reader.__dict__

{'fs': <fsspec.implementations.local.LocalFileSystem at 0x25938c1fd60>,
 'errors': 'ignore',
 'encoding': 'utf-8',
 'exclude': None,
 'recursive': True,
 'exclude_hidden': True,
 'required_exts': None,
 'num_files_limit': None,
 'raise_on_error': False,
 'input_dir': WindowsPath('papers'),
 'input_files': [WindowsPath('C:/Users/18623/Desktop/PhiAi/Jupyter/papers/attention is all you need 1706.03762v7.pdf'),
  WindowsPath('C:/Users/18623/Desktop/PhiAi/Jupyter/papers/BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf')],
 'file_extractor': {},
 'file_metadata': <llama_index.core.readers.file.base._DefaultFileMetadataFunc at 0x25a26555e50>,
 'filename_as_id': False}

In [100]:
reader.__dict__['input_files']

[WindowsPath('C:/Users/18623/Desktop/PhiAi/Jupyter/papers/attention is all you need 1706.03762v7.pdf'),
 WindowsPath('C:/Users/18623/Desktop/PhiAi/Jupyter/papers/BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf')]

In [101]:
reader.input_files

[WindowsPath('C:/Users/18623/Desktop/PhiAi/Jupyter/papers/attention is all you need 1706.03762v7.pdf'),
 WindowsPath('C:/Users/18623/Desktop/PhiAi/Jupyter/papers/BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf')]

<b>1.3 Store Text chunks in a dict to trace chunk-paper mapping</b>

In [7]:
pdf_files = reader.__dict__['input_files']
all_docs = []
doc_dict = {}
for i,docs in enumerate(reader.iter_data()):
    pdf_name = os.path.basename(pdf_files[i])
    print(f'Extract from file {pdf_name}')
    if pdf_name in doc_dict:
        doc_dict[pdf_name] += [docs]
    else:
        doc_dict[pdf_name] = [docs]
    all_docs.extend(docs)

Extract from file attention is all you need 1706.03762v7.pdf
Extract from file BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf


<b>1.4 Inspect Llama-Index document class</b>

In [8]:
doc_dict.keys()

dict_keys(['attention is all you need 1706.03762v7.pdf', 'BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf'])

In [102]:
#doc_dict

In [107]:
len(doc_dict['attention is all you need 1706.03762v7.pdf'][0])

15

In [108]:
len(doc_dict['BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf'][0])

16

In [105]:
doc = doc_dict['attention is all you need 1706.03762v7.pdf']

In [106]:
doc[0][0].__dict__

{'id_': '757dfa3d-eac9-419b-b818-8d82813df6c6',
 'embedding': None,
 'metadata': {'page_label': '1',
  'file_name': 'attention is all you need 1706.03762v7.pdf',
  'file_path': 'C:\\Users\\18623\\Desktop\\PhiAi\\Jupyter\\papers\\attention is all you need 1706.03762v7.pdf',
  'file_type': 'application/pdf',
  'file_size': 2215244,
  'creation_date': '2024-12-11',
  'last_modified_date': '2024-12-11'},
 'excluded_embed_metadata_keys': ['file_name',
  'file_type',
  'file_size',
  'creation_date',
  'last_modified_date',
  'last_accessed_date'],
 'excluded_llm_metadata_keys': ['file_name',
  'file_type',
  'file_size',
  'creation_date',
  'last_modified_date',
  'last_accessed_date'],
 'relationships': {},
 'text': 'Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.com\nNoam Shazeer∗\nGoogl

In [26]:
doc[0][0].__dict__['id_']

'757dfa3d-eac9-419b-b818-8d82813df6c6'

In [25]:
doc[0][0].id_

'757dfa3d-eac9-419b-b818-8d82813df6c6'

<b>1.5 Get total number of collected chunks</b>

In [27]:
len(all_docs)

31

<b>1.6 Get chuck distribution accross papers</b>

In [28]:
for key in doc_dict:
    num_chunks = len(doc_dict[key])
    print(f'{key} >>> {num_chunks} chunks')

attention is all you need 1706.03762v7.pdf >>> 1 chunks
BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf >>> 1 chunks


<b>1.7 Collect all chunks for each paper in one text</b>

In [31]:
articles = list(doc_dict.keys())
papers = {} 

for paper_name in articles:
    papers[paper_name] = ''
    for doc in doc_dict[paper_name]:
        papers[paper_name] += ' ' + doc[0].text

In [32]:
for paper_name in articles:
    papers[paper_name] = papers[paper_name].replace('.\n','. ')
    papers[paper_name] = papers[paper_name].replace('\n',' ')    

In [33]:
papers[articles[0]]

' Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works. Attention Is All You Need Ashish Vaswani∗ Google Brain avaswani@google.com Noam Shazeer∗ Google Brain noam@google.com Niki Parmar∗ Google Research nikip@google.com Jakob Uszkoreit∗ Google Research usz@google.com Llion Jones∗ Google Research llion@google.com Aidan N. Gomez∗ † University of Toronto aidan@cs.toronto.edu Łukasz Kaiser∗ Google Brain lukaszkaiser@google.com Illia Polosukhin∗ ‡ illia.polosukhin@gmail.com Abstract The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. E

## 2. QA: Ask questions about the papers

<b>2.1 Roberta-base-sqaud2-distilled model</b>

In [34]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline
import torch

In [35]:
dfs = {}

In [36]:
model_name = "deepset/roberta-base-squad2-distilled"

# a) Get predictions
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
nlp = pipeline('question-answering', 
               model=model_name, 
               tokenizer=model_name,
               device = device
              )


In [37]:
answer_list = []
score_list = []
for paper in articles:
    question = 'what is attention?'
    QA_input = {
        'question': question,
        'context': papers[paper]
    }
    res = nlp(QA_input)
    #print(f'from paper {paper} \n we learn that answer to {question} is \n {res} \n')
    answer_list.append(res['answer'])
    score_list.append(res['score'])

dfs[question + model_name] = pd.DataFrame({'question':len(articles)*[question],
                                           'model name':len(articles)*[model_name],
                                           'paper':articles,
                                           'answer': answer_list,
                                           'score':score_list
                             })
dfs[question + model_name]

  attn_output = torch.nn.functional.scaled_dot_product_attention(


Unnamed: 0,question,model name,paper,answer,score
0,what is attention?,deepset/roberta-base-squad2-distilled,attention is all you need 1706.03762v7.pdf,All You Need,0.038977
1,what is attention?,deepset/roberta-base-squad2-distilled,BERT pre_training of deep bidirectional transf...,conceptually simple and empirically powerful,0.000194


In [38]:
pd.set_option('display.max_colwidth', 500)

In [39]:
dfs[question + model_name]

Unnamed: 0,question,model name,paper,answer,score
0,what is attention?,deepset/roberta-base-squad2-distilled,attention is all you need 1706.03762v7.pdf,All You Need,0.038977
1,what is attention?,deepset/roberta-base-squad2-distilled,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,conceptually simple and empirically powerful,0.000194


In [40]:
answer_list = []
score_list = []
for paper in articles:
    question = 'what types of attention are defined?'
    QA_input = {
        'question': question,
        'context': papers[paper]
    }
    res = nlp(QA_input)
    #print(f'from paper {paper} \n we learn that answer to {question} is \n {res} \n')
    answer_list.append(res['answer'])
    score_list.append(res['score'])

dfs[question + model_name] = pd.DataFrame({'question':len(articles)*[question],
                              'model name':len(articles)*[model_name],
                              'paper':articles,
                              'answer': answer_list,
                              'score':score_list
                             })
dfs[question + model_name]



Unnamed: 0,question,model name,paper,answer,score
0,what types of attention are defined?,deepset/roberta-base-squad2-distilled,attention is all you need 1706.03762v7.pdf,multi-head attention,5e-06
1,what types of attention are defined?,deepset/roberta-base-squad2-distilled,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,self-attention layers,2.4e-05


In [41]:
answer_list = []
score_list = []
for paper in articles:
    question = 'what is multi-head attention?'
    QA_input = {
        'question': question,
        'context': papers[paper]
    }
    res = nlp(QA_input)
    #print(f'from paper {paper} \n we learn that answer to {question} is \n {res} \n')
    answer_list.append(res['answer'])
    score_list.append(res['score'])

dfs[question + model_name] = pd.DataFrame({'question':len(articles)*[question],
                              'model name':len(articles)*[model_name],
                              'paper':articles,
                              'answer': answer_list,
                              'score':score_list
                             })
dfs[question + model_name]



Unnamed: 0,question,model name,paper,answer,score
0,what is multi-head attention?,deepset/roberta-base-squad2-distilled,attention is all you need 1706.03762v7.pdf,parameter-free position representation,0.003145
1,what is multi-head attention?,deepset/roberta-base-squad2-distilled,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,MultiNLI accuracy,3e-06


<b>2.2 HuggingFace examplar pipeline</b>

In [42]:
qa_model = pipeline("question-answering", 
                    device=device)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [43]:
answer_list = []
score_list = []
for paper in articles:
    question = 'what is attention?'
    res = qa_model(question = question, context = papers[paper])
    #print(f'from paper {paper} \n we learn that answer to {question} is \n {res} \n')
    answer_list.append(res['answer'])
    score_list.append(res['score'])

dfs[question+'_nomodel'] = pd.DataFrame({'question':len(articles)*[question],
                                         'model name': len(articles)*['default'],
                                         'paper':articles,
                                         'answer': answer_list,
                                         'score':score_list
                                        })
dfs[question+'_nomodel']

Unnamed: 0,question,model name,paper,answer,score
0,what is attention?,default,attention is all you need 1706.03762v7.pdf,multi-head,0.680392
1,what is attention?,default,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,improving many natural language,0.133738


In [44]:
answer_list = []
score_list = []
for paper in articles:
    question = 'what types of attention are defined?'
    res = qa_model(question = question, context = papers[paper])
    #print(f'from paper {paper} \n we learn that answer to {question} is \n {res} \n')
    answer_list.append(res['answer'])
    score_list.append(res['score'])

dfs[question+'_nomodel'] = pd.DataFrame({'question':len(articles)*[question],
                                         'model name': len(articles)*['default'],
                                         'paper':articles,
                                         'answer': answer_list,
                                         'score':score_list
                                        })
dfs[question+'_nomodel']

Unnamed: 0,question,model name,paper,answer,score
0,what types of attention are defined?,default,attention is all you need 1706.03762v7.pdf,multi-head,0.289631
1,what types of attention are defined?,default,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,deep bidirectional representations,0.003118


In [45]:
answer_list = []
score_list = []
for paper in articles:
    question = 'what is multi-head attention?'
    res = qa_model(question = question, context = papers[paper])
    #print(f'from paper {paper} \n we learn that answer to {question} is \n {res} \n')
    answer_list.append(res['answer'])
    score_list.append(res['score'])

dfs[question+'_nomodel'] = pd.DataFrame({'question':len(articles)*[question],
                                         'model name': len(articles)*['default'],
                                         'paper':articles,
                                         'answer': answer_list,
                                         'score':score_list
                                        })
dfs[question+'_nomodel']

Unnamed: 0,question,model name,paper,answer,score
0,what is multi-head attention?,default,attention is all you need 1706.03762v7.pdf,scaled dot-product attention,0.8024
1,what is multi-head attention?,default,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,Language,1.2e-05


<b>2.3 Summarize QA results in one DataFrame</b>

In [46]:
qa_keys = list(dfs.keys())

merged_df = pd.concat([dfs[key] for key in qa_keys])

In [47]:
merged_df

Unnamed: 0,question,model name,paper,answer,score
0,what is attention?,deepset/roberta-base-squad2-distilled,attention is all you need 1706.03762v7.pdf,All You Need,0.038977
1,what is attention?,deepset/roberta-base-squad2-distilled,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,conceptually simple and empirically powerful,0.000194
0,what types of attention are defined?,deepset/roberta-base-squad2-distilled,attention is all you need 1706.03762v7.pdf,multi-head attention,5e-06
1,what types of attention are defined?,deepset/roberta-base-squad2-distilled,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,self-attention layers,2.4e-05
0,what is multi-head attention?,deepset/roberta-base-squad2-distilled,attention is all you need 1706.03762v7.pdf,parameter-free position representation,0.003145
1,what is multi-head attention?,deepset/roberta-base-squad2-distilled,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,MultiNLI accuracy,3e-06
0,what is attention?,default,attention is all you need 1706.03762v7.pdf,multi-head,0.680392
1,what is attention?,default,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,improving many natural language,0.133738
0,what types of attention are defined?,default,attention is all you need 1706.03762v7.pdf,multi-head,0.289631
1,what types of attention are defined?,default,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,deep bidirectional representations,0.003118


In [48]:
merged_df = merged_df.reset_index()
cols = ['paper', 'answer', 'score']
print(cols)
#cols.remove('index')
merged_df = merged_df.groupby(['question','model name'])[cols].apply(lambda x: x)

['paper', 'answer', 'score']


In [49]:
merged_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,paper,answer,score
question,model name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
what is attention?,deepset/roberta-base-squad2-distilled,0,attention is all you need 1706.03762v7.pdf,All You Need,0.038977
what is attention?,deepset/roberta-base-squad2-distilled,1,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,conceptually simple and empirically powerful,0.000194
what is attention?,default,6,attention is all you need 1706.03762v7.pdf,multi-head,0.680392
what is attention?,default,7,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,improving many natural language,0.133738
what is multi-head attention?,deepset/roberta-base-squad2-distilled,4,attention is all you need 1706.03762v7.pdf,parameter-free position representation,0.003145
what is multi-head attention?,deepset/roberta-base-squad2-distilled,5,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,MultiNLI accuracy,3e-06
what is multi-head attention?,default,10,attention is all you need 1706.03762v7.pdf,scaled dot-product attention,0.8024
what is multi-head attention?,default,11,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,Language,1.2e-05
what types of attention are defined?,deepset/roberta-base-squad2-distilled,2,attention is all you need 1706.03762v7.pdf,multi-head attention,5e-06
what types of attention are defined?,deepset/roberta-base-squad2-distilled,3,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,self-attention layers,2.4e-05


## 3. Sematic Vector Search: Hands-On 

In [50]:
from sentence_transformers import SentenceTransformer

In [51]:
model_name = 'all-mpnet-base-v2'
model = SentenceTransformer(model_name)

<b>3.1 Semantic Search Accross Text Chunks</b>

In [52]:
queries = ['what is attention?', 'what is multi-head attention?']
documents = [] 
for rec in all_docs:
    text_chunk = rec.text.replace('.\n','. ')
    text_chunk = text_chunk.replace('\n',' ')
    documents.append(text_chunk)

In [109]:
documents[0]

'Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works. Attention Is All You Need Ashish Vaswani∗ Google Brain avaswani@google.com Noam Shazeer∗ Google Brain noam@google.com Niki Parmar∗ Google Research nikip@google.com Jakob Uszkoreit∗ Google Research usz@google.com Llion Jones∗ Google Research llion@google.com Aidan N. Gomez∗ † University of Toronto aidan@cs.toronto.edu Łukasz Kaiser∗ Google Brain lukaszkaiser@google.com Illia Polosukhin∗ ‡ illia.polosukhin@gmail.com Abstract The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Ex

In [53]:
query_embeddings = model.encode(queries) 
document_embeddings = model.encode(documents)

In [54]:
query_embeddings.shape

(2, 768)

In [2]:
#query_embeddings[0]

In [55]:
document_embeddings.shape

(31, 768)

In [1]:
#document_embeddings[0]

In [56]:
scores = model.similarity(query_embeddings, document_embeddings)

In [57]:
scores.numpy()

array([[0.27904153, 0.2639539 , 0.33054262, 0.34323582, 0.36463457,
        0.2585139 , 0.2755901 , 0.12498504, 0.32768473, 0.1439856 ,
        0.16710263, 0.2609303 , 0.4622292 , 0.40838757, 0.31974906,
        0.20027614, 0.16921651, 0.17691593, 0.18058467, 0.22321635,
        0.15124284, 0.11138519, 0.17050765, 0.19434658, 0.1487274 ,
        0.14039241, 0.2860044 , 0.20682164, 0.19277057, 0.1614857 ,
        0.10827111],
       [0.33957446, 0.24378414, 0.4095444 , 0.37700048, 0.45007756,
        0.23550165, 0.28207815, 0.2192395 , 0.37530297, 0.18709067,
        0.18889803, 0.2374867 , 0.5049366 , 0.43961325, 0.44466537,
        0.20438865, 0.19514653, 0.2123058 , 0.16955382, 0.20691344,
        0.22477685, 0.21884929, 0.25725156, 0.18618935, 0.18130736,
        0.17228976, 0.35454687, 0.28306875, 0.18855003, 0.17762521,
        0.1183356 ]], dtype=float32)

In [58]:
# Output the results
thres = 0.25

query_score_list = []

for query, query_scores in zip(queries, scores):
    doc_score_pairs = list(zip(documents, query_scores))
    doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
    print("Query:", query)
    k=-1
    for document, score in doc_score_pairs:
        k += 1
        score_val = score.numpy()
        if score_val >= 0.25:
            print(score_val, document[:100])
            query_score_list.append({'query':query, 
                                     'doc':document,
                                     'llama_index_doc': all_docs[k],
                                     'score': score_val
                                    })        

Query: what is attention?
0.4622292 Attention Visualizations Input-Input Layer5 It is in this spirit that a majority of American governm
0.40838757 Input-Input Layer5 The Law will never be perfect , but its application should be just - this is what
0.36463457 output values. These are concatenated and once again projected, resulting in the final values, as de
0.34323582 Scaled Dot-Product Attention  Multi-Head Attention Figure 2: (left) Scaled Dot-Product Attention. (r
0.33054262 Figure 1: The Transformer - model architecture. The Transformer follows this overall architecture us
0.32768473 Table 3: Variations on the Transformer architecture. Unlisted values are identical to those of the b
0.31974906 Input-Input Layer5 The Law will never be perfect , but its application should be just - this is what
0.2860044 for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: An- alyzi
0.27904153 Provided proper attribution is provided, Google hereby grants permiss

<b>3.2 Semantic Search Inside Text Chunks</b>

In [59]:
ind = 2
doc = query_score_list[ind]['doc']
q = query_score_list[ind]['query']

doc_sentence = doc.split('. ')

In [60]:
q

'what is attention?'

In [61]:
doc_sentence

['output values',
 'These are concatenated and once again projected, resulting in the final values, as depicted in Figure 2',
 'Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions',
 'With a single attention head, averaging inhibits this',
 'MultiHead(Q, K, V) = Concat(head1, ...,headh)WO where headi = Attention(QWQ i , KWK i , V WV i ) Where the projections are parameter matricesWQ i ∈ Rdmodel×dk , WK i ∈ Rdmodel×dk , WV i ∈ Rdmodel×dv and WO ∈ Rhdv×dmodel ',
 'In this work we employ h = 8 parallel attention layers, or heads',
 'For each of these we use dk = dv = dmodel/h = 64',
 'Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality',
 '3.2.3 Applications of Attention in our Model The Transformer uses multi-head attention in three different ways: • In "encoder-decoder attention" layers, the queries come from the previ

In [62]:
doc_query_embeddings = model.encode(q) 
doc_sentence_embeddings = model.encode(doc_sentence)

In [63]:
sentence_scores = model.similarity(doc_query_embeddings, 
                                   doc_sentence_embeddings)

In [64]:
sentence_scores

tensor([[ 0.0960,  0.0322,  0.4273,  0.4587,  0.2416,  0.4568,  0.0357,  0.4336,
          0.4957,  0.2036,  0.3647,  0.3706,  0.3717,  0.1514,  0.3282,  0.0843,
          0.2573,  0.1392,  0.3269,  0.1512, -0.0143,  0.1179,  0.1034,  0.1406,
          0.1311,  0.0234,  0.0249,  0.1270]])

In [112]:
import matplotlib.cm as cm
import matplotlib as matplotlib
from IPython.display import display, Markdown

def color_map_color(value, 
                    cmap_name='PuBu',
                    #cmap_name='Wistia', 
                    vmin=0, 
                    vmax=1):
    # norm = plt.Normalize(vmin, vmax)
    norm = matplotlib.colors.Normalize(vmin=vmin, vmax=vmax)
    cmap = cm.get_cmap(cmap_name)  # PiYG
    rgb = cmap(norm(abs(value)))[:3]  # will return rgba, we take only first 3 so we get rgb
    color = matplotlib.colors.rgb2hex(rgb)
    return color

In [113]:
html_template = "<span style='background-color:{}'>{}</span>"
html_template.format(color_map_color(0.51),'text')

  cmap = cm.get_cmap(cmap_name)  # PiYG


"<span style='background-color:#6fa7ce'>text</span>"

In [114]:
display(Markdown(html_template.format(color_map_color(0.51),'text')))

  cmap = cm.get_cmap(cmap_name)  # PiYG


<span style='background-color:#6fa7ce'>text</span>

In [115]:
html_template = "<span style='background-color:{};opacity:0.8;'>{}</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:{:.2f}</sup>"
html_output = []
for query, query_scores in zip(q, sentence_scores):
    doc_score_pairs = list(zip(doc_sentence, query_scores))
    doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
    #print("Query:", q)
    for document, score in doc_score_pairs:
        score_val = score.numpy()
        html_output.append(html_template.format(color_map_color(1-score_val),document,score_val))

  cmap = cm.get_cmap(cmap_name)  # PiYG


In [116]:
display(Markdown(' '.join(html_output)))

<span style='background-color:#71a8ce;opacity:0.8;'>3.2.3 Applications of Attention in our Model The Transformer uses multi-head attention in three different ways: • In "encoder-decoder attention" layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.50</sup> <span style='background-color:#60a1ca;opacity:0.8;'>With a single attention head, averaging inhibits this</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.46</sup> <span style='background-color:#5ea0ca;opacity:0.8;'>In this work we employ h = 8 parallel attention layers, or heads</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.46</sup> <span style='background-color:#529bc7;opacity:0.8;'>Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.43</sup> <span style='background-color:#509ac6;opacity:0.8;'>Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.43</sup> <span style='background-color:#358fc0;opacity:0.8;'>In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.37</sup> <span style='background-color:#348ebf;opacity:0.8;'>• The encoder contains self-attention layers</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.37</sup> <span style='background-color:#328dbf;opacity:0.8;'>This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models such as [38, 2, 9]</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.36</sup> <span style='background-color:#2484ba;opacity:0.8;'>• Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.33</sup> <span style='background-color:#2383ba;opacity:0.8;'>3.3 Position-wise Feed-Forward Networks In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.33</sup> <span style='background-color:#0771b1;opacity:0.8;'>We implement this inside of scaled dot-product attention by masking out (setting to −∞) all values in the input of the softmax which correspond to illegal connections</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.26</sup> <span style='background-color:#056ead;opacity:0.8;'>MultiHead(Q, K, V) = Concat(head1, ...,headh)WO where headi = Attention(QWQ i , KWK i , V WV i ) Where the projections are parameter matricesWQ i ∈ Rdmodel×dk , WK i ∈ Rdmodel×dk , WV i ∈ Rdmodel×dv and WO ∈ Rhdv×dmodel </span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.24</sup> <span style='background-color:#0568a3;opacity:0.8;'>This allows every position in the decoder to attend over all positions in the input sequence</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.20</sup> <span style='background-color:#045e94;opacity:0.8;'>Each position in the encoder can attend to all positions in the previous layer of the encoder</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.15</sup> <span style='background-color:#045e94;opacity:0.8;'>This consists of two linear transformations with a ReLU activation in between</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.15</sup> <span style='background-color:#045c90;opacity:0.8;'>3.4 Embeddings and Softmax Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension dmodel</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.14</sup> <span style='background-color:#045c90;opacity:0.8;'>See Figure 2</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.14</sup> <span style='background-color:#045b8e;opacity:0.8;'>We also use the usual learned linear transfor- mation and softmax function to convert the decoder output to predicted next-token probabilities</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.13</sup> <span style='background-color:#045a8d;opacity:0.8;'>5</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.13</sup> <span style='background-color:#04588a;opacity:0.8;'>Another way of describing this is as two convolutions with kernel size 1</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.12</sup> <span style='background-color:#045483;opacity:0.8;'>The dimensionality of input and output is dmodel = 512, and the inner-layer has dimensionality dff = 2048</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.10</sup> <span style='background-color:#045280;opacity:0.8;'>output values</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.10</sup> <span style='background-color:#034e7b;opacity:0.8;'>We need to prevent leftward information flow in the decoder to preserve the auto-regressive property</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.08</sup> <span style='background-color:#034267;opacity:0.8;'>For each of these we use dk = dv = dmodel/h = 64</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.04</sup> <span style='background-color:#034165;opacity:0.8;'>These are concatenated and once again projected, resulting in the final values, as depicted in Figure 2</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.03</sup> <span style='background-color:#023e62;opacity:0.8;'>In the embedding layers, we multiply those weights by √dmodel</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.02</sup> <span style='background-color:#023d60;opacity:0.8;'>In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to [30]</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:0.02</sup> <span style='background-color:#023858;opacity:0.8;'>FFN(x) = max(0, xW1 + b1)W2 + b2 (2) While the linear transformations are the same across different positions, they use different parameters from layer to layer</span><sup style='font-size:10px;font-weight:bold;color:red;'>Score:-0.01</sup>

## 4. Semantic Vector Search: Vector DB

In [82]:
output_df = {}

In [83]:
pd.set_option('display.max_colwidth', 500)

In [84]:
# store docs into vector DB
index = VectorStoreIndex.from_documents(all_docs)

In [117]:
all_docs[0]

Document(id_='757dfa3d-eac9-419b-b818-8d82813df6c6', embedding=None, metadata={'page_label': '1', 'file_name': 'attention is all you need 1706.03762v7.pdf', 'file_path': 'C:\\Users\\18623\\Desktop\\PhiAi\\Jupyter\\papers\\attention is all you need 1706.03762v7.pdf', 'file_type': 'application/pdf', 'file_size': 2215244, 'creation_date': '2024-12-11', 'last_modified_date': '2024-12-11'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.com\nNoam Shazeer∗\nGoogle Brain\nnoam@google.com\nNiki Parmar∗\nGoogle Rese

In [85]:
# set number of docs to retreive
top_k = 7

# configure retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=top_k,
)
     

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.3)],
)

<b>4.1 To BERT or not to BERT? this is the questions</b>

In [118]:
# query documents
query = "What is BERT?"
response = query_engine.query(query)

In [119]:
response.__dict__.keys()

dict_keys(['response', 'source_nodes', 'metadata'])

In [120]:
response.__dict__['source_nodes'][0].node.__dict__['metadata']['file_name']

'BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf'

In [121]:
# reformat response
cols = ['question','scores','paper','text']
df_dict = {col:[] for col in cols}

for i in range(top_k):
    #context += f'>>> chunk {i+1} ' + str(response.source_nodes[i].score) + '\n\n' + response.source_nodes[i].text + "\n\n"
    #context += 'From: '+response.__dict__['source_nodes'][i].node.__dict__['metadata']['file_name'] + '\n\n'
    df_dict['scores'].append(response.source_nodes[i].score)
    df_dict['paper'].append(response.__dict__['source_nodes'][i].node.__dict__['metadata']['file_name'])
    df_dict['text'].append(response.source_nodes[i].text)

df_dict['question'] = [query]*len(df_dict['scores'])
output_df[query] = pd.DataFrame(df_dict)

In [122]:
output_df[query]

Unnamed: 0,question,scores,paper,text
0,What is BERT?,0.811123,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,"BERT is the ﬁrst ﬁne-\ntuning based representation model that achieves\nstate-of-the-art performance on a large suite\nof sentence-level and token-level tasks, outper-\nforming many task-speciﬁc architectures.\n• BERT advances the state of the art for eleven\nNLP tasks."
1,What is BERT?,0.751915,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,"BERT is conceptually simple and empirically\npowerful. It obtains new state-of-the-art re-\nsults on eleven natural language processing\ntasks, including pushing the GLUE score to\n80.5% (7.7% point absolute improvement),\nMultiNLI accuracy to 86.7% (4.6% absolute\nimprovement), SQuAD v1."
2,What is BERT?,0.751781,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,BERT BERT\nE[CLS] E1 E[SEP]. EN E1’ . EM’\nC\n T1\n T[SEP].\n TN\n T1’ .\n TM’\n[CLS] Tok 1 [SEP]. Tok N Tok 1 . TokM\nQuestion Paragraph\nStart/End Span\nBERT\nE[CLS] E1 E[SEP].
3,What is BERT?,0.740903,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,TokM\nQuestion Paragraph\nStart/End Span\nBERT\nE[CLS] E1 E[SEP]. EN E1’ . EM’\nC\n T1\n T[SEP].\n TN\n T1’ .\n TM’\n[CLS] Tok 1 [SEP]. Tok N Tok 1 .
4,What is BERT?,0.738021,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,"Input/Output Representations To make BERT\nhandle a variety of down-stream tasks, our input\nrepresentation is able to unambiguously represent\nboth a single sentence and a pair of sentences\n(e.g., ⟨Question, Answer ⟩) in one token sequence."
5,What is BERT?,0.733836,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,"Each downstream task has sep-\narate ﬁne-tuned models, even though they are ini-\ntialized with the same pre-trained parameters. The\nquestion-answering example in Figure 1 will serve\nas a running example for this section.\nA distinctive feature of BERT is its uniﬁed ar-\nchitecture across different tasks."
6,What is BERT?,0.731738,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,BERT\nE[CLS] E1 E[SEP]. EN E1’ . EM’\nC\n T1\n T[SEP].\n TN\n T1’ .\n TM’\n[CLS] Tok \n1\n [SEP]. Tok \nN\nTok \n1 .


<b>4.2 Pay Attention</b>

In [90]:
# query documents
query = "What is multi-head attention?"
response = query_engine.query(query)

In [91]:
# reformat response
cols = ['question','scores','paper','text']
df_dict = {col:[] for col in cols}

for i in range(top_k):
    #context += f'>>> chunk {i+1} ' + str(response.source_nodes[i].score) + '\n\n' + response.source_nodes[i].text + "\n\n"
    #context += 'From: '+response.__dict__['source_nodes'][i].node.__dict__['metadata']['file_name'] + '\n\n'
    df_dict['scores'].append(response.source_nodes[i].score)
    df_dict['paper'].append(response.__dict__['source_nodes'][i].node.__dict__['metadata']['file_name'])
    df_dict['text'].append(response.source_nodes[i].text)

df_dict['question'] = [query]*len(df_dict['scores'])
output_df[query] = pd.DataFrame(df_dict)

In [92]:
output_df[query]

Unnamed: 0,question,scores,paper,text
0,What is multi-head attention?,0.80071,attention is all you need 1706.03762v7.pdf,"Scaled Dot-Product Attention\n Multi-Head Attention\nFigure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several\nattention layers running in parallel.\nof the values, where the weight assigned to each value is computed by a compatibility function of the\nquery with the corresponding key."
1,What is multi-head attention?,0.784158,attention is all you need 1706.03762v7.pdf,"3.2.3 Applications of Attention in our Model\nThe Transformer uses multi-head attention in three different ways:\n• In ""encoder-decoder attention"" layers, the queries come from the previous decoder layer,\nand the memory keys and values come from the output of the encoder. This allows every\nposition in the decoder to attend over all positions in the input sequence."
2,What is multi-head attention?,0.77205,attention is all you need 1706.03762v7.pdf,"output values. These are concatenated and once again projected, resulting in the final values, as\ndepicted in Figure 2.\nMulti-head attention allows the model to jointly attend to information from different representation\nsubspaces at different positions. With a single attention head, averaging inhibits this."
3,What is multi-head attention?,0.758912,attention is all you need 1706.03762v7.pdf,"With a single attention head, averaging inhibits this.\nMultiHead(Q, K, V) = Concat(head1, ...,headh)WO\nwhere headi = Attention(QWQ\ni , KWK\ni , V WV\ni )\nWhere the projections are parameter matricesWQ\ni ∈ Rdmodel×dk , WK\ni ∈ Rdmodel×dk , WV\ni ∈ Rdmodel×dv\nand WO ∈ Rhdv×dmodel ."
4,What is multi-head attention?,0.740496,attention is all you need 1706.03762v7.pdf,"Each layer has two\nsub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-\nwise fully connected feed-forward network. We employ a residual connection [11] around each of\nthe two sub-layers, followed by layer normalization [ 1]."
5,What is multi-head attention?,0.738943,attention is all you need 1706.03762v7.pdf,"To counteract this effect, we scale the dot products by 1√dk\n.\n3.2.2 Multi-Head Attention\nInstead of performing a single attention function with dmodel-dimensional keys, values and queries,\nwe found it beneficial to linearly project the queries, keys and values h times with different, learned\nlinear projections to dk, dk and dv dimensions, respectively."
6,What is multi-head attention?,0.738749,attention is all you need 1706.03762v7.pdf,"In this work we employ h = 8 parallel attention layers, or heads. For each of these we use\ndk = dv = dmodel/h = 64. Due to the reduced dimension of each head, the total computational cost\nis similar to that of single-head attention with full dimensionality."


<b>4.3 Who let Transformer out?</b>

In [93]:
# query documents
query = "What is Transformer?"
response = query_engine.query(query)

In [94]:
# reformat response
cols = ['question','scores','paper','text']
df_dict = {col:[] for col in cols}

for i in range(top_k):
    #context += f'>>> chunk {i+1} ' + str(response.source_nodes[i].score) + '\n\n' + response.source_nodes[i].text + "\n\n"
    #context += 'From: '+response.__dict__['source_nodes'][i].node.__dict__['metadata']['file_name'] + '\n\n'
    df_dict['scores'].append(response.source_nodes[i].score)
    df_dict['paper'].append(response.__dict__['source_nodes'][i].node.__dict__['metadata']['file_name'])
    df_dict['text'].append(response.source_nodes[i].text)

df_dict['question'] = [query]*len(df_dict['scores'])
output_df[query] = pd.DataFrame(df_dict)

In [95]:
output_df[query]

Unnamed: 0,question,scores,paper,text
0,What is Transformer?,0.696613,attention is all you need 1706.03762v7.pdf,"6.2 Model Variations\nTo evaluate the importance of different components of the Transformer, we varied our base model\nin different ways, measuring the change in performance on English-to-German translation on the\n5We used values of 2.8, 3.7, 6.0 and 9.5 TFLOPS for K80, K40, M40 and P100, respectively.\n8"
1,What is Transformer?,0.695771,attention is all you need 1706.03762v7.pdf,The Transformer allows for significantly more parallelization and can reach a new state of the art in\ntranslation quality after being trained for as little as twelve hours on eight P100 GPUs.
2,What is Transformer?,0.679655,attention is all you need 1706.03762v7.pdf,"Figure 1: The Transformer - model architecture.\nThe Transformer follows this overall architecture using stacked self-attention and point-wise, fully\nconnected layers for both the encoder and decoder, shown in the left and right halves of Figure 1,\nrespectively.\n3.1 Encoder and Decoder Stacks\nEncoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two\nsub-layers."
3,What is Transformer?,0.674317,attention is all you need 1706.03762v7.pdf,"End-to-end memory networks are based on a recurrent attention mechanism instead of sequence-\naligned recurrence and have been shown to perform well on simple-language question answering and\nlanguage modeling tasks [34].\nTo the best of our knowledge, however, the Transformer is the first transduction model relying\nentirely on self-attention to compute representations of its input and output without using sequence-\naligned RNNs or convolution."
4,What is Transformer?,0.668299,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,"For example, the largest Transformer explored in\nVaswani et al. (2017) is (L=6, H=1024, A=16)\nwith 100M parameters for the encoder, and the\nlargest Transformer we have found in the literature\nis (L=64, H=512, A=2) with 235M parameters\n(Al-Rfou et al., 2018)."
5,What is Transformer?,0.662332,attention is all you need 1706.03762v7.pdf,"7 Conclusion\nIn this work, we presented the Transformer, the first sequence transduction model based entirely on\nattention, replacing the recurrent layers most commonly used in encoder-decoder architectures with\nmulti-headed self-attention.\nFor translation tasks, the Transformer can be trained significantly faster than architectures based\non recurrent or convolutional layers."
6,What is Transformer?,0.657973,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,"former is often referred to as a “Transformer encoder” while\nthe left-context-only version is referred to as a “Transformer\ndecoder” since it can be used for text generation.\nIn order to train a deep bidirectional representa-\ntion, we simply mask some percentage of the input\ntokens at random, and then predict those masked\ntokens."


In [96]:
merged_dfs = pd.concat([output_df[key] for key in output_df])
merged_dfs

Unnamed: 0,question,scores,paper,text
0,What is BERT?,0.811123,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,"BERT is the ﬁrst ﬁne-\ntuning based representation model that achieves\nstate-of-the-art performance on a large suite\nof sentence-level and token-level tasks, outper-\nforming many task-speciﬁc architectures.\n• BERT advances the state of the art for eleven\nNLP tasks."
1,What is BERT?,0.751915,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,"BERT is conceptually simple and empirically\npowerful. It obtains new state-of-the-art re-\nsults on eleven natural language processing\ntasks, including pushing the GLUE score to\n80.5% (7.7% point absolute improvement),\nMultiNLI accuracy to 86.7% (4.6% absolute\nimprovement), SQuAD v1."
2,What is BERT?,0.751781,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,BERT BERT\nE[CLS] E1 E[SEP]. EN E1’ . EM’\nC\n T1\n T[SEP].\n TN\n T1’ .\n TM’\n[CLS] Tok 1 [SEP]. Tok N Tok 1 . TokM\nQuestion Paragraph\nStart/End Span\nBERT\nE[CLS] E1 E[SEP].
3,What is BERT?,0.740903,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,TokM\nQuestion Paragraph\nStart/End Span\nBERT\nE[CLS] E1 E[SEP]. EN E1’ . EM’\nC\n T1\n T[SEP].\n TN\n T1’ .\n TM’\n[CLS] Tok 1 [SEP]. Tok N Tok 1 .
4,What is BERT?,0.738021,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,"Input/Output Representations To make BERT\nhandle a variety of down-stream tasks, our input\nrepresentation is able to unambiguously represent\nboth a single sentence and a pair of sentences\n(e.g., ⟨Question, Answer ⟩) in one token sequence."
5,What is BERT?,0.733836,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,"Each downstream task has sep-\narate ﬁne-tuned models, even though they are ini-\ntialized with the same pre-trained parameters. The\nquestion-answering example in Figure 1 will serve\nas a running example for this section.\nA distinctive feature of BERT is its uniﬁed ar-\nchitecture across different tasks."
6,What is BERT?,0.731738,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,BERT\nE[CLS] E1 E[SEP]. EN E1’ . EM’\nC\n T1\n T[SEP].\n TN\n T1’ .\n TM’\n[CLS] Tok \n1\n [SEP]. Tok \nN\nTok \n1 .
0,What is multi-head attention?,0.80071,attention is all you need 1706.03762v7.pdf,"Scaled Dot-Product Attention\n Multi-Head Attention\nFigure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several\nattention layers running in parallel.\nof the values, where the weight assigned to each value is computed by a compatibility function of the\nquery with the corresponding key."
1,What is multi-head attention?,0.784158,attention is all you need 1706.03762v7.pdf,"3.2.3 Applications of Attention in our Model\nThe Transformer uses multi-head attention in three different ways:\n• In ""encoder-decoder attention"" layers, the queries come from the previous decoder layer,\nand the memory keys and values come from the output of the encoder. This allows every\nposition in the decoder to attend over all positions in the input sequence."
2,What is multi-head attention?,0.77205,attention is all you need 1706.03762v7.pdf,"output values. These are concatenated and once again projected, resulting in the final values, as\ndepicted in Figure 2.\nMulti-head attention allows the model to jointly attend to information from different representation\nsubspaces at different positions. With a single attention head, averaging inhibits this."


In [97]:
merged_dfs = merged_dfs.reset_index()
cols = ['scores','text']
print(cols)
#cols.remove('index')
merged_dfs = merged_dfs.groupby(['question','paper'])[cols].apply(lambda x: x)

['scores', 'text']


In [98]:
merged_dfs

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,scores,text
question,paper,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
What is BERT?,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,0,0.811123,"BERT is the ﬁrst ﬁne-\ntuning based representation model that achieves\nstate-of-the-art performance on a large suite\nof sentence-level and token-level tasks, outper-\nforming many task-speciﬁc architectures.\n• BERT advances the state of the art for eleven\nNLP tasks."
What is BERT?,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,1,0.751915,"BERT is conceptually simple and empirically\npowerful. It obtains new state-of-the-art re-\nsults on eleven natural language processing\ntasks, including pushing the GLUE score to\n80.5% (7.7% point absolute improvement),\nMultiNLI accuracy to 86.7% (4.6% absolute\nimprovement), SQuAD v1."
What is BERT?,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,2,0.751781,BERT BERT\nE[CLS] E1 E[SEP]. EN E1’ . EM’\nC\n T1\n T[SEP].\n TN\n T1’ .\n TM’\n[CLS] Tok 1 [SEP]. Tok N Tok 1 . TokM\nQuestion Paragraph\nStart/End Span\nBERT\nE[CLS] E1 E[SEP].
What is BERT?,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,3,0.740903,TokM\nQuestion Paragraph\nStart/End Span\nBERT\nE[CLS] E1 E[SEP]. EN E1’ . EM’\nC\n T1\n T[SEP].\n TN\n T1’ .\n TM’\n[CLS] Tok 1 [SEP]. Tok N Tok 1 .
What is BERT?,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,4,0.738021,"Input/Output Representations To make BERT\nhandle a variety of down-stream tasks, our input\nrepresentation is able to unambiguously represent\nboth a single sentence and a pair of sentences\n(e.g., ⟨Question, Answer ⟩) in one token sequence."
What is BERT?,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,5,0.733836,"Each downstream task has sep-\narate ﬁne-tuned models, even though they are ini-\ntialized with the same pre-trained parameters. The\nquestion-answering example in Figure 1 will serve\nas a running example for this section.\nA distinctive feature of BERT is its uniﬁed ar-\nchitecture across different tasks."
What is BERT?,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,6,0.731738,BERT\nE[CLS] E1 E[SEP]. EN E1’ . EM’\nC\n T1\n T[SEP].\n TN\n T1’ .\n TM’\n[CLS] Tok \n1\n [SEP]. Tok \nN\nTok \n1 .
What is Transformer?,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,18,0.668299,"For example, the largest Transformer explored in\nVaswani et al. (2017) is (L=6, H=1024, A=16)\nwith 100M parameters for the encoder, and the\nlargest Transformer we have found in the literature\nis (L=64, H=512, A=2) with 235M parameters\n(Al-Rfou et al., 2018)."
What is Transformer?,BERT pre_training of deep bidirectional transformers for language understanding 1810.04805v2.pdf,20,0.657973,"former is often referred to as a “Transformer encoder” while\nthe left-context-only version is referred to as a “Transformer\ndecoder” since it can be used for text generation.\nIn order to train a deep bidirectional representa-\ntion, we simply mask some percentage of the input\ntokens at random, and then predict those masked\ntokens."
What is Transformer?,attention is all you need 1706.03762v7.pdf,14,0.696613,"6.2 Model Variations\nTo evaluate the importance of different components of the Transformer, we varied our base model\nin different ways, measuring the change in performance on English-to-German translation on the\n5We used values of 2.8, 3.7, 6.0 and 9.5 TFLOPS for K80, K40, M40 and P100, respectively.\n8"
