# Indexing

In this notebook will use LangChain to setup a Pinecone vector DB and embed our arXiv papers.

In [None]:
!pip install -qU langchain openai pinecone-client

## Preparing Data

We start by loading the data that we'll be indexing...

In [2]:
import json

with open('dataset.jsonl', 'r') as fp:
    dataset = [json.loads(line) for line in fp]

len(dataset)

27051

We'll need the text itself, but also the metadata associated with each item, that being the `doi` and `chunk-id`.

In [3]:
texts = [d['chunk'] for d in dataset]
ids = [f"{d['doi']}-{d['chunk-id']}" for d in dataset]

Build embeddings using our text and OpenAI's `text-embedding-ada-002` model. For this we use the embeddings util in LangChain.

In [4]:
from langchain.embeddings import OpenAIEmbeddings

model_name = 'text-embedding-ada-002'

embedding_model = OpenAIEmbeddings(
    document_model_name=model_name,
    query_model_name=model_name,
    openai_api_key='OPENAI_KEY'  # get at platform.openai.com
)

We encode like so:

In [5]:
docs = ["here is some text to encode", "and some more"]

embeds = embedding_model.embed_documents(docs)
embeds

[[-0.02255261316895485,
  0.011016451753675938,
  -0.003969615325331688,
  -0.023044968023896217,
  0.005108187440782785,
  0.03769254311919212,
  -0.022210698574781418,
  -0.01309528574347496,
  -0.036653123795986176,
  -0.021075546741485596,
  -0.005826205480843782,
  0.03739165887236595,
  -0.011180570349097252,
  0.0021814079955220222,
  0.012931167148053646,
  0.013457714579999447,
  0.009081222116947174,
  0.00023976682859938592,
  0.014319336041808128,
  -0.00649635586887598,
  -0.016890525817871094,
  0.002629314549267292,
  -0.013984261080622673,
  -0.005986904725432396,
  -0.006147604435682297,
  -0.008575189858675003,
  0.014729632064700127,
  -0.006513451691716909,
  0.0013710730709135532,
  -0.03140133246779442,
  0.013587641529738903,
  -0.021526873111724854,
  -0.008828205987811089,
  -0.01509889867156744,
  -0.022538935765624046,
  0.005795433185994625,
  -0.0024446812458336353,
  -0.01842229813337326,
  0.04067402705550194,
  -0.02580762840807438,
  0.00686562247574329

We now have 2 embeddings each with dimensionality of `1536` (`text-embedding-ada-002`'s embedding size):

In [6]:
len(embeds), len(embeds[0])

(2, 1536)

## Building Vector DB

Now we can move on to building the vector DB, which is where we'll store our embeddings and metadata.

We start by initializing a Pinecone index:

In [11]:
import pinecone

# initialize pinecone
pinecone.init(
    api_key="YOUR_API_KEY",  # app.pinecone.io
    environment="us-east1-gcp"  # check aligns to env in console (next to api key)
)

To apply this across all of our documents, we'll do everything in batches. So we iterate through `texts`, embed a number of the texts, then add them to Pinecone — then we move onto the next batch. We start by creating a langchain vector store object using the `Pinecone.from_texts` method:

In [None]:
from langchain.vectorstores import Pinecone

index_name = "arxiv-bot"
docsearch = Pinecone.from_texts(
    texts,
    embedding_model,
    index_name=index_name
)

If we do this directly with OpenAI + Pinecone it's faster and seems more reliable (for now):

In [8]:
import openai

def embed(docs: list):
    # query text-davinci-003
    res = openai.Embedding.create(
        input=docs, engine="text-embedding-ada-002"
    )
    embeds = [r['embedding'] for r in res['data']]
    return embeds

In [None]:
from tqdm.auto import tqdm

batch_size = 100
index_name = "arxiv-bot"
if index_name not in pinecone.list_indexes():
    pinecone.create_index(index_name, 1536)
index = pinecone.Index(index_name)

for i in tqdm(range(9400, len(texts), batch_size)):
    i_end = min(i+batch_size, len(texts))
    embeds_batch = embed(texts[i:i_end])
    ids_batch = ids[i:i_end]
    assert len(embeds_batch) == len(ids_batch)
    to_upsert = zip(ids_batch, embeds_batch)
    index.upsert(to_upsert)

Now we can query like so:

In [9]:
from langchain.vectorstores import Pinecone

index_name = "arxiv-bot"

docsearch = Pinecone.from_existing_index(
    index_name=index_name,
    embedding=embedding_model
)

In [None]:
docsearch.similarity_search("what is react?", k=5)

Will submit a PR to fix this at some point

In [12]:
index = pinecone.Index(index_name)

xq = embed(["what is react?"])[0]
xc = index.query(xq, top_k=5)
xc

{'matches': [{'id': '2301.07094-2',
              'score': 0.803957462,
              'sparseValues': {},
              'values': []},
             {'id': '2301.07094-22',
              'score': 0.786604226,
              'sparseValues': {},
              'values': []},
             {'id': '2301.07094-23',
              'score': 0.786588728,
              'sparseValues': {},
              'values': []},
             {'id': '2301.07094-17',
              'score': 0.78344804,
              'sparseValues': {},
              'values': []},
             {'id': '2301.07094-0',
              'score': 0.780215323,
              'sparseValues': {},
              'values': []}],
 'namespace': ''}

We need a local key-value store to extract whatever it is we're seeing here.

In [13]:
kv = {}

for record in dataset:
    key = f"{record['doi']}-{record['chunk-id']}"
    if key not in kv:
        kv[key] = record

In [14]:
for record in xc['matches']:
    key = record['id']
    print(kv[key]['doi']+'\n'+kv[key]['chunk'])

2301.07094
REACT
CLIP
Semi-ViT
SimCLR-v2
              60657075808590Classification / Retrieval Performance+2.8+1.1+0.4
+3.8+3.5 +5.1+2.0+1.4+3.0
+10
Classification
ImageNetZero-Shot 1% 10%
Classification
ELEVATER BenchmarkZero-Shot Few-ShotLP FT
Full-ShotLP FT
Retrieval
Flickr30KI2T T2I
Detection
MSCOCOZero-Shot Open-Voc
Segmentation
MSCOCOZero-Shot Anno-FreeREACT
CLIP
1520253035404550
Dense Prediction Performance+1.5+3.4
+3.6+2.6Figure 1. REACT achieves the best zero-shot ImageNet performance among public checkpoints with nearly 5smaller data size (Left),
achieves new SoTA on semi-supervised ImageNet classiﬁcation in the 1% labelled data setting (Middle), and consistently transfer better
than CLIP on across a variety of tasks, including ImageNet classiﬁcation, zero/few/full-shot classiﬁcation on 20 datasets in ELEV ATER
benchmark, image-text retrieval, object detection and segmentation (Right). Please see the detailed numbers and settings in the experimental
section. For the left ﬁg

This actually gives me a paper (that I wasn't aware of) about **RE**trieval-**A**ugmented **C**us**T**omization (REACT). Let me try to be more specific:

In [15]:
query = "what is the react framework for reasoning and acting in language models?"

xq = embed([query])[0]
xc = index.query(xq, top_k=5)

for record in xc['matches']:
    key = record['id']
    print(kv[key]['doi']+'\n'+kv[key]['chunk'])

2212.09146
investigates how reasoning abilities emerge in large
language models when they are prompted with a
few intermediate reasoning steps known as chain
of thoughts.
Moreoever, Flan-T5 is an instruction-ﬁnetuned
T5 model which is shown to have strong reasoning
abilities, outperforming the T5 model (Chung et al.,
2022; Raffel et al., 2020). Although this ﬁnetuned
model was not initially constructed for retrieverbased language modeling, it can be coupled with
DPR to complete the language modeling and question answering task using the retrieved statements.
In this paper, we study the reasoning ability of
REALM,kNN-LM, FiD with DPR, and ATLAS
with Contriever as retriever-based language models and Flan-T5 as a reasoning language model
coupled with DPR as a retriever. While retrievers
generally select statements from a huge common
corpus in the literature, as illustrated in Figure 2,
we accompany each query with a data-speciﬁc collection of statements since we want to have more
control 

Now I realize that this paper isn't even in the dataset because it's from Oct 2022, oops.

In [16]:
query = "what is the the latest research on reasoning and acting in language models?"

xq = embed([query])[0]
xc = index.query(xq, top_k=5)

for record in xc['matches']:
    key = record['id']
    print(kv[key]['doi']+'\n'+kv[key]['chunk'])

2212.10403
Towards Reasoning in Large Language Models: A Survey
Jie Huang Kevin Chen-Chuan Chang
Department of Computer Science, University of Illinois at Urbana-Champaign
{jeffhj, kcchang}@illinois.edu
Abstract
Reasoning is a fundamental aspect of human
intelligence that plays a crucial role in activities such as problem solving, decision making,
and critical thinking. In recent years, large language models (LLMs) have made signiﬁcant
progress in natural language processing, and
there is observation that these models may exhibit reasoning abilities when they are sufﬁciently large. However, it is not yet clear to
what extent LLMs are capable of reasoning.
This paper provides a comprehensive overview
of the current state of knowledge on reasoning
in LLMs, including techniques for improving
and eliciting reasoning in these models, methods and benchmarks for evaluating reasoning
abilities, ﬁndings and implications of previous
research in this ﬁeld, and suggestions on future
directions. Ou

This looks interesting. Let's try feeding it into a langchain completion endpoint.

In [17]:
# first extract retrieved contexts
contexts = [
    {"context": kv[record['id']]['chunk']} for record in xc['matches']
]

In [28]:
from langchain.llms import OpenAI

llm = OpenAI(
    model_name='text-davinci-003',
    openai_api_key=OPENAI_KEY
)

---

In [29]:
from langchain import PromptTemplate, LLMChain

prompt_template = """You are a influential educator in the space of machine learning
and artificial intelligence. You are known for providing easy to understand explanations
to complex concepts in these fields. Given the information contained in the following
contexts answer the question below. If the question cannot be answered using the
information in the contexts, answer "I don't know".

Contexts: {contexts}

Question: {question}

Answer: """

prompt = PromptTemplate(
    input_variables=["contexts", "question"],
    template=prompt_template
)

llm_chain = LLMChain(
    prompt=prompt,
    llm=llm
)

In [30]:
contexts_str = "\n\n".join([c['context'] for c in contexts])

In [31]:
llm_chain.run(question=query, contexts=contexts_str)

' The latest research on reasoning and acting in language models includes techniques for improving and eliciting reasoning in these models, methods and benchmarks for evaluating reasoning abilities, findings and implications of previous research in this field, and suggestions on future directions. Additionally, there have been studies on unsupervised word embeddings, syntactic ambiguity resolution, and the evaluation of large language models trained on code.'

That seems like a good overview, let's wrap it up into a `arxiv_bot` function so I can ask more questions:

In [32]:
def arxiv_bot(query: str):
    xq = embed([query])[0]
    xc = index.query(xq, top_k=5)
    contexts = [
        {"context": kv[record['id']]['chunk']} for record in xc['matches']
    ]
    dois = [kv[record['id']]['doi'] for record in xc['matches']]
    contexts_str = "\n\n".join([c['context'] for c in contexts])
    print(llm_chain.run(question=query, contexts=contexts_str))
    print(dois)

In [23]:
arxiv_bot(
    "what is the term that describes how large language models seem "+
    "to exhibit reasoning abilities when they get to a certain size?"
)

 Emergent behavior
['2301.12726', '2301.12726', '2212.10403', '2212.10403', '2212.10071']


In [33]:
arxiv_bot("Tell me about the idea behind 'emergent abilities' in LLMs?")

 Emergent abilities of language models is the concept that larger models are more proficient at meta-learning than smaller models and can acquire abilities that are not present in smaller models. This includes tasks such as few-shot prompting, transliteration from the International Phonetic Alphabet, recovering a word from its scrambled letters and Persian question-answering. Furthermore, larger language models can be trained with more data which can potentially lead to miscorrelation between different modalities. To mitigate the risks associated with emergent abilities, researchers are urged to develop up-to-date benchmarks to measure unforeseen behaviors in large language models.
['2301.12867', '2212.10755', '2301.06627', '2302.00763', '2301.10095']


In [34]:
arxiv_bot("what are 'chain of thoughts' in LLMs?")

 Chain of thoughts are a technique used to enable complex reasoning and generate explanations with LLMs by forcing models to explicitly verbalize reasoning steps as natural language. This method has improved performance on a variety of tasks and sparked the active development of further refinements.
['2212.10403', '2301.11596', '2301.11596', '2301.00303', '2301.13379']


In [35]:
arxiv_bot("can you tell me why zebras are stripey?")


Zebras are stripey because the stripes are thought to act as a form of camouflage, helping them to blend in with their environment and making it harder for predators to spot them. The stripes also act as a form of social identification, with each zebra having its own unique stripe pattern.
['2301.10799', '2301.03559', '2301.08721', '2301.03559', '2301.03559']


Ideally I'd rather the model not answer the question if it can't source info.