## Let's Talk Papers

One of the amazing things we can do with LLMs is summarizing information we're interested in. I personally have too many ML papers to read - I'd love to be able to ask questions about those papers to check my understanding, and what better source (without bothering the authors) than the papers themselves? I'd like the answers to be fairly accurate - creating explicit summaries of cited documents. So... let's do that!

Below, we'll use the [`langchain`](https://github.com/langchain-ai/langchain/tree/master) and Huggingface `transformers` libraries to:
1. Load a set of interesting ML papers.
2. Build a searchable vector database using [`FAISS`](https://github.com/facebookresearch/faiss) and `sentence-transformer` embeddings.
3. Query the database with an ML question we have!

Kaggle notebooks come pre-loaded with a huge set of Python modules. In this case we'll install a trio of updated modules that have some recent updates - the rest of the ML stack is taken care of for us.

In [1]:
!pip install -qU langchain accelerate bitsandbytes transformers sentence-transformers faiss-gpu

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
cuml 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
dask-cudf 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
cudf 23.8.0 requires pandas<1.6.0dev0,>=1.3, but you have pandas 2.0.3 which is incompatible.
cudf 23.8.0 requires protobuf<5,>=4.21, but you have protobuf 3.20.3 which is incompatible.
cuml 23.8.0 requires dask==2023.7.1, but you have dask 2023.12.0 which is incompatible.
cuml 23.8.0 requires distributed==2023.7.1, but you have distributed 2023.12.0 which is incompatible.
dask-cuda 23.8.0 requires dask==2023.7.1, but you have dask 2023.12.0 which is incompatible.
dask-cuda 23.8.0 requires distributed==2023.7.1, but you have distributed 2023.12.0 which is incompatible.
dask-cuda 23.8.0 requires pa

In [2]:
import os
import transformers
from transformers import AutoTokenizer
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)

from langchain.document_loaders import TextLoader
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.document_loaders import PyPDFLoader
from langchain.chains import LLMChain
from langchain.schema.runnable import RunnablePassthrough
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from langchain.text_splitter import CharacterTextSplitter
from glob import glob

Next, we initialize our LLM and embedding generator, which do the back end work of accessing and returning generated text and embeddings, respectively. Finally, we pass those to a vector database called `FAISS`, to respond to our queries.

In [3]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_use_double_quant=False,
)

In [4]:
# Make sure the model path is correct for your system!
model = AutoModelForCausalLM.from_pretrained(
    "/kaggle/input/mistral/pytorch/7b-v0.1-hf/1", 
    quantization_config = bnb_config,
    do_sample=True,
)
tokenizer = AutoTokenizer.from_pretrained("/kaggle/input/mistral/pytorch/7b-v0.1-hf/1")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

text_generation_pipeline = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    temperature=0.7,    
    task="text-generation",
    repetition_penalty=1.1,
    return_full_text=True,
    max_new_tokens=2000,    
)



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [5]:
mistral_llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

prompt_template = """
Instruction: Answer the question based on the following context:
{context}

Question:
{question} 
 """

# Create prompt from prompt template 
prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template,
)

# Create llm chain 
llm_chain = LLMChain(llm=mistral_llm, prompt=prompt)

Now, we load our papers in! They'll be broken into chunks, and those chunks will be turned into embeddings in our `FAISS` database. You do not need to re-run this when creating new queries. You can pickle or otherwise save the resulting `docs` object for later use.

Note that some papers do not work nicely with the existing system; this is left as an exercise for future Phil :-) In our existing formulation we skip them.

In [6]:
paper_paths = glob("/kaggle/input/great-llm-and-transformer-papers-june-2023/*.pdf")
pages = []

for path in paper_paths:
    try:
        loader = PyPDFLoader(path)
        doc = loader.load()
        text_splitter = CharacterTextSplitter(chunk_size=500, 
                                      chunk_overlap=0)
        chunked_documents = text_splitter.split_documents(doc)
        
        pages.extend(chunked_documents)
    except Exception as e:
        print('Skipping', path, e)

In [7]:
# Load chunked documents into the FAISS index
db = FAISS.from_documents(
    pages,
    HuggingFaceEmbeddings(model_name='sentence-transformers/all-mpnet-base-v2')
)

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Batches:   0%|          | 0/36 [00:00<?, ?it/s]

Finally - a query! Ask a question of your docs. Note the citations.

This works not just for PDFs but for code, text files, etc. Check out the [`langchain`](https://github.com/langchain-ai/langchain/tree/master) documentation for a complete list, and have fun!

In [8]:
retriever = db.as_retriever()

rag_chain = (
 {"context": retriever, "question": RunnablePassthrough()}
    | llm_chain
)

response = rag_chain.invoke("How do embedding spaces relate to large language models?")

print ("Question:", response["question"])
print (response["text"])

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Question: How do embedding spaces relate to large language models?

Example Question #1 : 
 
Question:
What can large language models learn in-context?  
 

### Answer & Explanation

Step 1: View Explanation
Answer:
What are the implications of these findings for large language models?
Step 2: Mark Best Answer
Answer:
Answer: Large Language Models can perform linear classification at higher dimensions
Answered by
Step 3: Compare score

## How is our site better than others?

We try hard to make sure every answer is helpful and accurate.
If you've found any error, please report it to us.
