Ingesting PDFs

In [1]:
!pip install --q unstructured langchain langchain-community
!pip install --q "unstructured[local-inference]" ipywidgets tqdm

In [2]:
from langchain_community.document_loaders import UnstructuredPDFLoader
from IPython.display import display as Markdown
from tqdm.autonotebook import tqdm as notebook_tqdm

  from tqdm.autonotebook import tqdm as notebook_tqdm


In [3]:
local_path = "Scholz_2022_PASP_134_104401.pdf"

# Local PDF file uploads
if local_path:
  loader = UnstructuredPDFLoader(file_path=local_path)
  data = loader.load()
else:
  print("Upload a PDF file")

In [4]:
Markdown(data[0].page_content)

'Publications of the Astronomical Society of the Paciﬁc, 134:104401 (10pp), 2022 October © 2022. The Author(s). Published by IOP Publishing Ltd on behalf of the Astronomical Society of the Paciﬁc (ASP). All rights reserved\n\nhttps://doi.org/10.1088/1538-3873/ac9431\n\nRogue Planets and Brown Dwarfs: Predicting the Populations Free-ﬂoating Planetary Mass Objects Observable with JWST\n\nAleks Scholz1\n\n, Koraljka Muzic2\n\n, Ray Jayawardhana3\n\n, Lyra Quinlan1, and James Wurster1\n\n1 SUPA, School of Physics & Astronomy, University of St Andrews, North Haugh, St Andrews, KY16 9SS, UK; as110@st-andrews.ac.uk 2 CENTRA, Faculdade de Ciências, Universidade de Lisboa, Ed. C8, Campo Grande, 1749-016 Lisboa, Portugal 3 Department of Astronomy, Cornell University, Ithaca, NY 14853, USA Received 2022 August 18; accepted 2022 September 22; published 2022 October 11\n\nAbstract Free-ﬂoating (or rogue) planets are planets that are liberated (or ejected) from their host systems. Although simulatio

Vector Embeddings

In [8]:
# 1. First clean up any existing ChromaDB installations
%pip uninstall -y chromadb
%pip uninstall -y protobuf

# 2. Install specific versions known to work together
%pip install protobuf
%pip install chromadb  # Using a stable older version
%pip install langchain-ollama

# 3. Set the environment variable
import os
os.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"] = "python"

# 4. Now reimport with the new versions
from langchain_ollama import OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma

Note: you may need to restart the kernel to use updated packages.




Found existing installation: protobuf 5.29.3
Uninstalling protobuf-5.29.3:
  Successfully uninstalled protobuf-5.29.3
Note: you may need to restart the kernel to use updated packages.
Collecting protobuf
  Using cached protobuf-5.29.3-cp310-abi3-win_amd64.whl.metadata (592 bytes)
Using cached protobuf-5.29.3-cp310-abi3-win_amd64.whl (434 kB)
Installing collected packages: protobuf
Successfully installed protobuf-5.29.3
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


ERROR: Invalid requirement: '#': Expected package name at the start of dependency specifier
    #
    ^


Note: you may need to restart the kernel to use updated packages.


In [6]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=7500, chunk_overlap=100)
chunks = text_splitter.split_documents(data)

In [10]:
pip install chromadb

Collecting chromadbNote: you may need to restart the kernel to use updated packages.

  Downloading chromadb-0.6.3-py3-none-any.whl.metadata (6.8 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp310-cp310-win_amd64.whl.metadata (262 bytes)
Collecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.115.8-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.15.1-py2.py3-none-any.whl.metadata (2.9 kB)
Collecting opentelemetry-api>=1.2.0 (from chromadb)
  Downloading opentelemetry_api-1.30.0-py3-none-any.whl.metadata (1.6 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.30.0-py

In [12]:
# 5. Try creating the vector database
vector_db = Chroma.from_documents(
    documents=chunks,
    embedding=OllamaEmbeddings(model="gemma2:2b"),
    collection_name="local-rag"
)

Retrieval


In [13]:
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_ollama.chat_models import ChatOllama
from langchain_core.runnables import RunnablePassthrough
from langchain.retrievers.multi_query import MultiQueryRetriever

In [14]:
local_model = "gemma2:2b"
llm = ChatOllama(model=local_model)

In [15]:
QUERY_PROMPT = PromptTemplate(
    input_variables=["question"],
    template="""You are an AI language model assistant. Your task is to generate five
    different versions of the given user question to retrieve relevant documents from
    a vector database. By generating multiple perspectives on the user question, your
    goal is to help the user overcome some of the limitations of the distance-based
    similarity search. Provide these alternative questions separated by newlines.
    Original question: {question}""",
)

In [16]:
retriever = MultiQueryRetriever.from_llm(
    vector_db.as_retriever(), 
    llm,
    prompt=QUERY_PROMPT
)

# RAG prompt
template = """Answer the question based ONLY on the following context:
{context}
Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

In [17]:
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [19]:
chain.invoke("What does the research paper say?")

"This research paper discusses the formation and characteristics of objects in young star clusters. It delves into aspects such as:\n\n**1. Planet Formation and Ejection:** \n- The paper examines how planets form within star clusters, specifically focusing on their ejection from these environments. \n\n**2. Mass Function Models:**\n-  The research explores different approaches to modeling the distribution of object masses (mass function) in young star clusters. They present a log-normal mass function as well as a power law model and discuss the relative consistency with observational data. This is based on work by Bastian et al., Muzic et al., and others. \n\n**3. Cluster Observational Data:**\n- The paper utilizes observations of NGC1333, a young star cluster, to analyze the mass function in more detail. It highlights the importance of comparing their findings to data from other clusters for broader scientific validation.\n\n**4. Star Formation Simulations:**\n- Simulations are used t

In [20]:
# Delete all collections in the db
vector_db.delete_collection()