# semantic_experiments

> Just some experiments with "RAG" indexing and semantic search using common toolkits like langchain, llamaindex, P   

In [None]:
#| default_exp semantic_experiments

# Experiments on using Proposition Chunking and RAG indexing for semantic search
This is based on the [5 Levels Of Text Splitting](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/5_Levels_Of_Text_Splitting.ipynb) notebook example for the "Agentic Text Splitting" method. It is based on the [Dense X Retrieval: What Retrieval Granularity Should We Use?](https://arxiv.org/pdf/2312.06648.pdf) paper and has a prompt implementation in LangChain hub. Unfortunately, the LangChain hub is API walled, so I copied the prompt template to this notebook. We need to handle PDFs as a combination of text, image and tabular data for the purposes of chunking and indexing. LangChain blog has some interesting experiments on [Benchmarking RAG on tables](https://blog.langchain.dev/benchmarking-rag-on-tables/), [Multi-modal RAG on slide decks](https://blog.langchain.dev/multi-modal-rag-template/) and a set of notebooks [Multi-modal eval: GPT-4 w/ multi-modal embeddings and multi-vector retriever](https://langchain-ai.github.io/langchain-benchmarks/notebooks/retrieval/multi_modal_benchmarking/multi_modal_eval.html?ref=blog.langchain.dev). It also appears that the there is not a single PDF library that can handle text, tables and images for proficient extraction. Rather, [pypdf2](https://pypi.org/project/PyPDF2/) is used for text extraction, [camelot](https://camelot-py.readthedocs.io/en/master/) for table extraction and [pypdfium2](https://pypi.org/project/pypdfium2/) for image extraction see [Extracting Text from PDF Files with Python: A Comprehensive Guide](https://towardsdatascience.com/extracting-text-from-pdf-files-with-python-a-comprehensive-guide-9fc4003d517). LangChain has a set of benchmarks [LangChain Benchmarks](https://langchain-ai.github.io/langchain-benchmarks/notebooks/retrieval/intro.html) that gives a deeper background on benchmarking. A tutorial for [How to Use PDF Loaders in Langchain](https://linuxhint.com/use-pdf-loader-langchain/) gives a good overview of the PDF extraction process. LangChain [PDF Document Loaders Documentation](https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf) gives details. 

In [None]:
#| hide
from nbdev.showdoc import *

In [None]:
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter

loader = PyPDFLoader("data/DoD_Data_Strategy.pdf")
pages = loader.load_and_split()

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(pages)

In [None]:

docs

[Document(page_content='Executive Summary : DoD Data Strategy  \nUnleashing Data to Advance the National Defense Strategy  \n \nBLUF :  The DoD Data Strategy supports the National Defense Strategy and Digital \nModernization by providing the overarching vision , focus areas , guiding principles , \nessential capabilities , and goals  necessary to transform the Department into a data -centric \nenterprise. Succes s cannot be taken for granted… it is the responsibility  of all DoD leaders \nto treat data as a weapon system and manage, secure, and use data for operational effect.    \nVision : DoD is a data -centric organization that uses data at speed and scale for operational \nadvantage and increased efficiency.  \nFocus Areas :  The strategy  emphasi zes the need to work closely with users in the \noperational community , particularly the  warfighter. Initial areas of focus include:   \n- Joint All Domain Operation s – using data for advantage on the battlefield  \n- Senior Leader Dec

In [None]:
from langchain.output_parsers.openai_tools import JsonOutputToolsParser
from langchain_community.chat_models import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableLambda
from langchain.chains import create_extraction_chain
from typing import Optional, List
from langchain.chains import create_extraction_chain_pydantic
from langchain_core.pydantic_v1 import BaseModel
import os


In [None]:
PromptTemplate = ChatPromptTemplate.from_messages(
    [
                (
                    "system",
                    """
                    Decompose the "Content" into clear and simple propositions, ensuring they are interpretable out of context.
                        1. Split compound sentence into simple sentences. Maintain the original phrasing from the input whenever possible.
                        2. For any named entity that is accompanied by additional descriptive information, separate this information into its own distinct proposition.
                        3. De-contextualize the proposition by adding necessary modifier to nouns or entire sentences and replacing pronouns (e.g., "it", "he", "she", "they", "this", "that") with the full name of the entities they refer to.
                        4. Present the results as a list of strings, formatted in JSON.

                    Example:

                        Input: Title: ¯Eostre. Section: Theories and interpretations, Connection to Easter Hares. Content: The earliest evidence for the Easter Hare (Osterhase) was recorded in south-west Germany in 1678 by the professor of medicine Georg Franck von Franckenau, but it remained unknown in other parts of Germany until the 18th century. Scholar Richard Sermon writes that "hares were frequently seen in gardens in spring, and thus may have served as a convenient explanation for the
                        origin of the colored eggs hidden there for children. Alternatively, there is a European tradition that hares laid eggs, since a hare’s scratch or form and a lapwing’s nest look very similar, and both occur on grassland and are first seen in the spring. In the nineteenth century the influence of Easter cards, toys, and books was to make the Easter Hare/Rabbit popular throughout Europe. German immigrants then exported the custom to Britain and America where it evolved into the Easter Bunny."
                        Output: [ "The earliest evidence for the Easter Hare was recorded in south-west Germany in 1678 by Georg Franck von Franckenau.", "Georg Franck von Franckenau was a professor of medicine.", "The evidence for the Easter Hare remained unknown in other parts of Germany until the 18th century.", "Richard Sermon was a scholar.", "Richard Sermon writes a hypothesis about the possible explanation for the connection between hares and the tradition during Easter", "Hares
                        were frequently seen in gardens in spring.", "Hares may have served as a convenient explanation for the origin of the colored eggs hidden in gardens for children.", 
                        "There is a European tradition that hares laid eggs.", "A hare’s scratch or form and a lapwing’s nest look very similar.", "Both hares and lapwing’s nests occur on grassland and are first seen in the spring.", 
                        "In the nineteenth century the influence of Easter cards, toys, and books was to make the Easter Hare/Rabbit popular throughout Europe.", "German immigrants exported the custom of the Easter Hare/Rabbit to Britain and America.", 
                        "The custom of the Easter Hare/Rabbit evolved into the Easter Bunny in Britain and America."]
                    """,
                ),
                ("user", "Decompose the following:\n{input}\n`"),
            ]
)

In [None]:
llm = ChatOpenAI(model='gpt-4-1106-preview', openai_api_key = os.getenv("OPENAI_API_KEY", 'YouKey'))

  warn_deprecated(


In [None]:
# use it in a runnable
runnable = PromptTemplate | llm

In [None]:
# Pydantic data class
class Sentences(BaseModel):
    sentences: List[str]
    
# Extraction
extraction_chain = create_extraction_chain_pydantic(pydantic_schema=Sentences, llm=llm)

In [None]:
#| hide
#import nbdev; nbdev.nbdev_export()