# Chat With Your Data

## Persist Data to Vector Stores

# Install libraries

In [None]:
pip install openai

In [None]:
pip install python-dotenv

In [None]:
pip install langchain

In [None]:
pip install langchain-openai

In [None]:
pip install pypdf

In [None]:
pip install faiss-cpu

In [None]:
pip install langchainhub

In [None]:
pip install langchain-community

## Load OpenAI API Key to use OpenAI's embedding model

In [2]:
import os

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

In [4]:
OPENAI_API_KEY=os.environ['OPENAI_API_KEY']

## Load documents

In [3]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader('big-book-of-data-engineering.pdf')
pages = loader.load()

## Chunk documents

In [4]:
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

# Load the document, split it into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(pages)

# Generate embeddings and store in vector database
## FAISS vector database

In [5]:
from langchain_community.vectorstores import FAISS
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY, model="text-embedding-3-small")
# Load it into the vector store and embed
vectordb = FAISS.from_documents(documents, embeddings )

In [6]:
print(vectordb.index.ntotal)

2


In [8]:
documents[1]

Document(metadata={'source': 'big-book-of-data-engineering.pdf', 'page': 1}, page_content='Challenges of data engineering in the AI era\nAs previously mentioned, data engineering is key to ensuring reliable data for \nAI initiatives. Data engineers who build and maintain ETL pipelines and the \ndata infrastructure that underpins analytics and AI workloads face specific \nchallenges in this fast-moving landscape. \n ■ Handling real-time data: From mobile applications to sensor data on \nfactory floors, more and more data is created and streamed in real \ntime and requires low-latency processing so it can be used in real-time \ndecision-making.\n ■ Scaling data pipelines reliably:  With data coming in large quantities \nand often in real time, scaling the compute infrastructure that runs \ndata pipelines is challenging, especially when trying to keep costs low \nand performance high. Running data pipelines reliably, monitoring data \npipelines and troubleshooting when failures occur are 

## Persist Data in your Vector Store

In [9]:
vectordb.save_local("faiss2b_index")

## Load Vector Store

In [10]:
new_db = FAISS.load_local("faiss2b_index", embeddings, allow_dangerous_deserialization=True)

## Prompt a model with no knowledge of big-book-of-data-engineering

In [5]:
from langchain_openai import ChatOpenAI

#initialize the LLM we'll use - OpenAI GPT 3.5 Turbo
llm = ChatOpenAI(openai_api_key=OPENAI_API_KEY, model="gpt-3.5-turbo-0125")

In [7]:
def print_output(docs,type:int=1):
    import textwrap
    match type:
        case 1:
            for doc in docs:
                print('The medatadata is: {}'.format(doc.metadata))
                for t in textwrap.wrap(doc.page_content,width=100):
                    print(t)
        case 2:
            #print('The medatadata is: {}'.format(docs.response_metadata))
            for t in textwrap.wrap(docs.content,width=100):
                print(t)

In [8]:
#prompt the model with no additional knowledge of the big book of data engineering 
docs = llm.invoke("What is data engineering?")
print_output(docs,2)  

Data engineering is a field within data science that involves the design, creation, and maintenance
of systems for collecting, storing, and analyzing data. Data engineers are responsible for
developing the architecture and infrastructure necessary to support large-scale data processing and
analytics tasks. This includes building data pipelines, data warehouses, and databases, as well as
implementing data processing algorithms and workflows. Data engineering is crucial for organizations
looking to make data-driven decisions and derive valuable insights from their data.


## Load database from disk

In [9]:
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS


db = FAISS.load_local("../02_02_assignment/faiss2b_index", 
                      OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY, model="text-embedding-3-small"), 
                      allow_dangerous_deserialization=True)

## Configure retriever
### Use the similarity search capabilities of a vector store to facilitate retrieval

In [10]:
retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": 6})

## Implement a chain
### Chain together multiple calls in a logical sequence

In [13]:
from langchain import hub

prompt = hub.pull("rlm/rag-prompt")



In [17]:
prompt

ChatPromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, metadata={'lc_hub_owner': 'rlm', 'lc_hub_repo': 'rag-prompt', 'lc_hub_commit_hash': '50442af133e61576e74536c6556cefe1fac147cad032f4377b60c436e6cdcb6e'}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question} \nContext: {context} \nAnswer:"), additional_kwargs={})])

In [18]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

#combine multiple steps in a single chain
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser() #convert the chat message to a string
)

## Send LLM's response to the user

In [19]:
for chunk in rag_chain.stream("What is data engineering?"):
    print(chunk, end="", flush=True)

Data engineering is the practice of processing raw data from a source for downstream use in analytics, business intelligence, or machine learning. It involves three main parts: data ingestion, transformation, and orchestration. Challenges in data engineering include handling real-time data, scaling data pipelines, ensuring data quality, and addressing governance and security concerns.