# Chat With Your Data

## Persist Data to Vector Stores

# Install libraries

In [None]:
pip install openai

In [None]:
pip install python-dotenv

In [None]:
pip install langchain

In [None]:
pip install langchain-openai

In [None]:
pip install pypdf

In [None]:
pip install faiss-cpu

In [None]:
pip install langchainhub

In [None]:
pip install langchain-community

## Load OpenAI API Key to use OpenAI's embedding model

In [1]:
import os

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

In [2]:
OPENAI_API_KEY=os.environ['OPENAI_API_KEY']

## Load documents

In [3]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader('big-book-of-data-engineering.pdf')
pages = loader.load()

## Chunk documents

In [4]:
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

# Load the document, split it into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(pages)

# Generate embeddings and store in vector database
## FAISS vector database

In [5]:
from langchain_community.vectorstores import FAISS
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY, model="text-embedding-3-small")
# Load it into the vector store and embed
vectordb = FAISS.from_documents(documents, embeddings )

In [6]:
print(vectordb.index.ntotal)

2


In [8]:
documents[1]

Document(metadata={'source': 'big-book-of-data-engineering.pdf', 'page': 1}, page_content='Challenges of data engineering in the AI era\nAs previously mentioned, data engineering is key to ensuring reliable data for \nAI initiatives. Data engineers who build and maintain ETL pipelines and the \ndata infrastructure that underpins analytics and AI workloads face specific \nchallenges in this fast-moving landscape. \n ■ Handling real-time data: From mobile applications to sensor data on \nfactory floors, more and more data is created and streamed in real \ntime and requires low-latency processing so it can be used in real-time \ndecision-making.\n ■ Scaling data pipelines reliably:  With data coming in large quantities \nand often in real time, scaling the compute infrastructure that runs \ndata pipelines is challenging, especially when trying to keep costs low \nand performance high. Running data pipelines reliably, monitoring data \npipelines and troubleshooting when failures occur are 

## Persist Data in your Vector Store

In [9]:
vectordb.save_local("faiss2b_index")

## Load Vector Store

In [10]:
new_db = FAISS.load_local("faiss2b_index", embeddings, allow_dangerous_deserialization=True)

## Prompt a model with no knowledge of big-book-of-data-engineering

In [6]:
from langchain_openai import ChatOpenAI

#initialize the LLM we'll use - OpenAI GPT 3.5 Turbo
llm = ChatOpenAI(openai_api_key=OPENAI_API_KEY, model="gpt-3.5-turbo-0125")

In [7]:
def print_output(docs,type:int=1):
    import textwrap
    match type:
        case 1:
            for doc in docs:
                print('The medatadata is: {}'.format(doc.metadata))
                for t in textwrap.wrap(doc.page_content,width=100):
                    print(t)
        case 2:
            #print('The medatadata is: {}'.format(docs.response_metadata))
            for t in textwrap.wrap(docs.content,width=100):
                print(t)

In [8]:
#prompt the model with no additional knowledge of the big book of data engineering 
docs = llm.invoke("What is data engineering?")
print_output(docs,2)  

Data engineering is a field within data science that involves the design, creation, and maintenance
of systems for collecting, storing, and analyzing data. Data engineers are responsible for
developing the architecture and infrastructure necessary to support large-scale data processing and
analytics tasks. This includes building data pipelines, data warehouses, and databases, as well as
implementing data processing algorithms and workflows. Data engineering is crucial for organizations
looking to make data-driven decisions and derive valuable insights from their data.


## Load database from disk

In [3]:
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS


db = FAISS.load_local("../02_02_assignment/faiss2b_index", 
                      OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY, model="text-embedding-3-small"), 
                      allow_dangerous_deserialization=True)

## Configure retriever
### Use the similarity search capabilities of a vector store to facilitate retrieval

In [4]:
retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": 6})

## Preserve Conversation History 

In [8]:
from langchain.chains import create_history_aware_retriever
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

system_prompt = """Given the chat history and a recent user question \
generate a new standalone question \
that can be understood without the chat history. Do NOT answer the question, \
just reformulate it if needed or otherwise return it as is."""

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}"),
    ]
)

retriever_with_history = create_history_aware_retriever(
    llm, retriever, prompt
)

## Perform question answering with chat history

In [9]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

qa_system_prompt = """You are an assistant for question-answering tasks. \
Use the following pieces of retrieved context to answer the question. \
If you don't know the answer, just say that you don't know. \
Use three sentences maximum and keep the answer concise.\

{context}"""

qa_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", qa_system_prompt),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}"),
    ]
)


question_answer_chain = create_stuff_documents_chain(llm, qa_prompt)

rag_chain = create_retrieval_chain(retriever_with_history, question_answer_chain)

In [10]:
from langchain_core.messages import HumanMessage

chat_history = []

question = "What is data engineering?"

ai_msg_1 = rag_chain.invoke({"input": question, "chat_history": chat_history})

chat_history.extend([HumanMessage(content=question), ai_msg_1["answer"]])

print(ai_msg_1["answer"])

Data engineering is the practice of processing raw data from a source to store and organize it for various uses such as data analytics, business intelligence, or machine learning model training. It involves preparing data to extract value from it effectively. Data engineering consists of three main parts: ingest, transform, and orchestrate, which are essential for success in data and AI initiatives.


In [11]:
second_question = "What are its best practices"

ai_msg_2 = rag_chain.invoke({"input": second_question, "chat_history": chat_history})
print(ai_msg_2["answer"])

The best practices in data engineering include building reliable data pipelines that can efficiently ingest and stream large amounts of data while ensuring high data quality. Data transformation is crucial for filtering, standardizing, cleaning, and aggregating data for usable storage. Orchestration involves scheduling, monitoring, and controlling the data pipeline to handle failures effectively.


In [12]:
third_question = "What are its main challenges?"
ai_msg_3 = rag_chain.invoke({"input": third_question, "chat_history": chat_history})
print(ai_msg_3["answer"])

The main challenges of data engineering in the AI era include handling real-time data for low-latency processing, scaling data pipelines reliably while keeping costs low, ensuring high data quality for training quality models, and addressing data governance and security concerns. Data engineers face responsibilities such as running data pipelines reliably, monitoring them, troubleshooting failures, and securing and governing data across multiple systems. Choosing the right data platform is crucial to navigate these challenges successfully in the age of AI.


In [13]:
fourth_question = "Do you think its data governance is important?"
ai_msg_4 = rag_chain.invoke({"input": fourth_question, "chat_history": chat_history})
print(ai_msg_4["answer"])

Yes, data governance is crucial as organizations face challenges with data spread across multiple systems and increasing numbers of internal teams accessing it. Securing and governing data is also a regulatory concern, especially in highly regulated industries, emphasizing the importance of data governance to ensure data quality, security, and compliance.


In [14]:
fifth_question = "How do you implement data governance?"
ai_msg_5 = rag_chain.invoke({"input": fifth_question, "chat_history": chat_history})
print(ai_msg_5["answer"])

Data governance can be implemented by defining data policies, standards, and procedures, establishing roles and responsibilities for data management, ensuring data quality and security, and monitoring compliance with regulations. Organizations can use data governance tools and technologies to enforce policies, manage metadata, and track data lineage. It is crucial to involve stakeholders across different teams in the organization to ensure successful data governance implementation.
