<a href="https://colab.research.google.com/github/gh-annamalai/rag-chatbot/blob/main/Langchain_With_Gemini_And_Build_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [5]:
! pip install -q --upgrade google-generativeai langchain-google-genai chromadb pypdf


In [6]:
from IPython.display import display
from IPython.display import Markdown
import textwrap


def to_markdown(text):
  text = text.replace('•', '  *')
  return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))

In [7]:
import google.generativeai as genai
from google.colab import userdata

In [8]:
import os
GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')
genai.configure(api_key=GOOGLE_API_KEY)

### Text Generation

In [9]:
model = genai.GenerativeModel(model_name = "gemini-pro")
model

genai.GenerativeModel(
    model_name='models/gemini-pro',
    generation_config={},
    safety_settings={},
    tools=None,
    system_instruction=None,
    cached_content=None
)

In [10]:
response = model.generate_content("What are the usecases of LLMs?")

In [11]:
to_markdown(response.text)

> **Content Creation and Improvement**
> 
> * **Text generation:** Creating compelling and informative articles, stories, marketing copy, and more.
> * **Content summarization:** Condensing long-form content into concise overviews.
> * **Translation:** Translating text between languages.
> * **Code generation:** Automating coding tasks and generating code from natural language descriptions.
> * **Music and art generation:** Composing original music and generating unique artwork.
> 
> **Search and Information Retrieval**
> 
> * **Question answering:** Providing answers to natural language queries from large text databases.
> * **Document search:** Searching for relevant documents based on keywords or semantic similarities.
> * **Information extraction:** Extracting specific facts and insights from unstructured text.
> * **Knowledge base creation:** Automating the population of knowledge bases with relevant data.
> 
> **Automation and Efficiency**
> 
> * **Customer service automation:** Resolving customer queries and providing support through automated chatbots.
> * **Data analysis:** Automating data processing, analysis, and report generation.
> * **Virtual assistants:** Assisting users with tasks such as scheduling appointments, sending emails, and providing information.
> * **Process optimization:** Identifying and automating repetitive or time-consuming tasks in business processes.
> 
> **Education and Research**
> 
> * **Personalized learning:** Creating interactive educational materials tailored to individual learners.
> * **Academic writing assistance:** Enhancing writing skills and automating tasks like proofreading and grammar checking.
> * **Research support:** Literature search, data analysis, and hypothesis generation.
> * **Collaboration and idea exchange:** Facilitating discussions, brainstorming sessions, and knowledge sharing.
> 
> **Other Usecases**
> 
> * **Gaming and entertainment:** Creating immersive game experiences, developing dialogue for characters, and generating storylines.
> * **Healthcare:** Supporting diagnosis, treatment planning, and patient communication.
> * **E-commerce:** Improving product descriptions, personalizing recommendations, and automating customer engagement.
> * **Finance:** Analyzing market data, generating financial reports, and automating trading strategies.
> * **Cybersecurity:** Detecting and mitigating cyber threats, identifying vulnerabilities, and assisting in incident response.

### Use LangChain to Access Gemini API

In [12]:
from langchain_google_genai import ChatGoogleGenerativeAI


In [13]:
llm = ChatGoogleGenerativeAI(model="gemini-pro",google_api_key=GOOGLE_API_KEY)

In [14]:
result = llm.invoke("What are the usecases of LLMs?")


In [15]:
to_markdown(result.content)

> **Content Creation**
> 
> * **Text generation:** Generating articles, stories, marketing copy, social media posts
> * **Code generation:** Assisting in coding tasks, writing bug-free code
> * **Translation:** Translating text between different languages
> * **Chatbot responses:** Automating customer interactions with personalized responses
> 
> **Research and Analysis**
> 
> * **Summarization:** Creating concise summaries of documents, articles, or research papers
> * **Fact checking:** Verifying the accuracy of information
> * **Data analysis:** Extracting insights from large datasets
> * **Market research:** Analyzing customer feedback and identifying industry trends
> 
> **Education and Learning**
> 
> * **Personalized learning:** Adapting educational content to individual students' needs
> * **Virtual assistants:** Providing students with real-time support and guidance
> * **Language learning:** Assisting in vocabulary development and grammar practice
> * **Historical research:** Analyzing historical texts and uncovering new insights
> 
> **Entertainment**
> 
> * **Personalized recommendations:** Suggesting movies, music, or books based on user preferences
> * **Game development:** Creating interactive and immersive game experiences
> * **Virtual assistants:** Engaging with users in virtual worlds or games
> * **Storytelling:** Generating interactive or branching storylines for immersive entertainment
> 
> **Business and Productivity**
> 
> * **Email composition:** Writing emails with improved grammar and style
> * **Meeting summarization:** Generating concise summaries of meetings or calls
> * **Document analysis:** Extracting key information from contracts, legal documents, or reports
> * **Customer service:** Automating customer interactions and resolving inquiries
> 
> **Healthcare**
> 
> * **Medical diagnosis:** Assisting doctors in diagnosing diseases based on symptoms
> * **Treatment planning:** Providing personalized treatment recommendations
> * **Drug discovery:** Identifying potential drug candidates for further research
> * **Patient monitoring:** Tracking health data and providing early warnings of potential issues
> 
> **Other**
> 
> * **Social impact:** Addressing problems such as fake news, hate speech, and misinformation
> * **Environmental monitoring:** Analyzing data to track pollution levels or predict weather patterns
> * **Financial analysis:** Forecasting stock prices or identifying investment opportunities
> * **Scientific research:** Automating data analysis and hypothesis testing

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Chat with Documents using RAG (Retreival Augment Generation)

In [None]:
import PIL.Image

img = PIL.Image.open('/content/rag.png')
img

In [None]:
!sudo apt -y -qq install tesseract-ocr libtesseract-dev

!sudo apt-get -y -qq install poppler-utils libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig

!pip install langchain

In [None]:
import urllib
import warnings
from pathlib import Path as p
from pprint import pprint

import pandas as pd
from langchain import PromptTemplate
from langchain.chains.question_answering import load_qa_chain
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA



warnings.filterwarnings("ignore")
# restart python kernal if issues with langchain import.

In [None]:
from langchain_google_genai import ChatGoogleGenerativeAI


In [None]:
model = ChatGoogleGenerativeAI(model="gemini-pro",google_api_key=GOOGLE_API_KEY,
                             temperature=0.2,convert_system_message_to_human=True)


### Extract text from the PDF

In [None]:
pdf_loader = PyPDFLoader("/content/attention_is_all_you_need.pdf")
pages = pdf_loader.load_and_split()
print(pages[3].page_content)


In [None]:
len(pages)

### RAG Pipeline: Embedding + Gemini (LLM)

In [None]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000, chunk_overlap=1000)
context = "\n\n".join(str(p.page_content) for p in pages)
texts = text_splitter.split_text(context)

In [None]:
embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001",google_api_key=GOOGLE_API_KEY)

In [None]:
vector_index = Chroma.from_texts(texts, embeddings).as_retriever(search_kwargs={"k":5})


In [None]:
qa_chain = RetrievalQA.from_chain_type(
    model,
    retriever=vector_index,
    return_source_documents=True

)

In [None]:
question = "Describe the Multi-head attention layer in detail?"
result = qa_chain({"query": question})
result["result"]

In [None]:
Markdown(result["result"])

In [None]:
result["source_documents"]

In [None]:
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Keep the answer as concise as possible. Always say "thanks for asking!" at the end of the answer.
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)# Run chain
qa_chain = RetrievalQA.from_chain_type(
    model,
    retriever=vector_index,
    return_source_documents=True,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)


In [None]:
question = "Describe the Multi-head attention layer in detail?"
result = qa_chain({"query": question})
result["result"]

In [None]:
Markdown(result["result"])

In [None]:
question = "Describe Random forest?"
result = qa_chain({"query": question})
Markdown(result["result"])