#### Getting Started With Langchain And Open AI

- Get setup with LangCHain, LangSmith and LangServe
- Use the most basic and common components of LangChain : Prompt templates, Models, and Output parsers.
- Build a simple application with LangChain
- Trace your application with LangSmith
- Serve your application with LangServe

### RAG - Retrieval Augmented Generation
1. Data Sources : PDF, JSON, URLs, Images => Data Ingestion Technique 
2. Data Translation : Converting Huge Data to Text Chunks
3. Embedding : Text to vectors
4. Store the vectors in the VectorStore Database


### Vector Database
1. FAISS
2. ChromaDB
3. AstroDB

## Retrieval Chain
Retrieval Chain is an interface, which is responsible for quering vector store DB.
## Data Ingestion With Documents Loaders
- Loading a data set from a specific source.
- https://python.langchain.com/v0.2/docs/integrations/document_loaders/
### Document loaders
- DocumentLoaders load data into the standard LangChain Document format.
- Each DocumentLoader has its own specific parameters, but they can all be invoked in the same way with the .load method.

In [None]:
from langchain_community.document_loaders import TextLoader
loader = TextLoader('speech.txt')
text = loader.load()

In [None]:
text

In [None]:
## Reading from the PDF File

from langchain_community.document_loaders import PyPDFLoader
loader  =  PyPDFLoader('attension.pdf')
doc = loader.load()
doc

### Text Splitting from Documents (Huge Text)


#### How to recursively split text by characters
This text splitter is the recommended one for generic text. it is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n","\n","",""]. THis has the seffect of trying to keep all paragraphs(and then sentences, and then words) together as long as possible, as those would generically seeem to be the strongest semantically related pieces of text.
- How the text is split: by list of characters.
- How the chunk size is measured: by number of characters.

In [None]:
from langchian_text_splitters import RecursiveCharacterTextSplitter
text_spliter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 50)
final_document = text_spliter.split_documents(doc)
final_document

In [None]:
speech = ""
with open("speech.txt") as f:
    speech = f.read()
print("the type of the speech is when open() is used=>",type(speech))

from langchain_community.document_loaders import TextLoader
loader=TextLoader('speech.txt')
text = loader.load()
print("the type of the speech is when TextLoader() is used=>",type(text))


In [None]:
new_text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20)
new_text =  new_text_splitter.create_documents([speech])
print("the type of the speech is when open() is used=>",type(new_text))
new_text[1]

#### How to split by  character- Character Text Splitter
THis is the simplest method. This splits based on as given character sequence, which defaults to "\n\n". Chunk length is measured by number of characters.

1. How the text is split : By single character separator.
2. How the chunk size is measures:  by number of characters.

In [None]:
from langchain_community.document_loaders import TextLoader
loader = TextLoader('speech.txt')
docs = loader.load()
doc

In [None]:
from langchain_text_splitters import CharacterTextSplitter
text_spliter = CharacterTextSplitter(separator = "\n\n", chunk_size=100,chunk_overlap=20)
text_spliter.split_documents(doc)

### How to split by HTML Header
HTMLHeaderTextSplitter is a "structure-aware" chunker that splits text at the HTML elment level and adds metadata fo each header "relevent" to any given chunk. It can return chunks element by element or combine element with the same metadata, with the objectives of a keepinig related text grouped(more or less ) sementically and (b) preserving context-rich information cncoded in document structure. It can be suded with  other text splitter as part of a chuncking pipeline.

In [None]:
from langchain_text_splitters import HTMLHeaderTextSplitter
headers_to_split_on  = [
    ("h1","Header 1"),
    ("h2","Header 2"),
    ("h3","Header 3")
]
html_string = '''<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>

<h1>My First Heading</h1>
<p>My first paragraph.</p>

</body>
</html>'''
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits

In [None]:
from langchain_text_splitters import HTMLHeaderTextSplitter
url = "https://plato.stanford.edu/entries/goedel/"
headers_to_split_on  = [
    ("h1","Header 1"),
    ("h2","Header 2"),
    ("h3","Header 3"),
    ("h4","Header 4")
]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text_from_url(url)
html_header_splits

#### How to split JSON Data
This json splitter splits json data while allowing control over chunk sizes. It traverses json data depth first and builds smaller json chunks.It attempts to keep nested json objects whole but will split them if needed to keep chunks between a min_chunk_size and the max_chunk_size.

if the value is not a nested json, but rather a very large string will not be split. If you need a hard cap on the chunk size consider composing this with a Recursive TExt Splitter on those chunks. Thes pre-processing step to split lists, by first converting tehm to a json(dict) and then splitting.

 - How the next is split : json value.
 - How the chunk size is measured : by number of characters.

In [None]:
import json
import requests

data = requests.get("https://jsonplaceholder.typicode.com/todos").json()
type(data)
json_data = {}
for index in range(0,len(data)):
    json_data[index]=data[index]
json_data

In [None]:
from langchain_text_splitters import RecursiveJsonSplitter
json_splitter = RecursiveJsonSplitter(max_chunk_size=300)
json_chunks = json_splitter.split_json(json_data)


In [None]:
for chunk in json_chunks[:3]:
    print(chunk)

In [None]:
## The splitter can also output docuemnts
docs = json_splitter.create_documents(texts = [json_data])
for doc in docs[:3]:
    print(doc)

## Embedding Techniques
##### Converting text into vectors
Embeddings are a numerical representation of text that can be used to measure the relatedness between two pieces of text. Embeddings are useful for search, clustering, recommendations, anomaly detection, and classification tasks. 

In [1]:
import os
from dotenv import load_dotenv
load_dotenv() # Load all the environment varaibles

ModuleNotFoundError: No module named 'dotenv'

In [None]:
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")

In [None]:
from langchain_openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings(model="text-embedding-3-large")
embedding

In [None]:
text = "This is a tutorial on OPENAI embedding"
query_result = embedding.embed_query(text)
query_result

### From the start to the making of vector then storing to the vector database

In [None]:
from langchian_text_splitters import RecursiveCharacterTextSplitter
text_spliter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 50)
final_document = text_spliter.split_documents(doc)
final_document

In [None]:
## Convert to the vector
## vector Embedding and Vector StoreDB

from langchain_community.vectorstores import Chroma
db = Chroma.from_documents(final_document,embedding)
db

In [None]:
# From the speech text , a sentence searched in the data base
query = "his dream of becoming a fighter pilot"
retrieved_results = db.similarity_search(query)
retrieved_results

## Using of open source model
### 1. OLLAMA
- From https://ollama.com/download download Ollama to your system
- Open the ollma in your system
- download any model of your like 
- For Example: using this command
-  ollama run llama3.1
- Ollama supports embedding models, making it possible to build retrieval augmented generation (RAG) applications that combine text prompts with existing documents or other data.

In [None]:
from langchain_community.embedding import OllamaEmbeddings

In [None]:
embedding = (
    OllamaEmbeddings(model="gemma:2b")  #by Default it uses llama2
)

In [None]:
embedding

In [None]:
r1 = embedding.embed_documents(
    [
        "Alpha is the first letter of Greek Alphabet",
        "Beta is the second letter of Greek alphabet"
    ]
)

In [None]:
r1

In [None]:
embedding.embed_query("What is the second letter of Greek alphabet")

### Other Ollama Embedding models
#### https://ollama.com/blog/embedding-models

In [None]:
embedding = OllamaEmbeddings(model="mxbai-embed-large")
text = "This is a test document."
query_result = embedding.embed_query(text)

## Embedding Techniques using HuggingFace
### Sentence Transformers on Hugging Face
Hugging Face sentence-transformers is a Python framework for state of the art sentenece, text and image embeddings. One of the embeddind models is used  in the HuggingFaceEmbedding class. We have also added an alias for SentenceTransformerEmbeddings for users who are more familiar with directly using that package. 

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings
embedding = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

In [None]:
text = "this is atest documents"
query_result = embedding.embed_query(text)
query_result

In [None]:
# dimension of the query result
len(query_result)

In [None]:
# giving list pf text to embed
doc_result = embedding.embed_documents([text,"This is not a test document."])

In [None]:
# Result of 1st text embedding 
doc_result[0]

In [None]:
# Result of 2nd text embedding 
doc_result[1]

## FIASS
Facebook AI Silimarity Search is a librabry for the effiecient similarity search and clustering of dense vectors . It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fi in RAM. It also contains supporting code for evaluation and parameter tun

In [None]:
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import OllamaEmbeddings
from langchain_text_splitters import CharacterTextSplitter

In [None]:
loader = TextLoader("speech.txt")
documents = loader.load()
text_spliter = CharacterTextSplitter(chunk_size = 1000, chunk_overlap = 30)
docs = text_spliter.split_documents(documents)

In [None]:
docs

In [None]:
embeddings = OllamaEmbeddings()
db = FAISS.from_documents(docs,embeddings)

In [None]:
db

In [None]:
### querying
query = " He narrowly missed achieving his dream of becoming a fighter pilot."
docs = db.similarity_search(query)

In [None]:
docs[0].page_content

### As a Retriever
We can also convert the vector store into a Retriever class. This allows us to easily use it in other LangChain methods. Which largly work with retrievers.

In [None]:
retriever = db.as_retriever()
docs = retriever.invoke(query)

### Similarity Search With Score
There are some FAISS Specific methods. One of them is similarity_search_score, which allows you to return not only the documents but also the distance score of the query to them. The returned distance score is L2 distance. Therfore, a lower score is better.

In [None]:
docs_and_score = db.similarity_search_with_score(query)
docs_and_score

In [None]:
embedding_vector = embeddings.embed_query(query)
embedding_vector

In [None]:
docs_score =  db.similarity_search_by_vector(embedding_vector)
docs_score

In [None]:
### Saving
db.save_local("faiss_index")

In [None]:
## Loading
new_db = FAISS.load_local("faiss_index",embedding,allow_dangerous_deserialization=True)
docs = new_db.similarity_search(query)

In [6]:
docs

''

### Chroma DB
Chroma is a AI-Native open-source vectorr database focused on developer productivity and happiness. Chroma is licensed under Apache 2.0.

In [None]:
## Building a sample vectordb
from langchain_chroma import Chroma
from langchain_community.document_loaders import TextLoader
from langchain_community.embeddings import OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [None]:
loader = TextLoader("speech.txt")
data = loader.load()
data

In [None]:
#Split
text_spliter = RecursiveCharacterTextSplitter(chunk_size = 500,chunk_overlap = 0)
splits = text_spliter.split_documents(data)

In [None]:
embeddings = OllamaEmbeddings()
vectordb = Chroma.from_documents(documents=splits,embedding=embeddings)
vectordb

In [None]:
### querying
query = " He narrowly missed achieving his dream of becoming a fighter pilot."
docs = vectordb.similarity_search(query)

In [None]:
docs

In [None]:
docs[0].page_content

In [None]:
#Saving to the disk
vectorDB = Chroma.from_documents(documents=splits,embedding=embeddings,persist_directory = "./chroma_db")

In [None]:
# Load from the disk
db2 = Chroma(persist_directory = "./chroma_db",embedding_function = embedding)


In [None]:
docs = db2.similarity_search(query)
docs[0].page_content