#### Getting Started With Langchain And Open AI

- Get setup with LangCHain, LangSmith and LangServe
- Use the most basic and common components of LangChain : Prompt templates, Models, and Output parsers.
- Build a simple application with LangChain
- Trace your application with LangSmith
- Serve your application with LangServe

### RAG - Retrieval Augmented Generation
1. Data Sources : PDF, JSON, URLs, Images => Data Ingestion Technique 
2. Data Translation : Converting Huge Data to Text Chunks
3. Embedding : Text to vectors
4. Store the vectors in the VectorStore Database


### Vector Database
1. FAISS
2. ChromaDB
3. AstroDB

## Retrieval Chain
Retrieval Chain is an interface, which is responsible for quering vector store DB.
## Data Ingestion With Documents Loaders
- Loading a data set from a specific source.
- https://python.langchain.com/v0.2/docs/integrations/document_loaders/
### Document loaders
- DocumentLoaders load data into the standard LangChain Document format.
- Each DocumentLoader has its own specific parameters, but they can all be invoked in the same way with the .load method.

In [None]:
from langchain_community.document_loaders import TextLoader
loader = TextLoader('speech.txt')
text = loader.load()

In [None]:
text

In [None]:
## Reading from the PDF File

from langchain_community.document_loaders import PyPDFLoader
loader  =  PyPDFLoader('attension.pdf')
doc = loader.load()
doc

### Text Splitting from Documents (Huge Text)


#### How to recursively split text by characters
This text splitter is the recommended one for generic text. it is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n","\n","",""]. THis has the seffect of trying to keep all paragraphs(and then sentences, and then words) together as long as possible, as those would generically seeem to be the strongest semantically related pieces of text.
- How the text is split: by list of characters.
- How the chunk size is measured: by number of characters.

In [None]:
from langchian_text_splitters import RecursiveCharacterTextSplitter
text_spliter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 50)
final_document = text_spliter.split_documents(doc)
final_document

In [None]:
speech = ""
with open("speech.txt") as f:
    speech = f.read()
print("the type of the speech is when open() is used=>",type(speech))

from langchain_community.document_loaders import TextLoader
loader=TextLoader('speech.txt')
text = loader.load()
print("the type of the speech is when TextLoader() is used=>",type(text))


In [None]:
new_text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20)
new_text =  new_text_splitter.create_documents([speech])
print("the type of the speech is when open() is used=>",type(new_text))
new_text[1]

#### How to split by  character- Character Text Splitter
THis is the simplest method. This splits based on as given character sequence, which defaults to "\n\n". Chunk length is measured by number of characters.

1. How the text is split : By single character separator.
2. How the chunk size is measures:  by number of characters.

In [None]:
from langchain_community.document_loaders import TextLoader
loader = TextLoader('speech.txt')
docs = loader.load()
doc

In [None]:
from langchain_text_splitters import CharacterTextSplitter
text_spliter = CharacterTextSplitter(separator = "\n\n", chunk_size=100,chunk_overlap=20)
text_spliter.split_documents(doc)

### How to split by HTML Header
HTMLHeaderTextSplitter is a "structure-aware" chunker that splits text at the HTML elment level and adds metadata fo each header "relevent" to any given chunk. It can return chunks element by element or combine element with the same metadata, with the objectives of a keepinig related text grouped(more or less ) sementically and (b) preserving context-rich information cncoded in document structure. It can be suded with  other text splitter as part of a chuncking pipeline.

from langchain_text_splitters import HTMLHeaderTextSplitter
headers_to_split_on  = [
    ("h1","Header 1"),
    ("h2","Header 2"),
    ("h3","Header 3")
]
html_string = '''<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>

<h1>My First Heading</h1>
<p>My first paragraph.</p>

</body>
</html>'''
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits

In [None]:
from langchain_text_splitters import HTMLHeaderTextSplitter
url = "https://plato.stanford.edu/entries/goedel/"
headers_to_split_on  = [
    ("h1","Header 1"),
    ("h2","Header 2"),
    ("h3","Header 3"),
    ("h4","Header 4")
]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text_from_url(url)
html_header_splits

#### How to split JSON Data
