### SIMPLE RAG PIPELINE

##### Data Ingestion/Extraction
Famouse Books, summary texts and details. Ref : https://www.sayebrand.com/blogs/stories/25famousbooks

In [3]:
# Data loading, loading a single text file
from langchain_community.document_loaders import TextLoader
loader = TextLoader("./books/THE BIBLE.txt") # Implement a loop for the texts
text = loader.load()
text

[Document(metadata={'source': './books/THE BIBLE.txt'}, page_content='GENRE\n\nReligious Text.\n\nREAD IF\n\nYou enjoy philosophy and symbolism texts.\n\nFUN FACT\n\nOver 100 million copies of the Bible are sold each year.\n\nYes, The Bible is a book and is one of the most successful of all time and also one of the go-toâ€™s on deep morality. The Bible is a collection of religious texts or scriptures that have become sacred to Christians, Jews, Samaritans, Rastafari and other religious groups.\n\nThe influence the Bible has had on Western culture is immeasurable. For thousands of years this book has inspired the greatest writers, artists, musicians, religious leaders, painters of our time. Love it or hate it, the bible has been one of the most pivotal books of all time in the western world.')]

In [None]:
# Loading data from a PDF document
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("path_to_PDF")
pdfs = loader.load()
pdfs

In [None]:
# Loading data from html pages of a website using Web based loader
from langchain_community.document_loaders import WebBaseLoader
import bs4

loader = WebBaseLoader(web_paths=("url_here"), bs_kwargs=dict(parse_only=bs4.SoupStrainer(
    class_ = ("post-title", "post-content", "post-header") # Html elements on the webpage to extract
)))

html_text = loader.load()
html_text

In [10]:
# Loading the entire directory of books with all text files, note: mention loader class or it will use default loader leading to more dependencies
from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader('./books/.', glob="**/*.txt", loader_cls=TextLoader)
docs = loader.load()
docs

[Document(metadata={'source': 'books\\1984.txt'}, page_content='GENRE\n\nDystopian social science fiction.\n\nREAD IF\n\nYou love reading about dystopian worlds, possible futures and if youâ€™re a fan of the series Black Mirror.\n\nFUN FACT\n\nOrwell modeled the character of Julia on his second wife, Sonia Brownell.\n\n1984 is a dystopian social science fiction novel and is written by English novelist George Orwell. The story takes place in an imagined future (the year 1984) when much of the world has fallen victim to totalitarianism, mass surveillance, manipulation of the past and propaganda.\n\nThe superstate is called Oceania and is ruled by the Party who employ Thought Police, whose job it is to persecute individuality and independent thinking. Winston Smith is the protagonist of the novel and although he is a responsible and reliable worker in the system, he dreams of rebellion. When his colleague, Julia, and him begin a forbidden relationship, he begins to remember what life was 

##### Data Transformation

In [20]:
# Converting the loader data into chunks to create embeddings for the vectorDB and in extension for the LLM

#Using text splitter
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=50)
documents = text_splitter.split_documents(docs)
print(documents[0].page_content)

GENRE

Dystopian social science fiction.

READ IF

You love reading about dystopian worlds, possible futures and if youâ€™re a fan of the series Black Mirror.

FUN FACT

Orwell modeled the character of Julia on his second wife, Sonia Brownell.


##### Data Loading

In [21]:
# # Creatin vector embeddings and VectorDB or Vector Store
# from langchain_community.embeddings import LocalAIEmbeddings
# from langchain_community.vectorstores import Chroma

# databse = Chroma.from_documents(documents, LocalAIEmbeddings)