# Chapter 2 - RAG Part I: Indexing Your Data

## Converting Your Documents into Text

LangChain provides document loaders that handle the parsing logic and enable you to “load” data from various sources into a ```Document``` class that consists of text and associated metadata.

For example, consider a simple _.txt_ file. You can simply import a LangChain ```TextLoader``` class to extract the text, like this:

In [5]:
from langchain_community.document_loaders import TextLoader
from torch.onnx.symbolic_opset11 import chunk

loader = TextLoader("./test.txt")
loader.load()

[Document(metadata={'source': './test.txt'}, page_content='Hello there!\n')]

Aside from _.txt_ files, LangChain provides document loaders for other popular file types including _.csv_, ._json_, and _Markdown_, alongside integrations with popular platforms such as Slack and Notion.

For example, you can use ```WebBaseLoader``` to load HTML from web URLs and parse it to text.



In [None]:
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://www.langchain.com/")
loader.load()

## Splitting Your Text into Chunks


In [1]:
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = TextLoader("./rime.txt")
doc = loader.load()


splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100,
)

splitted_doc = splitter.split_documents(doc)
print(len(splitted_doc), splitted_doc)


63 [Document(metadata={'source': './rime.txt'}, page_content='THE RIME OF THE ANCYENT MARINERE, IN SEVEN PARTS.\n\nARGUMENT.\n\nHow a Ship having passed the Line was driven by Storms to the cold Country towards the South Pole; and how from thence she made her course to the tropical Latitude of the Great Pacific Ocean; and of the strange things that befell; and in what manner the Ancyent Marinere came back to his own Country.\n\nI.'), Document(metadata={'source': './rime.txt'}, page_content='I.\n\n     It is an ancyent Marinere,\n       And he stoppeth one of three:\n     "By thy long grey beard and thy glittering eye\n       "Now wherefore stoppest me?\n\n     "The Bridegroom\'s doors are open\'d wide\n       "And I am next of kin;\n     "The Guests are met, the Feast is set,--\n       "May\'st hear the merry din.--\n\n     But still he holds the wedding-guest--\n       There was a Ship, quoth he--\n     "Nay, if thou\'st got a laughsome tale,\n       "Marinere! come with me."'), Docum

In the preceding code, the documents created by the document loader are split into chunks of 500 characters each, with some overlap between chunks of 100 characters to maintain some context. The result is also a list of documents, where each document is up to 500 characters in length, split along the natural divisions of written text—paragraphs, new lines and finally, words. This uses the structure of the text to keep each chunk a consistent, readable snippet of text.

## Generating Text Embeddings

Here’s an example of embedding a document using HuggingFace’s embedding model. **Note that this embedding is run locally**:

In [14]:
from langchain_huggingface import HuggingFaceEmbeddings

model = "sentence-transformers/all-mpnet-base-v2" # use this model to perform the embedding
model_kwargs = {"device": "cpu"}
encode_kwargs = {"normalize_embeddings": False}
hf = HuggingFaceEmbeddings(
    model=model,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs,
)

hf.embed_documents([chunk.page_content for chunk in splitted_doc])

[[-0.007732471916824579,
  0.027026023715734482,
  -0.014299447648227215,
  -0.01903754658997059,
  -0.0038336957804858685,
  0.0036903382278978825,
  0.027617407962679863,
  -0.013661875389516354,
  -0.04140247777104378,
  -0.04102339223027229,
  0.05177145078778267,
  0.014999454841017723,
  0.0223831906914711,
  -0.05624186992645264,
  0.0640692412853241,
  -0.11622095108032227,
  -0.003943960648030043,
  -0.008856697008013725,
  -0.03668621554970741,
  0.01955009065568447,
  -0.07486226409673691,
  -0.02675645612180233,
  -0.01661897636950016,
  0.0031305106822401285,
  0.0772733986377716,
  -0.01388746127486229,
  -0.013467401266098022,
  -0.03364459052681923,
  -0.05003274232149124,
  -0.027447935193777084,
  0.05407832935452461,
  -0.011991282925009727,
  0.057737208902835846,
  0.03226063400506973,
  1.6024956721594208e-06,
  0.00725559052079916,
  -0.005395084153860807,
  0.021874811500310898,
  0.0373159721493721,
  -0.009270073845982552,
  0.05791440233588219,
  0.0084217553