Skip to content

Commit

Permalink
Docu: RAG
Browse files Browse the repository at this point in the history
  • Loading branch information
langchain4j committed Mar 28, 2024
1 parent 2f417ce commit 57cd62c
Showing 1 changed file with 61 additions and 53 deletions.
114 changes: 61 additions & 53 deletions docs/docs/tutorials/7-rag.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,29 +8,36 @@ sidebar_position: 8
by [Siva](https://www.sivalabs.in/).

LLM's knowledge is limited to the data it has been trained on.
To enable LLM to "know" your private data, like internal company documentation, you can:
- Use RAG, which we will cover here
- Fine-tune LLM with your data
- [Combine RAG and fine-tuning](https://gorilla.cs.berkeley.edu/blogs/9_raft.html)
If you want to make an LLM aware of domain-specific knowledge or proprietary data, you can:
- Use RAG, which we will cover in this section
- Fine-tune the LLM with your data
- [Combine both RAG and fine-tuning](https://gorilla.cs.berkeley.edu/blogs/9_raft.html)

Simply put, RAG is the way to find and inject relevant pieces of information from your private knowledge base
## What is RAG?
Simply put, RAG is the way to find and inject relevant pieces of information from your data
into the prompt before sending it to the LLM.
This way LLM will get (hopefully) relevant information and will be able to reply using this information.
This way LLM will get (hopefully) relevant information and will be able to reply using this information,
which should reduce the probability of hallucinations.

## Easy RAG
LangChain4j has an "Easy RAG" feature that makes it as easy as possible to get started with RAG.
You don't need to learn about embeddings, choose a vector store, find the right embedding model,
You don't have to learn about embeddings, choose a vector store, find the right embedding model,
figure out how to parse and split documents, etc.
Just point to your document(s), and LangChain4j will work its magic.
Just point to your document(s), and LangChain4j will do its magic.

If you need a customizable RAG, skip to the [next section](/tutorials/rag#rag-apis).

If you are using Quarkus, there is an even easier way to do Easy RAG.
Please read [Quarkus documentation](https://docs.quarkiverse.io/quarkus-langchain4j/dev/easy-rag.html).

:::note
The quality of such "Easy RAG" will, of course, by definition, be lower than that of a tailored solution.
The quality of such "Easy RAG" will, of course, be lower than that of a tailored RAG setup.
However, this is the easiest way to start learning about RAG and/or make a proof of concept.
Later, you will be able to transition smoothly from Easy RAG to more advanced RAG,
adjusting and customizing more aspects.
adjusting and customizing more and more aspects.
:::

First, import the `langchain4j-easy-rag` dependency, which contains everything we need inside:
1. Import the `langchain4j-easy-rag` dependency:
```xml
<dependency>
<groupId>dev.langchain4j</groupId>
Expand All @@ -39,11 +46,11 @@ First, import the `langchain4j-easy-rag` dependency, which contains everything w
</dependency>
```

Then, let's load our documents:
2. Let's load your documents:
```java
List<Document> documents = FileSystemDocumentLoader.loadDocuments("/home/langchain4j/documents");
List<Document> documents = FileSystemDocumentLoader.loadDocuments("/home/langchain4j/documentation");
```
This will load all documents from the specified directory.
This will load all files from the specified directory.

<details>
<summary>What is happening under the hood?</summary>
Expand All @@ -60,7 +67,7 @@ provided by `langchain4j-easy-rag` dependency through SPI.

If you want to load documents from all subdirectories, you can use the `loadDocumentsRecursively` method:
```java
List<Document> documents = FileSystemDocumentLoader.loadDocumentsRecursively("/home/langchain4j/documents");
List<Document> documents = FileSystemDocumentLoader.loadDocumentsRecursively("/home/langchain4j/documentation");
```
Additionally, you can filter documents by using a glob or regex:
```java
Expand All @@ -74,33 +81,29 @@ in glob: `glob:**.pdf`.
:::
</details>

Now, once we have loaded our documents into memory,
we need to store them in a specialized embedding (vector) store to enable semantic search.
We can use any of our 15+ [integrations](/category/embedding-stores) with various embedding stores,
3. Now, we need to preprocess and store documents in a specialized embedding store, also known as vector database.
This is necessary to quickly find relevant pieces of information on the fly when a user asks a question.
We can use any of our 15+ [supported embedding stores](/category/embedding-stores),
but for simplicity, we will use an in-memory one:
```java
InMemoryEmbeddingStore<TextSegment> embeddingStore = new InMemoryEmbeddingStore<>();
```

Now, let's ingest our documents into the store:
```java
EmbeddingStoreIngestor.ingest(documents, embeddingStore);
```

<details>
<summary>What is happening under the hood?</summary>

1. Through SPI, the `EmbeddingStoreIngestor` loads a `DocumentSplitter` from the `langchain4j-easy-rag` dependency.
Each `Document` is split into smaller pieces (`TextSegment`s) each consisting of 300 tokens and with a 30-token overlap.
1. The `EmbeddingStoreIngestor` loads a `DocumentSplitter` from the `langchain4j-easy-rag` dependency through SPI.
Each `Document` is split into smaller pieces (`TextSegment`s) each consisting of no more than 300 tokens
and with a 30-token overlap.

2. Through SPI, the `EmbeddingStoreIngestor` loads an `EmbeddingModel` from the `langchain4j-easy-rag` dependency.
For the Easy RAG, a [bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) embedding model is used.
2. The `EmbeddingStoreIngestor` loads an `EmbeddingModel` from the `langchain4j-easy-rag` dependency through SPI.
Each `TextSegment` is converted into an `Embedding` using the `EmbeddingModel`.

:::note
This embedding model has achieved an impressive score
on the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard), and its quantized version
occupies only 24 megabytes of space.
We have chosen [bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) as the default embedding model for Easy RAG.
It has achieved an impressive score on the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard),
and its quantized version occupies only 24 megabytes of space.
Therefore, we can easily load it into memory and run it in the same process using [ONNX Runtime](https://onnxruntime.ai/).

Yes, that's right, you can convert text into embeddings entirely offline, without any external services,
Expand All @@ -112,7 +115,7 @@ LangChain4j offers 5 popular embedding models
3. All `TextSegment`-`Embedding` pairs are stored in the `EmbeddingStore`.
</details>

The last step is to create an [AI Service](/tutorials/ai-services) that will serve as our API to the LLM:
4. The last step is to create an [AI Service](/tutorials/ai-services) that will serve as our API to the LLM:
```java
interface Assistant {

Expand All @@ -125,27 +128,28 @@ Assistant assistant = AiServices.builder(Assistant.class)
.contentRetriever(EmbeddingStoreContentRetriever.from(embeddingStore))
.build();
```
Here, we configure our AI Service to use an OpenAI LLM, remember the 10 latest messages in the conversation,
and use an `EmbeddingStore` with our documents.
Here, we configure the `Assistant` to use an OpenAI LLM to answer user questions,
remember the 10 latest messages in the conversation,
and retrieve relevant content from an `EmbeddingStore` that contains our documents.

And now we are ready to chat with it!
5. And now we are ready to chat with it!
```java
String answer = assistant.chat("How to do RAG with LangChain4j?");
String answer = assistant.chat("How to do Easy RAG with LangChain4j?");
```

## RAG APIs
LangChain4j offers a broad set of APIs to make it easier for you to build your custom RAG pipelines.
LangChain4j offers a rich set of APIs to make it easy for you to build custom RAG pipelines,
ranging from very simple ones to very advanced ones. In this section, we will cover the main domain classes and APIs.

### Document
LangChain4j's domain model includes a `Document` class, which represents an entire document,
such as a single PDF file.
A `Document` class represents an entire document, such as a single PDF file or a web page.
At the moment, the `Document` can only represent textual information,
but future updates will enable it to support images and tables as well.

### Document Loader
You can create a `Document` from a `String`, but a simpler method is to use one of our `DocumentLoader`s included in the library:
- `FileSystemDocumentLoader` from the main (`langchain4j`) module
- `UrlDocumentLoader` from the main (`langchain4j`) module
You can create a `Document` from a `String`, but a simpler method is to use one of our document loaders included in the library:
- `FileSystemDocumentLoader` from the `langchain4j` module
- `UrlDocumentLoader` from the `langchain4j` module
- `AmazonS3DocumentLoader` from the `langchain4j-document-loader-amazon-s3` module
- `AzureBlobStorageDocumentLoader` from the `langchain4j-document-loader-azure-storage-blob` module
- `GitHubDocumentLoader` from the `langchain4j-document-loader-github` module
Expand All @@ -154,7 +158,7 @@ You can create a `Document` from a `String`, but a simpler method is to use one
### Document Parser
`Document`s can represent files in various formats, such as PDF, DOC, TXT, etc.
To parse each of these formats, there's a `DocumentParser` interface with several implementations included in the library:
- `TextDocumentParser` from the main (`langchain4j`) module, which can parse files in plain text format (e.g. TXT, HTML, MD, etc.)
- `TextDocumentParser` from the `langchain4j` module, which can parse files in plain text format (e.g. TXT, HTML, MD, etc.)
- `ApachePdfBoxDocumentParser` from the `langchain4j-document-parser-apache-pdfbox` module, which can parse PDF files
- `ApachePoiDocumentParser` from the `langchain4j-document-parser-apache-poi` module, which can parse MS Office file formats
(e.g. DOC, DOCX, PPT, PPTX, XLS, XLSX, etc.)
Expand Down Expand Up @@ -184,10 +188,13 @@ If no `DocumentParser`s are found through SPI, a `TextDocumentParser` is used as


### Document Transformer
More details are coming soon.
`DocumentTransformer` implementations can perform a variety of tasks such as transforming documents,
cleaning them, filtering, enriching, etc.

See `HtmlTextExtractor` in the `langchain4j` module.
Currently, the only implementation provided out-of-the-box is `HtmlTextExtractor` in the `langchain4j` module,
which can extract desired text content and metadata from an HTML document.

You can implement your own `DocumentTransformer` and plug it into the LangChain4j RAG pipeline.

### Text Segment
Once your `Document`s are loaded, it is time to split (chunk) them into smaller segments (pieces).
Expand All @@ -204,20 +211,20 @@ instead of the entire knowledge base in the prompt:
- The more information you provide in the prompt, the more you pay
- Irrelevant information in the prompt might confuse or distract the LLM and increase the chance of hallucinations

RAG addresses these concerns by splitting your knowledge base into smaller, more digestible segments.
We can address these concerns by splitting a knowledge base into smaller, more digestible segments.
How big should those segments be? That is a good question. As always, it depends.

There are currently 2 widely used approaches:
1. Each document (e.g., a PDF file, a web page, etc.) is atomic and indivisible.
During retrieval in the RAG pipeline, the N most relevant documents are retrieved and injected into the prompt.
You will most probably need to use a long-context LLM for this since documents can be quite long.
This approach is suitable if processing complete documents is important,
You will most probably need to use a long-context LLM in this case since documents can be quite long.
This approach is suitable if retrieving complete documents is important,
such as when you can't afford to miss some details.
- Pros: No context is lost.
- Cons:
- More tokens are consumed.
- Sometimes, documents can contain multiple sections/topics, and not all of them are relevant to the query.
- Vector search quality suffers because complete documents of various sizes are compressed into a single vector.
- Vector search quality suffers because complete documents of various sizes are compressed into a single, fixed-length vector.

2. Documents are split into smaller segments, such as chapters, paragraphs, or sometimes even sentences.
During retrieval in the RAG pipeline, the N most relevant segments are retrieved and injected into the prompt.
Expand All @@ -236,7 +243,7 @@ providing the LLM with additional information before and after the retrieved seg
</details>

### Document Splitter
LangChain4j has a DocumentSplitter interface with several out-of-the-box implementations:
LangChain4j has a `DocumentSplitter` interface with several out-of-the-box implementations:
- `DocumentByParagraphSplitter`
- `DocumentByLineSplitter`
- `DocumentBySentenceSplitter`
Expand All @@ -248,15 +255,16 @@ LangChain4j has a DocumentSplitter interface with several out-of-the-box impleme
They all work as follows:
1. You instantiate a `DocumentSplitter`, specifying the desired size of `TextSegment`s and,
optionally, an overlap in characters or tokens.
2. You use the `split(Document)` or `splitAll(List<Document>)` methods of the `DocumentSplitter`.
2. You call the `split(Document)` or `splitAll(List<Document>)` methods of the `DocumentSplitter`.
3. The `DocumentSplitter` splits the given `Document`s into smaller units,
the nature of which varies with the splitter. For instance, `DocumentByParagraphSplitter` divides
a document into paragraphs (defined by two or more consecutive newline characters (`\n`)),
a document into paragraphs (defined by two or more consecutive newline characters),
while `DocumentBySentenceSplitter` uses the OpenNLP library's sentence detector to split
a document into sentences, and so on.
4. The `DocumentSplitter` then processes these smaller units (paragraphs, sentences, words, etc.),
recombining them into a `TextSegment` and trying to include as many units as possible
into a single `TextSegment`, without exceeding the limit set in step 1.
4. The `DocumentSplitter` then combines these smaller units (paragraphs, sentences, words, etc.) into `TextSegment`s,
attempting to include as many units as possible in a single `TextSegment` without exceeding the limit set in step 1.
If some of the units are still too large to fit into a `TextSegment`, it calls a sub-splitter.
This is another `DocumentSplitter` capable of splitting units that do not fit into more granular units.

### Text Segment Transformer
More details are coming soon.
Expand Down

0 comments on commit 57cd62c

Please sign in to comment.