# Document Loaders

- it is not scalable to manually hardcode documents with all of the factual up to date data and pass to model
- we should dynamically load data from a variety of sources

There are two broad type of loaders:
 - file loaders (extracts from variety of file formats on local filesystem [CSV,XML,JSON,PDF,etc])
 - web loaders (extracts from variety of web platforms [Github, Playwright, Puppeteer,etc])

They all exist in the `@langchain/community/document_loaders` location.

We will be using a web scraper to access [LCEL](https://python.langchain.com/v0.1/docs/expression_language/) page and extract content and convert it to useable documents. Namely `CheerioWebBaseLoader` from `@langchain/community/document_loaders`

`CheerioWebBaseLoader` has dependency on `cheerio` from `npm`.

```shell
yarn add cheerio
```

## Import dependencies

In [None]:
var Bedrock = require('@langchain/community/llms/bedrock').Bedrock;
var ChatPromptTemplate = require('@langchain/core/prompts').ChatPromptTemplate;
var createStuffDocumentsChain = require("langchain/chains/combine_documents").createStuffDocumentsChain;

var CheerioWebBaseLoader = require("@langchain/community/document_loaders/web/cheerio").CheerioWebBaseLoader;

## Instantiate the `model` client

In [None]:
var model = new Bedrock({
    model_id:'amazon.titan-text-express-v1',
    temperature: 1,
    maxTokenCount: 512,
    topP: 0.9,
    verbose: true
});

## Create prompt with context placeholder

In [None]:
var promptContexDynamic = ChatPromptTemplate.fromTemplate(`
    Answer the user question.
    Context: {context}
    Question: {input}
`);

#### Create a chain

- We create a chain with prompt containing context, input placeholders and model

In [None]:
var chainContextDynamic;

createStuffDocumentsChain({
    llm: model,
    prompt: promptContexDynamic,
}).then((chain) => chainContextDynamic = chain);

## Configure input

In [None]:
var input = "What is LCEL ?";

## Instantiate loader

In [None]:
var loader = new CheerioWebBaseLoader("https://python.langchain.com/v0.1/docs/expression_language/");

## Load documents

- scrapes the web url and loads all content

In [None]:
var docs;

loader.load().then((data) => docs = data);

We can see what the web document loader has obtained by priting the docs

In [None]:
console.log(docs);

## Invoke LLM

- given input we invoke LLM and get response on what LLM knows about LCEL by given extra context with information about LCEL
- it is injected in invoke call as chain is a `documents` capable chain
- documents are now dynamically loaded from web, nothing hardcoded

In [None]:
chainContextDynamic.invoke({
    input,
    context: docs
}).then((response) => console.log(response));

## Pricing problem

- pricing is based on number of tokens
- we are sending a lot of tokens, more than what is needed

In [None]:
console.log(docs[0].pageContent.length);

- 4100+ characters for a very simple and basic page
- all are being fed into the model query
- we are charged for the tokens we provide / receive
- a lot of wasted $$$

## Split docs

- instead of having one big string with all of the web page data, lets fragment it
- not entire web page contains relevant information
- split in chunks, only some chunks have relevant data
- only send those chunks as context to model

Let's import `RecursiveCharacterTextSplitter`

In [None]:
var RecursiveCharacterTextSplitter = require("langchain/text_splitter").RecursiveCharacterTextSplitter;

## `RecursiveCharacterTextSplitter`

- `chunkSize`, how big is each chunk
- `chunkOverlap`, how many characters to overlap in each chunk so some info is not split across chunks and lost

In [None]:
var splitter = new RecursiveCharacterTextSplitter({
    chunkSize: 300,
    chunkOverlap: 30
});

Let's split the documents

In [None]:
var splitDocs;

splitter.splitDocuments(docs).then((data) => {
    splitDocs = data;
    console.log(splitDocs);
});

## Invoke LLM

- we send same dynamically loaded documents, but now they are split in chunks
- total amount of characters/tokens is still the same ...

In [None]:
chainContextDynamic.invoke({
    input,
    context: splitDocs
}).then((response) => console.log(response));