In [1]:
from dotenv import load_dotenv
load_dotenv()

True

In [8]:
from collections import defaultdict

In [2]:
from llama_index import SimpleDirectoryReader

## Data Loading

### Reading Files
[Ref](http://localhost:8000/module_guides/loading/simpledirectoryreader.html#supported-file-types).
`SimpleDirectoryReader` is meant as a prototyping tool and not to be used in production. Even so, it can read a whole bunch of file types as mentioned in the link above. Here are some useful ones from that link -
  * .csv
  * .docx
  * .epub
  * .ipynb
  * .jpeg, .jpg
  * .md
  * .mp3, .mp4
  * .pdf
  * .png
  * .ppt, .pptm, .pptx

For some reason `SimpleDirectoryReader` does not read JSON. For that check out [JSONLoader](https://llamahub.ai/l/readers/llama-index-readers-json).

In [3]:
docs = SimpleDirectoryReader("./data", recursive=True).load_data(show_progress=True)
len(docs)

Loading files: 100%|██████████| 5/5 [00:00<00:00, 10.37file/s]


6

In [9]:
docsmap = defaultdict(list)
for doc in docs:
    docsmap[doc.metadata["file_name"]].append(doc)

### Documents
All data loaders, even the third party ones in Llama-Hub have a `load_data()` method that will return a list of `Document` objects as we have seen above. For the most part the `Document` object will look like this -

```
Document(
  id_='c33dc7aa-e994-48a4-8dcc-e76c163a2ca3', 
  embedding=None, 
  metadata={
      'file_path': 'data/txts/paul_graham_essay.txt', 
      'file_name': 'paul_graham_essay.txt', 
      'file_type': 'text/plain', 
      'file_size': 75042, 
      'creation_date': '2024-04-10', 
      'last_modified_date': '2024-03-21', 
      'last_accessed_date': '2024-04-10'
  }, 
  excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], 
  excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], 
  relationships={}, 
  text='...', 
  start_char_idx=None, 
  end_char_idx=None, 
  text_template='{metadata_str}\n\n{content}', 
  metadata_template='{key}: {value}', 
  metadata_seperator='\n'
)
```

In the call in the preceeding cell, we got 1 more document than the number of files in our data directory. This is because if the actual document has multiple pages, then each page will have its own `Document` object along with `page_label` metadata. In the example above, we loaded a "single page" text file and image file, so that resulted in single objects for each. The `conda-cheatsheet.pdf` is a 2-page document, so each of its page got its own object. These objects have a `page_label` metadata that tells us the page number.

The `metadata` property has a already a lot of useful information by default. There is a way to add more to it when loading the doc. 

It is possible to create `Document` objects directly as well. A nice convenience method on `Document` is the `.example()` method that will generate some sample text for quick prototyping.

In [10]:
docsmap["ozymandias.txt"][0]

Document(id_='18982618-cbda-4b22-a4ca-2866b8e7bef0', embedding=None, metadata={'file_path': 'data/txts/ozymandias.txt', 'file_name': 'ozymandias.txt', 'file_type': 'text/plain', 'file_size': 637, 'creation_date': '2024-04-13', 'last_modified_date': '2024-04-10', 'last_accessed_date': '2024-04-13'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='I met a traveller from an antique land,\nWho said—“Two vast and trunkless legs of stone\nStand in the desert. . . . Near them, on the sand,\nHalf sunk a shattered visage lies, whose frown,\nAnd wrinkled lip, and sneer of cold command,\nTell that its sculptor well those passions read\nWhich yet survive, stamped on these lifeless things,\nThe hand that mocked them, and the heart that fed;\nAnd on the pedestal, thes

#### Reading PDFs
I'll need to install `pypdf` package to read pdfs.

Even though `SimpleDirectoryReader` can read simple PDFs it is not very good at even slightly stylized PDF as can be seen when reading the table in the conda cheat sheet.

```
Document(
  id_='e999ff9c-d654-45da-80fc-d0baf2d5edd3', 
  embedding=None, 
  metadata={
    'page_label': '1', 
    'file_name': 'conda-cheatsheet.pdf', 
    'file_path': 'data/pdfs/conda-cheatsheet.pdf', 
    'file_type': 'application/pdf', 
    'file_size': 299120, 
    'creation_date': '2024-04-10', 
    'last_modified_date': '2024-04-10', 
    'last_accessed_date': '2024-04-10'
  }, 
  excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], 
  excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], 
  relationships={}, 
  text='...',
  start_char_idx=None, 
  end_char_idx=None, 
  text_template='{metadata_str}\n\n{content}', 
  metadata_template='{key}: {value}', 
  metadata_seperator='\n'
)
```

In [12]:
print(docsmap["conda-cheatsheet.pdf"][0].text)

CONDA CHEAT SHEET
Command line package and environment manager
Learn to use conda in 30 minutes at bit.ly/tryconda TIP: Anaconda Navigator is a graphical interface to use conda.  
Double-click the Navigator icon on your desktop or in a Terminal or at 
the Anaconda prompt, type anaconda-navigator
CONTINUED ON BACK →conda info
conda update conda
conda install PACKAGENAME  
spyder 
conda update PACKAGENAME
COMMANDNAME --help  
conda install --helpConda basicsVerify conda is installed, check version numberUpdate conda to the current version
Install a package included in Anaconda
Run a package after install, example Spyder*Update any installed programCommand line help
 
*Must be installed and have a deployable command,  
usually PACKAGENAME
conda create --name py35 python=3.5 
WINDOWS:    activate py35  
LINUX, macOS: source activate py35conda env list
 
conda create --clone py35 --name py35-2
conda listconda list --revisionsconda install --revision 2
conda list --explicit > bio-env.txt
con

In [13]:
print(docsmap["conda-cheatsheet.pdf"][1].text)

conda create --name py34 python=3.4
Windows:   activate py34
Linux, macOS:  source activate py34
Windows:  where python
Linux, macOS: which -a pythonpython --versionInstalling and updating packages  
Install a new package (Jupyter Notebook)  
in the active environment
Run an installed package (Jupyter Notebook)
Install a new package (toolz) in a different environment 
(bio-env)  
Update a package in the current environmentInstall a package (boltons) from a speciﬁc channel 
(conda-forge)
Install a package directly from PyPI into the current active 
environment using pip 
Remove one or more packages (toolz, boltons)  
from a speciﬁc environment (bio-env)
Specifying version numbers
Ways to specify a package version number for use with conda create or conda install commands, and in meta.yaml ﬁles.
Constraint type Specification Result
Fuzzy numpy=1.11 1.11.0, 1.11.1, 1.11.2, 1.11.18 etc.
Exact numpy==1.11 1.11.0
Greater than or equal to "numpy>=1.11" 1.11.0 or higher
OR "numpy=1.11.1|1.11.3

In [17]:
docsmap["conda-cheatsheet.pdf"][0].metadata

{'page_label': '1',
 'file_name': 'conda-cheatsheet.pdf',
 'file_path': 'data/pdfs/conda-cheatsheet.pdf',
 'file_type': 'application/pdf',
 'file_size': 299120,
 'creation_date': '2024-04-13',
 'last_modified_date': '2024-04-10',
 'last_accessed_date': '2024-04-13'}

In [18]:
docsmap["conda-cheatsheet.pdf"][1].metadata

{'page_label': '2',
 'file_name': 'conda-cheatsheet.pdf',
 'file_path': 'data/pdfs/conda-cheatsheet.pdf',
 'file_type': 'application/pdf',
 'file_size': 299120,
 'creation_date': '2024-04-13',
 'last_modified_date': '2024-04-10',
 'last_accessed_date': '2024-04-13'}

#### Reading Images
I'll need to install `Pillow` package to read images.

Even though `SimpleDirectoryReader` supports reading images, the `ImageDocument` object produced is not very useful. I'd have expected the binary content of the image as part of the object, just like the text content is, but I don't see it. Here is what the `ImageDocument` will look like -

```
ImageDocument(
    id_='cab49e8c-c19f-45ad-ba51-dfed099a3156', 
    embedding=None, 
    metadata={
        'file_path': 'data/imgs/python.png', 
        'file_name': 'python.png', 
        'file_type': 'image/png', 
        'file_size': 8758, 
        'creation_date': '2024-04-10', 
        'last_modified_date': '2023-11-02', 
        'last_accessed_date': '2024-04-10'
    }, 
    excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'],
    excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], 
    relationships={}, 
    text='', 
    start_char_idx=None, 
    end_char_idx=None, 
    text_template='{metadata_str}\n\n{content}', 
    metadata_template='{key}: {value}', 
    metadata_seperator='\n', 
    image=None, 
    image_path='data/imgs/python.png', 
    image_url=None, 
    image_mimetype=None, 
    text_embedding=None
)
```

In [19]:
docsmap["python.png"][0]

ImageDocument(id_='0307419d-da34-435c-9094-b277c0987937', embedding=None, metadata={'file_path': 'data/imgs/python.png', 'file_name': 'python.png', 'file_type': 'image/png', 'file_size': 8758, 'creation_date': '2024-04-10', 'last_modified_date': '2023-11-02', 'last_accessed_date': '2024-04-11'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n', image=None, image_path='data/imgs/python.png', image_url=None, image_mimetype=None, text_embedding=None)

#### Customizing Documents
When creating `Document` directly, we lose all the automatic metadata, it has to be set manually. Metadata is mostly used by the embedding model and the LLM. They use the, `text_template`, `metadata_template`, and `metadata_separator` to parse out the metadata from the actual content when calling `Document.get_content()`. We can control which metadata fields will be seen or not by either of these models via the `excluded_embed_metadata_keys` and `excluded_llm_metadata_keys`.

Documents and their metadata can be customized pretty extensively as shown in the [documentation](http://localhost:8000/module_guides/loading/documents_and_nodes/usage_documents.html#customizing-documents). Below are some useful examples of customizing and using the metadata.

In [20]:
from llama_index import Document
from llama_index.schema import MetadataMode

In [21]:
with open("./data/txts/ozymandias.txt") as fin:
    doc = Document(text=fin.read())
doc

Document(id_='54057212-c168-4fc4-a266-b4a6c232f30f', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='I met a traveller from an antique land,\nWho said—“Two vast and trunkless legs of stone\nStand in the desert. . . . Near them, on the sand,\nHalf sunk a shattered visage lies, whose frown,\nAnd wrinkled lip, and sneer of cold command,\nTell that its sculptor well those passions read\nWhich yet survive, stamped on these lifeless things,\nThe hand that mocked them, and the heart that fed;\nAnd on the pedestal, these words appear:\nMy name is Ozymandias, King of Kings;\nLook on my Works, ye Mighty, and despair!\nNothing beside remains. Round the decay\nOf that colossal Wreck, boundless and bare\nThe lone and level sands stretch far away.”\n', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')

In [22]:
doc.metadata["filepath"] = "data/txts/ozymandias.txt"
doc.metadata["file_name"] = "ozymandias.txt"
doc.metadata["file_type"] = "text/plain"

In [23]:
doc

Document(id_='54057212-c168-4fc4-a266-b4a6c232f30f', embedding=None, metadata={'filepath': 'data/txts/ozymandias.txt', 'file_name': 'ozymandias.txt', 'file_type': 'text/plain'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='I met a traveller from an antique land,\nWho said—“Two vast and trunkless legs of stone\nStand in the desert. . . . Near them, on the sand,\nHalf sunk a shattered visage lies, whose frown,\nAnd wrinkled lip, and sneer of cold command,\nTell that its sculptor well those passions read\nWhich yet survive, stamped on these lifeless things,\nThe hand that mocked them, and the heart that fed;\nAnd on the pedestal, these words appear:\nMy name is Ozymandias, King of Kings;\nLook on my Works, ye Mighty, and despair!\nNothing beside remains. Round the decay\nOf that colossal Wreck, boundless and bare\nThe lone and level sands stretch far away.”\n', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', m

In [24]:
Document.example()

Document(id_='0132bd28-e985-416e-aad2-d3a482aab93a', embedding=None, metadata={'filename': 'README.md', 'category': 'codebase'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='\nContext\nLLMs are a phenomenal piece of technology for knowledge generation and reasoning.\nThey are pre-trained on large amounts of publicly available data.\nHow do we best augment LLMs with our own private data?\nWe need a comprehensive toolkit to help perform this data augmentation for LLMs.\n\nProposed Solution\nThat\'s where LlamaIndex comes in. LlamaIndex is a "data framework" to help\nyou build LLM  apps. It provides the following tools:\n\nOffers data connectors to ingest your existing data sources and data formats\n(APIs, PDFs, docs, SQL, etc.)\nProvides ways to structure your data (indices, graphs) so that this data can be\neasily used with LLMs.\nProvides an advanced retrieval/query interface over your data:\nFeed in any LLM input prompt, get back retrieved co

In [28]:
list(docsmap["ozymandias.txt"][0].metadata.keys())

['file_path',
 'file_name',
 'file_type',
 'file_size',
 'creation_date',
 'last_modified_date',
 'last_accessed_date']

In [29]:
list(docsmap["ozymandias.txt"][0].excluded_embed_metadata_keys)

['file_name',
 'file_type',
 'file_size',
 'creation_date',
 'last_modified_date',
 'last_accessed_date']

In [30]:
list(docsmap["ozymandias.txt"][0].excluded_llm_metadata_keys)

['file_name',
 'file_type',
 'file_size',
 'creation_date',
 'last_modified_date',
 'last_accessed_date']

In [31]:
print(docsmap["ozymandias.txt"][0].text_template)
print("-" * 100)
print(docsmap["ozymandias.txt"][0].metadata_template)
print("-" * 100)
print(docsmap["ozymandias.txt"][0].metadata_seperator)  # If there are multiple metadata keys as part of full content.
print("-" * 100, end="")

{metadata_str}

{content}
----------------------------------------------------------------------------------------------------
{key}: {value}
----------------------------------------------------------------------------------------------------


----------------------------------------------------------------------------------------------------

In [32]:
print(docsmap["ozymandias.txt"][0].get_content(metadata_mode=MetadataMode.LLM))

file_path: data/txts/ozymandias.txt

I met a traveller from an antique land,
Who said—“Two vast and trunkless legs of stone
Stand in the desert. . . . Near them, on the sand,
Half sunk a shattered visage lies, whose frown,
And wrinkled lip, and sneer of cold command,
Tell that its sculptor well those passions read
Which yet survive, stamped on these lifeless things,
The hand that mocked them, and the heart that fed;
And on the pedestal, these words appear:
My name is Ozymandias, King of Kings;
Look on my Works, ye Mighty, and despair!
Nothing beside remains. Round the decay
Of that colossal Wreck, boundless and bare
The lone and level sands stretch far away.”


In [33]:
print(docsmap["ozymandias.txt"][0].get_content(metadata_mode=MetadataMode.EMBED))

file_path: data/txts/ozymandias.txt

I met a traveller from an antique land,
Who said—“Two vast and trunkless legs of stone
Stand in the desert. . . . Near them, on the sand,
Half sunk a shattered visage lies, whose frown,
And wrinkled lip, and sneer of cold command,
Tell that its sculptor well those passions read
Which yet survive, stamped on these lifeless things,
The hand that mocked them, and the heart that fed;
And on the pedestal, these words appear:
My name is Ozymandias, King of Kings;
Look on my Works, ye Mighty, and despair!
Nothing beside remains. Round the decay
Of that colossal Wreck, boundless and bare
The lone and level sands stretch far away.”


### Nodes
Typically a single document (or a single page of multi-page document) is represented by a single `Document` object. But this is not neccessarily how I might want to save the document to the index. A typical use case is to break the document into multiple chunks and then index the chunks individually. This makes the retrieval more precise. A document chunk is called a **Node** in Llama-Index. The type of a document chunk aka Node is actually `TextNode`. Confusingly enough there is a also a `Node` class in this class family.

![doc-class](./doc-class.png)

But for the most part, when the documentation is referring to a "Node" it is referring to `TextNode`.

In [34]:
from llama_index.node_parser import SentenceSplitter

In [35]:
mddocs = SimpleDirectoryReader("./data/mds/").load_data()
parser = SentenceSplitter()
nodes = parser.get_nodes_from_documents(mddocs)
print(f"Type of document is {type(mddocs[0])}, type of node is a {type(nodes[0])}")
print(f"{len(mddocs)} documents were split into {len(nodes)} nodes")

Type of document is <class 'llama_index.schema.Document'>, type of node is a <class 'llama_index.schema.TextNode'>
1 documents were split into 1 nodes


In [None]:
nodes[0]

In [38]:
nodes = parser.get_nodes_from_documents([docsmap["sunset.md"][0]])
len(nodes)

1

In [40]:
from llama_index.node_parser.file import SimpleFileNodeParser

In [41]:
parser = SimpleFileNodeParser()
nodes = parser.get_nodes_from_documents([docsmap["sunset.md"][0]])
len(nodes)

KeyError: 'extension'

In [43]:
from llama_index.readers.file.flat_reader import FlatReader
from pathlib import Path

In [50]:
md_docs = FlatReader().load_data(Path("./data/mds/books.md"))
len(md_docs)

1

In [51]:
md_docs[0].metadata

{'filename': 'books.md', 'extension': '.md'}

In [52]:
nodes = parser.get_nodes_from_documents(md_docs)
len(nodes)

15

In [53]:
nodes[0]

TextNode(id_='1e939504-e9e5-4100-83c8-a345682d64fe', embedding=None, metadata={'Header 1': 'Books', 'filename': 'books.md', 'extension': '.md'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='633e0cc9-602d-4150-9380-9deb46cf4aee', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'filename': 'books.md', 'extension': '.md'}, hash='54f6dbc1566612a87ec0f9d2266d7502d2de2e5ad8e4bae9856c1c35190713ea'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='57763c71-15b2-490e-8433-9a981296b8ed', node_type=<ObjectType.TEXT: '1'>, metadata={'Header 1': 'Books', 'Header 2': 'Business', 'filename': 'books.md', 'extension': '.md'}, hash='229bcfd44b3585334ce8c3fa2b6c8a3c9e398d6cdaa60130274df4d751fe8d11')}, text='Books\n\n==Highlighted== books are books I want to read next.', start_char_idx=2, end_char_idx=61, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')

In [54]:
for i, node in enumerate(nodes):
    print(f"Chunk {i} - ")
    print(node.text)

Chunk 0 - 
Books

==Highlighted== books are books I want to read next.
Chunk 1 - 
Business

[FREE - The Future of a Radical Price](https://www.amazon.com/FREE-Future-Radical-Price/dp/B0055PK366/)

[Seeing What's Next](https://www.amazon.com/Seeing-Whats-Next-Theories-Innovation/dp/1591391857/)

[Measure What Matters](https://www.amazon.com/Measure-What-Matters-Google-Foundation/dp/0525536221/)

[Crossing the Chasm](https://www.amazon.com/Crossing-Chasm-3rd-Disruptive-Mainstream/dp/0062356852/)

==[Hooked: How to Build Habit-Forming Products](https://www.amazon.com/Hooked-How-Build-Habit-Forming-Products/dp/0241184835/)==

[Reinventing Organizations](https://www.amazon.com/Reinventing-Organizations-Frederic-Laloux/dp/2960133501/)

[The Truth About New Rules of Business Writing](https://www.amazon.com/Truth-About-Rules-Business-Writing-ebook/dp/B0031PXEGS/)

[The Staff Engineer's Path](https://www.amazon.com/Staff-Engineers-Path-Individual-Contributors/dp/1098118731/)
Chunk 2 - 
Popular 

## Misc Notes
`SimpleDirectoryReader` reads all the files in the current directory, recursively if specified, and runs the file through a format-specific reader which will chunk a single file into multiple chunks. The resulting chunks are still `Document` objects. E.g., for PDFs, it will chunk the document per page, with each page getting its own Document object. With Markdowns it will chunk based on headers (I have verified H1 and H2, don't know about the rest).

`FlatReader` on the other hand will simply read the file as is without any format-specific parsing. The resulting `Document` object then needs to be passed through another more specific node parser. Using the `SimpleFileNodeParser` will result in a similar behavior as `SimpleDirectoryReader`. The difference is that `SimpleDirectoryReader` will use `MarkdownReader` to read Markdowns but `SimpleFileNodeParser` will use `MarkdownNodeParser`. The node parsers are better because they save the header information in the metadata, something that the readers do not.