In [1]:
from langchain_community.document_loaders import UnstructuredMarkdownLoader

# Loading

Reading documents is a crucial part of RAG-LLM Chatbot (or any type of Chatbot requiring a context for generating an answer).  
Hence, you have to familiarize yourself with how to load the documents in a wide variety of formats.  

No small talk now, let's load it!

## Markdown format

In [2]:
markdown_path = "documents/markdown_file.md"
loader = UnstructuredMarkdownLoader(markdown_path)

In [3]:
docs = loader.load()

In [4]:
len(docs)

1

In [27]:
docs

[Document(page_content='Markdown Loader\n\nHere is an example of a file in .md format.\n\nThis type of file is mostly used in programming on daily basis,\n\nso if you need to use it as a data source, be sure you choose the right Loader and proper Chunking strategies.\n\nAdvantages\n\nits format is mostly easy to extract\n\nthere is one good trick you have to learn, which is to convert your pdf file to markdown file and then load it from the markdown. This way can save your life in the future.', metadata={'source': 'documents/markdown_file.md'})]

In [28]:
print(docs[0].page_content)

Markdown Loader

Here is an example of a file in .md format.

This type of file is mostly used in programming on daily basis,

so if you need to use it as a data source, be sure you choose the right Loader and proper Chunking strategies.

Advantages

its format is mostly easy to extract

there is one good trick you have to learn, which is to convert your pdf file to markdown file and then load it from the markdown. This way can save your life in the future.


## Text format

In [32]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader(file_path='./documents/text_file.txt')
docs = loader.load()

In [33]:
docs

[Document(page_content='Here is an example of a file in .txt format. This type of file is mostly used in note taking or casual drafts, so if you need to use it as a data source, be sure you choose the right Loader and proper Chunking strategies.\n', metadata={'source': './documents/text_file.txt'})]

In [34]:
print(docs[0].page_content)

Here is an example of a file in .txt format. This type of file is mostly used in note taking or casual drafts, so if you need to use it as a data source, be sure you choose the right Loader and proper Chunking strategies.



## CSV format

In [29]:
from langchain_community.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(file_path='./documents/csv_file.csv')
docs = loader.load()

In [30]:
docs

[Document(page_content='item_no: 1\nformat: excel\ndescription: this is an example of data in excel format', metadata={'source': './documents/csv_file.csv', 'row': 0}),
 Document(page_content='item_no: 2\nformat: csv\ndescription: this is an example of data in csv format', metadata={'source': './documents/csv_file.csv', 'row': 1})]

In [31]:
print(docs[0].page_content)

item_no: 1
format: excel
description: this is an example of data in excel format


## EXCEL format  
It's not easy to do it in excel format, so you may need to workaround as follows:
- read the excel and convert to csv, and take it from there
- find some libraries that can handle the table or irregular template use cases

### without mode="elements"

In [54]:
from langchain_community.document_loaders import UnstructuredExcelLoader

loader = UnstructuredExcelLoader(file_path='./documents/excel_file.xlsx')
docs = loader.load()
docs

[Document(page_content='\n\n\nitem_no\nformat\ndescription\n\n\n1\nexcel\nthis is an example of data in excel format\n\n\n2\ncsv\nthis is an example of data in csv format\n\n\n', metadata={'source': './documents/excel_file.xlsx'})]

In [55]:
print(docs[0].page_content)




item_no
format
description


1
excel
this is an example of data in excel format


2
csv
this is an example of data in csv format





### with mode="elements"

In [56]:
loader = UnstructuredExcelLoader(file_path='./documents/excel_file.xlsx', mode="elements")
docs = loader.load()
docs

[Document(page_content='\n\n\nitem_no\nformat\ndescription\n\n\n1\nexcel\nthis is an example of data in excel format\n\n\n2\ncsv\nthis is an example of data in csv format\n\n\n', metadata={'source': './documents/excel_file.xlsx', 'file_directory': './documents', 'filename': 'excel_file.xlsx', 'last_modified': '2024-06-01T13:10:49', 'page_name': 'Sheet1', 'page_number': 1, 'text_as_html': '<table border="1" class="dataframe">\n  <tbody>\n    <tr>\n      <td>item_no</td>\n      <td>format</td>\n      <td>description</td>\n    </tr>\n    <tr>\n      <td>1</td>\n      <td>excel</td>\n      <td>this is an example of data in excel format</td>\n    </tr>\n    <tr>\n      <td>2</td>\n      <td>csv</td>\n      <td>this is an example of data in csv format</td>\n    </tr>\n  </tbody>\n</table>', 'languages': ['eng'], 'filetype': 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet', 'category': 'Table'})]

In [57]:
print(docs[0].page_content)




item_no
format
description


1
excel
this is an example of data in excel format


2
csv
this is an example of data in csv format





## PDF format

In [66]:
from langchain_community.document_loaders.pdf import PyMuPDFLoader

loader = PyMuPDFLoader(file_path="documents/pdf_file.pdf")
docs = loader.load()
docs

[Document(page_content='The PDF Document Loader\nHere is an example of a file in .pdf format. This type of file is mostly used in contract, agreement,\nlegal documents, so if you need to use it as a data source, be sure you choose the right Loader and proper\nChunking strategies.\nRemember\nThe pdf file is tricky because it can come in different formats as the following examples:\nOne Column\nThis is an example of one column pdf document. You can see in this part that it stretches the text from the\nleft to the right as one long single piece.\nTwo Columns\nThis is an example of one column pdf document.\nYou can see in this part that it doesn’t stretch the\ntext from the left to the right as one long single\npiece. Instead, it breaks into the new columns\nConclusion\nUsing pdf files is a challenging task, you have to debug often. Also learn their limitations, and leverage\nthem wisely, so you don’t end up fixing a bug until you die!!\nNote:\n-\nText pdf is feasible, but Image pdf is ter

In [67]:
print(docs[0].page_content)

The PDF Document Loader
Here is an example of a file in .pdf format. This type of file is mostly used in contract, agreement,
legal documents, so if you need to use it as a data source, be sure you choose the right Loader and proper
Chunking strategies.
Remember
The pdf file is tricky because it can come in different formats as the following examples:
One Column
This is an example of one column pdf document. You can see in this part that it stretches the text from the
left to the right as one long single piece.
Two Columns
This is an example of one column pdf document.
You can see in this part that it doesn’t stretch the
text from the left to the right as one long single
piece. Instead, it breaks into the new columns
Conclusion
Using pdf files is a challenging task, you have to debug often. Also learn their limitations, and leverage
them wisely, so you don’t end up fixing a bug until you die!!
Note:
-
Text pdf is feasible, but Image pdf is terrible
-
Handwritten in the pdf file can bit

## HTML format

### Common way to load HTML

In [68]:
from langchain_community.document_loaders import UnstructuredURLLoader

urls = [
    "https://python.langchain.com/v0.1/docs/integrations/document_loaders/url/",
]

loader = UnstructuredURLLoader(urls=urls)
docs = loader.load()
docs

[Document(page_content='\n\nComponents\n\nDocument loaders\n\nURL\n\nURL\n\nThis example covers how to load HTML documents from a list of URLs into the Document format that we can use downstream.\n\nUnstructured URL Loader\u200b\n\nYou have to install the unstructured library:\n\n!pip install\n\nU unstructured\n\nfrom\n\nlangchain_community\n\ndocument_loaders\n\nimport\n\nUnstructuredURLLoader\n\nAPI Reference:\n\nUnstructuredURLLoader\n\nurls\n\n"https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-8-2023"\n\n"https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-9-2023"\n\nPass in ssl_verify=False with headers=headers to get past ssl_verification error.\n\nloader\n\nUnstructuredURLLoader\n\nurls\n\nurls\n\ndata\n\nloader\n\nload\n\nSelenium URL Loader\u200b\n\nThis covers how to load HTML documents from a list of URLs using the SeleniumURLLoader.\n\nUsing Selenium allows us to load pages that require Java

In [69]:
print(docs[0].page_content)



Components

Document loaders

URL

URL

This example covers how to load HTML documents from a list of URLs into the Document format that we can use downstream.

Unstructured URL Loader​

You have to install the unstructured library:

!pip install

U unstructured

from

langchain_community

document_loaders

import

UnstructuredURLLoader

API Reference:

UnstructuredURLLoader

urls

"https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-8-2023"

"https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-9-2023"

Pass in ssl_verify=False with headers=headers to get past ssl_verification error.

loader

UnstructuredURLLoader

urls

urls

data

loader

load

Selenium URL Loader​

This covers how to load HTML documents from a list of URLs using the SeleniumURLLoader.

Using Selenium allows us to load pages that require JavaScript to render.

To use the SeleniumURLLoader, you have to install selenium and unstructured

### If the site needs javascript to render

In [72]:
from langchain_community.document_loaders import SeleniumURLLoader

loader = SeleniumURLLoader(urls=urls)
docs = loader.load()
docs

[Document(page_content='\n\nComponents\n\nDocument loaders\n\nURL\n\nURL\n\nThis example covers how to load HTML documents from a list of URLs into the Document format that we can use downstream.\n\nUnstructured URL Loader\u200b\n\nYou have to install the unstructured library:\n\n!pip install\n\nU unstructured\n\nfrom\n\nlangchain_community\n\ndocument_loaders\n\nimport\n\nUnstructuredURLLoader\n\nAPI Reference:\n\nUnstructuredURLLoader\n\nurls\n\n"https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-8-2023"\n\n"https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-9-2023"\n\nPass in ssl_verify=False with headers=headers to get past ssl_verification error.\n\nloader\n\nUnstructuredURLLoader\n\nurls\n\nurls\n\ndata\n\nloader\n\nload\n\nSelenium URL Loader\u200b\n\nThis covers how to load HTML documents from a list of URLs using the SeleniumURLLoader.\n\nUsing Selenium allows us to load pages that require Java

In [73]:
print(docs[0].page_content)



Components

Document loaders

URL

URL

This example covers how to load HTML documents from a list of URLs into the Document format that we can use downstream.

Unstructured URL Loader​

You have to install the unstructured library:

!pip install

U unstructured

from

langchain_community

document_loaders

import

UnstructuredURLLoader

API Reference:

UnstructuredURLLoader

urls

"https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-8-2023"

"https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-9-2023"

Pass in ssl_verify=False with headers=headers to get past ssl_verification error.

loader

UnstructuredURLLoader

urls

urls

data

loader

load

Selenium URL Loader​

This covers how to load HTML documents from a list of URLs using the SeleniumURLLoader.

Using Selenium allows us to load pages that require JavaScript to render.

To use the SeleniumURLLoader, you have to install selenium and unstructured

# Playground

In [6]:
from langchain_community.document_loaders import SeleniumURLLoader
urls = [
    'https://www.reddit.com/r/Expeditions/comments/1b84ow3/pug_is_too_way_imbalanced/'
]

loader = SeleniumURLLoader(urls=urls)
docs = loader.load()
docs

[Document(page_content="whoa there, pardner!\n\nYour request has been blocked due to a network policy.\n\nTry logging in or creating an account here to get back to browsing.\n\nIf you're running a script or application, please register or sign in with your developer credentials here. Additionally make sure your User-Agent is not empty and is something unique and descriptive and try again. if you're supplying an alternate User-Agent string,\ntry changing back to default as that can sometimes result in a block.\n\nYou can read Reddit's Terms of Service here.\n\nif you think that we've incorrectly blocked you or you would like to discuss\neasier ways to get the data you want, please file a ticket here.\n\nwhen contacting us, please include your ip address which is: 184.22.33.218 and reddit account", metadata={'source': 'https://www.reddit.com/r/Expeditions/comments/1b84ow3/pug_is_too_way_imbalanced/', 'title': 'Blocked', 'description': 'No description found.', 'language': 'No language fou

In [7]:
len(docs)

1

In [8]:
print(docs[0].page_content)

whoa there, pardner!

Your request has been blocked due to a network policy.

Try logging in or creating an account here to get back to browsing.

If you're running a script or application, please register or sign in with your developer credentials here. Additionally make sure your User-Agent is not empty and is something unique and descriptive and try again. if you're supplying an alternate User-Agent string,
try changing back to default as that can sometimes result in a block.

You can read Reddit's Terms of Service here.

if you think that we've incorrectly blocked you or you would like to discuss
easier ways to get the data you want, please file a ticket here.

when contacting us, please include your ip address which is: 184.22.33.218 and reddit account
