# Document Loader using LangChain
Different document loaders from LangChain
https://python.langchain.com/docs/integrations/document_loaders/
Loading different file types including text, PDF, Markdown, JSON, CSV, and Doc X, directly from websites and handle unstructured files.
LangChain uses document loaders to gather information from several sources such as websites, files, and databases, and then prepares it for further use. Document loaders, act as connectors, pulling in data, and converting it into a format LangChain can work with.


## Loading documents from different sources
- plain text files, you can utilize the text loader class and LangChain to load them efficiently. LangChain features a load method explicitly designed for loading data as documents from configured sources.
- PDF files, the PyPDFLoader class in LangChain is your go-to tool. can also use the PyMuPDFLoader, the fastest PDF parsing tool in LangChain. The critical difference between PyMuPDFLoader and PyPDFLoader is that it includes more comprehensive metadata. - problems with loading pdf: preserving structure, images, tables etc.
- Markdown files - LangChain offers unstructured markdown loader class to load it
- JSON - JSON loader class specifically designed to load JSON files. This loader utilizes a JQ schema to parse JSON files according to specific needs
- CSV - CSV loader and LangChain is the ideal tool. This loader is designed to handle CSV files by converting each row of data into a separate document object. Here's an example. If you want to load the data, all in one document object, you can use unstructured CSV loader. This is useful when analyzing the data as a table rather than individual entries
- Online webpage - BeautifulSoup or WebBaseLoader and LangChain is designed to efficiently extract all text from HTML web pages and convert it into a document format suitable for downstream processing, the loader only captures all text from the web page, avoiding HTML tags and links, 
- For projects requiring flexibility due to unknown or varied file formats, the unstructured file loader in LangChain works perfectly. This loader is designed for general-purpose use and supports many file types, including text files, PowerPoint presentations, HTML pages, PDFs, images, and more


Text-splitter strategies

In [None]:
# import necessary libraries

from pprint import pprint
import json
from pathlib import Path
import nltk
import urllib.request
from langchain_community.document_loaders import TextLoader
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.document_loaders import UnstructuredMarkdownLoader
from langchain_community.document_loaders import JSONLoader
from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain_community.document_loaders.csv_loader import UnstructuredCSVLoader
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.document_loaders import Docx2txtLoader
from langchain_community.document_loaders import UnstructuredFileLoader

nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')

USER_AGENT environment variable not set, consider setting it to identify your requests.
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\jayne\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\jayne\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger_eng.zip.


True

In [None]:
# loading txt files
# first download sample file

url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/Ec5f3KYU1CpbKRp1whFLZw/new-Policies.txt"
filename = 'new-Policies.txt'
urllib.request.urlretrieve(url, filename)


('new-Policies.txt', <http.client.HTTPMessage at 0x293ceacb130>)

In [12]:
# using the TextLoader class to load the file
loader = TextLoader(filename)
loader

<langchain_community.document_loaders.text.TextLoader at 0x293ceb3fb60>

In [None]:
# use the load method to load the data as documents
data = loader.load()

# a document object that includes metadata and page_content 
data

[Document(metadata={'source': 'new-Policies.txt'}, page_content="1. Code of Conduct\n\nOur Code of Conduct establishes the core values and ethical standards that all members of our organization must adhere to. We are committed to fostering a workplace characterized by integrity, respect, and accountability.\n\nIntegrity: We commit to the highest ethical standards by being honest and transparent in all our dealings, whether with colleagues, clients, or the community. We protect sensitive information and avoid conflicts of interest.\n\nRespect: We value diversity and every individual's contribution. Discrimination, harassment, or any form of disrespect is not tolerated. We promote an inclusive environment where differences are respected, and everyone is treated with dignity.\n\nAccountability: We are responsible for our actions and decisions, complying with all relevant laws and regulations. We aim for continuous improvement and report any breaches of this code, supporting investigations