#### Document Loaders - document loaders, designed to ingest data from various sources and formats. These loaders transform external data into a consistent Document format, which can then be used for tasks like question answering, summarization, and more.


![alt text](<2.png>)
**A Document contains text (page_content) and metadata**

Document loaders typically depend on the type of data you're working with, but the following are commonly employed:

TextLoader: A workhorse. Used for almost any text-based file or data source.

PDFLoader: Very common for loading reports, documents, and publications in PDF format. Often paired with splitting strategies to break documents into smaller chunks for processing.

WebBaseLoader / UnstructuredURLLoader: Used for loading data from web pages, including blogs, articles, and websites. Unstructured URLLoader is especially popular.

CSVLoader: Useful for loading tabular data stored in CSV files.

JSONLoader: Used when your data is in JSON format.

MarkdownLoader: Useful for processing markdown documentation, wikis, and websites.

DirectoryLoader: Used to organize and apply other loaders to files within a directory. Frequently paired with other loaders.

Docx2txtLoader/UnstructuredWordDocumentLoader: Used to load Microsoft Word documents.
    

In [2]:
import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv
load_dotenv()  

#openai.api_key  = os.environ['OPENAI_KEY']

*** PDF's ***

In [3]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("https://see.stanford.edu/materials/aimlcs229/transcripts/MachineLearning-Lecture01.pdf")
pages = loader.load()



In [4]:
len(pages)
# the given pdf has 22 pages , Each page is a document , 
##** A Document contains text (page_content) and metadata

22

In [6]:
page = pages[0]
print(page)

page_content='MachineLearning-Lecture01  
Instructor (Andrew Ng): Okay. Good morning. Welcome to CS229, the machine 
learning class. So what I wanna do today is just spend a little time going over the logistics 
of the class, and then we'll start to talk a bit about machine learning.  
By way of introduction, my name's Andrew Ng and I'll be instructor for this class. And so 
I personally work in machine learning, and I've worked on it for about 15 years now, and 
I actually think that machine learning is the most exciting field of all the computer 
sciences. So I'm actually always excited about teaching this class. Sometimes I actually 
think that machine learning is not only the most exciting thing in computer science, but 
the most exciting thing in all of human endeavor, so maybe a little bias there.  
I also want to introduce the TAs, who are all graduate students doing research in or 
related to the machine learning and all aspects of machine learning. Paul Baumstarck 
works in ma

In [7]:
print(page.page_content[0:500])
##** A Document contains text (page_content) and metadata

MachineLearning-Lecture01  
Instructor (Andrew Ng): Okay. Good morning. Welcome to CS229, the machine 
learning class. So what I wanna do today is just spend a little time going over the logistics 
of the class, and then we'll start to talk a bit about machine learning.  
By way of introduction, my name's Andrew Ng and I'll be instructor for this class. And so 
I personally work in machine learning, and I've worked on it for about 15 years now, and 
I actually think that machine learning is the 


In [9]:
print(page.metadata)
##** A Document contains text (page_content) and metadata

{'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'creator': 'PScript5.dll Version 5.2.2', 'creationdate': '2008-07-11T11:25:23-07:00', 'author': '', 'moddate': '2008-07-11T11:25:23-07:00', 'title': '', 'source': 'https://see.stanford.edu/materials/aimlcs229/transcripts/MachineLearning-Lecture01.pdf', 'total_pages': 22, 'page': 0, 'page_label': '1'}


### YOUTUBE LOADER

In [None]:
# ! pip install yt_dlp
# ! pip install pydub

In [None]:
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

In [None]:
url="https://www.youtube.com/watch?v=jGwO_UgTS7I"
save_dir="docs/youtube/"
loader = GenericLoader(
    #YoutubeAudioLoader([url],save_dir),  # fetch from youtube
    FileSystemBlobLoader(save_dir, glob="*.m4a"),   #fetch locally
    OpenAIWhisperParser()
)
docs = loader.load()

In [None]:
docs[0].page_content[0:500]

### loading from webbased Loader -- URL ###

In [27]:
from langchain.document_loaders import WebBaseLoader
 

loader = WebBaseLoader("https://github.com/basecamp/handbook/blob/master/titles-for-programmers.md")

In [33]:
web_docs = loader.load()

In [41]:
print(web_docs[0].metadata)
##** A Document contains text (page_content) and metadata

{'source': 'https://github.com/basecamp/handbook/blob/master/titles-for-programmers.md', 'title': 'handbook/titles-for-programmers.md at master · basecamp/handbook · GitHub', 'description': 'Basecamp Employee Handbook. Contribute to basecamp/handbook development by creating an account on GitHub.', 'language': 'en'}


In [39]:
print(web_docs[0].page_content[:2000])















































































handbook/titles-for-programmers.md at master · basecamp/handbook · GitHub














































Skip to content













Navigation Menu

Toggle navigation




 













            Sign in
          








        Product
        













GitHub Copilot
        Write better code with AI
      







Security
        Find and fix vulnerabilities
      







Actions
        Automate any workflow
      







Codespaces
        Instant dev environments
      







Issues
        Plan and track work
      







Code Review
        Manage code changes
      







Discussions
        Collaborate outside of code
      







Code Search
        Find more, search less
      






Explore



      All features

    



      Documentation

    





      GitHub Skills

    





      Blog

    










        Solutions
        






By company size



      En