# Document Loading

In order to create an application where you 
can chat with your data, you first have to load 
your data into a format where it can be worked with. That's 
where LangChain document loaders come into play. We 
have over 80 different types of document loaders, and 
in this lesson we'll cover a few of the 
most important ones and get you comfortable with 
the concept in general. 

Let's jump in! 

Document loaders deal with the specifics of accessing and 
converting data from a variety of different formats and 
sources into a standardized format. There can be different 
places that we want to load data from, like websites, 
different databases, YouTube, and these documents can 
come in different data types, like PDFs, HTML, JSON. 

__And so the whole purpose of document loaders 
is to take this variety of data sources 
and load them into a standard document object. 
Which consists of content and then associated metadata.__

There are a lot of different type of 
document loaders in LangChain, and we won't 
have time to cover them all, but here is a rough 
categorization of the 80 plus that we have. There 
are a lot that deal with loading __unstructured data__, like 
text files, from public data sources, like 
YouTube, Twitter, Hacker News, and there are also even more that 
deal with loading unstructured data from 
the proprietary data sources that you or your company 
might have, like Figma, Notion. 

Document loaders can also be used to load __structured data__. 
Data that's in a tabular format and may just have 
some text data in one of those cells or rows that you 
still want to do question answering or semantic 
search over. And so the sources here 
include things like Airbyte, Stripe, Airtable. 

![Image](immagini/05_document.png)

![Image](immagini/06_document.png)

## Note to students.
During periods of high load you may find the notebook unresponsive. It may appear to execute a cell, update the completion number in brackets [#] at the left of the cell but you may find the cell has not executed. This is particularly obvious on print statements when there is no output. If this happens, restart the kernel using the command under the Kernel tab.

## Retrieval augmented generation
 
In retrieval augmented generation (RAG), an LLM retrieves contextual documents from an external dataset as part of its execution. 

This is useful if we want to ask question about specific documents (e.g., our PDFs, a set of videos, etc). 

![Image](immagini/07_document.png)

In [None]:
#! pip install langchain

In [None]:
import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

## PDFs

Let's load a PDF [transcript](https://see.stanford.edu/materials/aimlcs229/transcripts/MachineLearning-Lecture01.pdf) from Andrew Ng's famous CS229 course! These documents are the result of automated transcription so words and sentences are sometimes split unexpectedly.

In [None]:
# The course will show the pip installs you would need to install packages on your own machine.
# These packages are already installed on this platform and should not be run again.
#! pip install pypdf 

In [None]:
# import the relevant document loader from Langchain
from langchain.document_loaders import PyPDFLoader

# workspace
loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")

# load the document with the load function
pages = loader.load()

Let's have a look of what we just loaded:

Each page is a `Document`.

A `Document` contains text (`page_content`) and `metadata`.

In [None]:
# list of documents
len(pages)

*OUTPUT*

22

there are 22 different pages in this PDF. 
Each one is its own unique document. 

In [None]:
# take a look at the first document (page_content)

page = pages[0]

## .page_content()

In [None]:
# print out the first few hundred characters

print(page.page_content[0:500])

*OUTPUT*

```
MachineLearning-Lecture01  
Instructor (Andrew Ng):  Okay. Good morning. Welcome to CS229, the machine 
learning class. So what I wanna do today is ju st spend a little time going over the logistics 
of the class, and then we'll start to  talk a bit about machine learning.  
By way of introduction, my name's  Andrew Ng and I'll be instru ctor for this class. And so 
I personally work in machine learning, and I' ve worked on it for about 15 years now, and 
I actually think that machine learning i
```

## .metadata

The first thing the document consists of is some page content, which is the content of the page. This can be a bit long, so let's just print out the first few hundred characters. 
The other piece of information that's really important is the metadata associated with each document. This can be accessed with the metadata element. 

In [None]:
page.metadata

*OUTPUT*

```
{'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf', 'page': 0}
```

You can see here that there's two different pieces. 
- One is the source information. This is the PDF, the name of the file that we loaded it from. 
- The other is the page field. This corresponds to the page of the PDF that it was loaded from. 


## YouTube

We're going to import a few different things here. The key parts are the YouTube audio loader, which loads an audio file from a YouTube video. The other key part is the OpenAI Whisper parser. This will use OpenAI's Whisper model, a speech-to-text model, to convert the YouTube audio into a text format that we can work with.


In [None]:
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

In [None]:
# ! pip install yt_dlp
# ! pip install pydub

**Note**: This can take several minutes to complete.

We can now specify a URL, specify a directory in which to save the audio files, and then create the generic loader as a combination of this YouTube audio loader combined with the OpenAI Whisper parser. And then we can call "loader.load" to load the documents corresponding to this YouTube. 


In [None]:
url="https://www.youtube.com/watch?v=jGwO_UgTS7I"
save_dir="docs/youtube/"
loader = GenericLoader(
    YoutubeAudioLoader([url],save_dir),
    OpenAIWhisperParser()
)
docs = loader.load()

![Image](immagini/09_document.png)

In [None]:
docs[0].page_content[0:500]

![Image](immagini/10_document.png)

## URLs

The next set of documents that we're going to go over how to load are URLs from the Internet. There's a lot of really awesome educational content on the Internet, and wouldn't it be cool if you could just chat with it? 

We're going to enable that by importing the web-based loader from LangChain. Then we can choose any URL, our favorite URL. Here, we're going to choose a markdown file from this GitHub page and create a loader for it. And then next, we can call loader.load, and then we can take a look at the content of the page. 

Here, you'll notice there's a lot of white space, followed by some initial text, and then some more text. This is a good example of why you actually need to do some post-processing on the information to get it into a workable format.  


In [None]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://github.com/basecamp/handbook/blob/master/37signals-is-you.md")

In [None]:
docs = loader.load()

In [None]:
print(docs[0].page_content[:500])

![Image](immagini/11_document.png)

## Notion

Follow steps [here](https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/notion) for an example Notion site such as [this one](https://yolospace.notion.site/Blendle-s-Employee-Handbook-e31bff7da17346ee99f531087d8b133f):

* Duplicate the page into your own Notion space and export as `Markdown / CSV`.
* Unzip it and save it as a folder that contains the markdown file for the Notion page.
 

![Image](immagini/08_document.png)

In [None]:
from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("docs/Notion_DB")
docs = loader.load()

In [None]:
print(docs[0].page_content[0:200])

![Image](immagini/12_document.png)

In [None]:
docs[0].metadata

Finally, we'll cover how to load data from Notion. Notion is a really popular store of both personal and company data, and a lot of people have created chatbots talking to their Notion databases. In your notebook, you'll see instructions on how to export data from your Notion database into a format through which we can load it into LangChain. Once we have it in that format, we can use the Notion directory loader to load that data and get documents that we can work with. If we take a look at the content here, we can see that it's in markdown format, and this Notion document is from Blendle's Employee Handbook. 

I'm sure a lot of people listening have used Notion and have some Notion databases that they would like to chat with, and so this is a great opportunity to go export that data, bring it in here, and start working with it in this format. That's it for document loading. Here, we've covered how to load data from a variety of sources and get it into a standardized document interface. 

However, these documents are still rather large, and so in the next section, we're going to go over how to split them up into smaller chunks. 

This is relevant and important because when you're doing this retrieval augmented generation, you need to retrieve only the pieces of content that are most relevant, and so you don't want to select the whole documents that we've loaded here, but rather only the paragraph or few sentences that are most topical to what you're talking about.  This is also an even better opportunity to think about what sources of data we don't currently have loaders for, but you might still want to explore. 

![Image](immagini/13_document.png)