# Document Splitting

The steps in this notebook include: 
- **Use Langchain Document Loaders** 

## Contents
1. [Installation](#installation)
2. [PDFs](#pdf)
3. [Youtube](#youtube)  
4. [Notion](#notion)  

**Source:** https://learn.deeplearning.ai/langchain-chat-with-your-data/lesson/2/document-loading

## Retrieval augmented generation
 
In retrieval augmented generation (RAG), an LLM retrieves contextual documents from an external dataset as part of its execution. 

This is useful if we want to ask question about specific documents (e.g., our PDFs, a set of videos, etc). 

![overview.png](./images/overview.png)

# **Installation** <a name="installation"></a>

In [1]:
!pip install -U langchain openai python-dotenv



**Load OpenAI api key**

In [3]:
import os
import openai
import sys

sys.path.append('../..')

# Load from a .env file 
#from dotenv import load_dotenv, find_dotenv
#_ = load_dotenv(find_dotenv()) # read local .env file

os.environ['OPENAI_API_KEY'] = "eyJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJhcHAiLCJzdWIiOiIxNDYyNzU5IiwiYXVkIjoiV0VCIiwiaWF0IjoxNjk5NDUxNzMzLCJleHAiOjE3MDAwNTY1MzN9.7mqcOZ3w4gd7m9QGWcdOx7U1ayk1l22LNZ8LfPOLqjE"

openai.api_key  = os.environ['OPENAI_API_KEY']

# **PDFs** <a name="pdf"></a>

We load a PDF [transcript](https://see.stanford.edu/materials/aimlcs229/transcripts/MachineLearning-Lecture01.pdf) (These documents are the result of automated transcription so words and sentences are sometimes split unexpectedly).

In [4]:
!pip install -U pypdf 



In [5]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("data/MachineLearning-Lecture01.pdf")
pages = loader.load()

Each page is a `Document`.  
- A `Document` contains text (`page_content`) and `metadata`.

In [6]:
len(pages)

22

In [7]:
page = pages[0]
page

Document(page_content='MachineLearning-Lecture01  \nInstructor (Andrew Ng):  Okay. Good morning. Welcome to CS229, the machine \nlearning class. So what I wanna do today is ju st spend a little time going over the logistics \nof the class, and then we\'ll start to  talk a bit about machine learning.  \nBy way of introduction, my name\'s  Andrew Ng and I\'ll be instru ctor for this class. And so \nI personally work in machine learning, and I\' ve worked on it for about 15 years now, and \nI actually think that machine learning is th e most exciting field of all the computer \nsciences. So I\'m actually always excited about  teaching this class. Sometimes I actually \nthink that machine learning is not only the most exciting thin g in computer science, but \nthe most exciting thing in all of human e ndeavor, so maybe a little bias there.  \nI also want to introduce the TAs, who are all graduate students doing research in or \nrelated to the machine learni ng and all aspects of machin e l

In [8]:
print(page.page_content[0:500])

MachineLearning-Lecture01  
Instructor (Andrew Ng):  Okay. Good morning. Welcome to CS229, the machine 
learning class. So what I wanna do today is ju st spend a little time going over the logistics 
of the class, and then we'll start to  talk a bit about machine learning.  
By way of introduction, my name's  Andrew Ng and I'll be instru ctor for this class. And so 
I personally work in machine learning, and I' ve worked on it for about 15 years now, and 
I actually think that machine learning i


In [9]:
page.metadata

{'source': 'data/MachineLearning-Lecture01.pdf', 'page': 0}

# **YouTube** <a name="youtube"></a>

In [10]:
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

In [11]:
!pip install -U yt_dlp
!pip install -U youtube-dl
!pip install pydub
!pip install ffmpeg 
!pip install ffprobe
!pip install ffmpeg-python
!apt-get install ffmpeg

/bin/sh: apt-get: command not found


**Note**: This can take several minutes to complete.

- **Generic Document Loader:** A generic document loader that allows combining an arbitrary blob loader with a blob parser.  
    Parameters :    
        `blob_loader` – A blob loader which knows how to yield blobs  
        `blob_parser` – A blob parser which knows how to parse blobs into documents

- **OpenAIWhisperParser():** Transcribe and parse audio files. Audio transcription is with OpenAI Whisper model. (Whisper = open source automatic speech recognition model (ASR))

In [12]:
url="https://www.youtube.com/watch?v=jGwO_UgTS7I"
save_dir="./"
loader = GenericLoader(
    YoutubeAudioLoader([url],save_dir),
    OpenAIWhisperParser()
)
#docs = loader.load()

In [13]:
#docs[0].page_content[0:500]

# **URLs** <a name="url"></a>

In [14]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://openai.com/blog/new-models-and-developer-products-announced-at-devday")

In [15]:
docs = loader.load()
docs

[Document(page_content='\n\n\nNew models and developer products announced at DevDay\n\n\n\n\n\n\n\n\n\n\n\n\n\nCloseSearch Submit Skip to main contentSite NavigationResearchOverviewIndexGPT-4DALL·E 3APIOverviewData privacyPricingDocsChatGPTOverviewEnterpriseTry ChatGPTSafetyCompanyAboutBlogCareersResidencyCharterSecurityCustomer storiesSearch Navigation quick links Log inTry ChatGPTMenu Mobile Navigation CloseSite NavigationResearchOverviewIndexGPT-4DALL·E 3APIOverviewData privacyPricingDocsChatGPTOverviewEnterpriseTry ChatGPTSafetyCompanyAboutBlogCareersResidencyCharterSecurityCustomer stories Quick Links Log inTry ChatGPTSearch Submit BlogNew models and developer products announced at DevDayGPT-4 Turbo with 128K context and lower prices, the new Assistants API, GPT-4 Turbo with Vision, DALL·E 3 API, and more.November 6, 2023AuthorsOpenAI Announcements,\xa0ProductToday, we shared dozens of new additions and improvements, and reduced pricing across many parts of our platform. These inc

In [16]:
print(docs[0].page_content[:1500])




New models and developer products announced at DevDay













CloseSearch Submit Skip to main contentSite NavigationResearchOverviewIndexGPT-4DALL·E 3APIOverviewData privacyPricingDocsChatGPTOverviewEnterpriseTry ChatGPTSafetyCompanyAboutBlogCareersResidencyCharterSecurityCustomer storiesSearch Navigation quick links Log inTry ChatGPTMenu Mobile Navigation CloseSite NavigationResearchOverviewIndexGPT-4DALL·E 3APIOverviewData privacyPricingDocsChatGPTOverviewEnterpriseTry ChatGPTSafetyCompanyAboutBlogCareersResidencyCharterSecurityCustomer stories Quick Links Log inTry ChatGPTSearch Submit BlogNew models and developer products announced at DevDayGPT-4 Turbo with 128K context and lower prices, the new Assistants API, GPT-4 Turbo with Vision, DALL·E 3 API, and more.November 6, 2023AuthorsOpenAI Announcements, ProductToday, we shared dozens of new additions and improvements, and reduced pricing across many parts of our platform. These include:New GPT-4 Turbo model that is more capa

# **Notion** <a name="notion"></a>

Follow steps [here](https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/notion) for an example Notion site such as [this one](https://yolospace.notion.site/Blendle-s-Employee-Handbook-e31bff7da17346ee99f531087d8b133f):

* Duplicate the page into your own Notion space and export as `Markdown / CSV`.
* Unzip it and save it as a folder that contains the markdown file for the Notion page.
 

<img src="images/image_notion.png" width=300 />

- When exporting, make sure to select the _Markdown & CSV format_ option.

This will produce a `.zip` file. Move the .zip file into this repository and run the following command to unzip the zip file 

In [17]:
!unzip -o "data/27261042-ae4b-4a74-b2d7-6154d7246eb4_Export-0d6d611c-c647-4296-b0f6-4c989f8c5d0d.zip" -d "data/Notion_DB"

Archive:  data/27261042-ae4b-4a74-b2d7-6154d7246eb4_Export-0d6d611c-c647-4296-b0f6-4c989f8c5d0d.zip
  inflating: data/Notion_DB/MLOps tools 2023 830865f5e014447eb4c8c2cf5dbb7367.md  


> **`unzip` Command Options**:  
>The unzip command has various options:
>
>`-o` – overwrite existing files without prompting.   
> `-d exdir` - An optional directory to which to extract files.  

In [18]:
from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("data/Notion_DB")
docs = loader.load()

In [19]:
print(docs[0].page_content[0:200])

# MLOps tools 2023

[https://i0.wp.com/neptune.ai/wp-content/uploads/2023/07/MLOps-Landscape-in-2023-Top-Tools-and-Platforms-5.png?ssl=1](https://i0.wp.com/neptune.ai/wp-content/uploads/2023/07/MLOps-


In [20]:
docs[0].metadata

{'source': 'data/Notion_DB/MLOps tools 2023 830865f5e014447eb4c8c2cf5dbb7367.md'}