# Document Loading

In [1]:
import os
import openai

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
openai.api_key = os.environ["OPENAI_API_KEY"]

## PDF

In [2]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("docs/MachineLearning-Lecture01.pdf")
pages = loader.load()
len(pages)

22

In [3]:
page = pages[0]
print("Metadata:\n", page.metadata)
print()
print("Content:\n", page.page_content[:500])

Metadata:
 {'source': 'docs/MachineLearning-Lecture01.pdf', 'page': 0}

Content:
 MachineLearning-Lecture01  
Instructor (Andrew Ng): Okay. Good morning. Welcome to CS229, the machine 
learning class. So what I wanna do today is just spend a little time going over the logistics 
of the class, and then we'll start to talk a bit about machine learning.  
By way of introduction, my name's Andrew Ng and I'll be instructor for this class. And so 
I personally work in machine learning, and I've worked on it for about 15 years now, and 
I actually think that machine learning is the 


## YouTube

In [4]:
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

url = "https://www.youtube.com/watch?v=XUFLq6dKQok"
save_dir = "docs/youtube"
loader = GenericLoader(
    YoutubeAudioLoader([url], save_dir),
    OpenAIWhisperParser(),
)
docs = loader.load()

[youtube] Extracting URL: https://www.youtube.com/watch?v=XUFLq6dKQok
[youtube] XUFLq6dKQok: Downloading webpage
[youtube] XUFLq6dKQok: Downloading ios player API JSON
[youtube] XUFLq6dKQok: Downloading mweb player API JSON
[youtube] XUFLq6dKQok: Downloading m3u8 information
[info] XUFLq6dKQok: Downloading 1 format(s): 140
[download] docs\youtube\FORMATION DEEP LEARNING COMPLETE (2021).m4a has already been downloaded
[download] 100% of   28.63MiB
[ExtractAudio] Not converting audio docs\youtube\FORMATION DEEP LEARNING COMPLETE (2021).m4a; file is already in target format m4a
Transcribing part 1!
Transcribing part 2!


In [5]:
from pprint import pprint
pprint(docs[0].page_content[0:500])

('Ceci est un réseau de neurones artificiels, un des algorithmes '
 "d'intelligence artificielle les plus sophistiqués au monde. A l'origine "
 'inspirée du fonctionnement des neurones biologiques, cet algorithme est '
 "capable d'apprendre à réaliser n'importe quelle tâche. Conduire une voiture, "
 'jouer aux échecs, entretenir une conversation, ou encore reconnaître et '
 'classer des images telles que ces chiffres que vous voyez en ce moment à '
 "l'écran. Dans cette série de vidéos, je vais vous montrer comment créer")


## URL

In [6]:
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://github.com/bryantchakote/bryantchakote/blob/main/README.md")
docs = loader.load()

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [7]:
import re
import json

page = docs[0]
metadata = page.metadata
page_content = page.page_content
page_content = re.sub(r"\s{2,}", "\n", page_content)  # remove trailing spaces

print("Metadata:\n" + json.dumps(metadata, indent=4))
print()
print("Content:\n" + page_content[-504:-233])

Metadata:
{
    "source": "https://github.com/bryantchakote/bryantchakote/blob/main/README.md",
    "title": "bryantchakote/README.md at main \u00b7 bryantchakote/bryantchakote \u00b7 GitHub",
    "description": "Contribute to bryantchakote/bryantchakote development by creating an account on GitHub.",
    "language": "en"
}

Content:
Hello, I'm Bryan Tchakote 😁
A Student with a deep passion for Data Science & AI
🏋🏾‍♂️ So...
🫰🏾I’m currently learning a bit of everything 🤷🏾‍♂️
🐒Ask me about Machine & Deep Learning models, Computer Vision, and whatever else...
🌜Fun Fact: (Just because I saw it somewhere)


## Notion

In [8]:
from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("docs/notion")
docs = loader.load()
len(docs)

13

In [9]:
from IPython.display import display, Markdown

page = docs[9]
print("Metadata:\n", page.metadata)
print()
print("Content:\n")
display(Markdown(page.page_content))

Metadata:
 {'source': 'docs\\notion\\Paper 7 LayoutLMv3 Pre-training for Document AI wi d273f2f8a44245479c115d71f0b1cddb.md'}

Content:



# Paper 7: LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

---

Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. 2022. LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. In Proceedings of the 30th ACM International Conference on Multimedia(MM’22), October 10–14, 2022, Lisboa, Portugal. ACM, New York, NY, USA, 10 pages. [https://doi.org/10.1145/3503161.3548112](https://doi.org/10.1145/3503161.3548112)

---

# Abstract

- Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking.
- Experimental results show that LayoutLMv3 achieves state of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual
question answering, but also in image-centric tasks such as document image classification and document layout analysis.

# Introduction

- A pre-trained Document AI model can parse layout and extract key information for various documents such as scanned forms and academic papers.
- Comparisons with existing works (e.g., DocFormer [2] and SelfDoc [31])
    - On image embedding: our LayoutLMv3 uses linear patches to reduce the computational bottleneck of CNNs and eliminate the need for region supervision in training object detectors.
    - On pre-training objectives on image modality: our LayoutLMv3 learns to reconstruct discrete image tokens of masked patches instead of raw pixels or region features to capture high-level layout structures rather than noisy details.
- To overcome the discrepancy in pre-training objectives of text and image modalities and facilitate multimodal representation learning, we propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking objectives MLM (Masked Language Modeling) and MIM (Masked Image Modeling).
- Inspired by DALL-E [43] and BEiT [3], we obtain the target image tokens from latent codes of a discrete VAE. For documents, each text word corresponds to an image patch. To learn this cross-modal alignment, we propose a Word-Patch Alignment (WPA) objective to predict whether the corresponding image patch of a text word is masked.
- Inspired by ViT [11] and ViLT [22], LayoutLMv3 directly leverages raw image patches from document images without complex pre-processing steps such as page object detection.
- LayoutLMv3 jointly learns image, text and multimodal representations in a Transformer model with unified MLM, MIM and WPA objectives. This makes LayoutLMv3 the first multimodal pre-trained Document AI model without CNNs for image embeddings, which significantly saves parameters and gets rid of region annotations.
- The simple unified architecture and objectives make LayoutLMv3 a general purpose pre-trained model for both text-centric tasks and image centric Document AI tasks.

# LayoutLMv3

## Model Architecture

- LayoutLMv3 applies a unified text-image multimodal Transformer to learn cross-modal representations. The Transformer has a multi-layer architecture and each layer mainly consists of multi-head self-attention and position-wise fully connected feed-forward networks [49]. The input of Transformer is a concatenation of text embedding Y = y1:𝐿 and image embedding X = x1:𝑀 sequences, where 𝐿 and 𝑀 are sequence lengths for text and image respectively. Through the Transformer, the last layer outputs text-and-image contextual representations.
- Text embedding is a combination of word embeddings and position embeddings. We pre-processed document images with an off-the-shelf OCR toolkit to obtain textual content and corresponding 2D position information. We initialize the word embeddings with a word embedding matrix from a pre-trained model RoBERTa. The position embeddings include 1D position and 2D layout position embeddings, where the 1D position refers to the index of tokens within the text sequence, and the 2D layout position refers to the bounding box coordinates of the text sequence.
- Image Embedding: Inspired by ViT [11] and ViLT [22], we represent document images with linear projection features of image patches before feeding them in to the multimodal Transformer. […] We insert semantic 1D relative position and spatial 2D relative position as bias terms in self-attention networks for text and image modalities following LayoutLMv2[56].

## Pre-training Objectives

Objective 1: Masked Language Modeling (MLM). We mask 30% of text tokens with a span masking strategy with span lengths drawn from a Poisson distribution (𝜆 = 3) [21, 27].

Objective 2: Masked Image Modeling (MIM). The MIM objective is a symmetry to the MLM objective, that we randomly mask a percentage of about 40% image tokens with the blockwise masking strategy [3].

Objective 3: Word-Patch Alignment (WPA). For documents, each text word corresponds to an image patch. As we randomly mask text and image tokens with MLM and MIM respectively, there is no explicit alignment learning between text and image modalities. We thus propose a WPA objective to learn a fine-grained alignment between text words and image patches. The WPA objective is to predict whether the corresponding image patches of a text word are masked.

# Experiments

## Fine-tuning on Multimodal Tasks

- Task 1: Form and Receipt Understanding. Form and receipt understanding tasks require extracting and structuring forms and receipts’ textual content. The tasks are a sequence labeling problem aiming to tag each word with a label. We predict the label of the last hidden state of each text token with a linear layer and an MLP classifier for form and receipt understanding tasks, respectively.
- Task 2: Document Image Classification. The document image classification task aims to predict the category of document images. We feed the output hidden state of the special classification token ([CLS]) into an MLP classifier to predict the class labels.