## Document Loading Process for RAG

### Importing needed packages

In [7]:
import os
import openai
import sys
from dotenv import load_dotenv, find_dotenv

#PDF
from langchain_community.document_loaders import PyPDFLoader

#YouTube
from langchain_community.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader
from langchain_community.document_loaders.generic import GenericLoader
from langchain_community.document_loaders.parsers.audio import OpenAIWhisperParser

#WEB 
from langchain_community.document_loaders import WebBaseLoader

### API-KEY for OpenAI

In [8]:
sys.path.append('../..')
_ = load_dotenv(find_dotenv()) 
api_key = os.environ['OPENAI_API_KEY']

### Loading PDF Document

In [9]:
loader = PyPDFLoader("./docs/pdf_files/G-CNN.pdf")
pages = loader.load()

To check number of pages

In [10]:
len(pages)

10

splitting the pages

In [11]:
page_01 = pages[0]
print(page_01)

page_content='Group Equivariant Convolutional Networks
Taco S. Cohen T.S.COHEN @UVA.NL
University of Amsterdam
Max Welling M.WELLING @UVA.NL
University of Amsterdam
University of California Irvine
Canadian Institute for Advanced Research
Abstract
We introduce Group equivariant Convolutional
Neural Networks (G-CNNs), a natural general-
ization of convolutional neural networks that re-
duces sample complexity by exploiting symme-
tries. G-CNNs use G-convolutions, a new type of
layer that enjoys a substantially higher degree of
weight sharing than regular convolution layers.
G-convolutions increase the expressive capacity
of the network without increasing the number of
parameters. Group convolution layers are easy
to use and can be implemented with negligible
computational overhead for discrete groups gen-
erated by translations, reﬂections and rotations.
G-CNNs achieve state of the art results on CI-
FAR10 and rotated MNIST.
1. Introduction
Deep convolutional neural networks (CNNs, convn

Accessing to page content

In [12]:
abstract = page_01.page_content[223:923]
print(abstract)


Abstract
We introduce Group equivariant Convolutional
Neural Networks (G-CNNs), a natural general-
ization of convolutional neural networks that re-
duces sample complexity by exploiting symme-
tries. G-CNNs use G-convolutions, a new type of
layer that enjoys a substantially higher degree of
weight sharing than regular convolution layers.
G-convolutions increase the expressive capacity
of the network without increasing the number of
parameters. Group convolution layers are easy
to use and can be implemented with negligible
computational overhead for discrete groups gen-
erated by translations, reﬂections and rotations.
G-CNNs achieve state of the art results on CI-
FAR10 and rotated MNIST.



Accessing Meta Data of each page

In [13]:
page_01.metadata

{'producer': 'pdfTeX-1.40.14',
 'creator': 'TeX',
 'creationdate': '2016-05-27T01:32:03+02:00',
 'moddate': '2016-05-27T01:32:03+02:00',
 'trapped': '/False',
 'ptex.fullbanner': 'This is pdfTeX, Version 3.1415926-2.5-1.40.14 (TeX Live 2013/Debian) kpathsea version 6.1.1',
 'source': './docs/pdf_files/G-CNN.pdf',
 'total_pages': 10,
 'page': 0,
 'page_label': '1'}

### Loading from YouTube

In [17]:
url = "https://www.youtube.com/watch?v=jwnez8HdN7E&t=10s"
save_dir = "docs/youtube/"

loader = GenericLoader(
    YoutubeAudioLoader([url], save_dir),
    OpenAIWhisperParser()
)

video = loader.load()
print(video[0].page_content[0:1000])

[youtube] Extracting URL: https://www.youtube.com/watch?v=jwnez8HdN7E&t=10s
[youtube] jwnez8HdN7E: Downloading webpage




[youtube] jwnez8HdN7E: Downloading android sdkless player API JSON
[youtube] jwnez8HdN7E: Downloading web safari player API JSON




[youtube] jwnez8HdN7E: Downloading m3u8 information




[info] jwnez8HdN7E: Downloading 1 format(s): 140-5
[download] docs/youtube//Microsoft’s new chip looks like science fiction….m4a has already been downloaded
[download] 100% of    4.00MiB
[ExtractAudio] Not converting audio docs/youtube//Microsoft’s new chip looks like science fiction….m4a; file is already in target format m4a
Transcribing part 1!
Out of nowhere, Microsoft just announced an impossible new quantum computing chip named Mirona 1. But it's not your average quantum chip. They claim to have created an entirely new state of matter. So now we have solid, liquid, gas, plasma, and the new kid on the block, the topo-computer, or topological supercomputer. It is entirely a new state of matter. If it turns out not to be your typical Microsoft BS, and that's a big if, it could be a breakthrough on par with the transistor. The humble transistor allowed computers to scale up to millions of bits, and the topo-computer could be the technology that allows us to scale up to millions of qub

### Loading base on a URL

In [16]:
loader = WebBaseLoader("https://www.cit.tum.de/cit/studium/studiengaenge/bachelor-elektrotechnik-informationstechnik/")
docs_url = loader.load()

print(docs_url[0].page_content[0:1000])






Bachelor Elektrotechnik und Informationstechnik - TUM - TUM School of Computation, Information and Technology
































			Zum Inhalt springen
		






de




en






Google Suche












Menü












                            TUM School of Computation, Information and Technology
                        



                            Technische Universität München
                        





















                                        Startseite
                                    



                                        Studium
                                    






                            Vor dem Studium
                        







                            Schulprogramme
                        







                            Schnupperstudium Elektrotechnik Informationstechnik
                        



                            Schnupperstudium Informatik
                        



                       