# Document Loading

### Data Connection
Many LLM applications require user-specific data that is not part of the model's training set. LangChain gives you the building blocks to load, transform, store and query your data via:

Document loaders: Load documents from many different sources
Document transformers: Split documents, drop redundant documents, and more
Text embedding models: Take unstructured text and turn it into a list of floating point numbers
Vector stores: Store and search over embedded data
Retrievers: Query your data
![Data Connection](https://python.langchain.com/assets/images/data_connection-c42d68c3d092b85f50d08d4cc171fc25.jpg)

### Install LangChain

In [None]:
! pip install langchain

### Set Environmental Variabel

In [None]:
import os
import openai
import sys
sys.path.append('../..')

def set_env_var(name, value):
    os.environ[name] = value

set_env_var("OPENAI_API_KEY", "OPENAI_API_KEY")

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

# PDF

In [None]:
! pip install pypdf

In [None]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("Book -- Deep Learning with TensorFlow and Keras.pdf")
pages = loader.load_and_split()

In [None]:
len(pages)

In [None]:
page = pages[0]

In [None]:
print(page.page_content[0:500])

In [None]:
page.metadata

# EXCEL

In [None]:
from langchain.document_loaders.csv_loader import CSVLoader


loader = CSVLoader(file_path='file_example_XLSX_5000.csv')
data = loader.load()

In [None]:
print(data)

# YouTube

In [None]:
from langchain.document_loaders import YoutubeLoader

In [None]:
!pip install youtube-transcript-api
!pip install pytube

In [None]:
loader = YoutubeLoader.from_youtube_url(
    "https://www.youtube.com/watch?v=9Mjcs1R0tfA", add_video_info=True
)

In [None]:
loader.load()

In [None]:
docs = loader.load()

In [None]:
print(docs[0].page_content[:500])

# URL

In [None]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://en.wikipedia.org/wiki/Machine_learning")

In [None]:
docs = loader.load()

In [None]:
print(docs[0].page_content[:500])