# Data Loading
Data Loading in langchain is the process of loading data from various sources like files, databases, web pages, etc. The data is loaded into a Document object which is a part of the langchain library. The Document object contains the text content of the data and metadata like the source URL, title, etc. The Document object is then used for further processing like text analysis, summarization, etc.

Source: https://python.langchain.com/docs/integrations/document_loaders

In [3]:
from langchain.document_loaders import TextLoader

loader = TextLoader("test.txt")
document = loader.load()
print(document)
print(document[0])


[Document(metadata={'source': 'test.txt'}, page_content='Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce augue ex, vulputate ac nibh in, venenatis imperdiet ante. Nulla ut accumsan nibh. Sed enim mi, eleifend sit amet tellus et, tempor congue risus. In consequat, purus congue mollis posuere, urna ex scelerisque lorem, sit amet commodo purus arcu ut lacus. Nullam sed porttitor velit, sit amet placerat mauris. Sed et convallis dui, nec ornare ante. Vestibulum dolor justo, commodo vel sapien ut, ultricies tristique magna. Pellentesque tortor sem, commodo nec finibus at, tincidunt a ipsum. Praesent turpis sem, tempor ut ex non, gravida commodo nunc. Duis leo orci, laoreet eu ultrices nec, auctor at magna.\n\nInterdum et malesuada fames ac ante ipsum primis in faucibus. Nam ultrices sed mi ut maximus. Praesent tincidunt semper mauris. Aenean varius neque nulla. Orci varius natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Sed ac bibendum odio. Cras ip

In [2]:
%pip install -qU langchain_community beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


In [4]:
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://www.espn.com/")
documents = loader.load()
documents[0]

USER_AGENT environment variable not set, consider setting it to identify your requests.


Document(metadata={'source': 'https://www.espn.com/', 'title': 'ESPN - Serving Sports Fans. Anytime. Anywhere.', 'description': 'Visit ESPN for live scores, highlights and sports news. Stream exclusive games on ESPN+ and play fantasy sports.', 'language': 'en'}, page_content="\n\n\n\n\n\n\n\n\nESPN - Serving Sports Fans. Anytime. Anywhere.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n        Skip to main content\n    \n\n        Skip to navigation\n    \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n<\n\n>\n\n\n\n\n\n\n\n\n\nMenuESPN\n\n\n\n\n\nscores\n\n\n\nNFLNBAMLBWNBASoccerGolfNHLMore SportsBoxingNCAACricketF1GamingHorseLLWSMMANASCARNLLNBA G LeagueNBA Summer LeagueNCAAFNCAAMNCAAWNWSLOlympicsPLLProfessional WrestlingRacingRN BBRN FBRugbySports BettingTennisTGLUFLX GamesEditionsFantasyWatchESPN BETESPN+\n\n\n\n\n\n\n\n\n\n\n

In [5]:
# checking out the metadata
documents[0].metadata

{'source': 'https://www.espn.com/',
 'title': 'ESPN - Serving Sports Fans. Anytime. Anywhere.',
 'description': 'Visit ESPN for live scores, highlights and sports news. Stream exclusive games on ESPN+ and play fantasy sports.',
 'language': 'en'}

In [6]:
# load multiple pages
loader = WebBaseLoader(["https://www.espn.com/", "https://python.langchain.com/docs/integrations/document_loaders/web_base/"])
documents = loader.load()
documents[1].page_content

'\n\n\n\n\nWebBaseLoader | 🦜️🔗 LangChain\n\n\n\n\n\n\n\n\nSkip to main contentOur Building Ambient Agents with LangGraph course is now available on LangChain Academy!IntegrationsAPI ReferenceMoreContributingPeopleError referenceLangSmithLangGraphLangChain HubLangChain JS/TSv0.3v0.3v0.2v0.1💬SearchProvidersAnthropicAWSGoogleHugging FaceMicrosoftOpenAIMoreProvidersAbsoAcreomActiveloop Deep LakeADS4GPTsAerospikeAgentQLAI21 LabsAimAINetworkAirbyteAirtableAlchemyAleph AlphaAlibaba CloudAnalyticDBAnnoyAnthropicAnyscaleApache Software FoundationApache DorisApifyAppleArangoDBArceeArcGISArgillaArizeArthurArxivAscendAskNewsAssemblyAIAstra DBAtlasAwaDBAWSAZLyricsAzure AIBAAIBagelBagelDBBaichuanBaiduBananaBasetenBeamBeautiful SoupBibTeXBiliBiliBittensorBlackboardbookend.aiBoxBrave SearchBreebs (Open Knowledge)Bright DataBrowserbaseBrowserlessByteDanceCassandraCerebrasCerebriumAIChaindeskChromaClarifaiClearMLClickHouseClickUpCloudflareClovaCnosDBCogneeCogniSwitchCohereCollege ConfidentialCometConfid

### Loading CSV files into Document objects

In [9]:
from langchain_community.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(file_path="Air_Quality.csv", content_columns=["Name"])
documents = loader.load()
print(len(documents))
documents[0].page_content

18025


'Name: Boiler Emissions- Total SO2 Emissions'

### Loading PDFs

In [10]:
%pip install pypdf

# load moby dick into a langchain doc
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("moby-dick.pdf")
documents = loader.load()

print("number of pages: ", len(documents))

# Print first 500 characters of first page
print("First 500 characters of first page:")
print(documents[7].page_content[:500])


Note: you may need to restart the kernel to use updated packages.
number of pages:  468
First 500 characters of first page:
CHAPTER I.
LOOMINGS
Call me Ishmael. Some years ago—never mind how long precisely—having little
or no money in my purse, and nothing particular to interest me on shore, I thought
I would sail about a little and see the watery part of the world. It is a way I have
of driving oﬀ the spleen, and regulating the circulation. Whenever I ﬁnd myself
growing grim about the mouth; whenever it is a damp, drizzly November in my
soul; whenever I ﬁnd myself involuntarily pausing before coﬃn warehouses, and
br


### Cleaning up the data after loading
If you look at the page_content grabbed from a website or the text content grabbed from a file, you will see that it contains a lot of unwanted characters like newlines, tabs, etc.

In [11]:
%pip install --upgrade --quiet html2text

Note: you may need to restart the kernel to use updated packages.


In [12]:
from langchain_community.document_loaders import AsyncHtmlLoader
from langchain_community.document_transformers import Html2TextTransformer

loader = AsyncHtmlLoader(["https://www.espn.com/"])
documents = loader.load()
html2text = Html2TextTransformer()
documents2 = html2text.transform_documents(documents)
print(documents2[0].page_content)

USER_AGENT environment variable not set, consider setting it to identify your requests.
Fetching pages: 100%|##########| 1/1 [00:01<00:00,  1.30s/it]

Skip to main content  Skip to navigation

<

>

Menu

## ESPN

  *   *   *   * scores

  * NFL
  * NBA
  * MLB
  * WNBA
  * Soccer
  * Golf
  * NHL
  * More Sports

    * Boxing
    * NCAA
    * Cricket
    * F1
    * Gaming
    * Horse
    * LLWS
    * MMA
    * NASCAR
    * NLL
    * NBA G League
    * NBA Summer League
    * NCAAF
    * NCAAM
    * NCAAW
    * NWSL
    * Olympics
    * PLL
    * Professional Wrestling
    * Racing
    * RN BB
    * RN FB
    * Rugby
    * Sports Betting
    * Tennis
    * TGL
    * UFL
    * X Games

  * Editions
  * Fantasy
  * Watch
  * ESPN BET
  * ESPN+

##

  * Subscribe Now
  * UFC 318: Holloway vs. Poirier 3 (PPV)

  * NFL FLAG

  * NBA Summer League

  * The Ultimate Fighter

  * Little League Softball

## Quick Links

  * NBA Free Agency

  * NBA Summer League

  * WNBA Season Schedule

  * The Open

  * Where To Watch

  * Today's Top Odds

  * ESPN Radio: Listen Live

## Favorites

Manage Favorites

## Customize ESPN

Create AccountLog In


