# Introduction

This is an example that shows you how to use document loaders to summarize the following resources:
1. URL
2. PowerPoint
3. ReadTheDocs site
4. PDF

In [67]:
import os
from langchain.document_loaders import UnstructuredURLLoader, UnstructuredPowerPointLoader, ReadTheDocsLoader, PyPDFLoader
from langchain.llms import OpenAI
from langchain.chains.summarize import load_summarize_chain
from langchain.callbacks import get_openai_callback
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [68]:
def summarize_docs(docs, doc_url):
    print (f'You have {len(docs)} document(s) in your {doc_url} data')
    print (f'There are {len(docs[0].page_content)} characters in your document')

    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    split_docs = text_splitter.split_documents(docs)

    print (f'You have {len(split_docs)} split document(s)')

    OPENAI_API_KEY = os.environ['OPENAI_API_KEY']
    llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY, model_name="text-davinci-003")
    chain = load_summarize_chain(llm, chain_type="map_reduce", verbose=False)

    response = ""
    with get_openai_callback() as cb:
        response = chain.run(input_documents=split_docs)
        print(f"Total Tokens: {cb.total_tokens}")
        print(f"Prompt Tokens: {cb.prompt_tokens}")
        print(f"Completion Tokens: {cb.completion_tokens}")
        print(f"Successful Requests: {cb.successful_requests}")
        print(f"Total Cost (USD): ${cb.total_cost}")

    return response

1. Load a web page by its URL and get its content summarized.

In [69]:
url = "https://edition.cnn.com/2023/04/13/business/delta-earnings/index.html"
summarize_docs(UnstructuredURLLoader(urls = [url]).load(), url)

You have 1 document(s) in your https://edition.cnn.com/2023/04/13/business/delta-earnings/index.html data
There are 2780 characters in your document
You have 4 split document(s)
Total Tokens: 1416
Prompt Tokens: 980
Completion Tokens: 436
Successful Requests: 2
Total Cost (USD): $0.02832


' Delta Airlines reported record advanced bookings for the summer, indicating a recovery from pandemic-related losses. Despite a one-time charge of $864 million related to a four-year labor deal with pilots, the company reported a net profit when excluding special items. Revenue was up 45% from a year earlier and 14% from the same period in 2019. Additionally, a passenger was taken into custody after opening a door of a Boeing 737 and deploying an emergency exit slide at Los Angeles International Airport. Delta Airlines is expecting earnings per share of between $2 and $2.25, and between $5 and $6 for the full year. Other major US airlines are likely to face rising labor costs due to upcoming negotiations with a majority of their employees.'

2. Load PowerPoint file and get its content summarized.

In [70]:

!wget "https://github.com/tomw1808/truffle_eth_class2/blob/master/s08/Web3-intro.pptx?raw=true" -O Web3-intro.pptx

--2023-04-13 23:38:31--  https://github.com/tomw1808/truffle_eth_class2/blob/master/s08/Web3-intro.pptx?raw=true
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/tomw1808/truffle_eth_class2/raw/master/s08/Web3-intro.pptx [following]
--2023-04-13 23:38:31--  https://github.com/tomw1808/truffle_eth_class2/raw/master/s08/Web3-intro.pptx
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/tomw1808/truffle_eth_class2/master/s08/Web3-intro.pptx [following]
--2023-04-13 23:38:31--  https://raw.githubusercontent.com/tomw1808/truffle_eth_class2/master/s08/Web3-intro.pptx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8000::154, 2606:50c0:8002::154, 2606:50c0:8001::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercon

In [71]:
loader = UnstructuredPowerPointLoader("Web3-intro.pptx")
response = summarize_docs(loader.load(), "Web3-intro.pptx")
print(response)

You have 1 document(s) in your Web3-intro.pptx data
There are 864 characters in your document
You have 1 split document(s)
Total Tokens: 531
Prompt Tokens: 408
Completion Tokens: 123
Successful Requests: 2
Total Cost (USD): $0.01062
 Web3 is a Javascript library that enables users to interact with the blockchain via the json-RPC interface. It connects the browser to the blockchain via port 8545 and provides practical examples such as connecting to the Ethereum Wiki and getting the balance of an account.


3. Load readthedocs project and get its content summarized.

In [72]:
!wget -r -A.html -P langchain "https://langchain.readthedocs.io/en/latest/"

--2023-04-13 23:42:04--  https://langchain.readthedocs.io/en/latest/
Resolving langchain.readthedocs.io (langchain.readthedocs.io)... 2606:4700::6811:2152, 2606:4700::6811:2052, 104.17.32.82, ...
Connecting to langchain.readthedocs.io (langchain.readthedocs.io)|2606:4700::6811:2152|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://python.langchain.com/en/latest/ [following]
--2023-04-13 23:42:04--  https://python.langchain.com/en/latest/
Resolving python.langchain.com (python.langchain.com)... 2606:4700::6811:2052, 2606:4700::6811:2152, 104.17.32.82, ...
Connecting to python.langchain.com (python.langchain.com)|2606:4700::6811:2052|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘langchain/langchain.readthedocs.io/en/latest/index.html’

langchain.readthedo     [ <=>                ]  78.21K  --.-KB/s    in 0.05s   

2023-04-13 23:42:04 (1.62 MB/s) - ‘langchain/langchain.readthedocs.io/en/l

In [73]:
loader = ReadTheDocsLoader("langchain")
summarize_docs(loader.load(), "langchain")

You have 1 document(s) in your langchain data
There are 5350 characters in your document
You have 6 split document(s)
Total Tokens: 2123
Prompt Tokens: 1644
Completion Tokens: 479
Successful Requests: 2
Total Cost (USD): $0.04246


' LangChain is a framework for developing applications powered by language models. It provides modules for models, prompts, memory, indexes, and chains, as well as resources such as the LangChainHub, a glossary, a gallery, deployments, and tracing guides. ModelLaboratory is a platform that makes it easy to experiment with different prompts, models, and chains. There is a Discord to discuss LangChain, and production support is available with a dedicated Slack channel. The Quickstart Guide provides information on getting started, modules, use cases, reference docs, LangChain Ecosystem, and additional resources.'

4. Load PDF file by URL and get its content summarized.

In [74]:
!wget "https://ir.tesla.com/_flysystem/s3/sec/000095017023001409/tsla-20221231-gen.pdf" -O tsla-20221231-gen.pdf

--2023-04-13 23:45:16--  https://ir.tesla.com/_flysystem/s3/sec/000095017023001409/tsla-20221231-gen.pdf
Resolving ir.tesla.com (ir.tesla.com)... 2a02:26f0:9b00:39d::700, 2a02:26f0:9b00:393::700, 92.122.160.52
Connecting to ir.tesla.com (ir.tesla.com)|2a02:26f0:9b00:39d::700|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/pdf]
Saving to: ‘tsla-20221231-gen.pdf’

tsla-20221231-gen.p     [  <=>               ]   1.57M  5.48MB/s    in 0.3s    

2023-04-13 23:45:17 (5.48 MB/s) - ‘tsla-20221231-gen.pdf’ saved [1650825]



In [76]:
loader = PyPDFLoader("tsla-20221231-gen.pdf")
pages = loader.load_and_split()
summarize_docs(pages[:10], "tsla-20221231-gen.pdf")

You have 10 document(s) in your tsla-20221231-gen.pdf data
There are 3793 characters in your document
You have 30 split document(s)
Total Tokens: 14889
Prompt Tokens: 12541
Completion Tokens: 2348
Successful Requests: 2
Total Cost (USD): $0.29778


" Tesla, Inc. has released its annual report on Form 10-K for the year ended December 31, 2022. The report includes information on the company's business, risk factors, unresolved staff comments, properties, legal proceedings, mine safety disclosures, market for the company's common equity, management's discussion and analysis of financial condition and results of operations, quantitative and qualitative disclosures about market risk, financial statements and supplementary data, changes in and disagreements with accountants on accounting and financial disclosure, controls and procedures, other information, and disclosure regarding foreign jurisdictions that prevent inspections. Tesla designs, develops, manufactures, sells, and leases high-performance electric vehicles and energy generation and storage systems, and offers related services. They offer leasing and loan financing arrangements for vehicles in North America, Europe, and Asia, and provide resale value guarantees or buyback gu