<a href="https://colab.research.google.com/github/dhorvath/AI-Stuff/blob/main/TECH16_LLM_Lecture2_HW_David_Horvath.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup

In [None]:
#Installs
!pip install openai
!pip install langchain
!pip install langchain-openai
!pip install -U langchain-community pypdf langchain-openai

Collecting textwrap3
  Downloading textwrap3-0.9.2-py2.py3-none-any.whl.metadata (4.6 kB)
Downloading textwrap3-0.9.2-py2.py3-none-any.whl (12 kB)
Installing collected packages: textwrap3
Successfully installed textwrap3-0.9.2


In [None]:
# More setup
import os
from openai import OpenAI
from google.colab import userdata
from langchain_openai import ChatOpenAI
from langchain.document_loaders import PyPDFLoader
from langchain.chains.summarize import load_summarize_chain
from langchain_community.document_loaders import WebBaseLoader
from langchain.chains.combine_documents.stuff import StuffDocumentsChain
from langchain.chains.llm import LLMChain
from langchain.prompts import PromptTemplate

# API
open_ai_key = userdata.get('open_ai_key')
client = OpenAI(api_key=open_ai_key)

# Environmental variable
os.environ["USER_AGENT"] = 'TECH16 Colab'

# AI setup
llm = ChatOpenAI(temperature=0.1, model_name="gpt-4-turbo-preview", api_key=open_ai_key)

# Load document


In [None]:
# Get a pdf
!wget https://www.sf.gov/sites/default/files/2024-05/CSF_Proposed_Budget_Book_June_2024_r8.pdf
loader = PyPDFLoader("CSF_Proposed_Budget_Book_June_2024_r8.pdf")
pages = loader.load_and_split()

--2024-08-02 18:47:11--  https://www.sf.gov/sites/default/files/2024-05/CSF_Proposed_Budget_Book_June_2024_r8.pdf
Resolving www.sf.gov (www.sf.gov)... 3.163.189.32, 3.163.189.111, 3.163.189.119, ...
Connecting to www.sf.gov (www.sf.gov)|3.163.189.32|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8666653 (8.3M) [application/pdf]
Saving to: ‘CSF_Proposed_Budget_Book_June_2024_r8.pdf.2’


2024-08-02 18:47:11 (113 MB/s) - ‘CSF_Proposed_Budget_Book_June_2024_r8.pdf.2’ saved [8666653/8666653]



# Stuff method


In [None]:
# Simple summary
chain = load_summarize_chain(llm, chain_type="stuff")

res = chain.invoke(pages[8:200])
stuffSummary = res["output_text"]

print("SF 2024 Budget Summary using 'stuff' method:")
print(stuffSummary)

# Had to limit to summarizing the first half only b/c the pdf was 362 pages

Summary using 'stuff' method:
The proposed Fiscal Year (FY) 2024-25 budget for various San Francisco departments reflects the city's commitment to public safety, environmental sustainability, economic vitality, and equitable access to services. The budgets for departments like the Department of Emergency Management, the Environment Department, and the Office of Economic and Workforce Development highlight investments in critical areas such as emergency response, climate action, and economic recovery. These budgets aim to improve the quality of life for all San Franciscans by ensuring efficient government services, fostering a robust economy, and advancing climate protection.


In [None]:
# Bulleted summary

# Define prompt
prompt_template = """Write a concise summary in a maximum of 5 bullets of the following text enclosed within three backticks:
```{text}```
CONCISE SUMMARY:"""
prompt = PromptTemplate.from_template(prompt_template)

# Define LLM chain
llm_chain = LLMChain(llm=llm, prompt=prompt)

# Define StuffDocuments Chain
stuff_chain = StuffDocumentsChain(llm_chain=llm_chain, document_variable_name="text")
res = stuff_chain.invoke(pages[8:200])

# 07/20/24: tried to summarize all the pages from the document

print("Bulleted summary using 'stuff' method:")
print(res["output_text"])

Bulleted summary using 'stuff' method:
- The proposed FY 2024-25 budget for various San Francisco departments reflects adjustments based on operational needs, strategic initiatives, and compliance with state and federal laws. 
- Key focuses include enhancing public safety, expanding access to early childhood education, improving emergency response times, promoting environmental sustainability, and supporting economic recovery and workforce development.
- Investments are being made in infrastructure, technology upgrades, and programs aimed at addressing climate change, housing affordability, and workforce equity.
- The budgets also account for the implementation of new state laws, securing state and federal grants for climate projects, and continuing efforts to streamline permitting processes and improve citywide services.
- Efforts to promote racial and gender equity, inclusion, and diversity are integrated across departments, with initiatives to support victims of crime, expand childc

# MapReduce method



In [None]:
chain = load_summarize_chain(llm, chain_type="map_reduce")

res = chain.invoke(pages[8:200])
mapReduceSummary = res["output_text"]

print("Summary using 'map-reduce' method:")
print(mapReduceSummary)

Summary using 'map-reduce' method:
The Mayor of San Francisco has proposed budgets for fiscal years 2024-25 and 2025-26, with totals of $15.9 billion and $15.5 billion respectively, to address a budget shortfall while prioritizing workforce enhancement, public safety, and economic vitality. Despite facing a General Fund deficit, the city aims to balance the budget through expense reductions, fee increases, and one-time revenue sources. Key investments include competitive employee compensation, public safety improvements, support for families, homelessness reduction, and economic revitalization efforts focusing on business, tourism, arts, and culture. The budget plans also feature expansions in childcare subsidies, increased shelter capacity, and housing solutions for the homeless, alongside maintaining essential services. Funding adjustments are outlined for various city departments and initiatives, such as the Adult Probation Department, San Francisco Airport, and the Arts Commission,

# Comparison

In [38]:
differenceResponse = client.chat.completions.create(
    model="gpt-4-turbo-preview",
    messages=[
        {"role": "system", "content": "The user will give two messages, each representing a document. Refer to the first document as 'stuff summary', and refer to the second as 'map reduce summary'. Compare the documents for important differences and describe these differences, with an emphasis on writing style, level of detail, and key takeaways."},
        {"role": "user", "content": stuffSummary},
        {"role": "user", "content": mapReduceSummary}
    ]
)
print()
print("The difference between the 'stuff' summary and the 'map-reduce' summary")
print(differenceResponse.choices[0].message.content)


The difference between the 'stuff' summary and the 'map-reduce' summary
### Comparison between the 'Stuff Summary' and 'Map Reduce Summary'

#### Writing Style:
- The **Stuff Summary** utilizes a general and more abstract approach, focusing on the overarching goals and areas of investment without delving into specific figures or detailed plans. It emphasizes the city’s commitment to various sectors but lacks the granularity seen in the second document.
- The **Map Reduce Summary**, on the other hand, adopts a detailed and data-driven approach, providing specific budget figures, fiscal strategies, and departmental investments. It also delves into the planning process behind the budget, offering a comprehensive overview that includes stakeholder involvement and strategic objectives.

#### Level of Detail:
- **Stuff Summary** presents a broad overview without specific financial details or explicit mention of fiscal challenges. It outlines investment areas but stops short of detailing how

# Extra Credit - Let's summarize a Youtube video!

In [42]:
# Installs

!pip install --upgrade --quiet  youtube-transcript-api
!pip install --upgrade --quiet  pytube

from langchain_community.document_loaders import YoutubeLoader

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/57.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [41]:
# Load a video

loader = YoutubeLoader.from_youtube_url(
    "https://www.youtube.com/watch?v=4TMPXK9tw5U", add_video_info=False
)
loader.load()

[Document(metadata={'source': '4TMPXK9tw5U'}, page_content="Transcriber: Norhan Eliwa\nReviewer: Sadegh Vahdati Nia By providing space for constant evolution, we can all transform how we view ourselves\nand the world around us. Bear with me, everybody. I'm going to\nstart off today on a little bit of a heavy note, but I promise\nthings will lighten up. The dots on this screen represent\nan adult life in months, assuming a life expectancy of 90. So if you're 18 years old right now, this is an optimistic estimate of\nthe months that you have left. Take a second to take that in. Probably not as many as you would expect. And I’m sorry to say that\nit does get worse because about a third of that time\nis going to be spent sleeping. On average, 126 of those months will\ngo to school in your career. About 18 will be spent driving,\n36 cooking and eating, 36 doing chores and errands and\nabout 27 in the bathroom and taking care of personal hygiene. So that leaves you with 334 months, optimisti

In [47]:
transcript = loader.load()

# Define prompt
prompt_template = """Write a concise summary in a maximum of 5 bullets of the following transcript enclosed within three backticks:
```{text}```
CONCISE SUMMARY:"""
prompt = PromptTemplate.from_template(prompt_template)

# Define LLM chain
llm_chain = LLMChain(llm=llm, prompt=prompt)

# Define StuffDocuments Chain
stuff_chain_youtube = StuffDocumentsChain(llm_chain=llm_chain, document_variable_name="text")
res = stuff_chain_youtube.invoke(transcript)

print("Bulleted YouTube transcript summary using 'stuff' method:")
print(res["output_text"])

Bulleted YouTube transcript summary using 'stuff' method:
- The speaker begins by visualizing an adult's life expectancy in months, highlighting the limited amount of free time available after accounting for essential activities like sleeping, schooling, and chores, leaving only 334 months for personal pursuits.
- They emphasize the importance of how we choose to spend our free time, as it shapes our future selves, urging the audience to consider what activities are truly worth investing this time in.
- The speaker points out the alarming statistic that the average 18-year-old is expected to spend 93% of their remaining free time on screens, not including educational purposes, and discusses the negative impacts of excessive screen time on mental health, cognition, and perception of self-worth.
- Social media platforms are criticized for their business models, which prioritize profit over user well-being by monetizing user attention and data, leading to an overconsumption of screen time

In [43]:
# Get transcripts as timestamped chunks

from langchain_community.document_loaders.youtube import TranscriptFormat

loader = YoutubeLoader.from_youtube_url(
    "https://www.youtube.com/watch?v=4TMPXK9tw5U",
    add_video_info=True,
    transcript_format=TranscriptFormat.CHUNKS,
    chunk_size_seconds=30,
)
print("\n\n".join(map(repr, loader.load())))

Document(metadata={'source': 'https://www.youtube.com/watch?v=4TMPXK9tw5U&t=0s', 'title': 'The Battle for Your Time: Exposing the Costs of Social Media | Dino Ambrosi | TEDxLagunaBlancaSchool', 'description': 'Unknown', 'view_count': 2066268, 'thumbnail_url': 'https://i.ytimg.com/vi/4TMPXK9tw5U/hq720.jpg', 'publish_date': '2023-03-06 00:00:00', 'length': 691, 'author': 'TEDx Talks', 'start_seconds': 0, 'start_timestamp': '00:00:00'}, page_content="Transcriber: Norhan Eliwa\nReviewer: Sadegh Vahdati Nia By providing space for constant evolution, we can all transform how we view ourselves\nand the world around us. Bear with me, everybody. I'm going to\nstart off today on a little bit of a heavy note, but I promise\nthings will lighten up. The dots on this screen represent\nan adult life in months, assuming a life expectancy of 90.")

Document(metadata={'source': 'https://www.youtube.com/watch?v=4TMPXK9tw5U&t=30s', 'title': 'The Battle for Your Time: Exposing the Costs of Social Media | D