# LangChain Workshop 2: Data Loaders and Output Parsers

In this notebook, we'll explore:
1. Loading data from interesting sources (YouTube, news, web pages)
2. Using output parsers to get structured responses
3. Building practical applications that combine both

## Setup: Loading Our API Keys Securely üîê

First, let's load our environment variables. 

In real life, **NEVER** put API keys directly in your code! For this Kaggle workshop only, you can set your API key in the `KAGGLE_BACKUP` variable below. Though locally, you should use a `.env` file with the following content: 
```.env
OPENAI_API_KEY="your-key"
```

In [1]:
# Only if running this on Kaggle
KAGGLE_BACKUP = "sk-..."  # Replace with your OpenAI key for Kaggle only

In [2]:
try:
    import requests
    import langchain
    import langchain_openai
    import langchain_community
    from dotenv import load_dotenv
    import bs4
    import langchain_yt_dlp
    import feedparser
    import tqdm
    from newspaper import Article
except ImportError as e:
    !pip install requests python-dotenv langchain-openai langchain langchain-community beautifulsoup4 feedparser langchain-yt-dlp newspaper3k listparser lxml-html-clean tqdm
    import requests
    import langchain
    import langchain_openai
    import langchain_community
    from dotenv import load_dotenv
    import bs4
    import langchain_yt_dlp
    import feedparser
    from newspaper import Article
    import tqdm

In [3]:
import os
from langchain_openai import ChatOpenAI

load_dotenv()

# Load API key with Kaggle backup
api_key = os.getenv("OPENAI_API_KEY", KAGGLE_BACKUP)
if api_key:
    print(f"‚úÖ API key loaded successfully: {api_key[:12]}...")
else:
    print("‚ùå No API key found. Make sure you have a .env file with OPENAI_API_KEY=")

llm = ChatOpenAI(model="gpt-5-nano", temperature=0.3, api_key=api_key)

‚úÖ API key loaded successfully: sk-proj-xTrb...


## Part 1: Data Loaders - Getting Data from the Wild

Loading data from websites

In [5]:
from langchain_community.document_loaders import WebBaseLoader

# Load a Wikipedia page about something random 
loader = WebBaseLoader("https://en.wikipedia.org/wiki/Special:Random")
docs = loader.load()

# Clean up page content to remove extra whitespace
for doc in docs:
    doc.page_content = " ".join(doc.page_content.split())

print(f"Loaded {len(docs)} documents")
print(f"Content length: {len(docs[0].page_content)} characters")
print("\nStart of article:")
print(docs[0].page_content[500:3000] + "...")

Loaded 1 documents
Content length: 6828 characters

Start of article:
able of contents Frystown, Pennsylvania 13 languages ÿ™€Üÿ±⁄©ÿ¨ŸáCebuanoEspa√±olŸÅÿßÿ±ÿ≥€åFran√ßais⁄Ø€åŸÑ⁄©€åItalianoLadin–ù–æ—Ö—á–∏–π–Ω–°—Ä–ø—Å–∫–∏ / srpskiSrpskohrvatski / —Å—Ä–ø—Å–∫–æ—Ö—Ä–≤–∞—Ç—Å–∫–∏–¢–∞—Ç–∞—Ä—á–∞ / tatar√ßa–£–∫—Ä–∞—ó–Ω—Å—å–∫–∞ Edit links ArticleTalk English ReadEditView history Tools Tools move to sidebar hide Actions ReadEditView history General What links hereRelated changesUpload filePermanent linkPage informationCite this pageGet shortened URLDownload QR code Print/export Download as PDFPrintable version In other projects Wikimedia CommonsWikidata item Appearance move to sidebar hide Coordinates: 40¬∞26‚Ä≤59‚Ä≥N 76¬∞20‚Ä≤03‚Ä≥WÔªø / Ôªø40.44972¬∞N 76.33417¬∞WÔªø / 40.44972; -76.33417 From Wikipedia, the free encyclopedia Unincorporated community in Pennsylvania, US Census-designated place in Pennsylvania, United StatesFrystown, PennsylvaniaCensus-designated placeFrystownShow map of Pennsylvan

In [6]:
# Ask a question about the loaded content
from langchain.prompts import ChatPromptTemplate

# Use only start of article to keep token limits low. 
summary_prompt = ChatPromptTemplate.from_template(
    "Based on this content, answer in 1-4 sentences: {question}\n\nContent: {content}"
)

question = "What is this article about?"
response = llm.invoke(summary_prompt.format(
    question=question,
    content=docs[0].page_content[500:3000]
))

print(f"Q: {question}")
print(f"A: {response.content}")

Q: What is this article about?
A: This article is about Frystown, Pennsylvania, an unincorporated community and census-designated place in Bethel Township, Berks County. It provides geographic and demographic details (location coordinates, area of about 1.22 sq mi, population 355 in 2020, ZIP codes 17067 and 19507) and information on local administration (Tulpehocken Area School District) and nearby features like Little Swatara Creek and Interstate 78.


Loading data from Youtube videos

In [7]:
# YouTube transcript loader (requires youtube-transcript-api)
from langchain_yt_dlp.youtube_loader import YoutubeLoaderDL

# Load any Youtube video
youtube_loader = YoutubeLoaderDL.from_youtube_url(
    "https://www.youtube.com/watch?v=dQw4w9WgXcQ",  # Replace with actual video
    add_video_info=True
)

try:
    youtube_docs = youtube_loader.load()

    for doc in youtube_docs:
        doc.metadata['description'] = " ".join(doc.metadata['description'].split())

    print(f"YouTube description loaded: {len(youtube_docs[0].metadata['description'])} characters")
    print("\nFirst 300 characters:")
    print(youtube_docs[0].metadata['description'][:300] + "...")
    print(youtube_docs[0].metadata)
except Exception as e:
    print(f"YouTube loading failed (this is common): {e}")
    print("We'll use web content instead for the exercises")

YouTube description loaded: 2349 characters

First 300 characters:
The official video for ‚ÄúNever Gonna Give You Up‚Äù by Rick Astley. Never: The Autobiography üìö OUT NOW! Follow this link to get your copy and listen to Rick‚Äôs ‚ÄòNever‚Äô playlist ‚ù§Ô∏è #RickAstleyNever https://linktr.ee/rickastleynever ‚ÄúNever Gonna Give You Up‚Äù was a global smash on its release in July 1987,...
{'source': 'dQw4w9WgXcQ', 'title': 'Rick Astley - Never Gonna Give You Up (Official Video) (4K Remaster)', 'description': 'The official video for ‚ÄúNever Gonna Give You Up‚Äù by Rick Astley. Never: The Autobiography üìö OUT NOW! Follow this link to get your copy and listen to Rick‚Äôs ‚ÄòNever‚Äô playlist ‚ù§Ô∏è #RickAstleyNever https://linktr.ee/rickastleynever ‚ÄúNever Gonna Give You Up‚Äù was a global smash on its release in July 1987, topping the charts in 25 countries including Rick‚Äôs native UK and the US Billboard Hot 100. It also won the Brit Award for Best single in 1988. Stock Aitken and 

In [8]:
# Ask a question if YouTube loading succeeded
if 'youtube_docs' in locals() and youtube_docs:
    question = "Summarize the main topic of this video in one sentence."
    response = llm.invoke(summary_prompt.format(
        question=question,
        content=youtube_docs[0].metadata['description']
    ))
    print(f"Q: {question}")
    print(f"A: {response.content}")
else:
    print("Skipping YouTube Q&A since loading failed")

Q: Summarize the main topic of this video in one sentence.
A: This is the official music video for Rick Astley's 1987 hit "Never Gonna Give You Up," highlighting its release, chart-topping success, production by Stock Aitken and Waterman, and its enduring popularity.


Loading data from news articles

In [9]:
# RSS/News loader
from langchain_community.document_loaders import RSSFeedLoader

# Load recent news
rss_loader = RSSFeedLoader(
    urls=["https://feeds.bbci.co.uk/news/business/rss.xml?edition=int"],
    show_progress_bar=True
)

try:
    news_docs = rss_loader.load()
    for doc in news_docs:
        doc.page_content = " ".join(doc.page_content.split())

    print(f"Loaded {len(news_docs)} news articles")
    print("\nFirst article title:", news_docs[0].metadata.get('title', 'No title'))
    print("Summary:", news_docs[0].page_content[:200] + "...")
except Exception as e:
    print(f"RSS loading failed: {e}")

50it [00:16,  2.98it/s]

Loaded 50 news articles

First article title: Don't force drivers to use parking apps, RAC says
Summary: Don't force drivers to use parking apps, RAC says The RAC said paying for parking with an app should not be the only option for drivers The RAC welcomed the NPP but said more local authorities and par...





In [10]:
# Ask about the news articles if loading succeeded
if 'news_docs' in locals() and news_docs:
    # Combine first 3 article titles and summaries, truncated
    news_sample = "\n".join([
        f"{doc.metadata.get('title', 'No title')}: {doc.page_content[:500]}" 
        for doc in news_docs[:3]
    ])
    
    question = "What are the main themes in these articles? Use examples from the articles to justify your response."
    response = llm.invoke(summary_prompt.format(
        question=question,
        content=news_sample
    ))
    print(f"Q: {question}")
    print(f"A: {response.content}")
else:
    print("Skipping RSS Q&A since loading failed")

Q: What are the main themes in these articles? Use examples from the articles to justify your response.
A: The main themes are policy decisions and their real-world effects across technology, taxation, and international finance. In parking, the RAC warns against forcing drivers to use parking apps and notes issues like poor phone signal and the app not recognizing car parks, even as it welcomes the National Parking Platform. In housing, the debate over abolishing stamp duty shows how tax policy could reshape the housing market and political calculations around it. Internationally, the US move to implement a controversial rescue plan for Argentina illustrates how policy interventions are used to address currency crises.


## Part 2: Output Parsers - Getting Structured Responses

In [11]:
from langchain.output_parsers import PydanticOutputParser
from langchain.prompts import PromptTemplate
from pydantic import BaseModel, Field
from typing import List

In [14]:
# Structured data with Pydantic
class MovieReview(BaseModel):
    title: str = Field(description="Movie title")
    rating: int = Field(description="Rating from 1-10")
    pros: List[str] = Field(description="List of positive aspects")
    cons: List[str] = Field(description="List of negative aspects")
    recommended: bool = Field(description="Whether you'd recommend it")

parser = PydanticOutputParser(pydantic_object=MovieReview)

prompt = PromptTemplate(
    template="Write a review for the movie '{movie}'.\n{format_instructions}",
    input_variables=["movie"],
    partial_variables={"format_instructions": parser.get_format_instructions()}
)

instructions = prompt.format(movie="The Matrix but everyone is a rubber duck")
print("What the LLM received", instructions)
response = llm.invoke(instructions)
parsed_review = parser.parse(response.content)

print("Structured review:")
print(f"Title: {parsed_review.title}")
print(f"Rating: {parsed_review.rating}/10")
print(f"Pros: {parsed_review.pros}")
print(f"Cons: {parsed_review.cons}")
print(f"Recommended: {parsed_review.recommended}")

What the LLM received Write a review for the movie 'The Matrix but everyone is a rubber duck'.
The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:
```
{"properties": {"title": {"description": "Movie title", "title": "Title", "type": "string"}, "rating": {"description": "Rating from 1-10", "title": "Rating", "type": "integer"}, "pros": {"description": "List of positive aspects", "items": {"type": "string"}, "title": "Pros", "type": "array"}, "cons": {"description": "List of negative aspects", "items": {"type": "string"}, "title": "Cons", "type": "array"}, "recommended": {"description": "Whether you'

## Exercise 1: Bizarre News Summarizer

Create a system that loads weird Wikipedia pages and generates structured summaries with conspiracy-theory-style interpretations.

In [None]:
from pydantic import BaseModel, Field
from typing import List

# TODO: Define your conspiracy theory summary structure
class ConspiracySummary(BaseModel):
    # YOUR CODE HERE: Add fields for:
    # - title: str
    # - real_summary: str (actual factual summary)
    # - conspiracy_theory: str (humorous conspiracy interpretation)
    # - evidence_points: List[str] ("evidence" for the conspiracy)
    # - danger_level: int (1-10 scale)
    pass

# TODO: Create parser and prompt template

def analyze_weird_topic(wikipedia_url):
    """Load a Wikipedia page and generate a conspiracy analysis"""
    # YOUR CODE HERE:
    # 1. Load the webpage
    # 2. Use your structured parser to analyze it
    # 3. Return the parsed result
    pass

# Test with these weird Wikipedia topics:
weird_topics = [
    "https://en.wikipedia.org/wiki/Cargo_cult",
    "https://en.wikipedia.org/wiki/Kentucky_meat_shower",
    "https://en.wikipedia.org/wiki/Dancing_plague_of_1518"
]

# analyze_weird_topic(weird_topics[0])

### Solution:

In [15]:
class ConspiracySummary(BaseModel):
    title: str = Field(description="Topic title")
    real_summary: str = Field(description="Factual 2-sentence summary")
    conspiracy_theory: str = Field(description="Humorous conspiracy interpretation")
    evidence_points: List[str] = Field(description="List of 'evidence' supporting the conspiracy")
    danger_level: int = Field(description="Danger level from 1-10")

conspiracy_parser = PydanticOutputParser(pydantic_object=ConspiracySummary)

conspiracy_prompt = PromptTemplate(
    template="""
    Analyze this content and create both a factual summary and a humorous conspiracy theory interpretation:
    
    {content}
    
    {format_instructions}
    
    Make the conspiracy theory silly but creative. Include fake "evidence" points.
    """,
    input_variables=["content"],
    partial_variables={"format_instructions": conspiracy_parser.get_format_instructions()}
)

def analyze_weird_topic(wikipedia_url):
    loader = WebBaseLoader(wikipedia_url)
    docs = loader.load()
    
    # Use first 2000 characters to avoid token limits
    content = docs[0].page_content[:2000]
    
    response = llm.invoke(conspiracy_prompt.format(content=content))
    return conspiracy_parser.parse(response.content)

# Test it
result = analyze_weird_topic("https://en.wikipedia.org/wiki/Rubber_duck_debugging")
print(f"Title: {result.title}")
print(f"Real Summary: {result.real_summary}")
print(f"Conspiracy: {result.conspiracy_theory}")
print(f"Evidence: {result.evidence_points}")
print(f"Danger Level: {result.danger_level}/10")

Title: Rubber duck debugging
Real Summary: Rubber duck debugging is a software engineering debugging method in which a programmer explains their code, step by step, to a rubber duck to reveal mistakes and misunderstandings. The technique externalizes the programmer's thought process, helping identify logical errors that might be hidden during silent review.
Conspiracy: A silly, fictional conspiracy claims that rubber ducks are secretly coordinating debugging across the software industry. According to the theory, a hidden guild of 'Quack Debuggers' uses squeaks encoded in test outputs and desk duck positions to guide code toward more robust designs, with the Pragmatic Programmer anecdote acting as recruitment lore for this covert order. Every time a coder explains code to a duck, tiny algorithmic nudges are allegedly seeded into the project, while ducks communicate via a secret bathtime dashboard only accessible to those who own a rubber duck.
Evidence: ['Ducks on developer desks are ro

## Summary

**Data Loaders**: Get content from web pages, YouTube, RSS feeds, and more.

**Output Parsers**: Transform unstructured AI responses into structured data you can actually use in applications.

**Key Patterns**:
- Use Pydantic models to define your desired output structure
- Include format instructions in your prompts
- Handle loading errors gracefully
- Truncate content to avoid token limits

Next up: Chat models, agents, and vectorstores!