# L3: Custom Components - News Summarizer

<p style="background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px"> ⏳ <b>Note <code>(Kernel Starting)</code>:</b> This notebook takes about 30 seconds to be ready to use. You may start and watch the video while you wait.</p>

In [1]:
import warnings
from helper import load_env

warnings.filterwarnings('ignore')
load_env()

In [2]:
import requests

from typing import List

from haystack import Document, Pipeline, component
from haystack.components.builders import PromptBuilder
from haystack.components.generators.openai import OpenAIGenerator
from haystack.components.fetchers import LinkContentFetcher
from haystack.components.converters import HTMLToDocument

<p style="background-color:#fff6ff; padding:15px; border-width:3px; border-color:#efe6ef; border-style:solid; border-radius:6px"> 💻 &nbsp; <b>Access <code>requirements.txt</code> and <code>helper.py</code> files:</b> 1) click on the <em>"File"</em> option on the top menu of the notebook and then 2) click on <em>"Open"</em>. For more help, please see the <em>"Appendix - Tips and Help"</em> Lesson.</p>

## Custom Component Requirements
#### Build a Custom Component


In [3]:
@component
class Greeter:

    @component.output_types(greeting=str)
    def run(self, user_name: str):
        return {"greeting": f"Hello {user_name}"}

#### Run the Component

In [4]:
greeter = Greeter()

greeter.run(user_name="Tuana")

{'greeting': 'Hello Tuana'}

#### Add the Component to a Pipeline

In [5]:
greeter = Greeter()
template = """ You will be given the beginning of a dialogue. 
Create a short play script using this as the start of the play.
Start of dialogue: {{ dialogue }}
Full script: 
"""
prompt = PromptBuilder(template=template)
llm = OpenAIGenerator()

dialogue_builder = Pipeline()
dialogue_builder.add_component("greeter", greeter)
dialogue_builder.add_component("prompt", prompt)
dialogue_builder.add_component("llm", llm)

dialogue_builder.connect("greeter.greeting", "prompt.dialogue")
dialogue_builder.connect("prompt", "llm")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7fd7b2c09390>
🚅 Components
  - greeter: Greeter
  - prompt: PromptBuilder
  - llm: OpenAIGenerator
🛤️ Connections
  - greeter.greeting -> prompt.dialogue (str)
  - prompt.prompt -> llm.prompt (str)

In [None]:
dialogue_builder.show()

In [6]:
dialogue = dialogue_builder.run({"greeter": {"user_name": "Tuana"}})

print(dialogue["llm"]["replies"][0])

Characters:
SARAH - A young woman in her 20s
TIANA - Sarah's best friend

(As the lights come up, SARAH is sitting on a park bench, looking down at her phone. TIANA walks up to her.)

TIANA: Hello Tuana.

SARAH: (looks up, surprised) Tiana? What are you doing here? I thought you were out of town.

TIANA: I was, but I had to come back early. (sits down next to Sarah) What's been going on with you? You seem really distracted.

SARAH: (sighs) I've just been going through a lot lately. Work has been overwhelming, I'm having trouble with my boyfriend, and I just feel like everything is falling apart.

TIANA: I'm so sorry, Sarah. I wish you had told me sooner. You know I'm here for you, right?

SARAH: I know, Tiana. And I appreciate that more than you know. I just feel like I'm drowning in all this stress and I don't know how to cope.

TIANA: Well, how about this? Let's take a break from all the chaos and go grab some coffee. We can sit and chat and just take a moment to breathe. How does th

## Build a Hacker News Summarizer

> **Note:** Everyone will get different results for this application to what you see in the recording. Results depend on when you run it as it's based on 'current' top/new posts on Hacker News. 

In [7]:
trending_list = requests.get(
        url="https://hacker-news.firebaseio.com/v0/topstories.json?print=pretty"
    )
post = requests.get(
    url=f"https://hacker-news.firebaseio.com/v0/item/{trending_list.json()[0]}.json?print=pretty"
)

print(post.json())

{'by': 'brig90', 'descendants': 54, 'id': 44022353, 'kids': [44023494, 44024081, 44022718, 44024059, 44023668, 44022679, 44023233, 44022579, 44022597, 44023944, 44023829, 44023658, 44023866, 44022560, 44023644, 44022686, 44022868, 44022550], 'score': 202, 'text': 'I built this project as a way to learn more about NLP by applying it to something weird and unsolved.<p>The Voynich Manuscript is a 15th-century book written in an unknown script. No one’s been able to translate it, and many think it’s a hoax, a cipher, or a constructed language. I wasn’t trying to decode it — I just wanted to see: does it behave like a structured language?<p>I stripped a handful of common suffix-like endings (aiin, dy, etc.) to isolate what looked like root forms. I know that’s a strong assumption — I call it out directly in the repo — but it helped clarify the clustering. From there, I used SBERT embeddings and KMeans to group similar roots, inferred POS-like roles based on position and frequency, and built

In [10]:
@component
class HackernewsNewestFetcher:
    def __init__(self):
        fetcher = LinkContentFetcher()
        converter = HTMLToDocument()

        html_conversion_pipeline = Pipeline()
        html_conversion_pipeline.add_component("fetcher", fetcher)
        html_conversion_pipeline.add_component("converter", converter)

        html_conversion_pipeline.connect("fetcher", "converter")
        self.html_pipeline = html_conversion_pipeline
        
    @component.output_types(articles=List[Document])
    def run(self, top_k: int):
        articles = []
        trending_list = requests.get(
            url="https://hacker-news.firebaseio.com/v0/topstories.json?print=pretty"
        )
        for id in trending_list.json()[0:top_k]:
            post = requests.get(
                url=f"https://hacker-news.firebaseio.com/v0/item/{id}.json?print=pretty"
            )
            if "url" in post.json():
                try:
                    article = self.html_pipeline.run(
                        {"fetcher": {"urls": [post.json()["url"]]}}
                    )
                    articles.append(article["converter"]["documents"][0])
                except:
                    print(f"Can't download {post}, skipped")
            elif "text" in post.json():
                try:
                    articles.append(Document(content=post.json()["text"], meta= {"title": post.json()["title"]}))
                except:
                    print(f"Can't download {post}, skipped")
        return {"articles": articles}

In [11]:
fetcher = HackernewsNewestFetcher()
results = fetcher.run(top_k=3)

print(results['articles'])

[Document(id=917409cb3d5e8930aa4ac860c20c96e4650140018a7cc9f3c7a168d9ac209a23, content: 'This started as a personal challenge to figure out what modern NLP could tell us about the Voynich M...', meta: {'content_type': 'text/html', 'url': 'https://github.com/brianmg/voynich-nlp-analysis'}), Document(id=3491a2d399c559c9f59f8daecb3bd14763ee767652ffdc140ccd27ad5fc38834, content: 'Spaced repetition recap
Mastering any subject is built on a foundation of knowledge: knowledge of fa...', meta: {'content_type': 'text/html', 'url': 'https://domenic.me/fsrs/'}), Document(id=e9d7ab13b83e59581de2351d24351fa5dd086ebfbdc194f26011bd231d3329c9, content: 'Ditching Obsidian and building my own
Amber Williams
May 5, 2025 · 8 mins
"You can’t really know whe...', meta: {'content_type': 'text/html', 'url': 'https://amberwilliams.io/blogs/building-my-own-pkms'})]


In [12]:
prompt_template = """  
You will be provided a few of the top posts in HackerNews.  
For each post, provide a brief summary if possible.
  
Posts:  
{% for article in articles %}
  Post:\n
  {{ article.content}}
{% endfor %}  
"""

In [13]:
prompt_builder = PromptBuilder(template=prompt_template)
fetcher = HackernewsNewestFetcher()
llm = OpenAIGenerator()

summarizer_pipeline = Pipeline()
summarizer_pipeline.add_component("fetcher", fetcher)
summarizer_pipeline.add_component("prompt", prompt_builder)
summarizer_pipeline.add_component("llm", llm)

summarizer_pipeline.connect("fetcher.articles", "prompt.articles")
summarizer_pipeline.connect("prompt", "llm")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7fd7b2c5de10>
🚅 Components
  - fetcher: HackernewsNewestFetcher
  - prompt: PromptBuilder
  - llm: OpenAIGenerator
🛤️ Connections
  - fetcher.articles -> prompt.articles (List[Document])
  - prompt.prompt -> llm.prompt (str)

In [None]:
summarizer_pipeline.show()

In [14]:
summaries = summarizer_pipeline.run({"fetcher": {"top_k": 3}})

print(summaries["llm"]["replies"][0])

Post 1: 
The post discusses a project to analyze the Voynich Manuscript using modern NLP techniques to understand its structure without attempting a translation. The project involved clustering of words, deriving a lexicon hypothesis, and examining cluster similarities to real-world languages. The project aimed to model the manuscript's structure using computational linguistics and offers insights into the organization of the mysterious manuscript.

Post 2: 
The post explores the concept of spaced repetition as a method to enhance learning and knowledge retention. It delves into the development of a new scheduling algorithm, FSRS, that has improved spaced repetition systems by optimizing review intervals based on the probability of recall. The impact of FSRS on efficient learning is discussed, along with comparisons to other existing spaced repetition algorithms.

Post 3: 
In this post, the author shares their journey of ditching commercial note-taking apps like Obsidian and instead, b

In [15]:
prompt_template = """  
You will be provided a few of the top posts in HackerNews, followed by their URL.  
For each post, provide a brief summary followed by the URL the full post can be found at.  
  
Posts:  
{% for article in articles %}  
  {{ article.content }}
  URL: {{ article.meta["url"] }}
{% endfor %}  
"""

prompt_builder = PromptBuilder(template=prompt_template)
fetcher = HackernewsNewestFetcher()
llm = OpenAIGenerator()

summarizer_pipeline = Pipeline()
summarizer_pipeline.add_component("fetcher", fetcher)
summarizer_pipeline.add_component("prompt", prompt_builder)
summarizer_pipeline.add_component("llm", llm)

summarizer_pipeline.connect("fetcher.articles", "prompt.articles")
summarizer_pipeline.connect("prompt", "llm")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7fd7b0728f50>
🚅 Components
  - fetcher: HackernewsNewestFetcher
  - prompt: PromptBuilder
  - llm: OpenAIGenerator
🛤️ Connections
  - fetcher.articles -> prompt.articles (List[Document])
  - prompt.prompt -> llm.prompt (str)

In [16]:
summaries = summarizer_pipeline.run({"fetcher": {"top_k": 2}})

print(summaries["llm"]["replies"][0])

Summary: This post discusses the use of spaced repetition systems in learning and how a new scheduling algorithm known as FSRS has revolutionized the efficiency and effectiveness of these systems. The FSRS algorithm focuses on predicting when the probability of recalling information drops to 90% and uses machine learning to optimize scheduling intervals. The post explains how FSRS works, its parameters, and its practical applications in tools like Anki for language learning. A comparison is made with other popular language learning platforms like WaniKani and Bunpro, highlighting the superior performance of FSRS in retaining knowledge.

URL: https://domenic.me/fsrs/  


### Extra resources! 

Learn more about the Haystack integrations:

* [deepset-ai github repo](https://github.com/deepset-ai/haystack-integrations)
* [haystack.deepset.ai/integrations](https://haystack.deepset.ai/integrations)