# Scraping a website with a Python script

First, I start with trying the simplest possible option: scraping the website and then using Hugging Face transformers with no LangChain and a small model, optimised for summarisation, BART by facebook. 

In [1]:
from bs4 import BeautifulSoup
import requests
from transformers import pipeline
import os

# Step 1: Scrape Text from a Website
def scrape_website(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")

    # Extract only paragraph (<p>) text to avoid menus, headers and footers
    paragraphs = [p.get_text(strip=True) for p in soup.find_all('p')]
    text = ' '.join(paragraphs)

    return text  # Return only extracted paragraph text

# Step 2: Generate Summary and Title using Hugging Face Transformers
def summarize_and_title(url):
    text = scrape_website(url)
    
    # Initialize Hugging Face pipelines
    summarizer = pipeline("summarization", model="facebook/bart-large-cnn", framework="pt")
    title_generator = pipeline("text2text-generation", model="google/flan-t5-base", framework="pt")
    
    # Generate summary
    summary = summarizer(text, max_length=150, min_length=50, do_sample=False)[0]['summary_text']

    # Generate title (using the summary as input)
    title_prompt = f"Write a catchy and concise title for the following text:\n{summary}"
    title = title_generator(title_prompt, max_new_tokens=20, num_return_sequences=1)[0]['generated_text']
    
    return summary, title.strip().capitalize()

# Example Usage
url = "https://www.wix.com/encyclopedia/definition/artificial-intelligence"
summary, title = summarize_and_title(url)
print(f"Title: {title}\nSummary: {summary}")

Device set to use mps:0
Device set to use mps:0


Title: Artificial intelligence: what it is and how it works
Summary: Artificial intelligence is a branch of computer science that develops machine systems capable of demonstrating behaviors linked to human intelligence. The primary benefit of using AI is that these systems can potentially complete tasks better and more efficiently than humans. In order to fully understand what AI is and how it works, one must take into account the current state of AI.


We see there are problems: The output resembles the first paragraph. Possible reasons for the problem can be that BART has an input limitation of 1024 tokens so the text is truncated and only the small initial part is summarised. In addition, the first paragraph oftentimes have the most important ideas of the article. BART uses wording, close to the original so the output is explainable. Here is how we fix that.

#### Truncating the text: 

The idea is to break the text in smaller parts and these smaller parts to be summarized then. 

In [2]:
from bs4 import BeautifulSoup
import requests
from transformers import pipeline
import textwrap  

# Step 1: Scrape Text from a Website
def scrape_website(url, max_chars=4000):
    """Extracts paragraph text from a webpage and truncates it to a reasonable length."""
    response = requests.get(url)
    
    if response.status_code != 200:
        raise Exception(f"Failed to fetch webpage. Status code: {response.status_code}")
    
    soup = BeautifulSoup(response.text, "html.parser")

    # Extract all paragraph texts and join into a single string
    paragraphs = [p.get_text(strip=True) for p in soup.find_all('p')]
    text = ' '.join(paragraphs)

    # Truncate text if necessary
    return text[:max_chars] if len(text) > max_chars else text


# Step 2: Summarize Text with Chunking
def chunk_text(text, max_tokens=500):
    """Splits long text into smaller chunks to fit model constraints."""
    return textwrap.wrap(text, width=max_tokens)

def summarize_large_text(text):
    """Summarizes long text in chunks and then summarizes the combined result."""
    summarizer = pipeline("summarization", model="facebook/bart-large-cnn", framework="pt")
    
    chunks = chunk_text(text)
    
    # Generate a summary for each chunk
    summaries = [summarizer(chunk, max_length=150, min_length=50, do_sample=False)[0]['summary_text'] for chunk in chunks]

    # If there are multiple summaries, summarize them again
    if len(summaries) > 1:
        final_summary = summarizer(" ".join(summaries), max_length=150, min_length=50, do_sample=False)[0]['summary_text']
    else:
        final_summary = summaries[0]

    return final_summary


# Step 3: Generate Title Based on Summary
def generate_title(summary):
    """Creates a catchy and concise title from the summary."""
    title_generator = pipeline("text2text-generation", model="google/flan-t5-base", framework="pt")
    
    title_prompt = f"Write a catchy and concise title for the following text:\n{summary}"
    title = title_generator(title_prompt, max_new_tokens=20, num_return_sequences=1)[0]['generated_text']
    
    return title.strip().capitalize()


# Step 4: Main Function
def summarize_and_title(url):
    """Extracts text from a URL, summarizes it, and generates a title."""
    try:
        text = scrape_website(url)
        summary = summarize_large_text(text)
        title = generate_title(summary)

        return title, summary
    except Exception as e:
        return f"Error: {e}", ""


# Example Usage
if __name__ == "__main__":
    url = "https://www.wix.com/encyclopedia/definition/artificial-intelligence"
    title, summary = summarize_and_title(url)
    print(f"Title: {title}\nSummary: {summary}")

Device set to use mps:0
Your max_length is set to 150, but your input_length is only 88. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=44)
Your max_length is set to 150, but your input_length is only 96. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=48)
Your max_length is set to 150, but your input_length is only 108. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=54)
Your max_length is set to 150, but your input_length is only 99. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('

Title: Artificial intelligence: what it is, what it does, and why
Summary: Artificial intelligence is a branch of computer science that develops machine systems capable of demonstrating behaviors linked to human intelligence. The purpose of AI is to improve the systems we already use by automating tasks to make them more efficient. An AI system needs to be built based on three main cognitive skills: Learning, Reasoning and Prediction.


Seems like this is a better summary. The warnings happen because I chunck in characters while the model works with tokens so there is not guarantee that the chuncks will fit in the model's input size. While it is annoying and can lead to unfinished sentences or chunks split at wierd places, it works well on the summary so the result is OK. That's why I will avoid the warnings for now.   

#### Making the output longer: 

However, it is a good idea to make the output a little bit longer since the whole summary is only 3 lines. 

In [4]:
from bs4 import BeautifulSoup
import requests
from transformers import pipeline
import textwrap  

# Step 1: Scrape Text from a Website
def scrape_website(url, max_chars=4000):
    """Extracts paragraph text from a webpage and truncates it to a reasonable length."""
    response = requests.get(url)
    
    if response.status_code != 200:
        raise Exception(f"Failed to fetch webpage. Status code: {response.status_code}")
    
    soup = BeautifulSoup(response.text, "html.parser")

    # Extract all paragraph texts and join into a single string
    paragraphs = [p.get_text(strip=True) for p in soup.find_all('p')]
    text = ' '.join(paragraphs)

    # Truncate text if necessary
    return text[:max_chars] if len(text) > max_chars else text


# Step 2: Summarize Text with Longer Output
def chunk_text(text, max_tokens=500):
    """Splits long text into smaller chunks to fit model constraints."""
    return textwrap.wrap(text, width=max_tokens)

def summarize_large_text(text):
    """Summarizes long text in chunks and then summarizes the combined result."""
    summarizer = pipeline("summarization", model="facebook/bart-large-cnn", framework="pt")
    
    chunks = chunk_text(text)
    
    summaries = []
    for chunk in chunks:
        input_length = len(chunk.split())  # Approximate token count
        max_length = max(100, int(input_length * 1.2))  # Increase max_length to ~1.2x input
        summary = summarizer(chunk, max_length=max_length, min_length=50, do_sample=False)[0]['summary_text']
        summaries.append(summary)

    # If there are multiple summaries, summarize them again
    if len(summaries) > 1:
        combined_text = " ".join(summaries)
        input_length = len(combined_text.split())
        final_max_length = max(100, int(input_length * 1.2))  # Increase final summary length
        final_summary = summarizer(combined_text, max_length=final_max_length, min_length=80, do_sample=False)[0]['summary_text']
    else:
        final_summary = summaries[0]

    return final_summary


# Step 3: Generate Title Based on Longer Summary
def generate_title(summary):
    """Creates a catchy and concise title from the summary."""
    title_generator = pipeline("text2text-generation", model="google/flan-t5-base", framework="pt")
    
    title_prompt = f"Write a catchy and concise title for the following text:\n{summary}"
    title = title_generator(title_prompt, max_new_tokens=20, num_return_sequences=1)[0]['generated_text']
    
    return title.strip().capitalize()


# Step 4: Main Function
def summarize_and_title(url):
    """Extracts text from a URL, summarizes it with longer output, and generates a title."""
    try:
        text = scrape_website(url)
        summary = summarize_large_text(text)
        title = generate_title(summary)

        return title, summary
    except Exception as e:
        return f"Error: {e}", ""


# Example Usage
if __name__ == "__main__":
    url = "https://www.wix.com/encyclopedia/definition/artificial-intelligence"
    title, summary = summarize_and_title(url)
    print(f"Title: {title}\nSummary: {summary}")

Device set to use mps:0
Your max_length is set to 100, but your input_length is only 88. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=44)
Your max_length is set to 100, but your input_length is only 96. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=48)
Your max_length is set to 103, but your input_length is only 99. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=49)
Your max_length is set to 100, but your input_length is only 5. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('..

Title: Artificial intelligence
Summary: Artificial intelligence is a branch of computer science that develops machine systems capable of demonstrating behaviors linked to human intelligence. AI programs use data collected from different interactions to improve the way they mimic humans in order to perform tasks such as learning, planning, knowledge representation, perception and problem-solving. Technology is used for a wide range of applications, including inweb development, chatbots for customer service, product recommendations based on user’s habits, speech recognition, and even tobuild a website from scratch.


#### Dynamically adjusting the max_length: 

Now I will try to tackle the problem with the warnings. I will try to adjust the max-Length dynamically. 

In [5]:
from bs4 import BeautifulSoup
import requests
from transformers import pipeline
import textwrap  

# Step 1: Scrape Text from a Website
def scrape_website(url, max_chars=4000):
    """Extracts paragraph text from a webpage and truncates it to a reasonable length."""
    response = requests.get(url)
    
    if response.status_code != 200:
        raise Exception(f"Failed to fetch webpage. Status code: {response.status_code}")
    
    soup = BeautifulSoup(response.text, "html.parser")

    # Extract all paragraph texts and join into a single string
    paragraphs = [p.get_text(strip=True) for p in soup.find_all('p')]
    text = ' '.join(paragraphs)

    # Truncate text if necessary
    return text[:max_chars] if len(text) > max_chars else text


# Step 2: Summarize Text with Safe Length Adjustment
def chunk_text(text, max_tokens=500):
    """Splits long text into smaller chunks to fit model constraints."""
    return textwrap.wrap(text, width=max_tokens)

def safe_max_length(input_length, factor=1.2):
    """Ensures max_length is safe for summarization"""
    max_length = int(input_length * factor)
    
    # Ensure max_length does not exceed input length
    max_length = min(max_length, input_length - 1)  # Avoid equal-length warnings
    max_length = max(30, max_length)  # Ensure at least 30 tokens (for valid summary)
    
    return max_length

def summarize_large_text(text):
    """Summarizes long text in chunks and then summarizes the combined result."""
    summarizer = pipeline("summarization", model="facebook/bart-large-cnn", framework="pt")
    
    chunks = chunk_text(text)
    
    summaries = []
    for chunk in chunks:
        input_length = len(chunk.split())  # Approximate token count
        max_length = safe_max_length(input_length, factor=1.2)  # Dynamic & safe length
        summary = summarizer(chunk, max_length=max_length, min_length=30, do_sample=False)[0]['summary_text']
        summaries.append(summary)

    # If there are multiple summaries, summarize them again
    if len(summaries) > 1:
        combined_text = " ".join(summaries)
        input_length = len(combined_text.split())
        final_max_length = safe_max_length(input_length, factor=1.2)
        final_summary = summarizer(combined_text, max_length=final_max_length, min_length=50, do_sample=False)[0]['summary_text']
    else:
        final_summary = summaries[0]

    return final_summary


# Step 3: Generate Title Based on Longer Summary
def generate_title(summary):
    """Creates a catchy and concise title from the summary."""
    title_generator = pipeline("text2text-generation", model="google/flan-t5-base", framework="pt")
    
    title_prompt = f"Write a catchy and concise title for the following text:\n{summary}"
    title = title_generator(title_prompt, max_new_tokens=20, num_return_sequences=1)[0]['generated_text']
    
    return title.strip().capitalize()


# Step 4: Main Function
def summarize_and_title(url):
    """Extracts text from a URL, summarizes it with longer output, and generates a title."""
    try:
        text = scrape_website(url)
        summary = summarize_large_text(text)
        title = generate_title(summary)

        return title, summary
    except Exception as e:
        return f"Error: {e}", ""


# Example Usage
if __name__ == "__main__":
    url = "https://www.wix.com/encyclopedia/definition/artificial-intelligence"
    title, summary = summarize_and_title(url)
    print(f"Title: {title}\nSummary: {summary}")

Device set to use mps:0
Your max_length is set to 30, but your input_length is only 5. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=2)
Device set to use mps:0


Title: Artificial intelligence: what it means
Summary: Artificial intelligence is a branch of computer science that develops machine systems capable of demonstrating behaviors linked to human intelligence. The primary benefit of using AI is that these systems can potentially complete tasks better and more efficiently than humans. There are four main types of AI, according to a professor at Michigan State University. This categorization spans from the way we’re used to interacting with AI today, to the more “sci-fi” view of how AI might function in the future.


This gave rise to only one warning, which is normal because when the text finishes, the length of the chunk will be shorter. 

#### Trying another chuncking method

I will also try a chuncking method that is more compatible with the transformer models and cuts by tokens, not by words or sentences. Since these models work with tokens, it makes more sense to try to use this method. 

In [2]:
from bs4 import BeautifulSoup
import requests
from transformers import pipeline
import nltk

nltk.download('punkt')
nltk.download('punkt_tab')

from nltk.tokenize import sent_tokenize


# Step 1: Scrape Text from a Website
def scrape_website(url, max_chars=4000):
    """Extracts paragraph text from a webpage and truncates it to a reasonable length."""
    response = requests.get(url)
    
    if response.status_code != 200:
        raise Exception(f"Failed to fetch webpage. Status code: {response.status_code}")
    
    soup = BeautifulSoup(response.text, "html.parser")

    # Extract all paragraph texts and join into a single string
    paragraphs = [p.get_text(strip=True) for p in soup.find_all('p')]
    text = ' '.join(paragraphs)

    # Truncate text if necessary
    return text[:max_chars] if len(text) > max_chars else text


# Step 2: Split Long Text into Smaller Chunks
def chunk_text(text, max_tokens=500):
    """Splits text into chunks that fit within the model's constraints using sentence boundaries."""
    sentences = sent_tokenize(text)  # Tokenize into sentences
    chunks = []
    current_chunk = []
    current_length = 0

    for sentence in sentences:
        sentence_length = len(sentence)  # Approximate tokens by character length
        
        if current_length + sentence_length > max_tokens:
            # Store the current chunk
            chunks.append(" ".join(current_chunk))
            current_chunk = []
            current_length = 0
        
        current_chunk.append(sentence)
        current_length += sentence_length

    # Add any remaining text as a chunk
    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks


# Step 3: Summarize Text in Chunks
def summarize_large_text(text):
    """Summarizes long text in chunks and then summarizes the combined result."""
    summarizer = pipeline("summarization", model="facebook/bart-large-cnn", framework="pt")
    
    chunks = chunk_text(text)
    
    # Generate a summary for each chunk
    summaries = [
        summarizer(chunk, max_length=150, min_length=50, do_sample=False)[0]['summary_text']
        for chunk in chunks
    ]

    # If multiple chunks, summarize again
    if len(summaries) > 1:
        final_summary = summarizer(" ".join(summaries), max_length=150, min_length=50, do_sample=False)[0]['summary_text']
    else:
        final_summary = summaries[0]

    return final_summary


# Step 4: Generate a Title from the Summary
def generate_title(summary):
    """Creates a catchy and concise title from the summary."""
    title_generator = pipeline("text2text-generation", model="google/flan-t5-base", framework="pt")
    
    title_prompt = f"Write a catchy and concise title for the following text:\n{summary}"
    title = title_generator(title_prompt, max_new_tokens=20, num_return_sequences=1)[0]['generated_text']
    
    return title.strip().capitalize()


# Step 5: Main Function
def summarize_and_title(url):
    """Extracts text from a URL, summarizes it, and generates a title."""
    try:
        text = scrape_website(url)
        summary = summarize_large_text(text)
        title = generate_title(summary)

        return title, summary
    except Exception as e:
        return f"Error: {e}", ""


# Example Usage
if __name__ == "__main__":
    url = "https://www.wix.com/encyclopedia/definition/artificial-intelligence"
    title, summary = summarize_and_title(url)
    print(f"Title: {title}\nSummary: {summary}")

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/macbookair/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/macbookair/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
Device set to use mps:0
Your max_length is set to 150, but your input_length is only 86. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=43)
Your max_length is set to 150, but your input_length is only 76. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=38)
Your max_length is set to 150, but your input_length is only 70. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_len

Title: Artificial intelligence: what it means for humans
Summary: Artificial intelligence is a branch of computer science that develops machine systems capable of demonstrating behaviors linked to human intelligence. The primary benefit of using AI is that these systems can potentially complete tasks better and more efficiently than humans. Chatbots for customer service, product recommendations based on a user’s habits, speech recognition and even tobuild a website from scratch are all uses of AI.


The summary is OK but the warnings persisted since the text consists of many small sentences like headings and bullet points that can result in chunks that are shorter than the specified lenght. So for this text, the dynamic chunking works the best. However, all the summaries are OK as long as there is chunking, irrespective of the method. 