# Scraping Websites
## Extracts data from a Website
Alternative products:
<ul>
  <li><a target="_blank" href="https://www.crummy.com/software/BeautifulSoup/">Beautiful Soap</a></li>    
  <li><a target="_blank" href="https://scrapy.org/">Scrapy</a></li>    
  <li><a target="_blank" href="https://www.selenium.dev/">Selenium</a> Web application test framework</li>    
  <li><a target="_blank" href="https://playwright.dev/">PlayWritght</a> Web application test framework</li>    
</ul>

In [3]:
from bs4 import BeautifulSoup
from llama_index.core.schema import Document
import logging as log
import newspaper
import os

In [4]:
FORMAT_STRING = "%(module)s.%(funcName)s():%(lineno)d %(asctime)s\n[%(levelname)-5s] %(message)s\n"
log.basicConfig(level= log.INFO, format=FORMAT_STRING)

In [5]:
urls = [
    "https://docs.llamaindex.ai/en/stable/understanding",
    "https://docs.llamaindex.ai/en/stable/understanding/using_llms/using_llms/",
    "https://docs.llamaindex.ai/en/stable/understanding/indexing/indexing/",
    "https://docs.llamaindex.ai/en/stable/understanding/querying/querying/"
]

In [6]:
pages_content= []

In [7]:
for url in urls:
    try:
        article = newspaper.Article(url)
        log.info(f'url:{url}')
        article.download() # Retrieves the content.
        article.parse() # Extracts the content
        log.info(f'Title:{article.title}')
        if len(article.text) > 0:
            pages_content.append({ "url": url, "title": article.title, "text": article.text })
            log.info(f'Text:\n{article.text}')
    except:
        continue
	
print(pages_content[0])

1493940982.<module>():4 2025-03-24 23:14:38,280
[INFO ] url:https://docs.llamaindex.ai/en/stable/understanding

1493940982.<module>():7 2025-03-24 23:14:40,841
[INFO ] Title:Building an LLM Application

1493940982.<module>():10 2025-03-24 23:14:40,842
[INFO ] Text:
Building an LLM application#

Welcome to Understanding LlamaIndex. This is a series of short, bite-sized tutorials on every stage of building an agentic LLM application to get you acquainted with how to use LlamaIndex before diving into more advanced and subtle strategies. If you're an experienced programmer new to LlamaIndex, this is the place to start.

Key steps in building an agentic LLM application#

Tip You might want to read our high-level concepts if these terms are unfamiliar.

This tutorial has three main parts: Building a RAG pipeline, Building an agent, and Building Workflows, with some smaller sections before and after. Here's what to expect:

Ready to dive in? Head to using LLMs.

1493940982.<module>():4 2025-0

{'url': 'https://docs.llamaindex.ai/en/stable/understanding', 'title': 'Building an LLM Application', 'text': "Building an LLM application#\n\nWelcome to Understanding LlamaIndex. This is a series of short, bite-sized tutorials on every stage of building an agentic LLM application to get you acquainted with how to use LlamaIndex before diving into more advanced and subtle strategies. If you're an experienced programmer new to LlamaIndex, this is the place to start.\n\nKey steps in building an agentic LLM application#\n\nTip You might want to read our high-level concepts if these terms are unfamiliar.\n\nThis tutorial has three main parts: Building a RAG pipeline, Building an agent, and Building Workflows, with some smaller sections before and after. Here's what to expect:\n\nReady to dive in? Head to using LLMs."}


## Transforms the extracted text into a LlamaIndex document

In [8]:
documents = [Document(text=row['text'], metadata={"title": row['title'], "url": row['url']}) for row in pages_content]

## Data Cleaning
### Keyword filtering
Removes documents that do not contain any keywords in a given set

In [9]:
def keyword_filter(documents, keywords):
    filtered_docs = []
    for doc in documents:
        if any(keyword.lower() in doc.text.lower() for keyword in keywords):
            filtered_docs.append(doc)
    return filtered_docs

ai_keywords = ["artificial intelligence", "machine learning", "neural networks", "deep learning",'LlamaIndex','llm','rag']
filtered_documents = keyword_filter(documents, ai_keywords)
len(filtered_documents)

4

### Remove Boilerplate content and clean the HTML

In [10]:
def clean_html(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    
    # Remove script and style elements
    for script in soup(["script", "style"]):
        script.decompose()
    
    # Extract text from remaining tags
    text = soup.get_text()
    
    # Remove extra whitespace
    lines = (line.strip() for line in text.splitlines())
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    text = '\n'.join(chunk for chunk in chunks if chunk)
    
    return text

In [11]:
cleaned_documents = [Document(text=clean_html(doc.text), metadata=doc.metadata) for doc in documents]

In [12]:
for doc in cleaned_documents:
    print(f"Title:{doc.metadata['title']}")

Title:Building an LLM Application
Title:Using LLMs
Title:Indexing & Embedding
Title:LlamaIndex


### Truncating long documents
Truncates documents that exceed a given number of tokens.  The document is tokenized into words separated by whitespace.

In [13]:
def truncate_document(doc, max_tokens=1000):
    tokens = doc.text.split()
    if len(tokens) > max_tokens:
        truncated_text = ' '.join(tokens[:max_tokens])
        return Document(text=truncated_text, metadata=doc.metadata)
    return doc

truncated_documents = [truncate_document(doc) for doc in cleaned_documents]

In [14]:
for doc in truncated_documents:
    print(f"Title:{doc.metadata['title']}")
    print(f'{doc.text}\n')

Title:Building an LLM Application
Building an LLM application#
Welcome to Understanding LlamaIndex. This is a series of short, bite-sized tutorials on every stage of building an agentic LLM application to get you acquainted with how to use LlamaIndex before diving into more advanced and subtle strategies. If you're an experienced programmer new to LlamaIndex, this is the place to start.
Key steps in building an agentic LLM application#
Tip You might want to read our high-level concepts if these terms are unfamiliar.
This tutorial has three main parts: Building a RAG pipeline, Building an agent, and Building Workflows, with some smaller sections before and after. Here's what to expect:
Ready to dive in? Head to using LLMs.

Title:Using LLMs
Using LLMs#
Tip For a list of our supported LLMs and a comparison of their functionality, check out our LLM module guide.
One of the first steps when building an LLM-based application is which LLM to use; they have different strengths and price point

## Create the RAG pipeline

In [15]:
from llama_index.core import Settings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.embeddings.gemini import GeminiEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.llms.gemini import Gemini

In [44]:
llm = OpenAI(model="gpt-4o-mini",temperature=0)
embed_model = OpenAIEmbedding(model="text-embedding-3-small")
text_splitter = SentenceSplitter(chunk_size=512, chunk_overlap=30)

In [47]:
Settings.llm = llm
Settings.embed_model = embed_model
Settings.text_splitter = text_splitter

In [48]:
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

_client._send_single_request():1025 2025-03-24 22:58:43,390
[INFO ] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"



In [49]:
res = query_engine.query("What is a query engine?")

print(res.response)

# Show the retrieved nodes
for src in res.source_nodes:
  print("Node ID\t", src.node_id)
  print("Title\t", src.metadata['title'])
  print("URL\t", src.metadata['url'])
  print("Score\t", src.score)
  print("-_"*20)

_client._send_single_request():1025 2025-03-24 22:59:11,638
[INFO ] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"

_client._send_single_request():1025 2025-03-24 22:59:14,300
[INFO ] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"



A query engine is a component that facilitates the process of querying by allowing users to interact with an index to retrieve relevant information. It enables the execution of prompts, which can range from simple questions to complex instructions, and is essential for synthesizing responses based on the retrieved data. The query engine can be created from an index and is responsible for managing the stages of querying, including retrieval, postprocessing, and response synthesis.
Node ID	 7a14df7e-8132-4322-8dd7-9b225de2968f
Title	 LlamaIndex
URL	 https://docs.llamaindex.ai/en/stable/understanding/querying/querying/
Score	 0.5302384018558375
-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_
Node ID	 31c6b250-6f78-4ab1-82b3-ae3ef0872934
Title	 LlamaIndex
URL	 https://docs.llamaindex.ai/en/stable/understanding/querying/querying/
Score	 0.4344452085265439
-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_


### Example using Gemini

In [29]:
llm = Gemini(model="models/gemini-2.0-flash")
embed_model = GeminiEmbedding(model="models/embedding-001")
text_splitter = SentenceSplitter(chunk_size=512, chunk_overlap=30)
Settings.llm = llm
Settings.embed_model = embed_model
Settings.text_splitter = text_splitter

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

  embed_model = GeminiEmbedding(model="models/embedding-001")


In [26]:
res = query_engine.query("What is a query engine?")

print(res.response)

# Show the retrieved nodes
for src in res.source_nodes:
  print("Node ID\t", src.node_id)
  print("Title\t", src.metadata['title'])
  print("URL\t", src.metadata['url'])
  print("Score\t", src.score)
  print("-_"*20)

A QueryEngine is the basis of all querying. You can obtain a QueryEngine by having your index create one for you.

Node ID	 de14ac15-7fb4-4228-a7f9-fc9023cf203c
Title	 LlamaIndex
URL	 https://docs.llamaindex.ai/en/stable/understanding/querying/querying/
Score	 0.694896492999527
-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_
Node ID	 c265c99e-39e3-4f9e-85fe-9e73f2ddf8a7
Title	 LlamaIndex
URL	 https://docs.llamaindex.ai/en/stable/understanding/querying/querying/
Score	 0.6829592665283133
-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_


# Another Example 