# Scrapegraph

## Introduction

ScrapeGraphAI is a cutting-edge Python library designed for web scraping. It leverages large language models (LLM) and direct graph logic to create efficient and effective scraping pipelines for websites, documents, and XML files. Simply specify the information you want to extract, and ScrapeGraphAI will handle the rest, making web scraping accessible and straightforward for users of all levels.

## Installation

To get started with ScrapeGraphAI, you can easily install it via pip. The reference page for ScrapeGraphAI is available on [PyPI](https://pypi.org/project/scrapegraphai/).

```sh
pip install scrapegraphai
```

Additionally, for JavaScript-based scraping, you will need to install Playwright:

```sh
playwright install
```

## Main scraping components 

### Nodes


Nodes are the fundamental building blocks of ScrapeGraphAI's scraping pipelines. Each node represents a specific task within the pipeline, such as fetching data from a URL, extracting content using a pattern, or processing the extracted data. Nodes can be configured to perform a variety of actions, making them highly versatile and adaptable to different scraping scenarios.

#### Types of Nodes

Here are the types of nodes and their functionalitien:

- **BaseNode Module**: Provides an abstract base class for nodes in a graph-based workflow, designed to perform specific actions when executed.

- **ConditionalNode Module**: Determines the next step in the graph's execution flow based on the presence and content of a specified key in the graph's state.

- **FetchNode Module**: Responsible for fetching the HTML content of a specified URL or loading various types of documents and updating the graph's state with this content.

- **GenerateAnswerCSVNode Module**: Generates an answer using a language model (LLM) based on the user's input and the content extracted from a webpage, and constructs a prompt for the LLM.

- **GenerateAnswerNode Module**: Similar to GenerateAnswerCSVNode, it generates an answer using a large language model (LLM) based on the user's input and the content extracted from a webpage.

- **GenerateAnswerOmniNode Module**: Generates an answer using a large language model (LLM) based on the user's input and the content extracted from a webpage, similar to GenerateAnswerNode.

- **GenerateAnswerPDFNode Module**: Generates an answer using a language model (LLM) based on the user's input and the content extracted from a webpage, similar to GenerateAnswerNode.

- **GenerateScraperNode Module**: Generates a Python script for scraping a website using the specified library, based on the user's prompt and the scraped content.

- **GetProbableTagsNode Module**: Utilizes a language model to identify probable HTML tags within a document that are likely to contain information relevant to a user's query.

- **ImageToTextNode Module**: Retrieves images from a list of URLs and returns a description of the images using an image-to-text model.

- **MergeAnswersNode Module**: Merges the answers from multiple graph instances into a single answer.

- **ParseNode Module**: Parses HTML content from a document and splits it into chunks for further processing.

- **RAGNode Module**: Compresses input tokens and stores the document in a vector database for retrieval, storing relevant chunks in the state.

- **RobotsNode Module**: Checks if a website is scrapeable based on the robots.txt file, using a language model to determine if scraping is allowed.

- **SearchInternetNode Module**: Generates a search query based on the user's input, searches the internet for relevant information, and updates the state with the generated answer.

- **SearchLinkNode Module**: Filters out relevant links in the webpage content based on the user prompt, ideal to use after the FetchNode.

- **TextToSpeechNode Module**: Converts text to speech using the specified text-to-speech model.

### Graphs

Graphs in ScrapeGraphAI represent the overall structure and flow of a scraping pipeline. They consist of interconnected nodes, each performing a distinct task and passing data to the next node in the sequence. The graph-based approach allows for flexible and modular design, enabling users to easily modify and extend their scraping pipelines.

#### Components of a Graph

1. **Nodes**: As described earlier, nodes are the individual tasks within the graph. Each node is defined by its type and specific configuration settings.
   
2. **Edges**: Edges define the connections between nodes, determining the order of execution and the flow of data through the pipeline. Edges ensure that data is passed correctly from one node to the next, maintaining the integrity of the pipeline.

#### Types of Graphs

Here are the types of graphs and their functionalities:


- **AbstractGraph Module**: Provides a scaffolding class for creating a graph representation and executing it.

- **BaseGraph Module**: Provides a class for managing and executing a graph composed of interconnected nodes.

- **CSVScraperGraph Module**: Defines a class for creating and executing a graph that automates the process of extracting information from web pages using a natural language model.

- **JSONScraperGraph Module**: Defines a class for creating and executing a graph that automates the process of extracting information from JSON files using a natural language model.

- **OmniScraperGraph Module**: Defines a class for creating and executing a graph that automates the process of extracting information from web pages using a natural language model.

- **OmniSearchGraph Module**: Defines a class for creating and executing a graph that searches the internet for answers to a given prompt, combining web scraping and internet searching.

- **PDFScraperGraph Module**: Defines a class for creating and executing a graph that extracts information from PDF files using a natural language model to interpret and answer prompts.

- **ScriptCreatorGraph Module**: Defines a class for creating and executing a graph that generates web scraping scripts.

- **SearchGraph Module**: Defines a class for creating and executing a graph that searches the internet for answers to a given prompt.

- **SmartScraperGraph Module**: Defines a class for creating and executing a graph that automates the process of extracting information from web pages using a natural language model to interpret and answer prompts.

- **SpeechGraph Module**: Defines a class for creating and executing a graph that scrapes the web, provides an answer to a given prompt, and generates an audio file.

- **XMLScraperGraph Module**: Defines a class for creating and executing a graph that extracts information from XML files using a natural language model to interpret and answer prompts.

#### Designing a Scraping Graph

When designing a scraping graph, consider the following best practices:

1. **Define Clear Objectives**: Start by clearly defining the objectives of your scraping task. Identify the data you need to extract and the sources from which it will be fetched.
   
2. **Modular Approach**: Break down the scraping task into smaller, manageable nodes. This modular approach allows for easy debugging, maintenance, and scalability.

3. **Optimize Data Flow**: Arrange nodes in a logical sequence to optimize the flow of data. Ensure that each node performs its task efficiently and passes the data correctly to the next node.

**Example Graph Configuration:**

```json
{
    "text": [
        {
            "nodes": [
                {
                    "node_name": "SearchInternetNode",
                    "node_type": "node"
                },
                {
                    "node_name": "FetchNode",
                    "node_type": "node"
                },
                {
                    "node_name": "RAGNode",
                    "node_type": "node"
                },
                {
                    "node_name": "ParseNode",
                    "node_type": "node"
                }
            ],
            "edges": [
                {
                    "from": "SearchInternetNode",
                    "to": [
                        "FetchNode"
                    ]
                },
                {
                    "from": "FetchNode",
                    "to": [
                        "RAGNode"
                    ]
                },
                {
                    "from": "RAGNode",
                    "to": [
                        "ParseNode"
                    ]
                }
            ],
            "entry_point": "SearchInternetNode"
        }
    ]
}
```

In this example, the graph consists of four nodes connected by edges in a sequence:

**Nodes**

1. **SearchInternetNode**: This node is responsible for generating a search query based on the user's input and searching the internet for relevant information. It acts as the starting point of the graph, initiating the process by generating the necessary search queries.

2. **FetchNode**: After the search query is generated, the FetchNode retrieves the HTML content of the specified URLs found from the internet search. This node fetches the actual web content that will be processed further.

3. **RAGNode**: Following the fetch operation, the RAGNode compresses the input tokens and stores the document in a vector database for retrieval. It ensures that relevant chunks of data are stored in the state for efficient processing and retrieval.

4. **ParseNode**: Finally, the ParseNode parses the HTML content from the document and splits it into chunks for further processing. This node extracts and organizes the relevant data into manageable pieces for the next steps in the pipeline.

**Edges**

The edges define the flow of data between the nodes:

- The graph starts at the **SearchInternetNode**.
- The output of the **SearchInternetNode** is passed to the **FetchNode**.
- The **FetchNode** then sends its fetched content to the **RAGNode**.
- Finally, the **RAGNode** processes the data and passes it to the **ParseNode**.

**Entry Point**

The entry point of this graph is the **SearchInternetNode**, indicating that the graph's execution begins with generating the search query and proceeds through the sequence of nodes to parse the fetched and processed data.

## Examples from Scrapegraph 

The examples below are totally taken from the ScrapeGraphAI github repo: 

https://github.com/VinciGit00/Scrapegraph-ai/tree/ff4ccb94a125193efcf6a3c71781faf50d0464c3/examples

### Create the environment

In [20]:
%%capture
! pip install scrapegraphai --upgrade
! apt install chromium-chromedriver
! pip install nest_asyncio
! pip install playwright
! playwright install

In [21]:
import nest_asyncio
nest_asyncio.apply()

In [22]:
import os
from dotenv import load_dotenv
# enter Open AI key
load_dotenv()
openai_key = os.getenv("OPENAI_API_KEY")

### Creating a custom graph

This example is getting an error in RobotsNode. A support message is sent to the contributers of the ScrapegraphAI from the Discord.

In [None]:
"""
Example of custom graph using existing nodes
"""

from langchain_openai import OpenAIEmbeddings
from scrapegraphai.models import OpenAI
from scrapegraphai.graphs import BaseGraph
from scrapegraphai.nodes import FetchNode, ParseNode, RAGNode, GenerateAnswerNode, RobotsNode

# ************************************************
# Define the configuration for the graph
# ************************************************

graph_config = {
    "llm": {
        "openai_api_key": openai_key,
        "model_name": "gpt-3.5-turbo",
    },
    "verbose": True,
}

# ************************************************
# Define the graph nodes
# ************************************************

llm_model = OpenAI(graph_config["llm"])
embedder = OpenAIEmbeddings(api_key=llm_model.openai_api_key)

# define the nodes for the graph
robot_node = RobotsNode(
    input="url",
    output=["is_scrapable"],
    node_config={
        "llm_model": llm_model,
        "force_scraping": True,
        "verbose": True,
        }
)

fetch_node = FetchNode(
    input="url | local_dir",
    output=["doc", "link_urls", "img_urls"],
    node_config={
        "verbose": True,
        "headless": True,
    }
)
parse_node = ParseNode(
    input="doc",
    output=["parsed_doc"],
    node_config={
        "chunk_size": 4096,
        "verbose": True,
    }
)
rag_node = RAGNode(
    input="user_prompt & (parsed_doc | doc)",
    output=["relevant_chunks"],
    node_config={
        "llm_model": llm_model,
        "embedder_model": embedder,
        "verbose": True,
    }
)
generate_answer_node = GenerateAnswerNode(
    input="user_prompt & (relevant_chunks | parsed_doc | doc)",
    output=["answer"],
    node_config={
        "llm_model": llm_model,
        "verbose": True,
    }
)

# ************************************************
# Create the graph by defining the connections
# ************************************************

graph = BaseGraph(
    nodes=[
        robot_node,
        fetch_node,
        parse_node,
        rag_node,
        generate_answer_node,
    ],
    edges=[
        (robot_node, fetch_node),
        (fetch_node, parse_node),
        (parse_node, rag_node),
        (rag_node, generate_answer_node)
    ],
    entry_point=robot_node
)

# ************************************************
# Execute the graph
# ************************************************

result, execution_info = graph.execute({
    "user_prompt": "Describe the content",
    "url": "https://example.com/"
})

# get the answer from the result
result = result.get("answer", "No answer found.")
print(result)


### Navigate the links of a URL

DeepScraper is a scraping pipeline that automates the process of 
extracting information from web pages using a natural language model 
to interpret and answer prompts.

Unlike SmartScraper, DeepScraper can navigate to the links within,
the input webpage to fuflfil the task within the prompt

*This graph is still in the WIP and can produce some errors during the run.*

In [None]:
import os
from scrapegraphai.graphs import DeepScraperGraph
from scrapegraphai.utils import prettify_exec_info

# ************************************************
# Define the configuration for the graph
# ************************************************

graph_config = {
    "llm": {
        "api_key": openai_key,
        "model": "gpt-4",
    },
    "verbose": True,
    "max_depth": 1
}

# ************************************************
# Create the SmartScraperGraph instance and run it
# ************************************************

deep_scraper_graph = DeepScraperGraph(
    prompt="List me all the job titles and detailed job description.",
    # also accepts a string with the already downloaded HTML code
    source="https://www.google.com/about/careers/applications/jobs/results/?location=Bangalore%20India",
    config=graph_config
)

result = deep_scraper_graph.run()
print(result)

# ************************************************
# Get graph execution inf
# ************************************************

graph_exec_info = deep_scraper_graph.get_execution_info()
print(f"Relevant_links: {deep_scraper_graph.get_state('relevant_links')}")
print(prettify_exec_info(graph_exec_info))

### Scraping CSV files

The `CSVScraperGraph` is distinct from other graph types in that it is specifically designed to scrape and process data from CSV files or directories containing CSV files. This smart scraper uses a natural language model to interpret prompts and extract relevant information from the CSV data. The `source` parameter can be a single CSV file path or a directory containing multiple CSV files, thus allowing flexible and comprehensive data extraction from CSV formats.

In [None]:
import os
import pandas as pd
from scrapegraphai.graphs import CSVScraperGraph
from scrapegraphai.utils import convert_to_csv, convert_to_json, prettify_exec_info

# ************************************************
# Read the CSV file
# ************************************************

FILE_NAME = "inputs/username.csv"
curr_dir = os.path.dirname(os.path.realpath("__file__"))
file_path = os.path.join(curr_dir, FILE_NAME)

text = pd.read_csv(file_path)

# ************************************************
# Define the configuration for the graph
# ************************************************

graph_config = {
    "llm": {
        "api_key": openai_key,
        "model": "gpt-3.5-turbo",
    },
}

# ************************************************
# Create the CSVScraperGraph instance and run it
# ************************************************

csv_scraper_graph = CSVScraperGraph(
    prompt="List me all the last names",
    source=str(text),  # Pass the content of the file, not the file object
    config=graph_config
)

result = csv_scraper_graph.run()
print(result)

# ************************************************
# Get graph execution info
# ************************************************

graph_exec_info = csv_scraper_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))

# Save to json or csv
convert_to_csv(result, "result")
convert_to_json(result, "result", os.getcwd())


### Scraping JSON files

The `JSONScraperGraph` is specialized for extracting data from JSON files, differentiating it from other graph types by focusing on structured JSON data rather than web pages or CSV files. The `source` parameter can be either a path to a single JSON file or a directory containing multiple JSON files. This flexibility allows the graph to process and scrape data from various JSON sources effectively.

In [None]:
"""
Basic example of scraping pipeline using JSONScraperGraph from JSON documents
"""

import os
import json
from dotenv import load_dotenv
from scrapegraphai.graphs import JSONScraperGraph
from scrapegraphai.utils import convert_to_csv, convert_to_json, prettify_exec_info

# ************************************************
# Read the JSON file
# ************************************************

FILE_NAME = "inputs/example.json"
curr_dir = os.path.dirname(os.path.realpath("__file__"))
file_path = os.path.join(curr_dir, FILE_NAME)

with open(file_path, 'r', encoding="utf-8") as file:
    text = file.read()

# ************************************************
# Define the configuration for the graph
# ************************************************

graph_config = {
    "llm": {
        "api_key": openai_key,
        "model": "gpt-3.5-turbo",
    },
}

# ************************************************
# Create the JSONScraperGraph instance and run it
# ************************************************

json_scraper_graph = JSONScraperGraph(
    prompt="List me all the channel titles, title and descriptions of the youtube videos",
    source=text,  # Pass the content of the file, not the file object
    config=graph_config
)

result = json_scraper_graph.run()
# Print the JSON data in a pretty format
pretty_json = json.dumps(result, indent=4)
print(pretty_json)

# ************************************************
# Get graph execution info
# ************************************************

graph_exec_info = json_scraper_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))

# Save to json or csv
convert_to_csv(result, "result")
convert_to_json(result, "result")



### Scraping XML files

This graph scrapes the XML files

In [28]:
"""
Basic example of scraping pipeline using XMLScraperGraph from XML documents
"""

import os
from scrapegraphai.graphs import XMLScraperGraph
from scrapegraphai.utils import convert_to_csv, convert_to_json, prettify_exec_info

# ************************************************
# Read the XML file
# ************************************************

FILE_NAME = "inputs/books.xml"
curr_dir = os.path.dirname(os.path.realpath("__file__"))
file_path = os.path.join(curr_dir, FILE_NAME)

with open(file_path, 'r', encoding="utf-8") as file:
    text = file.read()

# ************************************************
# Define the configuration for the graph
# ************************************************

graph_config = {
    "llm": {
        "api_key": openai_key,
        "model": "gpt-3.5-turbo",
    },
    "verbose":False,
}

# ************************************************
# Create the XMLScraperGraph instance and run it
# ************************************************

xml_scraper_graph = XMLScraperGraph(
    prompt="List me all the authors, title and genres of the books",
    source=text,  # Pass the content of the file, not the file object
    config=graph_config
)

result = xml_scraper_graph.run()
# Print the JSON data in a pretty format
pretty_json = json.dumps(result, indent=4)
print(pretty_json)


# ************************************************
# Get graph execution info
# ************************************************

graph_exec_info = xml_scraper_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))

# Save to json or csv
convert_to_csv(result, "result")
convert_to_json(result, "result")



{
    "books": [
        {
            "author": "Gambardella, Matthew",
            "title": "XML Developer's Guide",
            "genre": "Computer"
        },
        {
            "author": "Ralls, Kim",
            "title": "Midnight Rain",
            "genre": "Fantasy"
        },
        {
            "author": "Corets, Eva",
            "title": "Maeve Ascendant",
            "genre": "Fantasy"
        },
        {
            "author": "Corets, Eva",
            "title": "Oberon's Legacy",
            "genre": "Fantasy"
        },
        {
            "author": "Corets, Eva",
            "title": "The Sundered Grail",
            "genre": "Fantasy"
        },
        {
            "author": "Randall, Cynthia",
            "title": "Lover Birds",
            "genre": "Romance"
        },
        {
            "author": "Thurman, Paula",
            "title": "Splish Splash",
            "genre": "Romance"
        },
        {
            "author": "Knorr, Stefan",
            "

### Scraping URL's

The `OmniScraperGraph` is distinct from other graph types by its ability to extract and integrate diverse data types, including text and images, from web pages or local directories, using a natural language model to process and respond to prompts comprehensively. The`source` parameter can be either a URL starting with "http" for web pages or a local directory path, allowing it to handle various input sources seamlessly.

In [None]:
""" 
Basic example of scraping pipeline using OmniScraper
"""

import os, json
from scrapegraphai.graphs import OmniScraperGraph
from scrapegraphai.utils import prettify_exec_info


# ************************************************
# Define the configuration for the graph
# ************************************************

graph_config = {
    "llm": {
        "api_key": openai_key,
        "model": "gpt-4o",
    },
    "verbose": True,
    "headless": True,
    "max_images": 5
}

# ************************************************
# Create the OmniScraperGraph instance and run it
# ************************************************

omni_scraper_graph = OmniScraperGraph(
    prompt="List me all the projects with their titles and image links and descriptions.",
    # also accepts a string with the already downloaded HTML code
    source="https://perinim.github.io/projects/",
    config=graph_config
)

result = omni_scraper_graph.run()
print(json.dumps(result, indent=2))

# ************************************************
# Get graph execution info
# ************************************************

graph_exec_info = omni_scraper_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))


### Scraping PDF documents

The `PDFScraperGraph` is distinct from other graph types as it is specifically designed to extract information from PDF files using a natural language model to interpret and answer prompts. The `source` parameter can be a path to a single PDF file or a directory containing multiple PDF files. This flexibility allows it to handle various PDF inputs effectively.

In [None]:
import os, json
import PyPDF4
from scrapegraphai.graphs import PDFScraperGraph


# ************************************************
# Define the configuration for the graph
# ************************************************
FILE_NAME = "inputs/dante-01-inferno.pdf"
curr_dir = os.path.dirname(os.path.realpath("__file__"))
file_path = os.path.join(curr_dir, FILE_NAME)

# Open the PDF file
with open(file_path, "rb") as file:
    pdf_reader = PyPDF4.PdfFileReader(file)
    text = ""

    # Extract text from each page
    for page_num in range(pdf_reader.numPages):
        page = pdf_reader.getPage(page_num)
        text += page.extractText()

graph_config = {
    "llm": {
        "api_key": openai_key,
        "model": "gpt-3.5-turbo"
    },
    "verbose": True
}

pdf_scraper_graph = PDFScraperGraph(
    prompt="Summarize the text and find the main topics",
    source=text[:10_000],
    config=graph_config,
)
result = pdf_scraper_graph.run()

print(json.dumps(result, indent=4))


### Scraping text files

The `SmartScraperGraph` is distinct from other graph types as it is designed to automate the extraction of information from web pages using a natural language model to interpret and answer prompts. This graph stands out for its ability to handle both web and local directory sources, as indicated by the `source` parameter. The `source` parameter can either be a URL for online content or a local directory path, allowing it to flexibly manage various data inputs.

In [None]:
""" 
Basic example of scraping pipeline using SmartScraper from text
"""

import os
from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import prettify_exec_info

# ************************************************
# Read the text file
# ************************************************

FILE_NAME = "inputs/plain_html_example.txt"
curr_dir = os.path.dirname(os.path.realpath("__file__"))
file_path = os.path.join(curr_dir, FILE_NAME)

# It could be also a http request using the request model
with open(file_path, 'r', encoding="utf-8") as file:
    text = file.read()

# ************************************************
# Define the configuration for the graph
# ************************************************

graph_config = {
    "llm": {
        "api_key": openai_key,
        "model": "gpt-3.5-turbo",
    },
}

# ************************************************
# Create the SmartScraperGraph instance and run it
# ************************************************

smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the projects with their description.",
    source=text,
    config=graph_config
)

result = smart_scraper_graph.run()
# Print the JSON data in a pretty format
pretty_json = json.dumps(result, indent=4)
print(pretty_json)

# ************************************************
# Get graph execution info
# ************************************************

graph_exec_info = smart_scraper_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))

### SmartScraperGraph vs. OmniScraperGraph

The `SmartScraperGraph` and `OmniScraperGraph` are both designed to automate information extraction using natural language models, but they have different capabilities and specific use cases. Here are the key differences:

**SmartScraperGraph**:
1. **Primary Focus**: The `SmartScraperGraph` is primarily designed for extracting textual information from web pages or local directories.
2. **Nodes and Workflow**:
   - Uses `FetchNode` to retrieve documents and URLs.
   - Uses `ParseNode` to process and parse the document content.
   - Uses `RAGNode` to extract relevant chunks of information.
   - Uses `GenerateAnswerNode` to generate final answers from the parsed document and relevant chunks.
3. **Image Handling**: While it can fetch image URLs, it does not explicitly process image content or convert images to text.

**OmniScraperGraph**:
1. **Enhanced Capabilities**: The `OmniScraperGraph` has a broader range of capabilities, including handling both textual and visual data. It is designed to integrate and process images as well as text from various sources.
2. **Nodes and Workflow**:
   - Uses `FetchNode` to retrieve documents, links, and image URLs.
   - Uses `ParseNode` to process and parse the document content.
   - Uses `ImageToTextNode` to convert image URLs to textual descriptions.
   - Uses `RAGNode` to extract relevant chunks of information.
   - Uses `GenerateAnswerOmniNode` to generate final answers that integrate both textual content and image descriptions.
3. **Image Handling**: Explicitly includes `ImageToTextNode` for processing image URLs and converting them to text, which is then used in generating comprehensive answers.

### Script creator graph

The `ScriptCreatorGraph` defines a scraping pipeline that automates the generation of web scraping scripts using a natural language model. It takes a prompt and source (URL or local directory) and processes the content through nodes that fetch and parse the document. The `library` parameter specifies the web scraping library to be used. This parameter ensures the generated script is compatible with the desired web scraping framework. The `GenerateScraperNode` then creates a script tailored to the input prompt, making it a versatile tool for generating web scraping scripts with different libraries.

In [None]:
""" 
Basic example of scraping pipeline using ScriptCreatorGraph
"""

from scrapegraphai.graphs import ScriptCreatorGraph
from scrapegraphai.utils import prettify_exec_info

# ************************************************
# Define the configuration for the graph
# ************************************************

graph_config = {
    "llm": {
        "api_key": openai_key,
        "model": "gpt-3.5-turbo",
    },
    "library": "beautifulsoup"
}

# ************************************************
# Create the ScriptCreatorGraph instance and run it
# ************************************************

script_creator_graph = ScriptCreatorGraph(
    prompt="List me all the projects with their description.",
    # also accepts a string with the already downloaded HTML code
    source="https://perinim.github.io/projects",
    config=graph_config
)

result = script_creator_graph.run()
print(result)

# ************************************************
# Get graph execution info
# ************************************************

graph_exec_info = script_creator_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))


### Script generator schema graph

Script generator schema graph also creates a script. However, it generates the output according to the pre-defined Pydantic schema. 

In [None]:
""" 
Basic example of scraping pipeline using ScriptCreatorGraph
"""

import os
from scrapegraphai.graphs import ScriptCreatorGraph
from scrapegraphai.utils import prettify_exec_info

from pydantic import BaseModel, Field
from typing import List

# ************************************************
# Define the schema for the graph
# ************************************************

class AttractionSchema(BaseModel):
    name: str
    description: str
    image_url: str


# ************************************************
# Define the configuration for the graph
# ************************************************

graph_config = {
    "llm": {
        "api_key": openai_key,
        "model": "gpt-4o",
    },
    "library": "beautifulsoup",
    "verbose": True,
}


# ************************************************
# Create the ScriptCreatorGraph instance and run it
# ************************************************

script_creator_graph = ScriptCreatorGraph(
    prompt="List me the all attractions in Chioggia.",
    source="https://en.wikipedia.org/wiki/Chioggia",
    config=graph_config,
    schema=AttractionSchema
)
result = script_creator_graph.run()
print(result)

# ************************************************
# Get graph execution info
# ************************************************

graph_exec_info = script_creator_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))

### Search graph with schema

The `SearchGraph` module is a powerful tool for dynamically searching the internet and extracting relevant information based on user prompts. It utilizes the language model to perform searches on the internet, making it highly dynamic and capable of retrieving up-to-date information. By leveraging the `SmartScraperGraph`, it handles detailed scraping of each URL, ensuring that the extracted data is relevant and accurate. The module also combines information from multiple sources into a single, refined answer, enhancing the reliability and comprehensiveness of the results. Additionally, it allows for customization through configuration parameters and schemas, making it adaptable to different use cases and requirements.

In [25]:
"""
Example of Search Graph
"""

from scrapegraphai.graphs import SearchGraph
from scrapegraphai.utils import convert_to_csv, convert_to_json, prettify_exec_info

from pydantic import BaseModel, Field
from typing import List

# ************************************************
# Define the output schema for the graph
# ************************************************

class Dish(BaseModel):
    name: str = Field(description="The name of the dish")
    description: str = Field(description="The description of the dish")

class Dishes(BaseModel):
    dishes: List[Dish]

# ************************************************
# Define the configuration for the graph
# ************************************************

graph_config = {
    "llm": {
        "api_key": openai_key,
        "model": "gpt-3.5-turbo",
    },
    "max_results": 4,
    "verbose": True,
}

# ************************************************
# Create the SearchGraph instance and run it
# ************************************************

search_graph = SearchGraph(
    prompt="List me Chioggia's famous dishes",
    config=graph_config,
    schema=Dishes
)

result = search_graph.run()
# Print the JSON data in a pretty format
pretty_json = json.dumps(result, indent=4)
print(pretty_json)

# ************************************************
# Get graph execution info
# ************************************************

graph_exec_info = search_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))

# Save to json and csv
convert_to_csv(result, "result")
convert_to_json(result, "result")

--- Executing SearchInternet Node ---
Search Query: Chioggia famous dishes
--- Executing GraphIterator Node with batchsize 16 ---
processing graph instances:   0%|          | 0/4 [00:00<?, ?it/s]--- Executing Fetch Node ---
--- Executing Fetch Node ---
--- Executing Fetch Node ---
--- Executing Fetch Node ---
--- (Fetching HTML from: https://www.visitchioggia.com/en/taste/chioggian-cuisine/) ---
--- (Fetching HTML from: https://www.visitchioggia.com/en/taste/) ---
--- (Fetching HTML from: https://www.tasteatlas.com/chioggia) ---
--- (Fetching HTML from: https://www.sottomarina.net/gastronomia_uk.htm) ---
--- Executing Parse Node ---
--- Executing RAG Node ---
--- (updated chunks metadata) ---
--- (tokens compressed and vector stored) ---
--- Executing GenerateAnswer Node ---
--- Executing Parse Node ---
--- Executing Parse Node ---
--- Executing RAG Node ---
--- Executing RAG Node ---
--- (updated chunks metadata) ---
--- (updated chunks metadata) ---
--- Executing Parse Node ---
--- E

{
    "dishes": [
        {
            "name": "Sardines in Sa\u00f2re",
            "description": "Sardines fried and left in a 'carpione' marinade using Chioggian white onions, A traditional dish of Chioggia made with sardines"
        },
        {
            "name": "Bigoli in salsa",
            "description": "Delicious bigoli pasta in sauce, A pasta dish with a sauce made from onions, anchovies, and sardines, spaghetti with a sauce of garlic, olive oil, onion, parsley and anchovy fillets."
        },
        {
            "name": "Stewed cuttlefish",
            "description": "Cuttlefish cooked in ink or stewed, Cuttlefish cooked in a stew with tomatoes and other ingredients"
        },
        {
            "name": "Moeche frite",
            "description": "Crispy molting crabs, Soft-shell crabs fried in batter, crabs fried in abundant oil"
        },
        {
            "name": "Seafood risotto",
            "description": "Risotto with seafood ingredients, A risotto dis

### Smartscraper graph

The `SmartScraperGraph` module automates the process of extracting information from web pages using a natural language model to interpret and respond to prompts. 

This module utilizes the language model to perform dynamic internet searches, ensuring the retrieval of up-to-date information. By leveraging the `SmartScraperGraph`, it handles detailed scraping of each URL to ensure that the extracted data is relevant and accurate. The results from multiple sources are combined into a single, refined answer, enhancing the reliability and comprehensiveness of the output. Additionally, the module allows for customization through configuration parameters and schemas, making it adaptable to various use cases and requirements. 

The `SmartScraperGraph` initializes with a prompt, a source (URL or local directory), a configuration dictionary, and an optional schema for the output format. It uses nodes such as `FetchNode` to retrieve documents, `ParseNode` to process and parse the document, `RAGNode` to extract relevant information, and `GenerateAnswerNode` to generate the final answer formatted according to the optional schema. The workflow of these nodes ensures a seamless data flow from fetching to generating the final answer. 

In [26]:
""" 
Basic example of scraping pipeline using SmartScraper
"""

import json
from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import prettify_exec_info


# ************************************************
# Define the configuration for the graph
# ************************************************

graph_config = {
    "llm": {
        "api_key": openai_key,
        "model": "gpt-3.5-turbo",
    },
    "verbose": True,
    "headless": False,
}

# ************************************************
# Create the SmartScraperGraph instance and run it
# ************************************************

smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the projects with their description",
    # also accepts a string with the already downloaded HTML code
    source="https://perinim.github.io/projects/",
    config=graph_config,
)

result = smart_scraper_graph.run()
print(json.dumps(result, indent=4))

# ************************************************
# Get graph execution info
# ************************************************

graph_exec_info = smart_scraper_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))


--- Executing Fetch Node ---
--- (Fetching HTML from: https://perinim.github.io/projects/) ---
--- Executing Parse Node ---
--- Executing RAG Node ---
--- (updated chunks metadata) ---
--- (tokens compressed and vector stored) ---
--- Executing GenerateAnswer Node ---
Processing chunks: 100%|██████████| 1/1 [00:03<00:00,  3.17s/it]

{
    "projects": [
        {
            "title": "Rotary Pendulum RL",
            "description": "Open Source project aimed at controlling a real life rotary pendulum using RL algorithms"
        },
        {
            "title": "DQN Implementation from scratch",
            "description": "Developed a Deep Q-Network algorithm to train a simple and double pendulum"
        },
        {
            "title": "Multi Agents HAED",
            "description": "University project which focuses on simulating a multi-agent system to perform environment mapping. Agents, equipped with sensors, explore and record their surroundings, considering uncertainties in their readings."
        },
        {
            "title": "Wireless ESC for Modular Drones",
            "description": "Modular drone architecture proposal and proof of concept. The project received maximum grade."
        }
    ]
}
        node_name  total_tokens  prompt_tokens  completion_tokens  \
0           Fetch             0   




### Generate audio from the scraped data

The `SpeechGraph` module automates the process of web scraping, generating answers to user prompts, and converting those answers into audio files. This pipeline integrates several nodes to create a comprehensive workflow, allowing it to fetch, parse, interpret, and transform web content into spoken word.

Initially, the `SpeechGraph` is set up with a prompt, a source (URL or local directory), a configuration dictionary, and an optional schema for the output format. The workflow involves multiple nodes: the `FetchNode` retrieves the document and associated links and images from the source, the `ParseNode` processes the fetched document into a parsed version, and the `RAGNode` extracts relevant chunks of information from the parsed document. The `GenerateAnswerNode` then creates a final answer based on the user prompt and relevant information. Lastly, the `TextToSpeechNode` converts the generated answer into an audio file using a text-to-speech model.

In [None]:
""" 
Basic example of scraping pipeline using SpeechSummaryGraph
"""

import os
from scrapegraphai.graphs import SpeechGraph
from scrapegraphai.utils import prettify_exec_info

# ************************************************
# Define audio output path
# ************************************************

FILE_NAME = "website_summary.mp3"
curr_dir = os.path.dirname(os.path.realpath("__file__"))
output_path = os.path.join(curr_dir, FILE_NAME)

# ************************************************
# Define the configuration for the graph
# ************************************************

graph_config = {
    "llm": {
        "api_key": openai_key,
        "model": "gpt-3.5-turbo",
        "temperature": 0.7,
    },
    "tts_model": {
        "api_key": openai_key,
        "model": "tts-1",
        "voice": "alloy"
    },
    "output_path": output_path,
}

# ************************************************
# Create the SpeechGraph instance and run it
# ************************************************

speech_graph = SpeechGraph(
    prompt="Make a detailed audio summary of the projects.",
    source="https://perinim.github.io/projects/",
    config=graph_config,
)

result = speech_graph.run()
print(result)

# ************************************************
# Get graph execution info
# ************************************************

graph_exec_info = speech_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))


### Generate a graph from a prompt

In [38]:
from scrapegraphai.builders import GraphBuilder

# Define the graph configuration
config = {
  "llm": {
      "api_key": openai_key,
      "model": "gpt-3.5-turbo",
      "temperature": 0.7,
  }
}

# Create and run the graph
graph_builder = GraphBuilder(
  user_prompt="Create a graph that extracts dynamic content from a given web site",
  config=config,
  )
result = graph_builder.build_graph()
# Print the JSON data in a pretty format
pretty_json = json.dumps(result, indent=4)
print(pretty_json)

{
    "input": "Create a graph that extracts dynamic content from a given web site",
    "text": [
        {
            "nodes": [
                {
                    "node_name": "SearchInternetNode",
                    "node_type": "node"
                },
                {
                    "node_name": "FetchNode",
                    "node_type": "node"
                },
                {
                    "node_name": "RAGNode",
                    "node_type": "node"
                },
                {
                    "node_name": "ParseNode",
                    "node_type": "node"
                }
            ],
            "edges": [
                {
                    "from": "SearchInternetNode",
                    "to": [
                        "FetchNode"
                    ]
                },
                {
                    "from": "FetchNode",
                    "to": [
                        "RAGNode"
                    ]
                