# Scraping Tips and Tricks

## Manage your Webdrivers with `webdriver-manager`

When using `Selenium` with Python, you probably did the following:

- Download Chromedriver Binary
- Unzip it
- Set the path to the driver

This is annoying.

- The Path can be changed
- You have to somehow manage those browser drivers for each OS
- Check if new updates for drivers are released

Instead of doing this manually, use `webdriver-manager`.

It makes managing binaries for different browsers easy.

`webdriver-manager` downloads binaries automatically for you.

So you don't have to go through the pain of doing it manually.


To use it in your project, see the example below. It's straightforward and saves you time and energy.

Especially when you integrate `Selenium` in your CI/CD Pipeline.

By default, `webdriver-manager` installs the latest version.

But you can also define a specific version of the driver.

In [None]:
!pip install webdriver-manager

In [None]:
# Old way
from selenium import webdriver
driver = webdriver.Chrome('path/to/driver.exe')

# New way with webdriver-manager and Selenium 4
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

# New way with webdriver-manager and Selenium 3
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(ChromeDriverManager().install())

# Use specific version
driver = webdriver.Chrome(executable_path=ChromeDriverManager("<your_version>").install())

## Speed up your Scraping with disabling image loading 

Do you want to speed up your web scraper?

Disable image loading!

Disabling image loading while scraping is a great way to speed up your scraper.

You are wasting a lot of connection bandwidth.

To disable image loading in Selenium, you only have to set one option (like below).

This will save you time and money.

In [None]:
from selenium import webdriver

chrome_options = webdriver.ChromeOptions()

chrome_options.add_argument('--blink-settings=imagesEnabled=false')

driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get("https://www.instagram.com/")

## AI-Powered Web Scraper with `scrapegraph-ai`

Do you want to let AI scrape your website?

Use `scrapegraph-ai`.

This library uses LLM and direct graph logic to scrape websites by only providing the information you need.

See below where we give it a prompt and an URL.

It also supports multi-page scraper that extracts information from the top n search results of a search engine.

In [None]:
!pip install scrapegraphai

In [None]:
from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
    "llm": {
        "api_key": OPENAI_API_KEY,
        "model": "gpt-3.5-turbo",
        "temperature":0,
    },
    "verbose":True,
}

smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the projects with their descriptions.",
    source="https://perinim.github.io/projects/",
    config=graph_config
)

'''
Output
{
  "projects": [
    {
      "title": "Rotary Pendulum RL",
      "description": "Open Source project aimed at controlling ..."
    },
    {
      "title": "DQN Implementation from scratch",
      "description": "Developed a Deep Q-Network algorithm to train a ..."
    },
    {
      "title": "Multi Agents HAED",
      "description": "University project which focuses ...."
    },
    {
      "title": "Wireless ESC for Modular Drones",
      "description": "Modular drone architecture ..."
    }
  ]
}
'''

## Crawl and Scrape Any Website with LLMs and `crawl4ai`

Scraping websites is hard.

Fortunately, LLM-powered scraping & crawling is now possible.

**Crawl4AI** is a Python tool to scrape and crawl data from any website with LLMs, allowing structured data extraction and markdown generation.

In [None]:
!pip install crawl4ai
!pip install crawl4ai-setup

In [None]:
from crawl4ai.extraction_strategy import *
from crawl4ai.crawler_strategy import *
from crawl4ai import AsyncWebCrawler
import asyncio
from pydantic import BaseModel, Field

url = r"https://openai.com/api/pricing/"

class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Name of the OpenAI model.")
    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
    output_fee: str = Field(..., description="Fee for output token for the OpenAI model."
)

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url=url,
            word_count_threshold=1,
            extraction_strategy=LLMExtractionStrategy(
                provider="groq/llama-3.1-70b-versatile",
                api_token=os.getenv("GROQ_API_KEY"),
                schema=OpenAIModelFee.model_json_schema(),
                extraction_type="schema",
                instruction="From the crawled content, extract all mentioned model names along with their "
                "fees for input and output tokens. Make sure not to miss anything in the entire content. "
                "One extracted model JSON format should look like this: "
                '{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}',
            ),
        )

        with open(".data/data.json", "w", encoding="utf-8") as f:
            f.write(result.extracted_content)

asyncio.run(main())