In [1]:
%load_ext autoreload
%autoreload 2

## How to scrape any website with ScraperAI

### Before we start, install the package

In [2]:
# ! pip install scraperai

In [70]:
import os
import json

import pandas as pd
from dotenv import find_dotenv, load_dotenv
from tqdm import tqdm

from scraperai.parsers.models import WebpageFields, Pagination, WebpageType
from scraperai import ParserAI
from scraperai.crawlers import SeleniumCrawler

### Step 1. Init crawler

First, we need to initialize a web-crawler that will help us to fetch data from the web.

In this tutorial we use `SeleniumCrawler` that uses Selenium webdriver. By default it creates a new Chrome session.

To use other browsers you can pass your own webdriver (both local and remote) to the `SeleniumCrawler`:
```
crawler = SeleniumCrawler(driver=your_own_webdriver)
```

If you want to use playwright or other services, you can create your own crawler implementation based on `BaseCrawler`.

In [65]:
crawler = SeleniumCrawler()

### Step 2. Init ParserAI

By default, we use the latest OpenAI GPT-4 model. You can place your API key in the `.env` file. If you don't have a key, you can get it [here](https://platform.openai.com/api-keys).
Also, you can use another AI model. To do this, you need to create another implementations of the `BaseLM`, `BaseJsonLM` and `BaseVision` classes.

In [71]:
env_file = find_dotenv()
if env_file:
    load_dotenv()
openai_api_key = os.getenv('OPENAI_API_KEY')
if openai_api_key is None:
    openai_api_key = input('Please, enter your OpenAI API key: ')

parser = ParserAI(openai_api_key=openai_api_key)

There are 2 experiments in this doc:
1. [List of YCombinator companies](https://www.ycombinator.com/companies/)
2. [List of commits in the repository](https://github.com/scraperai/scraperai/commits/main/)

#### Experiment 1. List of YCombinator companies

### Step 3. Open the website page
Later, in case of multiple similar sites you will be able to run batch scraping. The main target is to semi-automatically detect all xpaths

In [30]:
url = 'https://www.ycombinator.com/companies' # Enter the URL of the website

In [31]:
# Open page in the browser
crawler.get(url)

### Step 3.1. Detect page type
We divide webpages into 4 categories:
- **Catalog**: consists of similar-looking repeating elements. It can be a list of products, articles, companies, table rows, etc;
- **Details**: contains main information about one product;
- **Captcha**: in case we meet anti-scraping CAPTCHA;
- **Other**: everything else; we don't support this webpage type yet.

By default, we use screenshot of the page and GPT4 Vision model to determine a type. We also have a fallback algorithm if you cannot take a screenshot of the page or do not have access to Vision models.

If you know the type of the page, you can set it manually.

In [32]:
page_type = parser.detect_page_type(
    page_source=crawler.page_source,
    screenshot=crawler.get_screenshot_as_base64()
)
# You can set type manually
# page_type = WebpageType.CATALOG
page_type

<WebpageType.CATALOG: 'catalog'>

OpenAI tokens are spent on each action. You can find total money spent using:

In [33]:
parser.total_cost  # in USD

1.0501500000000001

### Step 3.2. Detect pagination
**It is used only for type `catalog`.**

We need to pass a whole page to detect the pagination.
There are 3 types of pagination: `xpath`, `scroll`, and `url_param` (not implemented yet).

In [34]:
pagination = parser.detect_pagination(crawler.page_source)
pagination

Wrong response: Loading more...,div


Pagination(type='scroll', xpath=None, url_param=None, url_param_first_value=1)

In case of error, you can set it manually.

In [13]:
# Scroll type
p1 = Pagination(type='scroll')
# XPATH
p2 = Pagination(type='xpath', xpath='//some-xpath')
# URL param
p3 = Pagination(type='url_param', url_param='page')

### Step 3.3. Detect catalog items
**It is used only for type `catalog`.**
You should correctly choose item block, url and ... 
AI isn't perfect, so you can manually add extra prompt to help AI to understand what you want or set xpath manually.

In [35]:
catalog_item = parser.detect_catalog_item(page_source=crawler.page_source, website_url=url, extra_prompt=None)
catalog_item

<card_xpath=//a[contains(@class,'_company_fj1ly_339')]
 url_xpath=//a[contains(@class,'_company_fj1ly_339')]/@href
 urls_on_page=['https://www.ycombinator.com/companies/airbnb', 'https://www.ycombinator.com/companies/instacart', 'https://www.ycombinator.com/companies/doordash', 'https://www.ycombinator.com/companies/coinbase', 'https://www.ycombinator.com/companies/dropbox', 'https://www.ycombinator.com/companies/gitlab', 'https://www.ycombinator.com/companies/ginkgo-bioworks', 'https://www.ycombinator.com/companies/pagerduty', 'https://www.ycombinator.com/companies/amplitude', 'https://www.ycombinator.com/companies/matterport', 'https://www.ycombinator.com/companies/weave', 'https://www.ycombinator.com/companies/notable-labs', 'https://www.ycombinator.com/companies/presto', 'https://www.ycombinator.com/companies/rigetti-computing', 'https://www.ycombinator.com/companies/pardes-bio', 'https://www.ycombinator.com/companies/embark-trucks', 'https://www.ycombinator.com/companies/momentus'

You can highlight fields using selenium:

In [36]:
if catalog_item is not None:
    crawler.highlight_by_xpath(catalog_item.card_xpath, '#8981D7', 5)
    crawler.highlight_by_xpath(catalog_item.url_xpath, '#5499D1', 3)

### Step 3.4. Detect data fields in a catalog item

We define two types of data fields in a HTML page.

First type is static field that do not contain a field name. It can be both a single value or an array. Example: product name or price.

Second type is dynamic fields where there are both field names and values mentioned. Usually these fields look like tables:
param1: value1
param2: value2
etc.

In [37]:
# Aux method to print detected fields
def _print_fields(fields: WebpageFields):
    print(f'Static fields ({len(fields.static_fields)}):')

    data = [{'name': f.field_name, 'xpath': f.field_xpath, 'value': f.first_value} for f in fields.static_fields]
    df = pd.DataFrame(data)
    print(df.to_markdown(tablefmt='plain', index=True))

    print()
    print(f'Dynamic fields ({len(fields.dynamic_fields)}):')
    if len(fields.dynamic_fields) == 0:
        print('Not found')
        return
    index = len(fields.static_fields)
    for field in fields.dynamic_fields:
        print(f' {index}  Section {field.section_name}\n'
                   f'    Labels xpath: {field.name_xpath}\n'
                   f'    Values xpath: {field.value_xpath}\n'
                   f'    Value: {field.first_values}')
        index += 1

In [38]:
fields = parser.extract_fields(html_snippet=catalog_item.html_snippet)
_print_fields(fields)

Static fields (4):
    name          xpath                                                                        value
 0  Company Name  //span[contains(@class, '_coName_fj1ly_454')]                                Airbnb
 1  Location      //span[contains(@class, '_coLocation_fj1ly_470')]                            San Francisco, CA, USA
 2  Description   //span[contains(@class, '_coDescription_fj1ly_479')]                         Book accommodations around the world.
 3  Tags          //a[contains(@class, '_tagLink_fj1ly_1026')]/span[contains(@class, 'pill')]  ['W09', 'Consumer', 'Travel, Leisure and Tourism']

Dynamic fields (0):
Not found


You can highlight detected fields:

In [40]:
# Method to highlight fields
def highlight_fields(fields: WebpageFields):
    colors = ['#539878', '#5499D1', '#549B9A', '#5982A3', '#5A5499', '#68D5A2', '#75DDDC', '#8981D7', '#98D1FF',
              '#98FFCF', '#9D5A5A', '#A05789', '#AAFFFE', '#C6C1FF', '#CD7CB3', '#D17A79', '#FAB4E4', '#FFB1B0']
    for index, field in enumerate(fields.static_fields):
        crawler.highlight_by_xpath(field.field_xpath, colors[index % len(colors)], border=4)
    for index, field in enumerate(fields.dynamic_fields):
        color = colors[index % len(colors)]
        crawler.highlight_by_xpath(field.value_xpath, color, border=3)
        crawler.highlight_by_xpath(field.name_xpath, color, border=3)

In [41]:
highlight_fields(fields)

### Step 3.4 Scrape data

We are almost there!

First of all, let's set some limits for simplicity:

In [43]:
max_pages = 5  # How many catalog pages we should iterate over
max_rows = 200  # How many rows to scrape before stop

Now we need to pass a scraping reciept from previous steps to our crawler and ask it to iterate over catalog cards.
It will handle pagination and data-extracting automatically.

In [46]:
rows = []
data_iterator = crawler.iter_data_from_catalog_pages(
    start_url=url,
    pagination=pagination,
    catalog_item_xpath=catalog_item.card_xpath,
    fields=fields,
    max_pages=max_pages,
    max_rows=max_rows
)
with tqdm(total=max_rows) as pbar:
    for data_list in data_iterator:
        rows += data_list
        pbar.update(len(data_list))

 30%|███       | 60/200 [00:00<00:00, 935.51it/s]


#### Congratulations! We got the final data!

In [47]:
rows

[{'Company Name': 'Airbnb',
  'Location': 'San Francisco, CA, USA',
  'Description': 'Book accommodations around the world.',
  'Tags': ['W09', 'Consumer', 'Travel, Leisure and Tourism']},
 {'Company Name': 'Instacart',
  'Location': 'San Francisco, CA, USA',
  'Description': 'Marketplace for grocery delivery and pickup',
  'Tags': ['S12', 'Consumer', 'Food and Beverage']},
 {'Company Name': 'DoorDash',
  'Location': 'San Francisco, CA, USA',
  'Description': 'Restaurant delivery.',
  'Tags': ['S13', 'Consumer', 'Food and Beverage']},
 {'Company Name': 'Coinbase',
  'Location': 'San Francisco, CA, USA',
  'Description': 'Buy, sell, and manage cryptocurrencies.',
  'Tags': ['S12', 'Fintech', 'Banking and Exchange']},
 {'Company Name': 'Dropbox',
  'Location': 'San Francisco, CA, USA',
  'Description': 'Backup and share files in the cloud.',
  'Tags': ['S07', 'B2B', 'Productivity']},
 {'Company Name': 'GitLab',
  'Location': 'San Francisco, CA, USA',
  'Description': 'A complete DevOps p

You can export data in any format:

In [None]:
# Export as json
with open('yc.json', 'w+') as f:
    json.dump(rows, f, indent=4)

# Export to Pandas DataFrame
df = pd.DataFrame(rows)
df.to_csv('yc.csv')

### Step 4. Parse nested detail page

You can extract data from nested pages using ScraperAI

In [48]:
# Open first nested page
crawler.get(catalog_item.urls_on_page[0])

### Step 4.1. Extract fields

First, we use `summarize_details_page_as_valid_html` method to find relevant parts on the initial webpage.
For example, a list of similar products is not a relevant part of a details page.

Then we use `parser.extract_fields` as before to get the fields from html snippet.

In [49]:
html_snippet = parser.summarize_details_page_as_valid_html(
    page_source=crawler.page_source,
    screenshot=crawler.get_screenshot_as_base64()
)
fields = parser.extract_fields(html_snippet)
_print_fields(fields)

Static fields (14):
    name                xpath                                                                  value
 0  Company Name        //h1[contains(@class,'font-extralight')]                               Airbnb
 1  Tagline             //div[contains(@class,'text-xl')]                                      Book accommodations around the world.
 2  Batch               //a[contains(@href,'/companies?batch=W09')]/div/span
 3  Status              //div[contains(text(),'Public')]                                       Public
 4  Industries          //a[contains(@href,'/companies/industry')]/div                         ['marketplace', 'travel']
 5  Location            //a[contains(@href,'/companies/location/san-francisco-bay-area')]/div  San Francisco
 6  Website             //a[contains(@href,'http://airbnb.com')]                               http://airbnb.com
 7  Description         //section//p[contains(@class,'whitespace-pre-line')]                   ['Founded in August of 2008

In [50]:
# Let's highlight the fields
highlight_fields(fields)

### Step 4.2. Scrape data

First, let's set some limits for simplicity:

In [63]:
max_pages = 5  # How many catalog pages we should iterate over
max_rows = 20  # How many rows to scrape before stop

Now, let's collect urls to the nested pages:

In [60]:
urls: set[str] = set()

urls_iterator = crawler.iter_urls_to_nested_pages(
    start_url=url,
    pagination=pagination,
    url_xpath=catalog_item.url_xpath,
    max_pages=max_pages
)
with tqdm(total=max_pages) as pbar:
    for url_list in urls_iterator:
        urls.update(url_list)
        pbar.update(1)
print(f'Collected {len(urls)} urls to nested pages')

 40%|████      | 2/5 [00:22<00:33, 11.17s/it]

Collected 200 urls to nested pages





Now, let's collect data from nested pages:

In [66]:
rows = []
data_iterator = crawler.iter_data_from_nested_pages(
    urls=urls,
    fields=fields,
    max_rows=max_rows
)
with tqdm(data_iterator, total=max_rows) as pbar:
    for row in pbar:
        rows.append(row)
rows

100%|██████████| 20/20 [01:02<00:00,  3.14s/it]


[{'Company Name': 'Athelas',
  'Tagline': 'Digital tools for healthcare providers',
  'Batch': None,
  'Status': None,
  'Industries': ['health-tech', 'medical-devices', 'biotech', 'healthcare'],
  'Location': 'Mountain View',
  'Website': None,
  'Description': "At Athelas, we're bringing simple, life-changing health care products to people around the globe.\r\n\r\nThe future of healthcare is at the home - we are a team of technologists building the next generation of medical products at the intersection of hardware and software. We won’t stop until we’ve brought the world class tools of a hospital to your home. \r\n\r\nAthelas Remote Patient Monitoring (RPM) allows healthcare providers to monitor patient vitals like blood pressure, weight, and blood glucose without the patient ever having to enter a clinic, improving patient health and engagement, and reducing hospitalizations. We do this all through a beautifully integrated suite of devices and software tools that provides access to

#### Experiment 2. [List of commits in the repository](https://github.com/scraperai/scraperai/commits/main/)

In [68]:
# Define url
url = 'https://github.com/scraperai/scraperai/commits/main/'

In [69]:
# Open url
crawler.get(url)

In [14]:
# Detect page_type
page_type = WebpageType.CATALOG
page_type

<WebpageType.CATALOG: 'catalog'>

In [72]:
# Detect pagination
pagination = parser.detect_pagination(crawler.page_source)
pagination

Pagination(type='xpath', xpath="//a[text()='Next']", url_param=None, url_param_first_value=1)

In [16]:
# Detect catalog item
catalog_item = parser.detect_catalog_item(
    page_source=crawler.page_source,
    website_url=url,
    extra_prompt='This page contains a list of commits. Each commit row is a catalog item')
catalog_item

<
card_xpath=//li[contains(@class, 'Box-sc-g0xbh4-0 gUACHT listviewitem')]
url_xpath=//li[contains(@class, 'Box-sc-g0xbh4-0 gUACHT listviewitem')]//a[contains(@class, 'color-fg-default')]/@href
urls_on_page=['https://github.com/scraperai/scraperai/commit/2e49f28f4900175d082c8667f934be5b238250dd', 'https://github.com/scraperai/scraperai/commit/a943f77ad02f04c7a64b34c73fddb8bdd0e9cbb7', 'https://github.com/scraperai/scraperai/commit/2530dbd8481e270e48b6d38b9c00a85984e9d249', '...']
>

In [18]:
crawler.highlight_by_xpath(catalog_item.card_xpath, '#8981D7', 5)
crawler.highlight_by_xpath(catalog_item.url_xpath, '#5499D1', 3)

In [22]:
fields = parser.extract_fields(html_snippet=catalog_item.html_snippet)
_print_fields(fields)

Static fields (3):
    name           xpath                                                                            value
 0  Commit Title   //h4[@class='Heading__StyledHeading-sc-1c1dgg0-0 kdbvcH markdown-title']/span/a  env example
 1  Commit Author  //a[@class='Link__StyledLink-sc-14289xe-0 iTDQyF']                               rrr2rrr
 2  Commit Hash    //div[@class='d-flex']/span/a/span[@class='Button-label color-fg-muted']

Dynamic fields (0):
Not found


In [24]:
highlight_fields(fields)

In [28]:
max_pages = 2  # How many catalog pages we should iterate over
max_rows = 100  # When to stop scraping
rows = []
data_iterator = crawler.iter_data_from_catalog_pages(
    start_url=url,
    pagination=pagination,
    catalog_item_xpath=catalog_item.card_xpath,
    fields=fields,
    max_pages=max_pages,
    max_rows=max_rows
)
with tqdm(total=max_rows) as pbar:
    for data_list in data_iterator:
        rows += data_list
        pbar.update(len(data_list))

rows

 53%|█████▎    | 53/100 [00:10<00:09,  4.91it/s]


[{'Commit Title': 'env example',
  'Commit Author': 'rrr2rrr',
  'Commit Hash': None},
 {'Commit Title': 'Add elements highlight',
  'Commit Author': 'iakov-kaiumov',
  'Commit Hash': None},
 {'Commit Title': 'Rewrite CLI as MVC app',
  'Commit Author': 'iakov-kaiumov',
  'Commit Hash': None},
 {'Commit Title': 'Improve prompt',
  'Commit Author': 'iakov-kaiumov',
  'Commit Hash': None},
 {'Commit Title': 'Fix model',
  'Commit Author': 'iakov-kaiumov',
  'Commit Hash': None},
 {'Commit Title': 'Improve chat queries',
  'Commit Author': 'iakov-kaiumov',
  'Commit Hash': None},
 {'Commit Title': 'Change invoke signature',
  'Commit Author': 'iakov-kaiumov',
  'Commit Hash': None},
 {'Commit Title': 'Add base agent with retry strategy',
  'Commit Author': 'iakov-kaiumov',
  'Commit Hash': None},
 {'Commit Title': 'Move Pagination to models',
  'Commit Author': 'iakov-kaiumov',
  'Commit Hash': None},
 {'Commit Title': 'Change csv separator',
  'Commit Author': 'iakov-kaiumov',
  'Commit 

In [29]:
len(rows)

53