[web scraper] verified source #262

rudolfix · 2023-09-13T18:33:05Z

Quick source info

Name of the source: [e.g. web_scraper]
What is the data source: any website

Current Status

I plan to write it

What source does/will do

The idea is to base the source on scrapy. In theory, scrapy can be used with dlt directly because you can get the scraped data as a generator, on the other hand it is typically wrapped in opaque process from where there's no way to get data. scrapy has its own framework so we can fit dlt into scrapy (ie as export option)

We must investigate if to use scrapy or switch to beautiful soup and write own spider.

The requirements [for scrapy]

implement dlt.resource that when provided a scrapy Spider will yield all the data that spider yields
ideally data is yielded as it appears, not when scrapping is over and re-yielded from ie. csv file
ideally data is scrapped concurrently (like I believe scrapy does)
data should be paginated to speed up processing see see ie.
return page metadata together with the items (ie page url, last modification time etc.)

Test account / test data

Looks like we'll have plenty of websites to test against

Additional context

Please provide one demo where we scrap PDFs and parse them in transformer as in

The text was updated successfully, but these errors were encountered:

sultaniman · 2024-01-09T13:19:03Z

I did a small research on this topic and I think we can use

Custom item exporter or
Custom item pipeline item which basically runs dlt
And feeds json values or uses dlt to create load packages and then finally starts exporting process to destination.

sultaniman · 2024-01-09T13:19:24Z

@rudolfix @burnash wdyt?

burnash · 2024-01-09T16:25:00Z

@sultaniman good points. Could you please do a proof of concept of this?

sultaniman · 2024-01-10T08:40:39Z

I will do it once I check out this one dlt-hub/dlt#811

sultaniman · 2024-01-10T15:03:11Z

So I created a draft prototype where

pipeline runs in a separate thread,
communication between the spider and pipeline goes via in-memory queue,
manually start crawler process,
manually start pipeline in separate thread.

flowchart LR
    queue[[queue]]
    pipeline[[dlt pipeline]]
    exit{{scraping done}}
    save([exit & save data])
    nodata{scraping done?}
    spider-- push results -->queue
    spider-- no more data -->exit
    queue-->pipeline
    pipeline-->nodata
    nodata-- NO -->queue
    nodata-- DONE -->save
    exit-. no data .->queue

Pipeline and scaffolding

from queue import Queue
import threading
import dlt
from scrapy.crawler import CrawlerProcess

from quotes_spider import QuotesSpider

result_queue = Queue(maxsize=1000)


class SpiderResultHandler(threading.Thread):
    def __init__(self, queue: Queue):
        super().__init__(daemon=True)
        self.result_queue = queue

    def run(self):
        @dlt.resource(name="quotes")
        def get_results():
            # keep pulling items from queue
            # until we get "done" in message
            while True:
                result = self.result_queue.get()
                if "done" in result:
                    break
                yield result

        pipeline = dlt.pipeline(
            pipeline_name="issue_262",
            destination="postgres",
        )

        load_info = pipeline.run(
            get_results,
            table_name="fam_quotes",
            write_disposition="replace",
        )

        print(load_info)

process = CrawlerProcess()

process.crawl(QuotesSpider, queue=result_queue)
handler = SpiderResultHandler(queue=result_queue)
handler.start()
process.start()
handler.join()

Spider source

from queue import Queue
from typing import Any
import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        "https://quotes.toscrape.com/page/1/",
    ]
    custom_settings = {"LOG_LEVEL": "INFO"}

    def __init__(
        self,
        name: str | None = None,
        queue: Queue | None = None,
        **kwargs: Any,
    ):
        super().__init__(name, **kwargs)
        self.queue = queue

    def parse(self, response):
        for quote in response.css("div.quote"):
            data = {
                "headers": dict(response.headers.to_unicode_dict()),
                "quote": {
                    "text": quote.css("span.text::text").get(),
                    "author": quote.css("small.author::text").get(),
                    "tags": quote.css("div.tags a.tag::text").getall(),
                },
            }
 
            # here we push result to queue
            self.queue.put(data)

        next_page = response.css("li.next a::attr(href)").get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)
        else:
            # finally if there are no more results then send "done"
            self.queue.put({"done": True})

burnash · 2024-01-10T15:35:05Z

@sultaniman thanks for the POC, this looks great. Please go ahead and make a verified source from this POC. As you mentioned, you'd need to devise a nice way to wrap it into a source definition that hides some complexities while giving enough ways to configure the source. Please take a look at other verified source in this repo for the inspiration. Please submit a draft PR and we'll iterate on the source interface.

rudolfix added the verified source dlt source with tests and demos label Sep 13, 2023

sultaniman self-assigned this Jan 10, 2024

sultaniman mentioned this issue Jan 12, 2024

[WIP] Implement Scrapy source #320

Closed

3 tasks

sultaniman mentioned this issue Jan 25, 2024

Scrapy source using scrapy #332

Merged

6 tasks

sultaniman linked a pull request Feb 1, 2024 that will close this issue

Scrapy source using scrapy #332

Merged

6 tasks

rudolfix closed this as completed in #332 Mar 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[web scraper] verified source #262

[web scraper] verified source #262

rudolfix commented Sep 13, 2023 •

edited

sultaniman commented Jan 9, 2024

sultaniman commented Jan 9, 2024

burnash commented Jan 9, 2024

sultaniman commented Jan 10, 2024

sultaniman commented Jan 10, 2024 •

edited

burnash commented Jan 10, 2024

[web scraper] verified source #262

[web scraper] verified source #262

Comments

rudolfix commented Sep 13, 2023 • edited

Quick source info

Current Status

What source does/will do

Test account / test data

Additional context

sultaniman commented Jan 9, 2024

sultaniman commented Jan 9, 2024

burnash commented Jan 9, 2024

sultaniman commented Jan 10, 2024

sultaniman commented Jan 10, 2024 • edited

Pipeline and scaffolding

Spider source

burnash commented Jan 10, 2024

rudolfix commented Sep 13, 2023 •

edited

sultaniman commented Jan 10, 2024 •

edited