Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[web scraper] verified source #262

Closed
1 task done
rudolfix opened this issue Sep 13, 2023 · 6 comments · Fixed by #332
Closed
1 task done

[web scraper] verified source #262

rudolfix opened this issue Sep 13, 2023 · 6 comments · Fixed by #332
Assignees
Labels
verified source dlt source with tests and demos

Comments

@rudolfix
Copy link
Contributor

rudolfix commented Sep 13, 2023

Quick source info

  • Name of the source: [e.g. web_scraper]
  • What is the data source: any website

Current Status

  • I plan to write it

What source does/will do

The idea is to base the source on scrapy. In theory, scrapy can be used with dlt directly because you can get the scraped data as a generator, on the other hand it is typically wrapped in opaque process from where there's no way to get data. scrapy has its own framework so we can fit dlt into scrapy (ie as export option)

We must investigate if to use scrapy or switch to beautiful soup and write own spider.

The requirements [for scrapy]

  • implement dlt.resource that when provided a scrapy Spider will yield all the data that spider yields
  • ideally data is yielded as it appears, not when scrapping is over and re-yielded from ie. csv file
  • ideally data is scrapped concurrently (like I believe scrapy does)
  • data should be paginated to speed up processing see see ie.
  • return page metadata together with the items (ie page url, last modification time etc.)

Test account / test data

Looks like we'll have plenty of websites to test against

Additional context

Please provide one demo where we scrap PDFs and parse them in transformer as in

@rudolfix rudolfix added the verified source dlt source with tests and demos label Sep 13, 2023
@sultaniman
Copy link
Collaborator

I did a small research on this topic and I think we can use

  1. Custom item exporter or
  2. Custom item pipeline item which basically runs dlt
  3. And feeds json values or uses dlt to create load packages and then finally starts exporting process to destination.

@sultaniman
Copy link
Collaborator

@rudolfix @burnash wdyt?

@burnash
Copy link
Collaborator

burnash commented Jan 9, 2024

@sultaniman good points. Could you please do a proof of concept of this?

@sultaniman
Copy link
Collaborator

I will do it once I check out this one dlt-hub/dlt#811

@sultaniman
Copy link
Collaborator

sultaniman commented Jan 10, 2024

So I created a draft prototype where

  1. pipeline runs in a separate thread,
  2. communication between the spider and pipeline goes via in-memory queue,
  3. manually start crawler process,
  4. manually start pipeline in separate thread.
flowchart LR
    queue[[queue]]
    pipeline[[dlt pipeline]]
    exit{{scraping done}}
    save([exit & save data])
    nodata{scraping done?}
    spider-- push results -->queue
    spider-- no more data -->exit
    queue-->pipeline
    pipeline-->nodata
    nodata-- NO -->queue
    nodata-- DONE -->save
    exit-. no data .->queue

Pipeline and scaffolding

from queue import Queue
import threading
import dlt
from scrapy.crawler import CrawlerProcess

from quotes_spider import QuotesSpider

result_queue = Queue(maxsize=1000)


class SpiderResultHandler(threading.Thread):
    def __init__(self, queue: Queue):
        super().__init__(daemon=True)
        self.result_queue = queue

    def run(self):
        @dlt.resource(name="quotes")
        def get_results():
            # keep pulling items from queue
            # until we get "done" in message
            while True:
                result = self.result_queue.get()
                if "done" in result:
                    break
                yield result

        pipeline = dlt.pipeline(
            pipeline_name="issue_262",
            destination="postgres",
        )

        load_info = pipeline.run(
            get_results,
            table_name="fam_quotes",
            write_disposition="replace",
        )

        print(load_info)

process = CrawlerProcess()

process.crawl(QuotesSpider, queue=result_queue)
handler = SpiderResultHandler(queue=result_queue)
handler.start()
process.start()
handler.join()

Spider source

from queue import Queue
from typing import Any
import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        "https://quotes.toscrape.com/page/1/",
    ]
    custom_settings = {"LOG_LEVEL": "INFO"}

    def __init__(
        self,
        name: str | None = None,
        queue: Queue | None = None,
        **kwargs: Any,
    ):
        super().__init__(name, **kwargs)
        self.queue = queue

    def parse(self, response):
        for quote in response.css("div.quote"):
            data = {
                "headers": dict(response.headers.to_unicode_dict()),
                "quote": {
                    "text": quote.css("span.text::text").get(),
                    "author": quote.css("small.author::text").get(),
                    "tags": quote.css("div.tags a.tag::text").getall(),
                },
            }
 
            # here we push result to queue
            self.queue.put(data)

        next_page = response.css("li.next a::attr(href)").get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)
        else:
            # finally if there are no more results then send "done"
            self.queue.put({"done": True})

@sultaniman sultaniman self-assigned this Jan 10, 2024
@burnash
Copy link
Collaborator

burnash commented Jan 10, 2024

@sultaniman thanks for the POC, this looks great. Please go ahead and make a verified source from this POC. As you mentioned, you'd need to devise a nice way to wrap it into a source definition that hides some complexities while giving enough ways to configure the source. Please take a look at other verified source in this repo for the inspiration. Please submit a draft PR and we'll iterate on the source interface.

@sultaniman sultaniman linked a pull request Feb 1, 2024 that will close this issue
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
verified source dlt source with tests and demos
Projects
Status: Ready for Deployment
Development

Successfully merging a pull request may close this issue.

3 participants