[WIP] Implement Scrapy source #320

sultaniman · 2024-01-12T14:08:36Z

Hey,

NOTE: This PR is still WIP

This PR implements Scrapy source which set's up communication between threads via queue, see the image below

Also in spider.py we have our own base spider class DLTSpiderBase which provides very basic logic and as reference you can see our generic DLTSpider which is used only when callbacks are specified.

NOTE:

If custom spider is specified then it has a higher priority thus our source will ignore it,
Else if spider is not provided then it expects two callbacks on_result, on_next_page

on_result callback must be a generator because we want to progressively yield results to dlt pipeline

def parse(response: Response) -> Generator[Dict, None, None]:
    for quote in response.css("div.quote"):
        yield {
            "quote": {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("small.author::text").get(),
                "tags": quote.css("div.tags a.tag::text").getall(),
            },
        }

on_next_page should provide the next page or None value thus indicating the end of crawling process and will stop the pipeline.

def next_page(response: Response) -> Optional[str]:
    return response.css("li.next a::attr(href)").get()

TODO

Write readme and manual,
README
Tests

sources/scrapy_scraper/README.md

sources/scrapy_scraper/spider.py

sources/scrapy_scraper_pipeline.py

burnash

Great progress, see my comments above.

sultaniman · 2024-01-12T16:24:39Z

@burnash Thanks for comments and guidelines, I will address your concerns.

rudolfix

@sultaniman good job! it was not easy to put those complicated things together :) please look at my comments

better use resource not dlt source
I have a few edge cases for the queue and I do not understand how spider really works - happy to discuss that
IMO we need better thread runner for dlt and scrapy

My other idea to feed dlt via scrapy pipeline https://docs.scrapy.org/en/latest/topics/item-pipeline.html#topics-item-pipeline
what is good about it is that (IMO) you can use standard scrapy spiders and scrapy cli to run scraping
still (most probably) you'd use queue and a thread to send items to dlt but architecture should be way way easier so please take a look

rudolfix · 2024-01-21T11:44:35Z

sources/scraping/README.md

+
+### 🛡️ Custom spider vs callbacks
+
+`build_scrapy_source` accepts callbacks and a custom spider implementation. If custom spider is provided then it will be used while callbacks are skipped, if you instead resort to custom callbacks then we will use our generic `spider.DLTSpider` which takes care of calling them and


if possible, link to original scrapy docs, where writing spider is explained. we should aim to minimize what user needs to learn in order to use the this source. s probably you could mention that you write the spider in usual way just derive it from our base class.

rudolfix · 2024-01-21T11:46:19Z

sources/scraping/README.md

+Scraping source allows you to scrape content from web and uses [Scrapy](https://doc.scrapy.org/en/latest/)
+to enable this capability.
+
+## 🧠 How it works?


IMO such technical details should come at the end. people interested in scrapers are mostly data scientists and they are interested in usage, not in internals (IMO :))

Fair, I will move this kind of information to the very bottom or a separate markdown file.

rudolfix · 2024-01-21T14:19:17Z

sources/scraping/__init__.py

+
+__all__ = ["build_scrapy_source"]
+
+logger = logging.getLogger(__file__)


please use logger from dlt.common

Will change it.

rudolfix · 2024-01-21T14:24:26Z

sources/scraping/__init__.py

+    yield get_scraping_results(queue=queue)
+
+
+def build_scrapy_source(


I'd use a little different structure here

do not use decorators to create source and resource

do not resource at all. create dlt.resource dynamically here and return it

thanks to apply_hints method users are able to modify such resources as they wish ie. create dynamic table names etc.

please check with_config decorator and pass all configuration using this and specs: https://dlthub.com/docs/general-usage/credentials/config_specs#writing-custom-specs

rudolfix · 2024-01-21T14:26:00Z

sources/scraping/__init__.py

+    """Builder which configures scrapy and Dlt pipeline to run in and communicate using queue"""
+    if not spider and not (on_next_page and on_result):
+        logger.error("Please provide spider or lifecycle hooks")
+        raise RuntimeError(


I think raising exception is enough.

rudolfix · 2024-01-21T15:01:39Z

sources/scraping/spider.py

+        # If next page is available
+        # Then we create next request
+        # Else we stop spider because no pages left
+        next_page = self.on_next_page(response)


I have a feeling you reimplement scrapy internals here

rudolfix · 2024-01-21T15:03:17Z

sources/scraping/spider.py

+        # Else we stop spider because no pages left
+        next_page = self.on_next_page(response)
+        if next_page is not None:
+            next_page = response.urljoin(next_page)


hmmmm why? this is how they do it in their spiders:

for next_page in response.css('a.next'): yield response.follow(next_page, self.parse)

rudolfix · 2024-01-21T15:05:38Z

sources/scraping/types.py

+
+
+OnNextPage = Callable[[Response], Optional[str]]
+OnResult = Callable[[Response], Generator[Any, None, None]]


Iterator[Any] == Generator[Any, None, None]]

rudolfix · 2024-01-21T15:08:13Z

sources/scraping/helpers.py

+
+    d = runner.join()
+    d.addBoth(lambda _: reactor.stop())  # type: ignore[attr-defined]
+    reactor.run()  # type: ignore[attr-defined]


here you should

wait for scrapy to finish (all spiders stopped yielding)

make sure that all messages got consumed from the queue

close the queue at the very end

rudolfix · 2024-01-21T15:09:50Z

sources/scraping/helpers.py

+) -> None:
+    """Convenience method which handles the order of starting of pipeline and scrapy"""
+    logger.info("Starting scraping pipeline")
+    pipeline_thread_runner = threading.Thread(target=pipeline_runner)


we need a better "host" for dlt. at the very least if should close the queue if exception occurred or if dlt exited and queue is not yet closed. if possible the dlt exception should be re-raised in main thread

sultaniman · 2024-01-25T20:29:24Z

Closing this in favor of #332

sultaniman self-assigned this Jan 12, 2024

sultaniman requested review from burnash, sh-rp, rudolfix and AstrakhantsevaAA January 12, 2024 14:24

burnash reviewed Jan 12, 2024

View reviewed changes

sources/scrapy_scraper/README.md Outdated Show resolved Hide resolved

burnash reviewed Jan 12, 2024

View reviewed changes

sources/scrapy_scraper/spider.py Outdated Show resolved Hide resolved

burnash reviewed Jan 12, 2024

View reviewed changes

sources/scrapy_scraper_pipeline.py Outdated Show resolved Hide resolved

burnash reviewed Jan 12, 2024

View reviewed changes

sultaniman and others added 20 commits January 18, 2024 11:42

Implement Scrapy source

22806fa

Cleanup

7f9eb88

Update docstrings

c34ed4f

Update docstrings

c540d8c

Rename pipeline

52b5c68

Add scrapy dependencies

b4100db

Fix linting issues

ba7b257

Fix test issues

36362a6

Re-arrange arguments

38f8c8a

Add test directory

e149186

Add basic test

68459b9

Join crawler process

2785b14

Move spider example and use crawler thread runner

03ae021

Fix linting issue

9259b5c

Define configuration options

c548908

Create initial extraction logic

b0475b4

Refactor source and scraper

b6e1a1a

Cleanup scrapy_source

a195488

Adjust variable name

e2e6162

Update module exports

cbbf126

sultaniman added 20 commits January 18, 2024 11:42

Update readme

2ba04bb

Update function signature

654290d

Fix mypy linting issues

702e4fe

Accept start_urls from param

f6a3ce5

Use default dictionary if no config found

48e4b93

Cleanup code

99dcbd6

Fix mypy issues

3010970

Fix mypy issues

8640e04

Fix mypy issues

2282da8

Refactor source/resource scaffolding

e8b2ff0

Fix typing issue

1d332e2

Adjust scraping test

74663c2

Fix typing issue

a077118

Fix typing issue

19f5a4a

Implement generic base queue for typing

24608ec

Remove unused import

122d3f7

Format code

cb6c9ca

Update readme

b588d96

Explicitly type next page resolver

397a3d5

Update readme

3d90736

sultaniman force-pushed the source/scrapy branch from 400eca8 to 3d90736 Compare January 18, 2024 10:42

sultaniman added 5 commits January 18, 2024 13:47

Add pytest-mock

d6789d6

Add test to check callback calls

f767da6

Update readme

f70534f

Add more tests

97f3433

Adjust next page checks

95a00fc

rudolfix requested changes Jan 21, 2024

View reviewed changes

sultaniman closed this Jan 25, 2024

sultaniman deleted the source/scrapy branch January 25, 2024 20:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Implement Scrapy source #320

[WIP] Implement Scrapy source #320

sultaniman commented Jan 12, 2024 •

edited

Loading

burnash left a comment

sultaniman commented Jan 12, 2024

rudolfix left a comment

rudolfix Jan 21, 2024

rudolfix Jan 21, 2024

sultaniman Jan 22, 2024

rudolfix Jan 21, 2024

sultaniman Jan 22, 2024

rudolfix Jan 21, 2024

rudolfix Jan 21, 2024

rudolfix Jan 21, 2024

rudolfix Jan 21, 2024

rudolfix Jan 21, 2024

rudolfix Jan 21, 2024

rudolfix Jan 21, 2024

sultaniman commented Jan 25, 2024


		### 🛡️ Custom spider vs callbacks

		`build_scrapy_source` accepts callbacks and a custom spider implementation. If custom spider is provided then it will be used while callbacks are skipped, if you instead resort to custom callbacks then we will use our generic `spider.DLTSpider` which takes care of calling them and


		__all__ = ["build_scrapy_source"]

		logger = logging.getLogger(__file__)

		yield get_scraping_results(queue=queue)


		def build_scrapy_source(



		OnNextPage = Callable[[Response], Optional[str]]
		OnResult = Callable[[Response], Generator[Any, None, None]]

[WIP] Implement Scrapy source #320

[WIP] Implement Scrapy source #320

Conversation

sultaniman commented Jan 12, 2024 • edited Loading

TODO

burnash left a comment

Choose a reason for hiding this comment

sultaniman commented Jan 12, 2024

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sultaniman commented Jan 25, 2024

sultaniman commented Jan 12, 2024 •

edited

Loading