Running the Crawl in the same process multiple times #1970
Unanswered
germanattanasio
asked this question in
Q&A
Replies: 1 comment
-
@B4nan mentioned that crawlee 3.4.0 does this automatically already |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
HI, first of all this library is amazing. Thank you! I'm writing this Q&A because I've spend some time looking at the source code, issues and existing discussions and I think there is an opportunity to help the community behind this project understand how the crawling works.
Crawlee:: 3.4.0
Problem
I consume a Kafka message and run a Playwright crawler based on the configuration from that message. Each page that is found is then publish into another topic.
Steps to reproduce
Based on the following function
I wrote a simple test that tries to call
crawlPage
multiple times.When running the code, the
onDocument
is called ~11 times the first time. The output includes the followingSo far so good.
When I try a second time, I call
crawlPage
again which creates a newPlaywrightCrawler
. I would expect the state between the two instances ofPlaywrightCrawler
is not shared by default. The test fails and says thatonDocumentSecond
has not been called.Workaround
Somehow digging into issues I found that someone was calling the
requestQueue.drop();
to cleanup the already visited pages so I'm now doing the followingQuestion
requestQueue.drop()
the right thing to do here?crawlPage()
method multiple times in parallel?Related issues/discussions
Configuration.getGlobalCOnfig().getStorageClient()?.purge?.()
but I get a file not found exception.Beta Was this translation helpful? Give feedback.
All reactions