You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the current implementation, the meta scraper runs all the scrapers sequentially, crawls the FAQs, and then writes to an Elasticsearch index. This is good for initializing an index from scratch.
We should implement a periodic job(cron or AWS Lambda) that runs the meta scraper and check for updates, additions, and deletions since the last run.
A possible quick-n-dirty alternative to a periodic sync job could have been to recreate the entire Elasticsearch index each time we crawl. This works, except, collecting user feedback gets tricky as we lose the document_id when the list of scrapers gets updated.
Workflow
execute the meta crawler
search in ES if crawled question/answer pairs for a given scraper are present. The ES query can be filtered by the link field.
existing questions in ES which are no longer present(or are changed) in the newly crawled link are marked as outdated in ES
Other details
Currently, the document_id field in ES is populated as incrementing numbers. It could be changed to UUID to make things simpler to implement.
The API queries should be changed to exclude outdated documents.
The text was updated successfully, but these errors were encountered:
Proposal
In the current implementation, the meta scraper runs all the scrapers sequentially, crawls the FAQs, and then writes to an Elasticsearch index. This is good for initializing an index from scratch.
We should implement a periodic job(cron or AWS Lambda) that runs the meta scraper and check for updates, additions, and deletions since the last run.
A possible quick-n-dirty alternative to a periodic sync job could have been to recreate the entire Elasticsearch index each time we crawl. This works, except, collecting user feedback gets tricky as we lose the
document_id
when the list of scrapers gets updated.Workflow
link
field.outdated
in ESOther details
document_id
field in ES is populated as incrementing numbers. It could be changed to UUID to make things simpler to implement.The text was updated successfully, but these errors were encountered: