Skip to content
This repository has been archived by the owner on Dec 5, 2022. It is now read-only.

Support data source pagination #1

Closed
jzohrab opened this issue Apr 16, 2020 · 9 comments
Closed

Support data source pagination #1

jzohrab opened this issue Apr 16, 2020 · 9 comments
Assignees
Labels
enhancement New feature or request

Comments

@jzohrab
Copy link
Contributor

jzohrab commented Apr 16, 2020

**Description.

In coviddatascraper, PR covidatlas/coronadatascraper#835 provides support for ArcGIS data pagination. Some json result sets are too big to return in a single response, so the requests will need to manage that. Presumably, similar to GitHub API, they provide a "nextResultSet" token or similar in the response, and then clients can requery with that as a token.

We'd need to manage that for both crawls and scrapes. Presumably this could be managed with lambdas, but the cache file naming convention will need to be page-aware, and return all files.

Describe the solution you'd like

One possibility: include page number, indexed from zero, after the cache key (or name), e.g., <datetime>-<name>-<page>-<sha>.<ext>.gz. If there is only one page (which will be true in most cases), 'page' would be 0 and there won't be any other data sets, and the thing passed to scrape would just be the content.

@jzohrab jzohrab added the enhancement New feature or request label Apr 16, 2020
@ryanblock ryanblock self-assigned this Apr 16, 2020
@camjc
Copy link
Contributor

camjc commented May 13, 2020

JP is a good example, we hit the 10,000 limit there

@camjc
Copy link
Contributor

camjc commented May 13, 2020

https://services8.arcgis.com/JdxivnCyd1rvJTrY/arcgis/rest/services/v2_covid19_list_csv/FeatureServer/0/query lets you query the dataset from a UI. Here are the main settings to get the JSON we use

Screen Shot 2020-05-14 at 07 14 20-fullpage

@camjc
Copy link
Contributor

camjc commented May 13, 2020

Hoping someone can advise on how the pagination works, I don't know how it does.

@ryanblock
Copy link
Contributor

Agreed. I know nothing about this system. I could really use:

  1. A broken source that needs this
  2. Clear instructions on how it's broken (eg steps to repro)
  3. If available, any ideas on how to unbreak things and get the source humming

@jzohrab
Copy link
Contributor Author

jzohrab commented May 14, 2020 via email

@jzohrab
Copy link
Contributor Author

jzohrab commented May 17, 2020

Actually relevant: currently, the PA scraper fetches paginated data. It handles this itself. e.g.

$ yarn start --location PA
...
1000 records from "... url ...&resultOffset=0&resultRecordCount=50000&f=json
...
✏️  coronadatascraper-cache/2020-5-17/55c884a3dc1f7fe60c5bb08af5371500.json written
1000 records from "... url ... &resultOffset=1000&resultRecordCount=50000&f=json
...
✏️  coronadatascraper-cache/2020-5-17/de03720e752a8e6478e066e3cb308ee2.json written
1000 records from "... url ... &resultOffset=2000&resultRecordCount=50000&f=json
...

ref src/shared/scrapers/PA/
method async function TEMPfetchArcGISJSON(obj, featureURL, date) {

@jzohrab jzohrab self-assigned this May 28, 2020
@jzohrab
Copy link
Contributor Author

jzohrab commented May 28, 2020

WIP PR will soon close this: https://github.com/covidatlas/li/pull/193/files

@jzohrab
Copy link
Contributor Author

jzohrab commented May 31, 2020

New PR: #218.

@jzohrab
Copy link
Contributor Author

jzohrab commented Jun 12, 2020

PR merged 🎉

@jzohrab jzohrab closed this as completed Jun 12, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants