Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

app/requestor: parse a response using a variable configuration. #7

Open
3 tasks
Tracked by #5
Ziinc opened this issue May 15, 2022 · 0 comments
Open
3 tasks
Tracked by #5

app/requestor: parse a response using a variable configuration. #7

Ziinc opened this issue May 15, 2022 · 0 comments
Projects

Comments

@Ziinc
Copy link
Owner

Ziinc commented May 15, 2022

Config - Parsing One way of viewing parsing config is through modules, each extracting text. This text can then either be converted into new requests, or passed on as parsed items. Right now, i can think of a few:

xpath extraction
css selector extraction
regex extraction (for starters)
json extraction
glob extraction

For example, extract a list of text, and convert all of them into lists.

However, what if we want to extract a list of items (objects)? An example is a list of products (search results).
One way to model it is to use nested extraction rules.
For example, use a css selector to select all <li> elements, then use css selectors to query for title and url and description, resulting in a list of objects.

It should also be possible to combine multiple selectors together and merge them into the list of items. For example, what if the search results are split into 2, and require two different selectors? or what each selector returns empty on certain page states? This allows for more parsing flexibility.

And what if we want to select different types of items that are present on each page? Then we would need multiple different sets of extraction rules, one for each type, and tag each parsed item with the corresponding type.

  • %Extractor{} that defines the extraction method
  • Item extraction - a list of fragment extractors with a nested list of attribute extractors, with each attribute having an extractor, attr key. limit to 1 level for now. list extractors -> attribute extractors. Tag each item with a item_type
  • request extraction - a list of extractors, where text extracted is converted into urls.
@Ziinc Ziinc changed the title parse the response using a variable configuration. app/requestor: parse a response using a variable configuration. May 15, 2022
@Ziinc Ziinc added this to Backlog in Dev May 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Dev
Backlog
Development

No branches or pull requests

1 participant