app/requestor: parse a response using a variable configuration. #7

Ziinc · 2022-05-15T13:49:40Z

Config - Parsing One way of viewing parsing config is through modules, each extracting text. This text can then either be converted into new requests, or passed on as parsed items. Right now, i can think of a few:

xpath extraction
css selector extraction
regex extraction (for starters)
json extraction
glob extraction

For example, extract a list of text, and convert all of them into lists.

However, what if we want to extract a list of items (objects)? An example is a list of products (search results).
One way to model it is to use nested extraction rules.
For example, use a css selector to select all <li> elements, then use css selectors to query for title and url and description, resulting in a list of objects.

It should also be possible to combine multiple selectors together and merge them into the list of items. For example, what if the search results are split into 2, and require two different selectors? or what each selector returns empty on certain page states? This allows for more parsing flexibility.

And what if we want to select different types of items that are present on each page? Then we would need multiple different sets of extraction rules, one for each type, and tag each parsed item with the corresponding type.

%Extractor{} that defines the extraction method
Item extraction - a list of fragment extractors with a nested list of attribute extractors, with each attribute having an extractor, attr key. limit to 1 level for now. list extractors -> attribute extractors. Tag each item with a item_type
request extraction - a list of extractors, where text extracted is converted into urls.

The text was updated successfully, but these errors were encountered:

Ziinc mentioned this issue May 15, 2022

app/requestor: Can receive a Request from Web and begin crawling #5

Closed

7 tasks

Ziinc changed the title ~~parse the response using a variable configuration.~~ app/requestor: parse a response using a variable configuration. May 15, 2022

Ziinc added this to Backlog in Dev May 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

app/requestor: parse a response using a variable configuration. #7

app/requestor: parse a response using a variable configuration. #7

Ziinc commented May 15, 2022 •

edited

app/requestor: parse a response using a variable configuration. #7

app/requestor: parse a response using a variable configuration. #7

Comments

Ziinc commented May 15, 2022 • edited

Ziinc commented May 15, 2022 •

edited