How do i remove certain tags/nodes before selecting a text? #26

naveenann · 2019-07-12T02:36:38Z

When we have few tags that need to be removed before selecting a tag for example

fn main() {
let selector = Selector::parse("body").unwrap();
    let html = r#"
    <!DOCTYPE html>
   <body>
   Hello World
   <script type="application/json" data-selector="settings-json">
   {"test":"json"}
   </script>
   </body>
"#;
    let document = Html::parse_document(html);
    let body = document.select(&selector).next().unwrap();
    let text = body.text().collect::<Vec<_>>();
    println!("{:?}", text);
}

Output

["\n Hello World\n ", "\n {\"test\":\"json\"}\n ", "\n \n"]

The output will have the value from the script tags, Is there any way we can remove those?

The text was updated successfully, but these errors were encountered:

causal-agent · 2019-10-03T18:58:10Z

The Text iterator is quite a simple wrapper around the Traverse operator: https://github.com/programble/scraper/blob/master/src/element_ref/mod.rs#L107-L126. I think it should be possible to write a loop over Traverse directly which avoids script elements. That's all I can think of since selectors can't select text nodes.

Boscop · 2020-04-12T21:08:38Z

@causal-agent Is it possible to exclude all children of the node for traversal / getting its text?

dpetrishin · 2020-04-28T21:06:19Z

@Boscop have the same issue right now...:-)

dpetrishin · 2020-04-28T21:36:56Z

@Boscop, For markup like this:

<a href="" class="pagination__link button button--text">
    <span class="sr-only">Page</span>
    333
</a>

I wrote following code to extract 333:

let value = extractor.extract_inner_html(&selectors);
let index = value.rfind('\n').unwrap();
let slice = &value[index..];

I can also use last '>' symbol as base.
I hope it will be helpful...:)

P.S. I'd also like to have such feature in library itself though...

dpetrishin · 2020-04-29T09:50:00Z

@causal-agent Hi! Could you suggest any possibilities to do inner_text method like for example in jquery using html5ever? I can probably help with implementation...Just this html5ever is absolute dark forest for me...

alexlee85 · 2021-07-10T04:23:09Z

just use ElementRef to get child text node.

let only_self_text = doc
    .select(&Selector::parse("button > span").unwrap())
    .next()
    .and_then(|item| item.first_child())
    .and_then(|t| t.value().as_text())
    .map(|t| t.text.to_string())
    .unwrap_or_default();

cfvescovo closed this as completed Nov 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do i remove certain tags/nodes before selecting a text? #26

How do i remove certain tags/nodes before selecting a text? #26

naveenann commented Jul 12, 2019

causal-agent commented Oct 3, 2019

Boscop commented Apr 12, 2020

dpetrishin commented Apr 28, 2020

dpetrishin commented Apr 28, 2020 •

edited

dpetrishin commented Apr 29, 2020

alexlee85 commented Jul 10, 2021

How do i remove certain tags/nodes before selecting a text? #26

How do i remove certain tags/nodes before selecting a text? #26

Comments

naveenann commented Jul 12, 2019

causal-agent commented Oct 3, 2019

Boscop commented Apr 12, 2020

dpetrishin commented Apr 28, 2020

dpetrishin commented Apr 28, 2020 • edited

dpetrishin commented Apr 29, 2020

alexlee85 commented Jul 10, 2021

dpetrishin commented Apr 28, 2020 •

edited