Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do i remove certain tags/nodes before selecting a text? #26

Closed
naveenann opened this issue Jul 12, 2019 · 6 comments
Closed

How do i remove certain tags/nodes before selecting a text? #26

naveenann opened this issue Jul 12, 2019 · 6 comments

Comments

@naveenann
Copy link

When we have few tags that need to be removed before selecting a tag for example

fn main() {
let selector = Selector::parse("body").unwrap();
    let html = r#"
    <!DOCTYPE html>
   <body>
   Hello World
   <script type="application/json" data-selector="settings-json">
   {"test":"json"}
   </script>
   </body>
"#;
    let document = Html::parse_document(html);
    let body = document.select(&selector).next().unwrap();
    let text = body.text().collect::<Vec<_>>();
    println!("{:?}", text);
}

Output

["\n Hello World\n ", "\n {\"test\":\"json\"}\n ", "\n \n"]

The output will have the value from the script tags, Is there any way we can remove those?

@causal-agent
Copy link
Owner

The Text iterator is quite a simple wrapper around the Traverse operator: https://github.com/programble/scraper/blob/master/src/element_ref/mod.rs#L107-L126. I think it should be possible to write a loop over Traverse directly which avoids script elements. That's all I can think of since selectors can't select text nodes.

@Boscop
Copy link

Boscop commented Apr 12, 2020

@causal-agent Is it possible to exclude all children of the node for traversal / getting its text?

@dpetrishin
Copy link

@Boscop have the same issue right now...:-)

@dpetrishin
Copy link

dpetrishin commented Apr 28, 2020

@Boscop, For markup like this:

<a href="" class="pagination__link button button--text">
    <span class="sr-only">Page</span>
    333
</a>

I wrote following code to extract 333:

let value = extractor.extract_inner_html(&selectors);
let index = value.rfind('\n').unwrap();
let slice = &value[index..];

I can also use last '>' symbol as base.
I hope it will be helpful...:)

P.S. I'd also like to have such feature in library itself though...

@dpetrishin
Copy link

@causal-agent Hi! Could you suggest any possibilities to do inner_text method like for example in jquery using html5ever? I can probably help with implementation...Just this html5ever is absolute dark forest for me...

@alexlee85
Copy link

just use ElementRef to get child text node.

let only_self_text = doc
    .select(&Selector::parse("button > span").unwrap())
    .next()
    .and_then(|item| item.first_child())
    .and_then(|t| t.value().as_text())
    .map(|t| t.text.to_string())
    .unwrap_or_default();

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants