Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selector doesn't work with newline after #76

Closed
David-OConnor opened this issue Apr 25, 2022 · 8 comments
Closed

Selector doesn't work with newline after #76

David-OConnor opened this issue Apr 25, 2022 · 8 comments

Comments

@David-OConnor
Copy link

document.select with Selector::parse is not working when there's a newline directly after the the tag.

Code:

let a_sel = scraper::Selector::parse("a").unwrap();
for el in document.select(&a_sel) {
    //...
}

HTML example that triggers this:

<a
                            href="...")"

When printing these affected elements:

Element(<a\n href="\\\"/...

Other elements in the query that are of the form Element(<a href="\\\"/... don't trigger this problem. Happy for a workaround in the meanwhile.

@David-OConnor David-OConnor changed the title Selectors Selector doesn't work with newline after Apr 25, 2022
@adumbidiot
Copy link

I can't get that html to even parse. Are you sure that's what you used to trigger the issue?

@David-OConnor
Copy link
Author

David-OConnor commented Apr 25, 2022

That's a minimal example. I don't know that's the issue, but that appears to be what's separating tags it finds vs ones it ignores.

Example link it finds:

<a href="https://github.com">

Example link it doesn't find:

<a
    href="https://github.com">

@adumbidiot
Copy link

That seems to work.

main.rs:

fn main() {
    let html = r#"<a
href="https://github.com">"#;

    println!("Raw HTML: {:?}", html);

    let document = scraper::Html::parse_document(html);
    let a_sel = scraper::Selector::parse("a").unwrap();
    for el in document.select(&a_sel) {
        println!("{}", el.html());
    }
}

Output:

Raw HTML: "<a\nhref=\"https://github.com\">"
<a href="https://github.com"></a>

@David-OConnor
Copy link
Author

Hmm. I'll dig deeper and report back; that's equivalent to the code I'm having trouble with

@David-OConnor
Copy link
Author

Hi - Sorry about the late reply. I have tried several troubleshooting approaches, and have not been able to narrow this down. I can provide this case to reproduce it:

https://www.anyleaf.org/blog

It will correctly pull the links at the header and footer of the page, but none of the articles linked in the middle will show up using the 'a' selector.

@adumbidiot
Copy link

I can't reproduce that.
main.rs:

fn main() {
    let url = "https://www.anyleaf.org/blog";
    let html = ureq::get(url).call().unwrap().into_string().unwrap();

    println!("Raw HTML: {:?}", html);

    let document = scraper::Html::parse_document(&html);
    let a_sel = scraper::Selector::parse("a").unwrap();
    for el in document.select(&a_sel) {
        println!("{}", el.html());
    }
}

Cargo.toml:

[package]
name = "scraper-issue-76"
version = "0.0.0"
edition = "2021"

[dependencies]
scraper = "0.13.0"
ureq = "2.4.0"

Output:

Raw HTML: "\n\n<!DOCTYPE html>\n<html lang=\"en\">\n<head>\n    <meta charset=\"utf-8\">\n    <meta name=\"viewport\" content=\"width=device-width\">\n\n    <sc
ript type=\"module\">\n        document.documentElement.classList.remove('no-js');\n        document.documentElement.classList.add('js');\n    </script>\n\n
<link rel=\"stylesheet\" href=\"/static/style.css\">\n\n\n    <meta name=\"description\" content=\"Sensors and measurement for science, hydroponics, and aquariu
ms\">\n    <meta property=\"og:locale\" content=\"en_US\">\n    <meta property=\"og:type\" content=\"website\">\n    <meta name=\"twitter:card\" content=\"summa
ry_large_image\">\n    <meta property=\"og:url\" content=\"https://www.anyleaf.org\">\n\n    \n    <link rel=\"shortcut icon\" type=\"image/png\" href=\"/static
/favicon.png\"/>\n\n    \n    \n    <link rel=\"apple-touch-icon\" href=\"/static/favicon.png\">\n    \n    <meta name=\"theme-color\" content=\"#a2c8a9\">\n\n
   \n    <meta name=\"description\" content=\"AnyLeaf Articles: On sensors, measurements, and embedded computing\">\n\n    <meta property=\"og:title\" content=\
"\">\n    <meta property=\"og:description\" content=\"AnyLeaf Articles: On sensors, measurements, and embedded computing\">\n\n    <title>AnyLeaf sensors: Artic
les</title>\n\n\n</head>\n<body>\n\n<div id=\"top-bar\">\n    <div id=\"menu\">\n        <a href=\"/\" class=\"menu-item\"><h3 class=\"menu-header\">Home</h3></
a>\n        <a href=\"/mercury-g4\" class=\"menu-item\"><h3 class=\"menu-header\">Quad FC</h3></a>\n        <a href=\"/stove-thermometer\" class=\"menu-item\"><
h3 class=\"menu-header\">Stove Thermometer</h3></a>\n        <a href=\"/water-monitor\" class=\"menu-item\"><h3 class=\"menu-header\">Water Monitor</h3></a>\n
      <a href=\"/ph-module\" class=\"menu-item\"><h3 class=\"menu-header\">pH</h3></a>\n        <a href=\"/ec-module\" class=\"menu-item\"><h3 class=\"menu-head
er\">Conductivity</h3></a>\n        <a href=\"/temp-module\" class=\"menu-item\"><h3 class=\"menu-header\">Temperature</h3></a>\n        <a class=\"menu-item\"
href=\"/about\"><h3 class=\"menu-header\">About</h3></a>\n        <a class=\"menu-item\" href=\"/checkout\"><h3 class=\"menu-header\">Checkout</h3></a>\n
 <a class=\"menu-item\" href=\"/blog\"><h3 class=\"menu-header\">Blog</h3></a>\n        <a class=\"menu-item\" href=\"mailto:anyleaf@anyleaf.org\"><h3 class=\"m
enu-header\">Contact</h3></a>\n    </div>\n</div>\n\n\n\n\n    <div class=\"home-body\">\n        <div style=\"text-align: center;\">\n        <img src=\"/stati
c/logo.png\" style = \"margin-top: 40px\" width=300 alt=\"AnyLeaf\" />\n        </div>\n\n        <h1>AnyLeaf Blog</h1>\n\n        <h2>Misc:</h2>\n        <ul>\
n            <li style=\"margin-bottom: 40px;\">\n                <a\n                        href=\"/filter-design\"\n                        style=\"font-size
: 1.5em;\"\n                >Digital filter design and response\n                </a>\n            </li>\n        </ul>\n\n        <h2>Articles:</h2>\n        <
ul>\n            \n                <li style=\"margin-bottom: 40px;\">\n                    <a\n                            href=\"/blog/parts-you-need-for-a-qu
adcopter-in-2022\"\n                            style=\"font-size: 1.5em\">\n                        Parts you need for a quadcopter in 2022\n
  </a> - Feb. 24, 2022, 7:46 p.m.\n                </li>\n            \n                <li style=\"margin-bottom: 40px;\">\n                    <a\n
                 href=\"/blog/writing-embedded-firmware-using-rust\"\n                            style=\"font-size: 1.5em\">\n                        Writing e
mbedded firmware using Rust\n                    </a> - Sept. 25, 2021, 5:45 p.m.\n                </li>\n            \n                <li style=\"margin-botto
m: 40px;\">\n                    <a\n                            href=\"/blog/measuring-ph-on-raspberry-pi\"\n                            style=\"font-size: 1.5
em\">\n                        Measuring pH on Raspberry Pi\n                    </a> - Feb. 6, 2021, 9:47 a.m.\n                </li>\n            \n
      <li style=\"margin-bottom: 40px;\">\n                    <a\n                            href=\"/blog/the-essence-of-embedded-computers\"\n
             style=\"font-size: 1.5em\">\n                        The essence of embedded computers\n                    </a> - Sept. 6, 2020, 7:09 p.m.\n
          </li>\n            \n                <li style=\"margin-bottom: 40px;\">\n                    <a\n                            href=\"/blog/electrical-
conductivity-(ec)-for-hydroponics\"\n                            style=\"font-size: 1.5em\">\n                        Electrical Conductivity (EC) for Hydroponi
cs\n                    </a> - Aug. 22, 2020, 4 p.m.\n                </li>\n            \n                <li style=\"margin-bottom: 40px;\">\n
    <a\n                            href=\"/blog/project:-building-an-automatic-ph-doser\"\n                            style=\"font-size: 1.5em\">\n
             Project: Building an automatic pH doser\n                    </a> - July 21, 2020, 7:33 p.m.\n                </li>\n            \n
<li style=\"margin-bottom: 40px;\">\n                    <a\n                            href=\"/blog/ph-measurement-for-hydroponics\"\n
    style=\"font-size: 1.5em\">\n                        pH Measurement for Hydroponics\n                    </a> - July 19, 2020, 3:43 p.m.\n                </
li>\n            \n                <li style=\"margin-bottom: 40px;\">\n                    <a\n                            href=\"/blog/how-to-calibrate-ph-sen
sors\"\n                            style=\"font-size: 1.5em\">\n                        How to Calibrate pH Sensors\n                    </a> - July 17, 2020,
1:23 p.m.\n                </li>\n            \n                <li style=\"margin-bottom: 40px;\">\n                    <a\n                            href=\"
/blog/temperature-sensors:-a-comparison\"\n                            style=\"font-size: 1.5em\">\n                        Temperature sensors: A comparison\n
                   </a> - July 15, 2020, 6:42 p.m.\n                </li>\n            \n        </ul>\n    </div>\n\n\n\n<div id=\"footer\">\n    <h4 style=\"m
argin-top: 30px\">Assembled in Raleigh, NC, USA.</h4>\n    <div style=\"margin-bottom: 30px\">\n        <a class=\"fineprint\" style=\"margin-right: 20px\" href
=\"/privacy\">Privacy policy</a>\n        <a class=fineprint href=\"/terms\">Terms and conditions</a>\n    </div>\n    <div style=\"display: flex; flex-directio
n: column\">\n        <h5 class=\"fineprint\">\n            All AnyLeaf products comply with the\n            <a href=\"https://en.wikipedia.org/wiki/Restrictio
n_of_Hazardous_Substances_Directive\">\n                Restriction of Hazardous Substances (RoHS) Directive</a>.</h5>\n        <h5 class=\"fineprint\">© 2022 A
nyLeaf</h5>\n    </div>\n</div>\n\n\n<script src=\"/static/js/main.js\"></script>\n<script src=\"/static/js/cart.js\"></script>\n\n</body>\n</html>"
<a href="/" class="menu-item"><h3 class="menu-header">Home</h3></a>
<a class="menu-item" href="/mercury-g4"><h3 class="menu-header">Quad FC</h3></a>
<a class="menu-item" href="/stove-thermometer"><h3 class="menu-header">Stove Thermometer</h3></a>
<a class="menu-item" href="/water-monitor"><h3 class="menu-header">Water Monitor</h3></a>
<a class="menu-item" href="/ph-module"><h3 class="menu-header">pH</h3></a>
<a class="menu-item" href="/ec-module"><h3 class="menu-header">Conductivity</h3></a>
<a href="/temp-module" class="menu-item"><h3 class="menu-header">Temperature</h3></a>
<a class="menu-item" href="/about"><h3 class="menu-header">About</h3></a>
<a class="menu-item" href="/checkout"><h3 class="menu-header">Checkout</h3></a>
<a class="menu-item" href="/blog"><h3 class="menu-header">Blog</h3></a>
<a href="mailto:anyleaf@anyleaf.org" class="menu-item"><h3 class="menu-header">Contact</h3></a>
<a style="font-size: 1.5em;" href="/filter-design">Digital filter design and response
                </a>
<a href="/blog/parts-you-need-for-a-quadcopter-in-2022" style="font-size: 1.5em">
                        Parts you need for a quadcopter in 2022
                    </a>
<a href="/blog/writing-embedded-firmware-using-rust" style="font-size: 1.5em">
                        Writing embedded firmware using Rust
                    </a>
<a href="/blog/measuring-ph-on-raspberry-pi" style="font-size: 1.5em">
                        Measuring pH on Raspberry Pi
                    </a>
<a href="/blog/the-essence-of-embedded-computers" style="font-size: 1.5em">
                        The essence of embedded computers
                    </a>
<a href="/blog/electrical-conductivity-(ec)-for-hydroponics" style="font-size: 1.5em">
                        Electrical Conductivity (EC) for Hydroponics
                    </a>
<a href="/blog/project:-building-an-automatic-ph-doser" style="font-size: 1.5em">
                        Project: Building an automatic pH doser
                    </a>
<a style="font-size: 1.5em" href="/blog/ph-measurement-for-hydroponics">
                        pH Measurement for Hydroponics
                    </a>
<a href="/blog/how-to-calibrate-ph-sensors" style="font-size: 1.5em">
                        How to Calibrate pH Sensors
                    </a>
<a href="/blog/temperature-sensors:-a-comparison" style="font-size: 1.5em">
                        Temperature sensors: A comparison
                    </a>
<a class="fineprint" style="margin-right: 20px" href="/privacy">Privacy policy</a>
<a class="fineprint" href="/terms">Terms and conditions</a>
<a href="https://en.wikipedia.org/wiki/Restriction_of_Hazardous_Substances_Directive">
                Restriction of Hazardous Substances (RoHS) Directive</a>

@David-OConnor
Copy link
Author

Thanks for looking! Not sure what's up. I'll work between your code and mine and see where the disconnect is.

@teymour-aldridge
Copy link
Collaborator

I've also added a test for this (#82) so I'm reasonably confident it's not a bug. Please do let us know if this remains a problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants