Selector doesn't work with newline after #76

David-OConnor · 2022-04-25T02:23:29Z

document.select with Selector::parse is not working when there's a newline directly after the the tag.

Code:

let a_sel = scraper::Selector::parse("a").unwrap();
for el in document.select(&a_sel) {
    //...
}

HTML example that triggers this:

<a
                            href="...")"

When printing these affected elements:

Element(<a\n href="\\\"/...

Other elements in the query that are of the form Element(<a href="\\\"/... don't trigger this problem. Happy for a workaround in the meanwhile.

The text was updated successfully, but these errors were encountered:

adumbidiot · 2022-04-25T02:50:46Z

I can't get that html to even parse. Are you sure that's what you used to trigger the issue?

David-OConnor · 2022-04-25T12:59:05Z

That's a minimal example. I don't know that's the issue, but that appears to be what's separating tags it finds vs ones it ignores.

Example link it finds:

<a href="https://github.com">

Example link it doesn't find:

<a
    href="https://github.com">

adumbidiot · 2022-04-27T02:48:52Z

That seems to work.

main.rs:

fn main() {
    let html = r#"<a
href="https://github.com">"#;

    println!("Raw HTML: {:?}", html);

    let document = scraper::Html::parse_document(html);
    let a_sel = scraper::Selector::parse("a").unwrap();
    for el in document.select(&a_sel) {
        println!("{}", el.html());
    }
}

Output:

Raw HTML: "<a\nhref=\"https://github.com\">"
<a href="https://github.com"></a>

David-OConnor · 2022-04-27T13:41:02Z

Hmm. I'll dig deeper and report back; that's equivalent to the code I'm having trouble with

David-OConnor · 2022-05-14T00:05:35Z

Hi - Sorry about the late reply. I have tried several troubleshooting approaches, and have not been able to narrow this down. I can provide this case to reproduce it:

https://www.anyleaf.org/blog

It will correctly pull the links at the header and footer of the page, but none of the articles linked in the middle will show up using the 'a' selector.

adumbidiot · 2022-05-14T02:47:35Z

I can't reproduce that.
main.rs:

fn main() {
    let url = "https://www.anyleaf.org/blog";
    let html = ureq::get(url).call().unwrap().into_string().unwrap();

    println!("Raw HTML: {:?}", html);

    let document = scraper::Html::parse_document(&html);
    let a_sel = scraper::Selector::parse("a").unwrap();
    for el in document.select(&a_sel) {
        println!("{}", el.html());
    }
}

Cargo.toml:

[package]
name = "scraper-issue-76"
version = "0.0.0"
edition = "2021"

[dependencies]
scraper = "0.13.0"
ureq = "2.4.0"

Output:

Raw HTML: "\n\n<!DOCTYPE html>\n<html lang=\"en\">\n<head>\n    <meta charset=\"utf-8\">\n    <meta name=\"viewport\" content=\"width=device-width\">\n\n    <sc
ript type=\"module\">\n        document.documentElement.classList.remove('no-js');\n        document.documentElement.classList.add('js');\n    </script>\n\n
<link rel=\"stylesheet\" href=\"/static/style.css\">\n\n\n    <meta name=\"description\" content=\"Sensors and measurement for science, hydroponics, and aquariu
ms\">\n    <meta property=\"og:locale\" content=\"en_US\">\n    <meta property=\"og:type\" content=\"website\">\n    <meta name=\"twitter:card\" content=\"summa
ry_large_image\">\n    <meta property=\"og:url\" content=\"https://www.anyleaf.org\">\n\n    \n    <link rel=\"shortcut icon\" type=\"image/png\" href=\"/static
/favicon.png\"/>\n\n    \n    \n    <link rel=\"apple-touch-icon\" href=\"/static/favicon.png\">\n    \n    <meta name=\"theme-color\" content=\"#a2c8a9\">\n\n
   \n    <meta name=\"description\" content=\"AnyLeaf Articles: On sensors, measurements, and embedded computing\">\n\n    <meta property=\"og:title\" content=\
"\">\n    <meta property=\"og:description\" content=\"AnyLeaf Articles: On sensors, measurements, and embedded computing\">\n\n    <title>AnyLeaf sensors: Artic
les</title>\n\n\n</head>\n<body>\n\n<div id=\"top-bar\">\n    <div id=\"menu\">\n        <a href=\"/\" class=\"menu-item\"><h3 class=\"menu-header\">Home</h3></
a>\n        <a href=\"/mercury-g4\" class=\"menu-item\"><h3 class=\"menu-header\">Quad FC</h3></a>\n        <a href=\"/stove-thermometer\" class=\"menu-item\"><
h3 class=\"menu-header\">Stove Thermometer</h3></a>\n        <a href=\"/water-monitor\" class=\"menu-item\"><h3 class=\"menu-header\">Water Monitor</h3></a>\n
      <a href=\"/ph-module\" class=\"menu-item\"><h3 class=\"menu-header\">pH</h3></a>\n        <a href=\"/ec-module\" class=\"menu-item\"><h3 class=\"menu-head
er\">Conductivity</h3></a>\n        <a href=\"/temp-module\" class=\"menu-item\"><h3 class=\"menu-header\">Temperature</h3></a>\n        <a class=\"menu-item\"
href=\"/about\"><h3 class=\"menu-header\">About</h3></a>\n        <a class=\"menu-item\" href=\"/checkout\"><h3 class=\"menu-header\">Checkout</h3></a>\n
 <a class=\"menu-item\" href=\"/blog\"><h3 class=\"menu-header\">Blog</h3></a>\n        <a class=\"menu-item\" href=\"mailto:anyleaf@anyleaf.org\"><h3 class=\"m
enu-header\">Contact</h3></a>\n    </div>\n</div>\n\n\n\n\n    <div class=\"home-body\">\n        <div style=\"text-align: center;\">\n        <img src=\"/stati
c/logo.png\" style = \"margin-top: 40px\" width=300 alt=\"AnyLeaf\" />\n        </div>\n\n        <h1>AnyLeaf Blog</h1>\n\n        <h2>Misc:</h2>\n        <ul>\
n            <li style=\"margin-bottom: 40px;\">\n                <a\n                        href=\"/filter-design\"\n                        style=\"font-size
: 1.5em;\"\n                >Digital filter design and response\n                </a>\n            </li>\n        </ul>\n\n        <h2>Articles:</h2>\n        <
ul>\n            \n                <li style=\"margin-bottom: 40px;\">\n                    <a\n                            href=\"/blog/parts-you-need-for-a-qu
adcopter-in-2022\"\n                            style=\"font-size: 1.5em\">\n                        Parts you need for a quadcopter in 2022\n
  </a> - Feb. 24, 2022, 7:46 p.m.\n                </li>\n            \n                <li style=\"margin-bottom: 40px;\">\n                    <a\n
                 href=\"/blog/writing-embedded-firmware-using-rust\"\n                            style=\"font-size: 1.5em\">\n                        Writing e
mbedded firmware using Rust\n                    </a> - Sept. 25, 2021, 5:45 p.m.\n                </li>\n            \n                <li style=\"margin-botto
m: 40px;\">\n                    <a\n                            href=\"/blog/measuring-ph-on-raspberry-pi\"\n                            style=\"font-size: 1.5
em\">\n                        Measuring pH on Raspberry Pi\n                    </a> - Feb. 6, 2021, 9:47 a.m.\n                </li>\n            \n
      <li style=\"margin-bottom: 40px;\">\n                    <a\n                            href=\"/blog/the-essence-of-embedded-computers\"\n
             style=\"font-size: 1.5em\">\n                        The essence of embedded computers\n                    </a> - Sept. 6, 2020, 7:09 p.m.\n
          </li>\n            \n                <li style=\"margin-bottom: 40px;\">\n                    <a\n                            href=\"/blog/electrical-
conductivity-(ec)-for-hydroponics\"\n                            style=\"font-size: 1.5em\">\n                        Electrical Conductivity (EC) for Hydroponi
cs\n                    </a> - Aug. 22, 2020, 4 p.m.\n                </li>\n            \n                <li style=\"margin-bottom: 40px;\">\n
    <a\n                            href=\"/blog/project:-building-an-automatic-ph-doser\"\n                            style=\"font-size: 1.5em\">\n
             Project: Building an automatic pH doser\n                    </a> - July 21, 2020, 7:33 p.m.\n                </li>\n            \n
<li style=\"margin-bottom: 40px;\">\n                    <a\n                            href=\"/blog/ph-measurement-for-hydroponics\"\n
    style=\"font-size: 1.5em\">\n                        pH Measurement for Hydroponics\n                    </a> - July 19, 2020, 3:43 p.m.\n                </
li>\n            \n                <li style=\"margin-bottom: 40px;\">\n                    <a\n                            href=\"/blog/how-to-calibrate-ph-sen
sors\"\n                            style=\"font-size: 1.5em\">\n                        How to Calibrate pH Sensors\n                    </a> - July 17, 2020,
1:23 p.m.\n                </li>\n            \n                <li style=\"margin-bottom: 40px;\">\n                    <a\n                            href=\"
/blog/temperature-sensors:-a-comparison\"\n                            style=\"font-size: 1.5em\">\n                        Temperature sensors: A comparison\n
                   </a> - July 15, 2020, 6:42 p.m.\n                </li>\n            \n        </ul>\n    </div>\n\n\n\n<div id=\"footer\">\n    <h4 style=\"m
argin-top: 30px\">Assembled in Raleigh, NC, USA.</h4>\n    <div style=\"margin-bottom: 30px\">\n        <a class=\"fineprint\" style=\"margin-right: 20px\" href
=\"/privacy\">Privacy policy</a>\n        <a class=fineprint href=\"/terms\">Terms and conditions</a>\n    </div>\n    <div style=\"display: flex; flex-directio
n: column\">\n        <h5 class=\"fineprint\">\n            All AnyLeaf products comply with the\n            <a href=\"https://en.wikipedia.org/wiki/Restrictio
n_of_Hazardous_Substances_Directive\">\n                Restriction of Hazardous Substances (RoHS) Directive</a>.</h5>\n        <h5 class=\"fineprint\">© 2022 A
nyLeaf</h5>\n    </div>\n</div>\n\n\n<script src=\"/static/js/main.js\"></script>\n<script src=\"/static/js/cart.js\"></script>\n\n</body>\n</html>"
<a href="/" class="menu-item"><h3 class="menu-header">Home</h3></a>
<a class="menu-item" href="/mercury-g4"><h3 class="menu-header">Quad FC</h3></a>
<a class="menu-item" href="/stove-thermometer"><h3 class="menu-header">Stove Thermometer</h3></a>
<a class="menu-item" href="/water-monitor"><h3 class="menu-header">Water Monitor</h3></a>
<a class="menu-item" href="/ph-module"><h3 class="menu-header">pH</h3></a>
<a class="menu-item" href="/ec-module"><h3 class="menu-header">Conductivity</h3></a>
<a href="/temp-module" class="menu-item"><h3 class="menu-header">Temperature</h3></a>
<a class="menu-item" href="/about"><h3 class="menu-header">About</h3></a>
<a class="menu-item" href="/checkout"><h3 class="menu-header">Checkout</h3></a>
<a class="menu-item" href="/blog"><h3 class="menu-header">Blog</h3></a>
<a href="mailto:anyleaf@anyleaf.org" class="menu-item"><h3 class="menu-header">Contact</h3></a>
<a style="font-size: 1.5em;" href="/filter-design">Digital filter design and response
                </a>
<a href="/blog/parts-you-need-for-a-quadcopter-in-2022" style="font-size: 1.5em">
                        Parts you need for a quadcopter in 2022
                    </a>
<a href="/blog/writing-embedded-firmware-using-rust" style="font-size: 1.5em">
                        Writing embedded firmware using Rust
                    </a>
<a href="/blog/measuring-ph-on-raspberry-pi" style="font-size: 1.5em">
                        Measuring pH on Raspberry Pi
                    </a>
<a href="/blog/the-essence-of-embedded-computers" style="font-size: 1.5em">
                        The essence of embedded computers
                    </a>
<a href="/blog/electrical-conductivity-(ec)-for-hydroponics" style="font-size: 1.5em">
                        Electrical Conductivity (EC) for Hydroponics
                    </a>
<a href="/blog/project:-building-an-automatic-ph-doser" style="font-size: 1.5em">
                        Project: Building an automatic pH doser
                    </a>
<a style="font-size: 1.5em" href="/blog/ph-measurement-for-hydroponics">
                        pH Measurement for Hydroponics
                    </a>
<a href="/blog/how-to-calibrate-ph-sensors" style="font-size: 1.5em">
                        How to Calibrate pH Sensors
                    </a>
<a href="/blog/temperature-sensors:-a-comparison" style="font-size: 1.5em">
                        Temperature sensors: A comparison
                    </a>
<a class="fineprint" style="margin-right: 20px" href="/privacy">Privacy policy</a>
<a class="fineprint" href="/terms">Terms and conditions</a>
<a href="https://en.wikipedia.org/wiki/Restriction_of_Hazardous_Substances_Directive">
                Restriction of Hazardous Substances (RoHS) Directive</a>

David-OConnor · 2022-05-14T03:15:35Z

Thanks for looking! Not sure what's up. I'll work between your code and mine and see where the disconnect is.

teymour-aldridge · 2022-07-23T21:10:36Z

I've also added a test for this (#82) so I'm reasonably confident it's not a bug. Please do let us know if this remains a problem.

David-OConnor changed the title ~~Selectors~~ Selector doesn't work with newline after Apr 25, 2022

teymour-aldridge added C-bug and removed C-bug labels Jul 23, 2022

teymour-aldridge closed this as completed Jul 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Selector doesn't work with newline after #76

Selector doesn't work with newline after #76

David-OConnor commented Apr 25, 2022

adumbidiot commented Apr 25, 2022

David-OConnor commented Apr 25, 2022 •

edited

adumbidiot commented Apr 27, 2022

David-OConnor commented Apr 27, 2022

David-OConnor commented May 14, 2022

adumbidiot commented May 14, 2022

David-OConnor commented May 14, 2022

teymour-aldridge commented Jul 23, 2022

Selector doesn't work with newline after #76

Selector doesn't work with newline after #76

Comments

David-OConnor commented Apr 25, 2022

adumbidiot commented Apr 25, 2022

David-OConnor commented Apr 25, 2022 • edited

adumbidiot commented Apr 27, 2022

David-OConnor commented Apr 27, 2022

David-OConnor commented May 14, 2022

adumbidiot commented May 14, 2022

David-OConnor commented May 14, 2022

teymour-aldridge commented Jul 23, 2022

David-OConnor commented Apr 25, 2022 •

edited