Performance issues #62

omarmhaimdat · 2021-05-15T12:02:19Z

I am trying to parse a large number of HTML documents, and I have noticed that the parsing took most of the time, around 97% of the program. Is there any way to speed up the parsing process?

To give you a perspective, the average parsing time is around 9ms per document.

Code example

fn main() {
    let now = Instant::now();
    // I have 10_000 HTML documents
    let paths = fs::read_dir("../data").unwrap();
    let mut reports: Vec<HashMap<String, Value>> = Vec::new();
    for path in paths {
        let data = fs::read_to_string(path.unwrap().path()).expect("Unable to read file");
        // This line took 97% of the running time
        let document = Html::parse_document(&data);
        }
    println!("The program took {}", now.elapsed().as_secs());
}

The text was updated successfully, but these errors were encountered:

teymour-aldridge · 2021-05-28T08:31:38Z

I don't think so – unless this is something that scraper is doing wrong (rather than html5ever) – if you try that example with html5ever directly (scraper is just a set of convenience wrappers over https://github.com/servo/html5ever) do you find that there is an increase in speed?

Apart from this you might find that using a multiprocessing library such as rayon could speed up your program drastically.

let4be · 2021-06-01T08:45:57Z

Time spent per page is completely within normal bounds I'd expect,

I'm currently working on a toy project - implementing broad web crawler in rust, here is one curious stat(ran on i9 10900k, each CPU will have this a bit different)

What you need to do is use multiple threads either via rayon or directly with std::thread

nathaniel-daniel · 2021-06-15T08:31:40Z

If you only care about parsing and if the data you extract is simple enough, you might have better luck with lol-html.

teymour-aldridge · 2021-11-10T21:16:52Z

What @nathaniel-daniel said, and also it's not unexpected that in that program that parsing would take 97% of the running time of your program, given that your program is primarily parsing (with a little io). As for the 9ms figure, that seems fine – how big are your documents?

cfvescovo closed this as completed Jan 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issues #62

Performance issues #62

omarmhaimdat commented May 15, 2021 •

edited

Loading

teymour-aldridge commented May 28, 2021

let4be commented Jun 1, 2021 •

edited

Loading

nathaniel-daniel commented Jun 15, 2021

teymour-aldridge commented Nov 10, 2021

Performance issues #62

Performance issues #62

Comments

omarmhaimdat commented May 15, 2021 • edited Loading

Code example

teymour-aldridge commented May 28, 2021

let4be commented Jun 1, 2021 • edited Loading

nathaniel-daniel commented Jun 15, 2021

teymour-aldridge commented Nov 10, 2021

omarmhaimdat commented May 15, 2021 •

edited

Loading

let4be commented Jun 1, 2021 •

edited

Loading