Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issues #62

Closed
omarmhaimdat opened this issue May 15, 2021 · 4 comments
Closed

Performance issues #62

omarmhaimdat opened this issue May 15, 2021 · 4 comments

Comments

@omarmhaimdat
Copy link

omarmhaimdat commented May 15, 2021

I am trying to parse a large number of HTML documents, and I have noticed that the parsing took most of the time, around 97% of the program. Is there any way to speed up the parsing process?

To give you a perspective, the average parsing time is around 9ms per document.

Code example

fn main() {
    let now = Instant::now();
    // I have 10_000 HTML documents
    let paths = fs::read_dir("../data").unwrap();
    let mut reports: Vec<HashMap<String, Value>> = Vec::new();
    for path in paths {
        let data = fs::read_to_string(path.unwrap().path()).expect("Unable to read file");
        // This line took 97% of the running time
        let document = Html::parse_document(&data);
        }
    println!("The program took {}", now.elapsed().as_secs());
}
@teymour-aldridge
Copy link
Collaborator

I don't think so – unless this is something that scraper is doing wrong (rather than html5ever) – if you try that example with html5ever directly (scraper is just a set of convenience wrappers over https://github.com/servo/html5ever) do you find that there is an increase in speed?

Apart from this you might find that using a multiprocessing library such as rayon could speed up your program drastically.

@let4be
Copy link

let4be commented Jun 1, 2021

Time spent per page is completely within normal bounds I'd expect,

I'm currently working on a toy project - implementing broad web crawler in rust, here is one curious stat(ran on i9 10900k, each CPU will have this a bit different)
example

What you need to do is use multiple threads either via rayon or directly with std::thread

@nathaniel-daniel
Copy link

If you only care about parsing and if the data you extract is simple enough, you might have better luck with lol-html.

@teymour-aldridge
Copy link
Collaborator

What @nathaniel-daniel said, and also it's not unexpected that in that program that parsing would take 97% of the running time of your program, given that your program is primarily parsing (with a little io). As for the 9ms figure, that seems fine – how big are your documents?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants