Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

selector accidently edited html #57

Closed
GopherJ opened this issue Feb 6, 2021 · 5 comments
Closed

selector accidently edited html #57

GopherJ opened this issue Feb 6, 2021 · 5 comments

Comments

@GopherJ
Copy link

GopherJ commented Feb 6, 2021

I'm writing a robot to fetch cn.etherscan.com's token data.

On their site the transfers section has content: 939,005

image

while using the following code it gives me different thing:

    let transfers_selector = Selector::parse(
        ".card .card-body #ContentPlaceHolder1_trNoOfTxns #totaltxns",
    )
    .unwrap();

    if let Some(overview) =
        fragment.select(&overview_selector).next()
    {
        dbg!(&overview
            .select(&transfers_selector)
            .next()
            .unwrap()
            .html());
    }

image

@GopherJ
Copy link
Author

GopherJ commented Feb 6, 2021

you can see 939,005 has been changed to -

@demurgos
Copy link

Does the raw HTML (before parsing it) contain - or 939,005? It may be that the value you want is not set by the server but defined by JS on the client side. In such case, scaper can't do much more.

@GopherJ
Copy link
Author

GopherJ commented Feb 13, 2021

@demurgos Hi the raw HTML doesn't contain - but it contains 939,005

It may be that the value you want is not set by the server but defined by JS on the client side. In such case, scaper can't do much more.

Yes it's possible, I haven't checked it further

@nathaniel-daniel
Copy link

nathaniel-daniel commented Apr 1, 2021

@GopherJ The client js mutates that variable so you can't access it with scraper. The following example:

fn main() {
    let response =
        ureq::get("https://cn.etherscan.com/token/0xB8c77482e45F1F44dE1745F52C74426C631bDD52")
            .call()
            .expect("invalid http response");
    let response_text = response.into_string().expect("failed to get response text");
    let fragment = scraper::Html::parse_document(&response_text);

    let overview_selector = scraper::Selector::parse("#ContentPlaceHolder1_divSummary").expect("invalid overview selector");

    let transfers_selector =
        scraper::Selector::parse(".card .card-body #ContentPlaceHolder1_trNoOfTxns #totaltxns")
            .expect("invalid transfers selector");

    if let Some(overview) = fragment.select(&overview_selector).next() {
        dbg!(&overview.select(&transfers_selector).next().unwrap().html());
    }
}

yields

[src\main.rs:16] &overview.select(&transfers_selector).next().unwrap().html() = "<span id=\"totaltxns\">-</span>"

Here's a version that does what you want by extracting the needed variables from js with regexes.
main.rs:

fn main() {
    let agent = ureq::agent();

    let response1 = agent
        .get("https://cn.etherscan.com/token/0xB8c77482e45F1F44dE1745F52C74426C631bDD52")
        .call()
        .expect("invalid http response");
    let response_text1 = response1
        .into_string()
        .expect("failed to get response1 text");
    let fragment1 = scraper::Html::parse_document(&response_text1);

    let script_selector = scraper::Selector::parse("script").expect("invalid script selector");

    let mode_regex = regex::Regex::new(r"window\.mode = '(.*)';").expect("invalid mode regex");
    let contract_address_regex = regex::Regex::new(r"var litreadContractAddress = '(.*)';")
        .expect("invalid contract address regex");
    let address_regex =
        regex::Regex::new(r"var litAddress = '(.*)';").expect("invalid address regex");
    let sid_regex = regex::Regex::new(r"var sid = '(.*)';").expect("invalid sid regex");

    let script1 = fragment1
        .select(&script_selector)
        .find_map(|script| {
            let text = script.text().next()?;
            if mode_regex.is_match(text) {
                Some(text)
            } else {
                None
            }
        })
        .expect("missing script");

    let mode_captures = mode_regex.captures(script1).expect("missing mode captures");
    let mode = mode_captures.get(1).expect("missing mode").as_str();

    let contract_address_captures = contract_address_regex
        .captures(script1)
        .expect("missing contract address");
    let contract_address = contract_address_captures
        .get(1)
        .expect("missing contract address")
        .as_str();

    let address_captures = address_regex.captures(script1).expect("missing address");
    let address = address_captures.get(1).expect("missing address").as_str();

    let sid = fragment1
        .select(&script_selector)
        .find_map(|script| {
            let text = script.text().next()?;
            let captures = sid_regex.captures(text)?;
            Some(captures.get(1)?.as_str())
        })
        .expect("missing sid");

    let url = format!(
        "https://cn.etherscan.com/token/generic-tokentxns2?m={}&contractAddress={}&a={}&sid={}&p=1",
        mode, contract_address, address, sid
    );

    let response2 = agent.get(&url).call().expect("invalid http response");
    let response_text2 = response2
        .into_string()
        .expect("failed to get response2 text");

    let fragment2 = scraper::Html::parse_document(&response_text2);
    let txns_regex = regex::Regex::new(r"var totaltxns = '(.*)';").expect("invalid txns regex");
    let total_txns_str = fragment2
        .select(&script_selector)
        .find_map(|script| {
            let text = script.text().next()?;
            let captures = txns_regex.captures(text)?;
            Some(captures.get(1)?.as_str())
        })
        .expect("missing txns");

    dbg!(total_txns_str);
}

Cargo.toml:

[package]
name = "scraper-issue-57"
version = "0.0.0"
authors = [ "nathaniel daniel <nathaniel.daniel12@gmail.com>" ]
edition = "2018"

[dependencies]
regex = "1.4.5"
scraper = "0.12.0"
ureq = { version = "2.1.0", features = [ "cookies" ] }

which yields:

[src\main.rs:78] total_txns_str = "942,202"

@GopherJ
Copy link
Author

GopherJ commented Apr 2, 2021

Yes I agree it's probably this

@GopherJ GopherJ closed this as completed Apr 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants