New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
selector accidently edited html #57
Comments
you can see |
Does the raw HTML (before parsing it) contain |
@demurgos Hi the raw HTML doesn't contain
Yes it's possible, I haven't checked it further |
@GopherJ The client js mutates that variable so you can't access it with fn main() {
let response =
ureq::get("https://cn.etherscan.com/token/0xB8c77482e45F1F44dE1745F52C74426C631bDD52")
.call()
.expect("invalid http response");
let response_text = response.into_string().expect("failed to get response text");
let fragment = scraper::Html::parse_document(&response_text);
let overview_selector = scraper::Selector::parse("#ContentPlaceHolder1_divSummary").expect("invalid overview selector");
let transfers_selector =
scraper::Selector::parse(".card .card-body #ContentPlaceHolder1_trNoOfTxns #totaltxns")
.expect("invalid transfers selector");
if let Some(overview) = fragment.select(&overview_selector).next() {
dbg!(&overview.select(&transfers_selector).next().unwrap().html());
}
} yields
Here's a version that does what you want by extracting the needed variables from js with regexes. fn main() {
let agent = ureq::agent();
let response1 = agent
.get("https://cn.etherscan.com/token/0xB8c77482e45F1F44dE1745F52C74426C631bDD52")
.call()
.expect("invalid http response");
let response_text1 = response1
.into_string()
.expect("failed to get response1 text");
let fragment1 = scraper::Html::parse_document(&response_text1);
let script_selector = scraper::Selector::parse("script").expect("invalid script selector");
let mode_regex = regex::Regex::new(r"window\.mode = '(.*)';").expect("invalid mode regex");
let contract_address_regex = regex::Regex::new(r"var litreadContractAddress = '(.*)';")
.expect("invalid contract address regex");
let address_regex =
regex::Regex::new(r"var litAddress = '(.*)';").expect("invalid address regex");
let sid_regex = regex::Regex::new(r"var sid = '(.*)';").expect("invalid sid regex");
let script1 = fragment1
.select(&script_selector)
.find_map(|script| {
let text = script.text().next()?;
if mode_regex.is_match(text) {
Some(text)
} else {
None
}
})
.expect("missing script");
let mode_captures = mode_regex.captures(script1).expect("missing mode captures");
let mode = mode_captures.get(1).expect("missing mode").as_str();
let contract_address_captures = contract_address_regex
.captures(script1)
.expect("missing contract address");
let contract_address = contract_address_captures
.get(1)
.expect("missing contract address")
.as_str();
let address_captures = address_regex.captures(script1).expect("missing address");
let address = address_captures.get(1).expect("missing address").as_str();
let sid = fragment1
.select(&script_selector)
.find_map(|script| {
let text = script.text().next()?;
let captures = sid_regex.captures(text)?;
Some(captures.get(1)?.as_str())
})
.expect("missing sid");
let url = format!(
"https://cn.etherscan.com/token/generic-tokentxns2?m={}&contractAddress={}&a={}&sid={}&p=1",
mode, contract_address, address, sid
);
let response2 = agent.get(&url).call().expect("invalid http response");
let response_text2 = response2
.into_string()
.expect("failed to get response2 text");
let fragment2 = scraper::Html::parse_document(&response_text2);
let txns_regex = regex::Regex::new(r"var totaltxns = '(.*)';").expect("invalid txns regex");
let total_txns_str = fragment2
.select(&script_selector)
.find_map(|script| {
let text = script.text().next()?;
let captures = txns_regex.captures(text)?;
Some(captures.get(1)?.as_str())
})
.expect("missing txns");
dbg!(total_txns_str);
}
[package]
name = "scraper-issue-57"
version = "0.0.0"
authors = [ "nathaniel daniel <nathaniel.daniel12@gmail.com>" ]
edition = "2018"
[dependencies]
regex = "1.4.5"
scraper = "0.12.0"
ureq = { version = "2.1.0", features = [ "cookies" ] } which yields:
|
Yes I agree it's probably this |
I'm writing a robot to fetch cn.etherscan.com's token data.
On their site the
transfers
section has content:939,005
while using the following code it gives me different thing:
The text was updated successfully, but these errors were encountered: