New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feat] Generalized scraper for all labels (SeleniumJS) #13
Conversation
Function to check if label exist for addr in allLabels
WOW, this is an epic contribution @brianleect 🙏 Is this ready for PR review? I know we've been chatting over in discord about the importance of separating labels to separate files. If all labels are in one file, you cannot split it properly and therefore have massive bundle sizes. Thanks again! Excited to join forces here 🎉 |
package.json
Outdated
@@ -43,5 +43,8 @@ | |||
}, | |||
"lint-staged": { | |||
"*": "prettier -u --write" | |||
}, | |||
"dependencies": { | |||
"selenium-webdriver": "^4.4.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make this a dev dep since the end user does not need this installed when they are doing an npm install
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shifted selenium-webdriver
to devDependecies for both package.json and package-lock.json. Should be fixed by the latest commit
Noticed a bug. Some labels apparently are empty. Not sure if its caused by scraping too quickly? |
Initially started from index 4 thinking it wd skip header but is false
Fixed the empty labels. Seems there's also some weird issue going on with label scraping where inconsistent labels are getting scraped. Had occasion where I ended up with ~370 labels scraping all and ended up managing to scrape up to 400 labels total on a second run. Might need to test if we are getting consistent number of labels back from labelcloud and if so, might have an issue elsewhere. |
Thanks for the comments on all this @brianleect 🙏 I'll take a look soon. I appreciate the patience, I was offline a lot for EthMexico where I competed 🙌 |
We've got a big refactor underway already which replaces the need for SeleniumJS. Thank you for this issue @brianleect, we've decided on a different path that's working well for now! 🙏 |
Flow
node scrape-all
for all labels ornode scrape-all labelName
for single label retrievaletherscan
labelcloud
label.json
insrc/mainnet/all-json
which are filtered outignore_list
labels which are hardcoded in for beingtoo large
(100k+ labels) orbugged
(no values)filteredLabels
and save each label tosrc/mainnet/all-json
as${label}.json