Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat] Generalized scraper for all labels (SeleniumJS) #13

Closed
wants to merge 13 commits into from

Conversation

brianleect
Copy link

@brianleect brianleect commented Aug 14, 2022

Flow

  1. In terminal, node scrape-all for all labels or node scrape-all labelName for single label retrieval
  2. Login to etherscan
  3. Extract all labels from labelcloud
  4. Checks for existing label.json in src/mainnet/all-json which are filtered out
  5. Checks for ignore_list labels which are hardcoded in for being too large (100k+ labels) or bugged (no values)
  6. Loop through filteredLabels and save each label to src/mainnet/all-json as ${label}.json

@dawsbot
Copy link
Owner

dawsbot commented Aug 16, 2022

WOW, this is an epic contribution @brianleect 🙏

Is this ready for PR review? I know we've been chatting over in discord about the importance of separating labels to separate files. If all labels are in one file, you cannot split it properly and therefore have massive bundle sizes.

Thanks again! Excited to join forces here 🎉

package.json Outdated
@@ -43,5 +43,8 @@
},
"lint-staged": {
"*": "prettier -u --write"
},
"dependencies": {
"selenium-webdriver": "^4.4.0"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make this a dev dep since the end user does not need this installed when they are doing an npm install.

Learn more

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shifted selenium-webdriver to devDependecies for both package.json and package-lock.json. Should be fixed by the latest commit

@brianleect
Copy link
Author

Noticed a bug. Some labels apparently are empty. Not sure if its caused by scraping too quickly?

@brianleect
Copy link
Author

image

Wrote a quick script to check. Apparently 186 labels impacted. I'll try to see if re-running the scraper fixes the problem or introducing a delay.

@brianleect
Copy link
Author

Fixed the empty labels. Seems there's also some weird issue going on with label scraping where inconsistent labels are getting scraped. Had occasion where I ended up with ~370 labels scraping all and ended up managing to scrape up to 400 labels total on a second run.

Might need to test if we are getting consistent number of labels back from labelcloud and if so, might have an issue elsewhere.

@dawsbot
Copy link
Owner

dawsbot commented Aug 24, 2022

Thanks for the comments on all this @brianleect 🙏

I'll take a look soon. I appreciate the patience, I was offline a lot for EthMexico where I competed 🙌

@dawsbot
Copy link
Owner

dawsbot commented Apr 13, 2024

We've got a big refactor underway already which replaces the need for SeleniumJS. Thank you for this issue @brianleect, we've decided on a different path that's working well for now! 🙏

@dawsbot dawsbot closed this Apr 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants