[Feat] Generalized scraper for all labels (SeleniumJS) #13

brianleect · 2022-08-14T02:21:27Z

Flow

In terminal, node scrape-all for all labels or node scrape-all labelName for single label retrieval
Login to etherscan
Extract all labels from labelcloud
Checks for existing label.json in src/mainnet/all-json which are filtered out
Checks for ignore_list labels which are hardcoded in for being too large (100k+ labels) or bugged (no values)
Loop through filteredLabels and save each label to src/mainnet/all-json as ${label}.json

Function to check if label exist for addr in allLabels

dawsbot · 2022-08-16T19:07:34Z

WOW, this is an epic contribution @brianleect 🙏

Is this ready for PR review? I know we've been chatting over in discord about the importance of separating labels to separate files. If all labels are in one file, you cannot split it properly and therefore have massive bundle sizes.

Thanks again! Excited to join forces here 🎉

dawsbot · 2022-08-17T15:02:35Z

package.json

@@ -43,5 +43,8 @@
  },
  "lint-staged": {
    "*": "prettier -u --write"
+  },
+  "dependencies": {
+    "selenium-webdriver": "^4.4.0"


We should make this a dev dep since the end user does not need this installed when they are doing an npm install.

Learn more

Shifted selenium-webdriver to devDependecies for both package.json and package-lock.json. Should be fixed by the latest commit

brianleect · 2022-08-22T05:02:46Z

Noticed a bug. Some labels apparently are empty. Not sure if its caused by scraping too quickly?

brianleect · 2022-08-22T05:06:36Z

Wrote a quick script to check. Apparently 186 labels impacted. I'll try to see if re-running the scraper fixes the problem or introducing a delay.

Initially started from index 4 thinking it wd skip header but is false

brianleect · 2022-08-22T13:58:15Z

Fixed the empty labels. Seems there's also some weird issue going on with label scraping where inconsistent labels are getting scraped. Had occasion where I ended up with ~370 labels scraping all and ended up managing to scrape up to 400 labels total on a second run.

Might need to test if we are getting consistent number of labels back from labelcloud and if so, might have an issue elsewhere.

dawsbot · 2022-08-24T16:46:51Z

Thanks for the comments on all this @brianleect 🙏

I'll take a look soon. I appreciate the patience, I was offline a lot for EthMexico where I competed 🙌

dawsbot · 2024-04-13T20:19:03Z

We've got a big refactor underway already which replaces the need for SeleniumJS. Thank you for this issue @brianleect, we've decided on a different path that's working well for now! 🙏

brianleect and others added 7 commits August 12, 2022 23:17

[Feat] SeleniumWebdriver + Labelcloud scrape

62d7188

[Feat] Fuly scraper w partial JSON data

a0607a5

[Feat] Ignore 100k+ labels + bug + existing

1a7c7ae

[Feat] Single label option via cmd line argument

03e539d

[QOL] Consolidated address into single json w dump

27a581e

[Feat] allLabel class w hasLabel search

7415639

Function to check if label exist for addr in allLabels

Update readme w scraping instructions & allLabel

7f089c7

[Revert] Delete combinedLabel folder + readme

8da7b56

dawsbot requested changes Aug 17, 2022

View reviewed changes

[Fix] selenium-webdriver to devDependencies

6dd9751

brianleect added 4 commits August 22, 2022 14:14

[Bug] First label skipped

a6c6c2d

Initially started from index 4 thinking it wd skip header but is false

Fixed label data

fd52e64

script to check n remove empty labels

9506ab8

combined json for all

37d2de6

dawsbot closed this Apr 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] Generalized scraper for all labels (SeleniumJS) #13

[Feat] Generalized scraper for all labels (SeleniumJS) #13

brianleect commented Aug 14, 2022 •

edited

dawsbot commented Aug 16, 2022

dawsbot Aug 17, 2022

brianleect Aug 17, 2022 •

edited

brianleect commented Aug 22, 2022

brianleect commented Aug 22, 2022

brianleect commented Aug 22, 2022

dawsbot commented Aug 24, 2022

dawsbot commented Apr 13, 2024

[Feat] Generalized scraper for all labels (SeleniumJS) #13

[Feat] Generalized scraper for all labels (SeleniumJS) #13

Conversation

brianleect commented Aug 14, 2022 • edited

dawsbot commented Aug 16, 2022

dawsbot Aug 17, 2022

Choose a reason for hiding this comment

brianleect Aug 17, 2022 • edited

Choose a reason for hiding this comment

brianleect commented Aug 22, 2022

brianleect commented Aug 22, 2022

brianleect commented Aug 22, 2022

dawsbot commented Aug 24, 2022

dawsbot commented Apr 13, 2024

brianleect commented Aug 14, 2022 •

edited

brianleect Aug 17, 2022 •

edited