Skip to content

cdump/evmole-datasets

Repository files navigation

EVMole Datasets

This repository contains smart contract datasets used for EVMole benchmarks. The datasets include large Solidity contracts, randomly selected contracts, and Vyper contracts from the Ethereum mainnet.

Dataset Construction Process

  1. First, clone the source repository containing verified Ethereum smart contracts:
git clone https://github.com/tintinweb/smart-contract-sanctuary.git
  1. Locate all Solidity contracts and record their sizes:
$ cd smart-contract-sanctuary/ethereum/contracts/mainnet/

# (contract_size_in_bytes) (contract_file_path)
$ find ./ -name "*.sol" -printf "%s %p\n" > all.txt
  1. Extract approximately 1200 of the largest contracts by file size:
$ cat all.txt | sort -rn | head -n 1200 | cut -d'/' -f3 | cut -d'_' -f1 > top.txt
  1. Select approximately 55,000 random contracts:
$ cat all.txt | cut -d'/' -f3 | cut -d'_' -f1 | sort -u | shuf | head -n 55000 > random.txt
  1. Get all vyper contracts:
$ find ./ -type f -name '*.vy' | cut -d'/' -f3 | cut -d'_' -f1 > vyper.txt
  1. Download contracts code & abi (using scripts/etherscan):
$ poetry run python3 download.py --etherscan-api-key=CHANGE_ME --addrs-list=top.txt --out-dir=datasets/largest1k --limit=1000 --code-regexp='^0x(?!73).'
$ poetry run python3 download.py --etherscan-api-key=CHANGE_ME --addrs-list=random.txt --out-dir=datasets/random50k --limit=50000 --code-regexp='^0x(?!73).'
$ poetry run python3 download.py --etherscan-api-key=CHANGE_ME --addrs-list=vyper.txt --out-dir=datasets/vyper --code-regexp='^0x(?!73).'

The --code-regexp='^0x(?!73).' parameter is used to filter contracts:

  1. It skips contracts with empty code ({"code": "0x",), which are self-destructed contracts
  2. It excludes contracts with code starting with 0x73 (the PUSH20 opcode)

Note about excluded contracts: Compiled Solidity libraries begin with the PUSH20 opcode for call protection. These are currently excluded because non-storage structs are referred to by their fully qualified name, which is not yet supported by our reference Etherscan extractor (providers/etherscan). This limitation may be addressed in future updates.

  1. Build the storage dataset (/mnt/sourcify/sources must contain the downloaded contracts):
$ cd scripts/sourcify
$ npm install
$ mkdir -p out/

# Process contracts in parallel (adjust -P16 for your CPU cores)
$ find /mnt/sourcify/sources/contracts/full_match/1 -mindepth 1 -maxdepth 1 -type d | shuf | head -n 4000 | xargs -n1 -P16 node index.mjs

# Select ~3000 unique contracts
$ md5sum out/* | sort | uniq -w 32 | shuf | head -n 3000 | awk '{print $2}' | xargs -I{} cp {} storage3k/

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors