# 2. Extracting Token Transfers
In this exercise, we will be extracting token transfers from a single token contract.
We will obtain these transfers from the event/log data, using web3py.

In a second step, we will analyze a case where entities likely participated multiple times in an airdrop.

As in the previous excercise, please use your own endpoint URL. You can create a free account with [Moralis.io](https://admin.moralis.io/register), and get an endpoint URL from them. You may be able to use this notebook endpoint, but you may run into rate limits if other participants are using it at the same time. Or at a later point in time this endpoint may not work at all anymore.

In [1]:
endpoint = "https://speedy-nodes-nyc.moralis.io/03c966587b022c980f59136b/eth/mainnet/archive"

In [2]:
from web3 import Web3
w3 = Web3(Web3.HTTPProvider(endpoint))
w3.isConnected()

True

# 2.1 Defining the token contract address and extracting logs
We can now specify a token contract address. For illustration purposes, we'll use the Bionic token contract address. (Feel free to change it!)

Using the Moralis.io endpoint, we then extract logs in a certain block range, but in batches of 2000 blocks. This is the maximum range that the Moralis.io endpoint allows. If you run your own node, you could use larger intervals.

The result will be event logs in JSON format, but not parsed yet. So transfer amounts would still be in hexadecimal format for example.

In [3]:
tokenContractAddress = "0xef51c9377feb29856e61625caf9390bd0b67ea18" # Bionic token contract address

In [4]:
from tqdm.notebook import tqdm
import itertools

logLists = []
blockStart = 6000000
blockEnd = blockStart + 300000
blockRange = 2000 # Moralis only supports a maximum of 2000

for blockNumber in tqdm(range(blockStart, blockEnd, blockRange)):
    logs = w3.eth.get_logs({"fromBlock": str(hex(blockNumber)),
                            "toBlock": str(hex(min(blockEnd, blockNumber+blockRange))),
                            "address": Web3.toChecksumAddress(tokenContractAddress),
                            "topics": [Web3.keccak(text='Transfer(address,address,uint256)').hex()]})
    logLists.append(logs)
logs = list(itertools.chain.from_iterable(logLists))

  0%|          | 0/150 [00:00<?, ?it/s]

# 2.2 Parsing JSON logs with the Transfer event ABI
While smart contracts written in high level languages like Solidity are compiled to EVM bytecode, accessing the bytecode functionality through function names would still be very useful. This is where the Application Binary Interface (ABI) comes in. It provides information on function and event signatures, which allows for a translation to bytecode entrypoints. EVM smart contract developers usually generate this ABI for their code. [You can obtain such ABIs from Etherscan (at the bottom of the page)](https://etherscan.io/address/0xef51c9377feb29856e61625caf9390bd0b67ea18#code), where developers frequently upload them. If you want to learn more, [quicknode has a good article on ABIs](https://www.quicknode.com/guides/solidity/what-is-an-abi)

Using the ABI of a transfer event let's us easily parse event logs that conform to the token transfer signature.
During parsing, we create a custom dictionary format because web3py's internal structure is nested and immutable.

In [5]:
import json
# Reduced ERC-20 ABI, only Transfer event
ABIstring = """[ { "anonymous": false, "inputs": [
            {   "indexed": true,
                "name": "from",
                "type": "address"
            },
            {   "indexed": true,
                "name": "to",
                "type": "address"
            },
            {   "indexed": false,
                "name": "value",
                "type": "uint256"
            } ], "name": "Transfer", "type": "event" } ]"""
anonERC20contract = w3.eth.contract(abi=json.loads(ABIstring))
transferEventType = anonERC20contract.events.Transfer
transferEventABI = transferEventType._get_event_abi()

In [6]:
# With the ABI, we can now parse the transfer event logs and create our custom, unnested format.
from web3._utils.events import get_event_data
transferEvents = []
for log in logs:
        log = dict(get_event_data(w3.codec, transferEventABI, log))
        log["transactionHash"] = log["transactionHash"].hex()
        del log["blockHash"]
        for k,v in log["args"].items():
            log[k] = v
        del log["args"]
        transferEvents.append(log)

# 2.3 Creating a pandas dataframe
At this point we can transform the transfer event JSON list into a pandas dataframe, allowing for all sorts of analysis.

In [7]:
import pandas as pd
tokenTransfersDF = pd.DataFrame(transferEvents)
tokenTransfersDF

Unnamed: 0,event,logIndex,transactionIndex,transactionHash,address,blockNumber,from,to,value
0,Transfer,44,95,0x4f3f96adef9ca0444209b4844dea4049f97a74045d45...,0xEf51c9377FeB29856E61625cAf9390bD0B67eA18,6009316,0xdD2A5B646bb936CbC279CBE462E31eab2C309452,0x24cdEABeD51cCacD3199b410236Cd7E24ca1d313,2500000000000
1,Transfer,17,68,0xaa9369523e8aeff77c457bcc2bfbdc6f17fc0be16794...,0xEf51c9377FeB29856E61625cAf9390bD0B67eA18,6010109,0xdD2A5B646bb936CbC279CBE462E31eab2C309452,0xFe0AedEe5FaC7037b7DEc21F05a58c954df5A0F0,2500000000000
2,Transfer,20,29,0xfec0cb27cc608809998b6fb6ab35e5540b38de5676de...,0xEf51c9377FeB29856E61625cAf9390bD0B67eA18,6010118,0xdD2A5B646bb936CbC279CBE462E31eab2C309452,0x817F0fc8760Eac5c497697329b50f632c453a4eB,2500000000000
3,Transfer,46,72,0x7534c86ba66f6636c94ebf9cc46db22b77d9c70c9bc3...,0xEf51c9377FeB29856E61625cAf9390bD0B67eA18,6010123,0xdD2A5B646bb936CbC279CBE462E31eab2C309452,0x289BC64F25Af2785669DA79126a7B26A76cA956e,2500000000000
4,Transfer,44,85,0x2fb9ba27f354dfa73a3055f559bef749748265f400e4...,0xEf51c9377FeB29856E61625cAf9390bD0B67eA18,6010127,0xdD2A5B646bb936CbC279CBE462E31eab2C309452,0xC802506207A588cF85191B14693743d3777aDd8B,2500000000000
...,...,...,...,...,...,...,...,...,...
16925,Transfer,28,23,0x90f2ff8d20256d699e692f796ad49a51364250d6fa79...,0xEf51c9377FeB29856E61625cAf9390bD0B67eA18,6299508,0x4E33d8c2CA8c74FaD0905E42C7f3317019774f50,0x274F3c32C90517975e29Dfc209a23f315c1e5Fc7,115266668157343
16926,Transfer,11,18,0x04797f21607e80a5472867c7ea98cb4aa9021615fdff...,0xEf51c9377FeB29856E61625cAf9390bD0B67eA18,6299640,0xBb02AB3275D79007c0acAf381d36F6Bd78abeCbd,0x274F3c32C90517975e29Dfc209a23f315c1e5Fc7,2500000000000
16927,Transfer,20,50,0xfd4a1623f1f069c0525d95939c6903da33c66776e8bb...,0xEf51c9377FeB29856E61625cAf9390bD0B67eA18,6299689,0x233F0dd15867FabC5195deb251eB68432FcafD98,0x2a0c0DBEcC7E4D658f48E01e3fA353F44050c208,2500000000000
16928,Transfer,14,36,0x39894fc4393d779789c6597b48e6f1e0f9590d2f22bb...,0xEf51c9377FeB29856E61625cAf9390bD0B67eA18,6299823,0x1138BCa344eEec9c8e319Dbb54e05FCe726D1e3B,0x274F3c32C90517975e29Dfc209a23f315c1e5Fc7,2500000000000


# 2.4 Create a token transfer graph with networkx
And of course we can also transform the dataframe into a graph object with networkx. This is useful to compute graph properties and test graph algorithms.

In [8]:
import networkx as nx
G = nx.from_pandas_edgelist(df=tokenTransfersDF,
                            source="from", target="to",
                            edge_attr="value",
                            create_using=nx.DiGraph)

nodeMeasures = pd.DataFrame(dict(
    indegree = dict(G.in_degree),
    outdegree = dict(G.out_degree),
    indegree_centrality = nx.in_degree_centrality(G)
))
# Show some node measures
nodeMeasures[nodeMeasures.values >= 10]

Unnamed: 0,indegree,outdegree,indegree_centrality
0xdD2A5B646bb936CbC279CBE462E31eab2C309452,2,9828,0.000195
0x97126cbde15c4582cB5A76dEB1eDD1577C279F69,25,2,0.002443
0x79Af9E6fb38D10D6B99901faa25472D8f53ca121,13,1,0.001271
0xC128e3E58d9b126D2c0fA0E52FA33ea14aa6B83a,31,2,0.003030
0x39c9a982CfE5adB77c4B855309Dc87c12d9db4cC,29,1,0.002834
...,...,...,...
0xF21329C8A24a19388e8eb91A7c710D6fa10dEDE6,694,2,0.067826
0xcbd2A3423ea7095e2C4b650eD4539cc2936045dA,11,1,0.001075
0xcaF30aEf51f28FdAb9e0c6dD77299b67DFCa823B,39,2,0.003812
0x948E0e69De1d1ee9B129cDC3e61a36e7B68408d2,11,1,0.001075


# 2.5 Example airdrop multiparticipation study
Suppose we know already that that for this particular token there has been an airdrop, where users were able to sign up to receive the same amount of tokens for free.

If we want to find out which address has airdropped the tokens, we can simply group by from address and value, and count the number of transfers:

In [9]:
senders = tokenTransfersDF.groupby(['from','value']).agg({'to':'count'}).sort_values("to", ascending=False)
senders.reset_index(inplace=True)
senders.head()

Unnamed: 0,from,value,to
0,0xdD2A5B646bb936CbC279CBE462E31eab2C309452,2500000000000,9954
1,0xdD2A5B646bb936CbC279CBE462E31eab2C309452,100000000000,27
2,0xdD2A5B646bb936CbC279CBE462E31eab2C309452,1000000000000000,10
3,0x8d12A197cB00D4747a1fe03395095ce2A5CC6819,2500000000000,9
4,0x3c8bB860d09c8E50a4E85331F66E418d071Bf080,35000000000000,4


## 2.5.1 Who is receiving tokens of the same value multiple times?
We do the same in reverse, we group by receiving address and value, and count the number of transfers:

In [10]:
receivers = tokenTransfersDF.groupby(['to','value']).agg({'from':'count'}).sort_values("from", ascending=False)
receivers.reset_index(inplace=True)
receivers.head()

Unnamed: 0,to,value,from
0,0xE6c2d451936dFCA11fa968426b93E70F8A135221,2500000000000,1072
1,0xF21329C8A24a19388e8eb91A7c710D6fa10dEDE6,2500000000000,688
2,0xC4fd7c26aE028BF42F60FA6a0eF5c32990fCdA9f,2500000000000,445
3,0x715D4B5180fd31Ab9ABaf63646c6AFdb3Ce37a3C,2500000000000,411
4,0xa02E73A0564874Cd17B82669E72daE170AcF0371,2500000000000,315


# 2.5.2 Which groups of accounts likely belong to the same entity?
If a single address has received the same amount of tokens through multiple intermediaries multiple times, we can assume that the receiving address and its intermediaries likely belong to the same entity/user, and this user has participated multiple times in the airdrop.

*CAUTION: We assume the receiver is not a decentralized exchange or other type of service. In a real scenario these should be excluded, but we omit this issue for reduced complexity here.*

To identify these groups, we construct a two hop network from the distributor, excluding the distributor itself:

In [11]:
# only consider edges with the same value as in the airdrop
airdropNode = senders.at[0, "from"] # sender with the most outgoing transfers of the same value (airdrop distributor)
airdropValue = senders.at[0, "value"] # the airdrop amount that the distributor has given away multiple times

# only consider the subgraph of these airdrop value transfers
edges = tokenTransfersDF[(tokenTransfersDF.value == airdropValue) &
                         (tokenTransfersDF.value != 0)]

# construct a graph from these airdrop value transfers
G_airdropValue = nx.from_pandas_edgelist(edges, source="from", target="to", create_using=nx.DiGraph)

# construct a subgraph starting from the airdrop distributor, going two hops, excluding the distributor itself
G_ego = nx.ego_graph(G_airdropValue, airdropNode, radius=2, center=False)

airdropCollections = nx.to_pandas_edgelist(G_ego, source="intermediary")
airdropCollections

Unnamed: 0,intermediary,target
0,0xcc1B7569442F89b18727506203B72f9511692512,0xEEBE77393079799B4582d307D708B35351c8b904
1,0x2e797D9dd31424eb93C5f042D46E76525903E1Fc,0x97126cbde15c4582cB5A76dEB1eDD1577C279F69
2,0xDC89F11b5796A6Eb1200Dc6bAe59324E2A3a5871,0xEEBE77393079799B4582d307D708B35351c8b904
3,0xC3eC43d632E2F2D83FbfBc34aC985B4472E7b080,0xEEBE77393079799B4582d307D708B35351c8b904
4,0x6D620E3b3aFe5585F5FDdEC87b0F91f42F03F05A,0xEEBE77393079799B4582d307D708B35351c8b904
...,...,...
5865,0x16F4aB9806914486AAE34fFe9d282a847890E915,0x0310613F7cBBfAC9e5c8d487c60A1b8f8738ba4D
5866,0x2F55D051248D963233f487D9B61437d7e5c3CC0b,0x2a0c0DBEcC7E4D658f48E01e3fA353F44050c208
5867,0x43158eE6Ef651742b274D18bB8C124cd4000e64D,0x2a0c0DBEcC7E4D658f48E01e3fA353F44050c208
5868,0xB44D57A251DA397bcBE33CFbbEf567E9fEd151fC,0x2a0c0DBEcC7E4D658f48E01e3fA353F44050c208


## 2.5.3 Restricting groups to those that have collected at least n airdrops (i.e. n=10)
To be a little more conservative with the groups that we identify, let's select only those were a sufficient number of airdrops have been aggregated into a single address. For ease of use, let's pick 10. In a real scenario, a more advanced mechanism should be introduced to identify a good threshold.

In [12]:
minAggregations = 10
minAggDF = airdropCollections[airdropCollections.groupby(
    ['target'])['intermediary'].transform('count') >= minAggregations]
minAggDF

Unnamed: 0,intermediary,target
1,0x2e797D9dd31424eb93C5f042D46E76525903E1Fc,0x97126cbde15c4582cB5A76dEB1eDD1577C279F69
6,0x97126cbde15c4582cB5A76dEB1eDD1577C279F69,0x8d12A197cB00D4747a1fe03395095ce2A5CC6819
7,0x8b20369BbC0f8562d1C40a73a137B519cBa72674,0x97126cbde15c4582cB5A76dEB1eDD1577C279F69
8,0x858aC1DF0F5E96Fa0e94bD44A1cD601A339a6E3d,0x8d12A197cB00D4747a1fe03395095ce2A5CC6819
9,0xB808Ea87da53fd1A9B064B5fbE647a727c7376C6,0x97126cbde15c4582cB5A76dEB1eDD1577C279F69
...,...,...
5864,0x9e7f737de4E686533E2bF9Ef4F26b68a7c75E754,0x2a0c0DBEcC7E4D658f48E01e3fA353F44050c208
5866,0x2F55D051248D963233f487D9B61437d7e5c3CC0b,0x2a0c0DBEcC7E4D658f48E01e3fA353F44050c208
5867,0x43158eE6Ef651742b274D18bB8C124cd4000e64D,0x2a0c0DBEcC7E4D658f48E01e3fA353F44050c208
5868,0xB44D57A251DA397bcBE33CFbbEf567E9fEd151fC,0x2a0c0DBEcC7E4D658f48E01e3fA353F44050c208


## 2.5.4 Exctracting entities
We can determine and number the groups of intermediary and target addresses, and enumerate them. The assumption is that each group is likely controlled by a single entity. Since the graph doesn't contain the distributor node, it consists of several disconnected components, which we can extract and enumerate.

This yields a mapping of address to entity id.

In [13]:
# Create an undirected graph of intermediary and target addresses
G_minAgg = nx.from_pandas_edgelist(minAggDF, 'intermediary', 'target')
# Extract connected components. This works because the distributor node is not part of the graph.
componentAddresses=list(nx.connected_components(G_minAgg))
# Enumerate each component, so that each entity has a number
componentAddressesEnum=[dict.fromkeys(y,x) for x, y in enumerate(componentAddresses)]
# Extract the addresses of each component and assign them their component number
addressToEntityDict={k: v for d in componentAddressesEnum for k, v in d.items()}
mapping = pd.DataFrame(list(addressToEntityDict.items()), columns=['address', 'entity'])
mapping

Unnamed: 0,address,entity
0,0x395c778f78C31EC84e8490FcA48e60A93BE190b9,0
1,0x5463C2A6bFb588918C77Ab82D9d0D4b394Bba47C,0
2,0xD44f53Abc542ABD60E391c659b948Df4aa3A135a,0
3,0x2792baa81416Fa264B0eB656FB11b9a655a17FD9,0
4,0x5A62df5AF866ed26Af427871E027b9F1694927E5,0
...,...,...
5442,0xf56Dbe2B0599BCA275b46237E812B0adf66bfb09,40
5443,0xd51dD97EDEC0A57855083E1aCD04633894Cd849a,40
5444,0x65D8C3EFccAdf8faF635DAD8876729cCE8804847,40
5445,0x03f2538C312176B0DC420f0ab567BA3133e573d9,40


This shows that there are likely at least 41 entities that have used a total of 5447 addresses to participate multiple (at least 10) times in the airdrop.