# `1_ingest.ipynb`

The following notebook generates analysis dataframes and graphs for the following networks:

- Bored Ape Yacht Club
- Coolcats
- Cryptoadz
- Cyberkongz
- Hashmasks
- Mutant Ape Yacht Club
- Meebits
- Mekaverse
- Sneaky Vampire Syndicate

References to these projects and their respective smart contracts can be found in the accompanying report. 
The notebook first generates **general analysis** dataframes pulling from each project csv file found in the `./data/collated` directory. The notebook persists these dataframes at `./memory/<project>/full.npy` before creating the graph objects from them. 

Thus, if you want to change the graph models (as we needed to many times over the course of this project), you can do so without re-generating the dataframes themselves. If you only want to generate the graph objects and not the dataframes from scratch, set the `GENERATE_DATAFRAMES` variable in the config section in order to control this behavior.

The graphs are then generated from each dataframe as ten successive snapshots of the network. Each snapshot is inclusive of those that came before-hand. In short, the snapshots capture the evolution of the network at different times. Each snapshot is stored at `./memory/<project>/snapshots/` along with a summary dataframe. There is also a `GENERATE_GRAPHS` config variable available as well, which functions analogously to its sibling variable for the dataframes. Note that if both of these variables are set to `False`, the notebook won't produce any output.

You can also set the `TEST_LIMIT` config variable, which is helpful for debugging. This limits the anaylsis to the first X rows of the csv files. Note that that upon a successful run the dataframe object will be overwritten, so stash your changes or save a copy in order to restore. 

The notebook is written in a straightforward and functional style in the attempt to minimize any bloat and be supremely understandable for the reader. Very few of the cells actually run any processes, most are simple function declarations.

If you downloaded this project from the github repository, the necessary dataframes and graph snapshots are already generated for you. You can then run them in the `2_analysis.ipynb` notebook.

# Setup

The notebook has one network dependency in the Coinbase API. Set the below environment variables in order to re-run the notebook successfully. This API is used for the ETH-USD conversion rate. Although we generated persistent representations of this data, we used this API for filling in any gaps. Note this is not required if all you want to do is generate the graphs. In that case, simply comment out this portion and set `GENERATE_DATAFRAMES` to `False`.

In [1]:
import os
import numpy as np
import pandas as pd
import networkx as nx

import arrow
from tqdm import tqdm

from dotenv import load_dotenv
from coinbase.wallet.client import Client

load_dotenv('.env')
client = Client(os.environ['COINBASE_KEY'], os.environ['COINBASE_SECRET'])

# Config

In [2]:
TEST_LIMIT = None  # Set to None for production run. Helpful for testing as limits the rows of the dataframe generated
GENERATE_DATAFRAMES = False  # Can be set to True to re-generate dataframes from scratch
GENERATE_GRAPHS = False  # Can be set to True to re-generate graph objects from scratch

projects = [
    'bayc',
    'coolcats',
    'cryptoadz',
    'cyberkongz',
    'hashmasks',
    'mayc',
    'meebits',
    'mekaverse',
    'svs'
]

# Generate Analysis Dataframes

### Generate base dataframe

In [3]:
def create_base_data(project):
    PATH_TO_DATA = './data/collated/' + project + '.csv'  # Change if needed
    column_names = ["row", "tx_hash", "token_address", "from_address", "to_address", "token_id", "blk_number", "blk_timestamp", "eth_value"]
    
    df = pd.read_csv(PATH_TO_DATA, delimiter=',', skiprows=1, names=column_names)
    
    df["from_address"] = df.from_address.apply(lambda x: x.strip())
    df["to_address"] = df.to_address.apply(lambda x: x.strip())
    
    return df

### Lookup account data

In [4]:
def get_transaction_data(project):
    PATH_TO_DATA = f"./data/balances/{project}.csv"
    return pd.read_csv(PATH_TO_DATA)

errors = []

def lookup_account_value(df, block, account):
    value = 0
    df = df.infer_objects()
    
    if account == '0x0000000000000000000000000000000000000000':
        return value
    
    try:
        df_blocked = df[(df['block'] == block) & (df['address'] == account)]
        value = df_blocked['eth_value'].head(1).iat[0]
    except Exception as e:
        errors.append((block, account))
    return value

### Generate ETH/USD lookup and persist

In [5]:
def build_eth_to_usd_lookup():
    """The result is what one ETH is worth in USD"""
    column_names = ["date", "eth_to_usd"]
    df_eth_to_usd = pd.DataFrame(columns=column_names)
    
    for project in projects:
        df_transactions = get_transaction_data(project)
        
        df_transactions['eth_value'] = df_transactions['eth_value'].apply(pd.to_numeric, errors='coerce').fillna(0)
        df_transactions['usd_value'] = df_transactions['usd_value'].apply(pd.to_numeric, errors='coerce').fillna(0)
        
        df_transactions = df_transactions.astype({
            'eth_value': 'float64',
            'usd_value': 'float64'
        })
        
        df_transactions = df_transactions[df_transactions['eth_value'] != 0].groupby('date', as_index=False).first()
    
        for index, row in tqdm(df_transactions.iterrows(), total=df_transactions.shape[0]):
            date = row['date']
            eth_to_usd = row['usd_value'] / row['eth_value']

            df_eth_to_usd = df_eth_to_usd.append({
                'date': date,
                'eth_to_usd': eth_to_usd,
            }, ignore_index=True)
        
    df_eth_to_usd = df_eth_to_usd.groupby('date', as_index=False).first()
    print(df_eth_to_usd)
    
    np.save(f"./memory/eth_to_usd.npy", df_eth_to_usd)

In [6]:
build_eth_to_usd_lookup()

100%|███████████████████████████████████████| 214/214 [00:00<00:00, 1079.57it/s]
100%|███████████████████████████████████████| 144/144 [00:00<00:00, 1108.78it/s]
100%|█████████████████████████████████████████| 84/84 [00:00<00:00, 1066.64it/s]
100%|███████████████████████████████████████| 228/228 [00:00<00:00, 1045.52it/s]
100%|███████████████████████████████████████| 277/277 [00:00<00:00, 1099.85it/s]
100%|█████████████████████████████████████████| 94/94 [00:00<00:00, 1098.39it/s]
100%|███████████████████████████████████████| 212/212 [00:00<00:00, 1106.74it/s]
100%|█████████████████████████████████████████| 51/51 [00:00<00:00, 1073.41it/s]
100%|█████████████████████████████████████████| 85/85 [00:00<00:00, 1084.75it/s]

           date  eth_to_usd
0    2021-01-28     1240.62
1    2021-01-29     1333.61
2    2021-01-30     1380.04
3    2021-01-31     1380.00
4    2021-02-01     1313.95
..          ...         ...
304  2021-11-28     4098.53
305  2021-11-29     4298.38
306  2021-11-30     4449.42
307  2021-12-01     4636.43
308  2021-12-02     4586.87

[309 rows x 2 columns]





### Helper functions to get eth_to_usd

In [7]:
np_data = np.load('./memory/eth_to_usd.npy', allow_pickle=True)
df_eth_to_usd = pd.DataFrame(data=np_data, columns=['date', 'eth_to_usd'])

def get_eth_to_usd(date):
    # This is when you miss static types.. 
    date = date.strftime("%Y-%m-%d")
    rate = df_eth_to_usd.loc[df_eth_to_usd['date'] == date].eth_to_usd.values[0]
    return rate

# Convert ETH value to USD at specified date
def get_usd_value(date, eth_value):
    if eth_value == 0:
        return eth_value
    try:
        rate = get_eth_to_usd(date)
        return rate * eth_value
    except IndexError:
        print("Date not in values: " + str(date))
        return float(client.get_spot_price(currency_pair='ETH-USD', date=date)['amount']) * eth_value

### Build time-based analysis dataframes

In [8]:
def create_timed_data(df, df_transactions):
    ZERO_ADDRESS = '0x0000000000000000000000000000000000000000'
    column_names = [
        "date", 
        "days_since_mint", 
        "from_address", 
        "to_address", 
        "token_id", 
        "blk_number", 
        "eth_value",
        "usd_value",
        "from_value",
        "to_value",
        "from_value_usd",
        "to_value_usd"
    ]
    
    df_time = pd.DataFrame(columns=column_names)
    df_total = df.shape[0]
    
    if TEST_LIMIT:
        df = df.head(TEST_LIMIT)
        
    mint_date_set = False
    
    for index, row in tqdm(df.iterrows(), total=df_total):
        blk_timestamp = row['blk_timestamp']
        date = arrow.get(blk_timestamp).datetime

        from_address = str(row['from_address'])
        to_address = str(row['to_address'])
        token_id = row['token_id']
        blk_number = row['blk_number']
        eth_value = row['eth_value']
        usd_value = get_usd_value(date, eth_value)
        
        if not mint_date_set:
            days_since_mint = 0
            mint_date = date
            mint_date_set = True
        else:
            days_since_mint = (date - mint_date).days
            
        from_value = lookup_account_value(df_transactions, blk_number, from_address)
        to_value = lookup_account_value(df_transactions, blk_number, to_address)
        
        from_value_usd = get_usd_value(date, from_value)
        to_value_usd = get_usd_value(date, to_value)
            
        df_time = df_time.append({
            'date': date,
            'days_since_mint': days_since_mint,
            'from_address': from_address,
            'to_address': to_address,
            'token_id': token_id, 
            'blk_number': blk_number,
            'eth_value': eth_value,
            'usd_value': usd_value,
            'from_value': from_value,
            'to_value': to_value,
            'from_value_usd': from_value_usd,
            'to_value_usd': to_value_usd,
        }, ignore_index=True)
    
    df_time = df_time.infer_objects()
    return df_time

### Driver code

In [9]:
if GENERATE_DATAFRAMES:
    for project in projects:
        df_transactions = get_transaction_data(project)
        df_time = create_timed_data(create_base_data(project), df_transactions)
    
        np.save(f"./memory/{project}/full.npy", df_time)

# Generate Graph Snapshots

### Build graph objects from time base dataframe

In [10]:
def build_graph_from_timed(df_time, old_graph=None):    
    # Building a network per block
    # we will use a weighted and directed graph.
    graph = old_graph if old_graph is not None else nx.MultiDiGraph()

    # loop over the pandas dataframe.
    for index, row in tqdm(df_time.iterrows(), total=df_time.shape[0]):
        # read the values from the dataframe.
        # token_id  blk_timestamp eth_value 
        date = row['date']
        from_address = row['from_address']
        to_address = row['to_address']
        token_id = row['token_id']
        blk_number = row['blk_number']
        eth_value = row['eth_value']
        usd_value = row['usd_value']
        from_value = row['from_value']
        to_value = row['to_value']
        from_value_usd = row['from_value_usd']
        to_value_usd = row['to_value_usd']
        
        # make sure both addresses are in the graph.
        if from_address not in graph:
            graph.add_node(from_address)
        if to_address not in graph:
            graph.add_node(to_address)

        # set the attributes on this node.
        nx.set_node_attributes(graph, {from_address: from_value, to_address: to_value}, 'eth_value')
        nx.set_node_attributes(graph, {from_address: from_value_usd, to_address: to_value_usd}, 'usd_value')

        # keep track of how many trades a wallet has done.
        trades = nx.get_node_attributes(graph, "trades")
        if from_address in trades:
            nx.set_node_attributes(graph, {from_address:trades[from_address] + 1}, 'trades')
        else:
            nx.set_node_attributes(graph, {from_address:1}, 'trades')
        if to_address in trades:
            nx.set_node_attributes(graph, {to_address:trades[to_address] + 1}, 'trades')
        else:
            nx.set_node_attributes(graph, {to_address:1}, 'trades')

        # add an edge for the transaction. # Note changed to usd_value
        graph.add_edge(from_address, to_address, weight=usd_value, token_id=token_id) # keep track of token id by adding it to the edge.
        
    return graph

### Build time-based snapshots

In [11]:
def build_snapshots(df_time):
    res = []
    column_names = [
        "time_bucket", 
        "time_bucket_label",
        "number_of_nodes",
        "degree",
        "density",
        "reciprocity", 
        "assortativity", 
        "assortativity_base", 
        "assortativity_out_out", 
        "assortativity_in_in", 
        "assortativity_in_out",
        "centrality_degree",
        "centrality_closeness", 
    ]
    
    df_snapshots = pd.DataFrame(columns=column_names)
    
    df_time['date_quantile'], bins = pd.qcut(df_time['date'], 10, labels=False, retbins=True)
    time_buckets = np.unique(df_time["date_quantile"].to_numpy())
    
    for i, (time_bucket, label) in enumerate(zip(time_buckets, bins)):
        graph_selection = df_time[(df_time['date_quantile'] == time_bucket)]
        
        if i != 0:
            old_graph = res[i-1]
        else:
            old_graph = None
        
        graph_snapshot = build_graph_from_timed(graph_selection, old_graph=old_graph)
        degree = [(node, val) for (node, val) in graph_snapshot.degree()]  # This is necesssary because .degree() returns a *VIEW*
        
        res.append(graph_snapshot)
        df_snapshots = df_snapshots.append({
            "time_bucket": time_bucket,
            "time_bucket_label": label,
            "number_of_nodes": graph_snapshot.number_of_nodes(),
            "degree": degree,
            "density": nx.density(graph_snapshot),
            "reciprocity": nx.reciprocity(graph_snapshot),
            "assortativity": nx.degree_assortativity_coefficient(graph_snapshot),
            "assortativity_base": nx.degree_pearson_correlation_coefficient(graph_snapshot.to_undirected(), weight='weight'),
            "assortativity_out_out": nx.degree_pearson_correlation_coefficient(graph_snapshot, x='out', y='out', weight='weight'),
            "assortativity_in_in": nx.degree_pearson_correlation_coefficient(graph_snapshot, x='in', y='in', weight='weight'),
            "assortativity_in_out": nx.degree_pearson_correlation_coefficient(graph_snapshot, x='in', y='out', weight='weight'),
            "centrality_degree": nx.degree_centrality(graph_snapshot),
            "centrality_closeness": nx.closeness_centrality(graph_snapshot),
        }, ignore_index=True)
        
    return (df_snapshots.sort_values(by=['time_bucket']), res)

### Driver code

In [12]:
if GENERATE_GRAPHS:
    for project in projects:
        column_names = [
            "date", 
            "days_since_mint", 
            "from_address", 
            "to_address", 
            "token_id", 
            "blk_number", 
            "eth_value",
            "usd_value",
            "from_value", 
            "to_value",
            "from_value_usd",
            "to_value_usd"
        ]

        np_data = np.load(f"./memory/{project}/full.npy", allow_pickle=True)
        df_time = pd.DataFrame(data=np_data, columns=column_names)

        df_snapshot_summary, g_snapshots = build_snapshots(df_time)

        for i, snapshot in enumerate(g_snapshots):
            nx.write_gml(snapshot, f"./memory/{project}/snapshots/{i}.gml")
            print("Successfully wrote snapshot")

        np.save(f"./memory/{project}/snapshots/summary.npy", df_snapshot_summary)