# Constructing the Social Network
We can construct a graph $G = (V,E)$ where $V$ is a set of Ethereum addresses that have completed at least one transaction. We can then label these addresses (which may or may not be null). Then, $E$ is the set of arcs where $(n_i, n_j, TrS_{ij}) \in E$ if there is at least one transaction from $n_i$ to $n_j$. Note that $TrS_{ij}$ consists of a set of triplets $(tr_{ij_k}, \tau_{ij_k}, v_{ij_k})$ where $tr_{ij_k}$ is the $k^{th}$ transaction from $n_i$ to $n_j$, $\tau_{ij_k}$ is the timestamp, and $v_{ij_k}$ is the amount of Wei transferred.

Then, we can select a few factors from Social Network Analysis to characterize each address:
1. In degree
2. Out degree
3. In transaction
4. Out transaction
5. In value
6. Out value
7. Clustering coefficient
8. PageRank

In [54]:
import numpy as np
import numpy.linalg as npla
import os
import pandas as pd

# Eros Distance (Extended Frobenius Norm)

Let $A$ and $B$ be two multivariate time series of size $m_A \times n$ and $m_B \times n$ respectively, where $m_A, m_B$ are the number of observations and $n$ is the number of factors. Then, construct the covariance matrices and denote these as $M_A, M_B$. Then, apply SVD to these matrices to construct the right eigenvectors matrices $V_A, V_B$. Let $V_A = [a_1, \dots, a_n]$ and $V_B = [b_1, \dots, b_n]$, where $a_i,b_i$ are column orthonormal vectors of size $n$. Then, $$\text{Eros} (A,B,w) = \sum_{i=1}^n w_i | \langle a_i, b_i \rangle | = \sum_{i=1}^n w_i | \cos \theta_i |$$
Note that the distance will be a value on the interval $[0,1]$, where $1$ is the most similar.

In [107]:
def cosine_similarity(a, b):
    return np.dot(a,b) / (npla.norm(a) * npla.norm(b))

def Eros(A, B, weights):
    n = np.shape(A)[1]

    cov_A = np.cov(A.T)
    cov_B = np.cov(B.T)

    U_A, S_A, V_A = npla.svd(cov_A)
    U_B, S_B, V_B = npla.svd(cov_B)

    result = 0
    for i in range(n):
        result += weights[i] * np.abs(cosine_similarity(V_A[i], V_B[i]))
    
    return result

I grabbed some data about a subset of the exchange addresses to construct a basic multivariate time series. I have not yet constructed a network so the features related to that have been omitted for now. Hence, I just need to load these values into matrices.

In [82]:
arrays = []
for file in os.listdir('Code-Files/time_series_data'):
    f = os.path.join('Code-Files/time_series_data', file)
    arr = np.genfromtxt(f, delimiter=',', dtype=np.float64, skip_header=1)
    arr = np.array([[int(x.decode()) if isinstance(x, bytes) else x for x in row] for row in arr])    
    arrays.append(arr)

We need to calculate the optimal weights for the system since just using variations of the eigenvalues does not necessarily capture the traits of interest in our system. I will be using a practice dataset for this training, which contains a bunch addresses I do not want to include in my data. To start, I just focused on exchanges since I already had a very basic list of some known exchanges. Furthermore, rather than testing every permutation, I am going to do a few random weights and find the best (for now, at least).

In [83]:
weights = []

for i in range(100000):
    r = np.random.randint(100,size=(6))
    r = r / np.sum(r)
    weights.append(r)

def find_weights(test_data, weights):
    min_diff = 2**31
    optimal_weight = []

    for w in weights:
        avg_diff = 0
        val = 0
        for sample_1 in test_data:
            for sample_2 in test_data: 
                if np.all(sample_2 == sample_1):
                    continue
                d = Eros(sample_1, sample_2, w)
                
                val += 1
                avg_diff += d

        if avg_diff != 0 and val != 0:
            avg_diff = avg_diff / val
            if avg_diff < min_diff:
                min_diff = avg_diff
                optimal_weight = w
                print(optimal_weight)
    
    return optimal_weight

opt = find_weights(arrays, weights)
print(opt)

[0.11231884 0.22826087 0.15217391 0.1884058  0.14492754 0.17391304]
[0.11842105 0.14802632 0.24013158 0.0625     0.14144737 0.28947368]
[0.30147059 0.11029412 0.3125     0.02573529 0.10294118 0.14705882]
[0.05625  0.278125 0.2875   0.       0.134375 0.24375 ]
[0.25514403 0.06995885 0.3744856  0.01646091 0.09465021 0.18930041]
[0.23834197 0.07253886 0.37305699 0.01036269 0.07253886 0.23316062]
[0.2375  0.075   0.425   0.      0.03125 0.23125]
[0.02830189 0.         0.67924528 0.03773585 0.11320755 0.14150943]


KeyboardInterrupt: 

# Results

Now, I have determined an optimal weight for classifying an address as an exchange address (well, sort of). This can then be used to find the Eros distance between a new address and a known exchange address. Then, I can define some threshold value for classification.

In this case, matrix $A$,$B$ are exchanges classified as part of the same class, meanwhile $C$ is just a random address I pulled from Etherscan (its had very different behavior). Thus, we see the distance calculation provides us with the results we would want.

In [109]:
weights = np.array([0.02830189, 0, 0.67924528, 0.03773585, 0.11320755, 0.14150943])

f = 'Code-Files/time_series_data/series_0xd433138d12beB9929FF6fd583DC83663eea6Aaa5.csv'
A = np.genfromtxt(f, delimiter=',', dtype=np.float64, skip_header=1)
A = np.array([[int(x.decode()) if isinstance(x, bytes) else x for x in row] for row in A]) 

f = 'Code-Files/time_series_data/series_0x9B99CcA871Be05119B2012fd4474731dd653FEBe.csv'
B = np.genfromtxt(f, delimiter=',', dtype=np.float64, skip_header=1)
B = np.array([[int(x.decode()) if isinstance(x, bytes) else x for x in row] for row in B]) 


f = 'Code-Files/time_series_data/series_0x4838B106FCe9647Bdf1E7877BF73cE8B0BAD5f97.csv'
C = np.genfromtxt(f, delimiter=',', dtype=np.float64, skip_header=1)
C = np.array([[int(x.decode()) if isinstance(x, bytes) else x for x in row] for row in C]) 

print("Eros between two exchanges:", Eros(A,B,weights))
print("Eros between exchange and random address:", Eros(A,C,weights))

Eros between two exchanges: 0.9997584716687287
Eros between exchange and random address: 0.11320504923754138
