# Address Filtering Demo 2
Given some sample dataset, I will construct the matrices with basic input/output information. I will filter these based on a training set of identified addresses and do a similar construction with address affiliation unknown. Then, I can use a variety of combinations of distance functions to measure how related addresses are.

In [11]:
import pandas as pd
import numpy as np
from analysis import *
from extract_sender_receiver import *

### Datasets
Pulled from online, but can be generated using `ethereum-etl`. List of interesting addresses was also pulled from a project since they are just of set of addresses that researchers wanted to omit for analysis (just an example of what I could work with).

### Constructing Matrices
I will extract: block_number,out_tx,in_tx,out_value,in_value,unique_receivers,unique_senders for each address in the transaction dataset.

In [3]:
path = 'Data/eth_transactions.csv'
interesting_addresses = 'Data/layerzero_sybils.csv'
output_folder = 'Matrices2'
construct_matrix(path, interesting_addresses, output_folder)

Number of interesting senders: 800
Number of interesting receivers: 136
Number of blocks: 635


### Defining Class
In order to make comparisons between an address and a class, I do not want to compare it to every single matrix, so we are going to define a single matrix representative of that class.

In [25]:
matrix_folder_path = 'Matrices2'
dim_red = diffusion_map
func = np.mean
output_file = 'UserSpectra.csv'

centroids = define_spectra(matrix_folder_path, dim_red, func)
np.savetxt(output_file, centroids, delimiter=",")

print('Spectra for the class have been constructed and saved to', output_file)

Spectra for the class have been constructed and saved to UserSpectra.csv


### Eros Distance
Compute weights for the class based on component extraction technique and similarity metric.

In [26]:
matrix_folder_path = 'Matrices2'

output_file = 'weights_diffusion_map.csv'

matrices = []

for file in os.listdir(matrix_folder_path):
    with open(os.path.join(matrix_folder_path, file)) as f:
        A = np.genfromtxt(f, delimiter=',', dtype=np.float64, skip_header=1)
        A = np.array([[int(x.decode()) if isinstance(x, bytes) else x for x in row] for row in A]).T 
        matrices.append(A)

eig_mat, eig_vec_mat = build_eig_mat(matrices, dim_red)
weights = compute_weight_raw(eig_mat)
np.savetxt(output_file, weights, delimiter=",")

print('Weights have been calculated and saved to', output_file)
print(weights)

Weights have been calculated and saved to weights_diffusion_map.csv
[2.0000e-01 2.0000e-01 2.0000e-01 2.0000e-01 2.0000e-01 3.0203e-17 1.1725e-17]


In [28]:
# with diffusion map
compare_0 = np.genfromtxt('Matrices/0x0b2443fdca5faa860738000ece90122a0702c5bb.csv', delimiter=',', dtype=np.float64, skip_header=1)
compare_0 = np.array([[int(x.decode()) if isinstance(x, bytes) else x for x in row] for row in compare_0]).T 

compare_1 = np.genfromtxt('CompareMatrices/0x0a3cc7cb8c66e5a033352154afa918156b8fc2d2.csv', delimiter=',', dtype=np.float64, skip_header=1)
compare_1 = np.array([[int(x.decode()) if isinstance(x, bytes) else x for x in row] for row in compare_1]).T 

compare_2 = np.genfromtxt('transaction_data.csv', delimiter=',', dtype=np.float64, skip_header=1)
compare_2 = np.array([[int(x.decode()) if isinstance(x, bytes) else x for x in row] for row in compare_2]).T 

dist_0 = Eros(centroids, compare_0, weights, cosine_similarity, dim_red)
dist_1 = Eros(centroids, compare_1, weights, cosine_similarity, dim_red)
dist_2 = Eros(centroids, compare_2, weights, cosine_similarity, dim_red)

dist_0_euc = Eros(centroids, compare_0, weights, euclidean_distance, dim_red)
dist_1_euc = Eros(centroids, compare_1, weights, euclidean_distance, dim_red)
dist_2_euc = Eros(centroids, compare_2, weights, euclidean_distance, dim_red)

dist_0_mse = Eros(centroids, compare_0, weights, mean_squared_error, dim_red)
dist_1_mse = Eros(centroids, compare_1, weights, mean_squared_error, dim_red)
dist_2_mse = Eros(centroids, compare_2, weights, mean_squared_error, dim_red)

print('Cosine Similarity')
print(dist_0)
print(dist_1)
print(dist_2)
print('Euclidean Distance')
print(dist_0_euc)
print(dist_1_euc)
print(dist_2_euc)
print('Mean Squared Error')
print(dist_0_mse)
print(dist_1_mse)
print(dist_2_mse)

Cosine Similarity
0.04117051889075911
0.06675015209637022
0.025933414590600386
Euclidean Distance
0.22721842406620538
0.1680156087309959
0.1955513908572939
Mean Squared Error
0.05223673519620126
0.032220804804103644
0.03861846805644338


In [24]:
# with pca
compare_0 = np.genfromtxt('Matrices/0x0b2443fdca5faa860738000ece90122a0702c5bb.csv', delimiter=',', dtype=np.float64, skip_header=1)
compare_0 = np.array([[int(x.decode()) if isinstance(x, bytes) else x for x in row] for row in compare_0]).T 

compare_1 = np.genfromtxt('CompareMatrices/0x0a3cc7cb8c66e5a033352154afa918156b8fc2d2.csv', delimiter=',', dtype=np.float64, skip_header=1)
compare_1 = np.array([[int(x.decode()) if isinstance(x, bytes) else x for x in row] for row in compare_1]).T 

compare_2 = np.genfromtxt('transaction_data.csv', delimiter=',', dtype=np.float64, skip_header=1)
compare_2 = np.array([[int(x.decode()) if isinstance(x, bytes) else x for x in row] for row in compare_2]).T 

dist_0 = Eros(centroids, compare_0, weights, cosine_similarity, dim_red)
dist_1 = Eros(centroids, compare_1, weights, cosine_similarity, dim_red)
dist_2 = Eros(centroids, compare_2, weights, cosine_similarity, dim_red)

dist_0_euc = Eros(centroids, compare_0, weights, euclidean_distance, dim_red)
dist_1_euc = Eros(centroids, compare_1, weights, euclidean_distance, dim_red)
dist_2_euc = Eros(centroids, compare_2, weights, euclidean_distance, dim_red)

dist_0_mse = Eros(centroids, compare_0, weights, mean_squared_error, dim_red)
dist_1_mse = Eros(centroids, compare_1, weights, mean_squared_error, dim_red)
dist_2_mse = Eros(centroids, compare_2, weights, mean_squared_error, dim_red)

print('Cosine Similarity')
print(dist_0)
print(dist_1)
print(dist_2)
print('Euclidean Distance')
print(dist_0_euc)
print(dist_1_euc)
print(dist_2_euc)
print('Mean Squared Error')
print(dist_0_mse)
print(dist_1_mse)
print(dist_2_mse)

Cosine Similarity
0.06007258842537396
0.054566869224389296
0.011856841357632387
Euclidean Distance
0.16084287394486116
0.22823370824842418
0.19993197410007288
Mean Squared Error
0.026431250252048864
0.05283468547750698
0.04011301156009313


### Improvements
- Finish up tools for running the analysis (CLI, visualization functions, etc)
- Clustering coefficient
- Potentially try running it on a larger dataset