# DBSCAN

The overall complexity of DBSCAN is O(n²) in the worst case.

Motivation for **density-based clustering** (https://towardsdatascience.com/a-practical-guide-to-dbscan-method-d4ec5ab2bc99):

Two popular types of clustering methods are: partitioning and hierarchical methods.
- **Partitioning method**: partitions the dataset to k (the main input of the methods) number of groups (clusters).The most well-known Partitioning method is K-means.The partition methods have some significant drawbacks: you should know beforehand into how many groups you want to split the database (the K value). Another important drawback is that K-means does not perform well on finding non-convex/non-spherical shapes of clusters and K-means is sensitive to noise data.
- **Hierarchical method** creates a Hierarchical visual representation for the data using a special tree. In contradiction to K-means, you don’t need to decide what should be the number of clusters, but it also has some serious drawbacks: it isn’t suitable for big datasets, has high computational complexity, you need to choose the metric for merging the clusters (linkage) that affects the clustering results. 

As we can see the main disadvantages of partitioning and hierarchical methods are: handling noise and getting bad results with finding clusters of nonspherical shape.
**The DBSCAN clustering method is able to represent clusters of arbitrary shape and to handle noise.**

**DBSCAN**

Differing groups of points by their density is the main idea of the DBSCAN.

The DBSCAN groups together points with a dense neighborhood into clusters: A point will be considered as crowded if it has many other neighbors points near it. The DBSCAN finds these crowded points and places them and their neighbors in a cluster.

In general, ε should be chosen as small as possible.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import pyplot
from utils import create_date, drop_smart_contract, clean_up_row, drop_missing_data
from data_preparation import train_data_loader, data_pre_processing
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN

import networkx as nx
import operator

## Load data

In [2]:
data = train_data_loader()

  eth_20170702 = pd.read_csv("https://s3.eu-central-2.wasabisys.com/ethblockchain/eth_trans_20170702.csv")
  eth_20170707 = pd.read_csv("https://s3.eu-central-2.wasabisys.com/ethblockchain/eth_trans_20170707.csv")
  eth_20170708 = pd.read_csv("https://s3.eu-central-2.wasabisys.com/ethblockchain/eth_trans_20170708.csv")
  eth_20170709 = pd.read_csv("https://s3.eu-central-2.wasabisys.com/ethblockchain/eth_trans_20170709.csv")
  eth_20170710 = pd.read_csv("https://s3.eu-central-2.wasabisys.com/ethblockchain/eth_trans_20170710.csv")
  eth_20170714 = pd.read_csv("https://s3.eu-central-2.wasabisys.com/ethblockchain/eth_trans_20170714.csv")
  eth_20170716 = pd.read_csv("https://s3.eu-central-2.wasabisys.com/ethblockchain/eth_trans_20170716.csv")
  eth_20170717 = pd.read_csv("https://s3.eu-central-2.wasabisys.com/ethblockchain/eth_trans_20170717.csv")
  eth_20170718 = pd.read_csv("https://s3.eu-central-2.wasabisys.com/ethblockchain/eth_trans_20170718.csv")
  eth_20170719 = pd.read_csv("https:/

# Data Preprocessing

In [3]:
df = data_pre_processing(data)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_drop.drop(columns=['check_hash'], inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_no_missing_date.drop(['receipt_status', 'max_fee_per_gas', 'max_priority_fee_per_gas', 'transaction_type','receipt_contract_address'],axis=1,inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_nodup['block_timestamp_str'] = data_nodup['block_timestamp'].astype(str)
A value is trying to be set 

In [4]:
!nvidia-smi

Sat Jul 15 12:21:38 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.41.03              Driver Version: 530.41.03    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Quadro RTX 8000                 Off| 00000000:3B:00.0 Off |                  Off |
|  0%   48C    P8               26W / 260W|     10MiB / 49152MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Quadro RTX 8000                 Off| 00000000:5E:00.0 Off |  

In [5]:
# Drop target features (comments, flags)
train_data = df.drop(['comment_from_address_darklst','date_mal_trans_from', 'mal_trans_from', 'comment_to_address_darklst',
                    'date_mal_trans_to', 'mal_trans_to','transaction_flag1', 'transaction_flag2','attack_descr', 'attack_date'],axis=1)


# Keep only numeric features
train_data = train_data[['nonce', 'transaction_index',
       'value', 'gas', 'gas_price', 'receipt_cumulative_gas_used',
       'receipt_gas_used', 'block_number','receipt_effective_gas_price', 
       'dates', 'gas_price_unit','value_div_gas', 'from_address_count', 'to_address_count',
       'block_count', 'degree_centrality_from', 'degree_centrality_to',
       'in_degree_adr_to', 'out_degree_adr_to', 'in_degree_adr_from','out_degree_adr_from','transaction_flag']]

In [6]:
Xtrain = train_data.drop(['transaction_flag'],axis=1)
Xtrain.set_index('dates', inplace=True)
ytrain = train_data[['transaction_flag','dates']]
ytrain.set_index('dates', inplace=True)

In [7]:
Xtrain.head()

Unnamed: 0_level_0,nonce,transaction_index,value,gas,gas_price,receipt_cumulative_gas_used,receipt_gas_used,block_number,receipt_effective_gas_price,gas_price_unit,value_div_gas,from_address_count,to_address_count,block_count,degree_centrality_from,degree_centrality_to,in_degree_adr_to,out_degree_adr_to,in_degree_adr_from,out_degree_adr_from
dates,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2017-07-31,5,1,0.0,53759.0,60000000000,74758.0,53758.0,4098983.0,60000000000.0,1116092.0,0.0,3,4858,171,2e-06,0.003502,4858,0,0,3
2017-07-31,2596723,172,5.45982e+17,50000.0,4000000000,5307522.0,21000.0,4101279.0,4000000000.0,80000.0,10919640000000.0,357662,32,180,0.257829,4.6e-05,32,32,6,357662
2017-07-31,2591374,15,2.077429e+17,50000.0,4000000000,336000.0,21000.0,4099488.0,4000000000.0,80000.0,4154859000000.0,357662,43,93,0.257829,6.2e-05,43,43,6,357662
2017-07-31,2591369,10,2.178482e+17,50000.0,4000000000,231000.0,21000.0,4099488.0,4000000000.0,80000.0,4356964000000.0,357662,56,93,0.257829,5.1e-05,56,15,6,357662
2017-07-31,2591373,14,5.256675e+16,50000.0,4000000000,315000.0,21000.0,4099488.0,4000000000.0,80000.0,1051335000000.0,357662,60,93,0.257829,4.4e-05,60,1,6,357662


In [8]:
ytrain.head()

Unnamed: 0_level_0,transaction_flag
dates,Unnamed: 1_level_1
2017-07-31,0
2017-07-31,0
2017-07-31,0
2017-07-31,0
2017-07-31,0


In [9]:
ytrain['transaction_flag'].value_counts()

transaction_flag
0    7457565
1       3290
Name: count, dtype: int64

## Apply the standard Scaler

## Fit the DBSCAN

In [10]:
Xtrain_subset = Xtrain.iloc[0:100000]

In [11]:
clustering = DBSCAN().fit(Xtrain_subset)

In [13]:
clustering.labels_

array([ -1,  -1,  -1, ..., 599,  -1,  -1])

In [20]:
clustering = DBSCAN().fit_predict(Xtrain_subset)

In [19]:
labels = pd.Series(clustering.labels_)
pd.unique(labels)
len(pd.unique(labels))
labels.value_counts()

-1      80835
 31      2637
 116     1730
 590      878
 30       525
        ...  
 394        5
 581        5
 184        5
 60         5
 600        5
Name: count, Length: 686, dtype: int64

In [15]:
clustering.get_params()

{'algorithm': 'auto',
 'eps': 0.5,
 'leaf_size': 30,
 'metric': 'euclidean',
 'metric_params': None,
 'min_samples': 5,
 'n_jobs': None,
 'p': None}

In [12]:
!nvidia-smi

Sat Jul 15 12:22:59 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.41.03              Driver Version: 530.41.03    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Quadro RTX 8000                 Off| 00000000:3B:00.0 Off |                  Off |
|  0%   47C    P8               27W / 260W|     10MiB / 49152MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Quadro RTX 8000                 Off| 00000000:5E:00.0 Off |  