# Explore Tor Relay Emails and Crypto Addresses

This notebook will load a list of Tor relays and bridges. It will look at the Tor relays, and work to extract email addresses and cryptocurrency payment methods.

The goal is to determine how many Tor relays are contactable, and how many have crypto donation details presenttt.

In [9]:
import pandas as pd
import requests
import re

## Step 1: Download the database

In [10]:
j = requests.get('https://onionoo.torproject.org/details?limit=100000').json()

## Step 2: Parse the database

In [11]:
relays = pd.DataFrame(j['relays'])
bridges = pd.DataFrame(j['bridges']) # not used, for now; does not contain contact

relays['contact'] = relays['contact'].astype(str)

relays['is_exit'] = relays['flags'].apply(lambda val: 'Exit' in val)

relays # show

Unnamed: 0,nickname,fingerprint,or_addresses,last_seen,last_changed_address_or_port,first_seen,running,flags,country,country_name,...,measured,exit_addresses,overload_general_timestamp,alleged_family,exit_policy_v6_summary,indirect_family,unverified_host_names,dir_address,unreachable_or_addresses,is_exit
0,seele,000A10D43011EA4928A35F610405F92B4433B4DC,[104.53.221.159:9001],2022-07-22 21:00:00,2022-01-04 00:00:00,2014-08-23 06:00:00,True,"[Fast, HSDir, Running, Stable, V2Dir, Valid]",us,United States of America,...,True,,,,,,,,,False
1,CalyxInstitute14,0011BD2485AD45D984EC4159C88FC066E5E3300E,[162.247.74.201:443],2022-07-22 21:00:00,2022-01-27 04:00:00,2014-12-23 17:00:00,True,"[Exit, Fast, Guard, Running, Stable, V2Dir, Va...",us,United States of America,...,True,[162.247.74.201],,,,,,,,True
2,earthpig,00152DFAB972A3F8B08648E14A7B098CC29483E9,[41.216.189.100:9001],2022-07-22 21:00:00,2022-02-18 14:00:00,2022-01-04 02:00:00,True,"[Fast, Running, Stable, V2Dir, Valid]",de,Germany,...,True,,1.658477e+12,"[0FDA760D03D6E2C848C52DF1FAEF47F65250D107, 4FD...",,,,,,False
3,skylarkRelay,00240ECB2B535AA4C1E1874D744DFA6AF2E5E941,[95.111.230.178:443],2022-07-22 21:00:00,2020-08-28 11:00:00,2020-08-28 11:00:00,True,"[Fast, Guard, HSDir, Running, Stable, V2Dir, V...",de,Germany,...,True,,,,,,,,,False
4,StarAppsMobley,00283B5564E3072DCDDAB31D6EF622DD49BF524F,"[195.15.242.99:9001, [2001:1600:10:100::201]:9...",2022-07-22 21:00:00,2021-11-28 19:00:00,2021-10-01 17:00:00,True,"[Fast, Guard, Running, Stable, V2Dir, Valid]",ch,Switzerland,...,True,,,,,,,,,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7329,DarkfireCrypt,FFF3031D13AAD58EE849AD4B738051DE1E51E7D8,[31.7.175.1:9090],2022-07-22 21:00:00,2022-06-19 17:00:00,2022-04-14 00:00:00,True,"[Fast, HSDir, Running, Stable, V2Dir, Valid]",bg,Bulgaria,...,True,,1.658498e+12,,,,[31.7.175.1.via.itvnet.net],,,False
7330,DoctorWho,FFF599954C3821A28620E95C08CBDC6245E9DDAA,[195.154.200.68:9001],2022-07-22 21:00:00,2022-01-07 23:00:00,2022-01-07 23:00:00,True,"[Fast, Guard, HSDir, Running, Stable, V2Dir, V...",fr,France,...,True,,1.658491e+12,,,,,,,False
7331,ddetor2,FFF78C44BA6E6B6F7525095BBE14EF7CBEB89744,"[144.76.75.137:9001, [2a01:4f8:191:9388::2]:9001]",2022-07-22 21:00:00,2017-10-30 15:00:00,2014-04-24 21:00:00,True,"[Fast, HSDir, Running, Stable, V2Dir, Valid]",de,Germany,...,True,,,[89FF4A0818D8BE9F8B71338D96A6B854F77F9FEC],,,,144.76.75.137:9030,,False
7332,viennaOnTheWalk,FFFA91F18663F8CCE8E725A2493F85B390B8D337,"[81.169.159.28:29001, [2a01:238:423d:b500:d308...",2022-07-22 21:00:00,2022-05-21 09:00:00,2022-05-21 09:00:00,True,"[Fast, Running, V2Dir, Valid]",de,Germany,...,True,,,,,,,,,False


We now have a DataFrame of relays (`relays`), with these elements: https://metrics.torproject.org/onionoo.html#examples

```
nickname                                                         CalyxInstitute14
fingerprint                              0011BD2485AD45D984EC4159C88FC066E5E3300E
or_addresses                                                 [162.247.74.201:443]
last_seen                                                     2021-10-11 06:00:00
last_changed_address_or_port                                  2018-06-28 04:00:00
first_seen                                                    2014-12-23 17:00:00
running                                                                      True
flags                           [Exit, Fast, Guard, HSDir, Running, Stable, V2...
country                                                                        us
country_name                                             United States of America
as                                                                         AS4224
consensus_weight                                                            50000
last_restarted                                                2021-05-23 17:03:43
bandwidth_rate                                                         1073741824
bandwidth_burst                                                        1073741824
observed_bandwidth                                                       36433858
advertised_bandwidth                                                     36433858
exit_policy                     [reject 0.0.0.0/8:*, reject 169.254.0.0/16:*, ...
exit_policy_summary             {'accept': ['20-23', '43', '53', '79-81', '88'...
contact                         Nicholas Merrill <nick AT calyx dot com> BTC -...
platform                                                     Tor 0.4.5.7 on Linux
version                                                                   0.4.5.7
version_status                                                        recommended
effective_family                [0011BD2485AD45D984EC4159C88FC066E5E3300E, 0B5...
consensus_weight_fraction                                                 0.00038
guard_probability                                                             0.0
middle_probability                                                            0.0
exit_probability                                                         0.001154
recommended_version                                                          True
measured                                                                     True
exit_addresses                                                   [162.247.74.201]
dir_address                                                     162.247.74.201:80
verified_host_names                        [kunstler.tor-exit.calyxinstitute.org]
as_name                                                                       NaN
alleged_family                                                                NaN
indirect_family                                                               NaN
exit_policy_v6_summary                                                        NaN
overload_general_timestamp                                                    NaN
hibernating                                                                   NaN
unverified_host_names                                                         NaN
unreachable_or_addresses                                                      NaN
```

## Step 3: See how many normal emails are present

In [12]:
emailRegex = re.compile(r"([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)") # alternative: https://gist.github.com/dideler/5219706

relays['found_normal_emails'] = relays['contact'].apply(lambda v: list(set(emailRegex.findall(v))))
relays['found_normal_emails_count'] = relays['found_normal_emails'].apply(lambda v: len(v))

# Look at what we found
relays[['nickname', 'found_normal_emails', 'found_normal_emails_count']]

# Print stats on basic finding of emails
#relays[['found_normal_emails_count', 'nickname']].groupby(['found_normal_emails_count'], as_index=False).count()

pd.pivot_table(relays, index='found_normal_emails_count', values='nickname', columns=['is_exit'], aggfunc='count')

is_exit,False,True
found_normal_emails_count,Unnamed: 1_level_1,Unnamed: 2_level_1
0,3967,1418
1,1578,279
2,49,43


## Step 4: Try to do fancy parsing on the emails
Lots of emails are entered in antispam formats, and thus require some decoding.

In [13]:
def doEmailReplaces(contact:str, static_data:list=[re]) -> str:
    """ Replaces common email substitutions. Very slow now. """
    re = static_data[0] # for apply_parallel

    contact = re.sub(r'\s*[^a-zA-Z\d]{1,4}(at|a|ascii64|@t|@|att|a@t|4t|aatt|atatata)[^a-zA-Z\d]{1,4}\s*', '@', contact, flags=re.IGNORECASE)
    contact = re.sub(r'\s*[^a-zA-Z\d]{1,4}(dot|d|ascii46|d0t|dawt|doot|blot)[^a-zA-Z\d]{1,4}\s*', '.', contact, flags=re.IGNORECASE)
    contact = re.sub(r'\s*[^a-zA-Z\d]{1,4}(dash|minus|hyphen)[^a-zA-Z\d]{1,4}\s*', '-', contact, flags=re.IGNORECASE)
    contact = re.sub(r'\s+(at|ascii64|@t|@|att|a@t|4t|atatata|with)\s+', '@', contact, flags=re.IGNORECASE)
    contact = re.sub(r'\s+(dot|dt|ascii46|d0t|dawt|blot|dottydotdot)\s+', '.', contact, flags=re.IGNORECASE)
    contact = re.sub(r'\s+(dash|minus|hyphen)\s+', '-', contact, flags=re.IGNORECASE)
    contact = contact.replace('[]', '@') # per https://nusenu.github.io/ContactInfo-Information-Sharing-Specification/
    contact = contact.replace('[.]', '.')
    contact = contact.replace('dotcom', '.com')
    contact = contact.replace('google mail', 'gmail.com')
    contact = contact.replace(' host: ', '@')
    contact = contact.replace('protonmail', '@protonmail.com ').replace('@@', '@')
    return contact

relays['replaced_contact'] = relays['contact'].apply_parallel(doEmailReplaces, num_processes=2, static_data=[re])

relays['found_emails'] = relays['replaced_contact'].apply(lambda v: list(set(emailRegex.findall(v))))
relays['found_emails_count'] = relays['found_emails'].apply(lambda v: len(v))

pd.pivot_table(relays, index='found_emails_count', values='nickname', columns=['is_exit'], aggfunc='count')

is_exit,False,True
found_emails_count,Unnamed: 1_level_1,Unnamed: 2_level_1
0,2049,178
1,3369,1361
2,163,195
3,13,6


## Step 5: Look at what email patterns we're still missing

In [14]:

remainingContacts = relays.sample(300, random_state=1113)
remainingContacts = remainingContacts[remainingContacts['found_emails_count'] == 0][['contact', 'replaced_contact']]
remainingContacts = remainingContacts[remainingContacts['contact'].apply(lambda i:  not pd.isna(i) and len(i) > 5)]

remainingContacts

Unnamed: 0,contact,replaced_contact
4253,ididntedittheconfig,ididntedittheconfig
134,abuse(at_sign)zwiebelmett(dot)xyz,abuse@sign)zwiebelmett.xyz
3257,Contact (at)SilSte on Twitter [tor-relay.co],Contact@SilSte on Twitter [tor-relay.co]
3430,Izy <tor lain cx>,Izy <tor lain cx>
4636,tor -----symbol for email---- zigh in et ...,tor -----symbol for email---- zigh in et ...
5561,This relay is named after: https://youtu.be/EW...,This relay is named after: https://youtu.be/EW...
5069,Nicholas Johnson <see nicksphere dot ch>,Nicholas Johnson <see nicksphere.ch>
4130,robgjansen 0xF6264AB29F8AEDAA,robgjansen 0xF6264AB29F8AEDAA
6616,tor <email sign thingy> bluuurgh.com,tor <email sign thingy> bluuurgh.com
1083,your@e-mail,your@e-mail


## Step 6: Check for crypto addresses

In [15]:
cryptoRegex = {
    "btc": "[13][a-km-zA-HJ-NP-Z1-9]{25,34}",
    "bch": "((bitcoincash|bchreg|bchtest):)?(q|p)[a-z0-9]{41}", # TODO confirm that this works right with .group()
    "eth": "0x[a-fA-F0-9]{40}",
    "ltc": "[LM3][a-km-zA-HJ-NP-Z1-9]{26,33}",
    "doge": "D{1}[5-9A-HJ-NP-U]{1}[1-9A-HJ-NP-Za-km-z]{32}",
    "dash": "X[1-9A-HJ-NP-Za-km-z]{33}",
    "xmr": "4[0-9AB][1-9A-HJ-NP-Za-km-z]{93}",
    "neo": "A[0-9a-zA-Z]{33}",
    "xrp": "r[0-9a-zA-Z]{33}"
} # Source: https://github.com/k4m4/cryptaddress-validator/blob/master/index.js
for crypto in cryptoRegex.keys():
    cryptoRegex[crypto] = re.compile(cryptoRegex[crypto])

def whatCryptoSubstrings(contact:str) -> list:
    """ Get a list of what crypto addresses are present. Never returns two of the same crypto (only takes the first addr). """
    cryptosPresent = []
    for crypto in cryptoRegex.keys():
        if searchResult := cryptoRegex[crypto].search(contact):
            cryptosPresent.append({'crypto': crypto, 'address': searchResult.group()})
    return cryptosPresent

relays['found_cryptos'] = relays['contact'].apply(whatCryptoSubstrings)
relays['found_cryptos_count'] = relays['found_cryptos'].apply(lambda v: len(v)) # number of different cryptos

relays_cryptos = relays.explode('found_cryptos', ignore_index=True)

# expand dict to cols (crypto and address)
relays_cryptos = pd.concat([relays_cryptos.drop(columns=['found_cryptos']), relays_cryptos['found_cryptos'].apply(pd.Series)], axis=1)

# make a table with the number of addresses
#relays_cryptos[['crypto', 'found_cryptos_count']].groupby('crypto').count().sort_values('found_cryptos_count', ascending=False)
pd.pivot_table(relays_cryptos, index='crypto', values='nickname', columns=['is_exit'], aggfunc='count').sort_values(True, ascending=False)


is_exit,False,True
crypto,Unnamed: 1_level_1,Unnamed: 2_level_1
btc,212,374
neo,155,323
ltc,86,317
xrp,21,181
xmr,11,180
doge,14,32
dash,14,6
bch,12,4
eth,32,2


In [16]:
relays_cryptos[['nickname', 'found_emails', 'found_emails_count', 'crypto', 'address']][pd.notna(relays_cryptos['crypto'])]
#relays_cryptos[['nickname', 'found_emails', 'found_emails_count', 'crypto', 'address']][relays_cryptos['crypto'] == 'bch']

Unnamed: 0,nickname,found_emails,found_emails_count,crypto,address
1,CalyxInstitute14,[nick@calyx.com],1,btc,14wntQ8cBdnhUVfYmDjXz6PbpSSX8nCtkr
5,ForPrivacyNET,"[abuse@for-privacy.net, admin@for-privacy.net]",2,neo,A2AAD5A0DEF92D9DFE5442A58226EF5149
13,right2,[],0,btc,1AKfiFWajSckVrArTVh21KkdPuegordE3E
15,Quintex13,[john@quintex.com],1,btc,1Ab93nZzqBhhoTetUfzjXxpNjUoEQ84Jrt
16,Quintex13,[john@quintex.com],1,ltc,3nZzqBhhoTetUfzjXxpNjUoEQ84Jrt
...,...,...,...,...,...
8424,Quetzalcoatl,[Quetzalcoatl_relays@protonmail.com],1,xrp,r8cFG8T5F7fiCziV1fS21KKsbkBQmZNk5V
8425,ForPrivacyNET,"[abuse@for-privacy.net, admin@for-privacy.net]",2,neo,A2AAD5A0DEF92D9DFE5442A58226EF5149
8427,Hydra1,[abuse-node49@posteo.de],1,neo,A4166E368C1792C0B06D6E5222665A01C3
8433,1jcbcaulbxid,[admin@arviceblot.com],1,btc,3f8327468277442dd4765d8494b5f453a64


## Step 7: Do a lookup of the balances of these addresses

* See if anyone has donated to each address.
* Also do lookup by search engine maybe to see if there is a central distributor of funds, maybe.

These are next steps, and can be done in the future.