# Relay Churn Rate

The goal with this project is to build up some intuition about the current churn rates within The Tor Network.

The Tor Network is an open network, which means that anybody with access to a networked computer with a semi-static IP address can set it up to participate as a relay in the network.

We currently don't have a good overview of how long relays stay in the Tor network and how many new nodes joins the network each month.

Before we can dive into this, we need to be clear about some definition. When we talk about "Relay Churn Rate" here, we are interested in seeing how many nodes that leaves the network. We look at the network on a monthly basis, so if a relay have participated in the network for a single day, the relay will be considered to have been active during that month.

We are interested in figuring out:

1. How many never before seen (new) relays are joining the Tor network each month?
2. How many relays have left the network in the given month (but was active last month)?
3. How many relays are returning from having been away from the network for an entire month?

Additionally we are interested in learning about the following:

1. How many unique relays have the Tor network seen in total during its lifetime?
2. What is the lifetime properties of a relay? For example: The average lifetime of a relay?

## Implementation

The first thing we have to do is to build up the data structure from the historical data in the archives. We use the Stem library to parse the historical data into a Python data structure that we can work with.

In [1]:
from stem.descriptor import parse_file
from concurrent.futures import ThreadPoolExecutor

import os
import binascii
import logging

DATA_DIR = "/home/user/stem-collector-data"

logging.basicConfig(format='%(asctime)s %(message)s', level=logging.DEBUG)

Next we build our data structure that we are going to use for the computations later. The resulting structure will be a map of a `(year, month)` tuple to a set of relay fingerprints.

This part is going to read around 120 GB of data, so it takes around 6 hours on my laptop. The pickled result is around 63 MB of data.

In [2]:
def parse_one_consensus(path):
    result = set()
    year, month = tuple(path.strip("consensus-").strip(".tar").split("-"))
    
    year = int(year)
    month = int(month)
    
    logging.info("Parsing: {}".format(path))
    
    for relay in parse_file("{}/{}".format(DATA_DIR, path)):
        fingerprint = binascii.unhexlify(relay.fingerprint)
        
        if "Running" not in relay.flags:
            continue
            
        result.add(fingerprint)
    
    return (year, month, result) 

files = sorted(os.listdir(DATA_DIR))
result = {}

with ThreadPoolExecutor(max_workers = 4) as e:
    for year, month, relays in e.map(parse_one_consensus, files):
        result[year, month] = relays

2020-05-07 01:06:55,213 Parsing: consensuses-2007-10.tar
2020-05-07 01:06:55,218 Parsing: consensuses-2007-11.tar
2020-05-07 01:06:55,235 Parsing: consensuses-2007-12.tar
2020-05-07 01:06:55,289 Parsing: consensuses-2008-01.tar
2020-05-07 01:08:03,826 Parsing: consensuses-2008-02.tar
2020-05-07 01:14:00,814 Parsing: consensuses-2008-03.tar
2020-05-07 01:15:27,014 Parsing: consensuses-2008-04.tar
2020-05-07 01:16:13,097 Parsing: consensuses-2008-05.tar
2020-05-07 01:16:40,300 Parsing: consensuses-2008-06.tar
2020-05-07 01:22:28,139 Parsing: consensuses-2008-07.tar
2020-05-07 01:23:19,282 Parsing: consensuses-2008-08.tar
2020-05-07 01:23:43,578 Parsing: consensuses-2008-09.tar
2020-05-07 01:25:38,198 Parsing: consensuses-2008-10.tar
2020-05-07 01:28:50,984 Parsing: consensuses-2008-11.tar
2020-05-07 01:29:53,821 Parsing: consensuses-2008-12.tar
2020-05-07 01:30:15,981 Parsing: consensuses-2009-01.tar
2020-05-07 01:30:18,932 Parsing: consensuses-2009-02.tar
2020-05-07 01:33:29,895 Parsing

2020-05-07 10:59:20,614 Parsing: consensuses-2019-10.tar
2020-05-07 11:16:40,124 Parsing: consensuses-2019-11.tar
2020-05-07 11:21:08,716 Parsing: consensuses-2019-12.tar
2020-05-07 11:23:29,778 Parsing: consensuses-2020-01.tar
2020-05-07 11:23:54,851 Parsing: consensuses-2020-02.tar


In [7]:
result

{(2007,
  10): {b")z\xd3\x93\xf5\xd4G\xb5\xb6\xa6\xa9\xc7\x92\x99\xe3\x88m'\x9b`", b'\x9b\x10\xc5\xdc|j\xaf\xb4\xd1s\xa7\xd8S}\x8f\xb5\x16\xd8+\xe6', b'\xb7A\xca\xb4\x8b \xbeIbP\x01\x90b\xcb\x9cf\x95\x10\x7f\xa2', b'8s\x08J\xff\xfa\xfc\xc6\xff\xb0\x88\xf0$\xc2o\xebe\x83\xd7Q', b')\x0b2j|5\xb9\x9c\xb4:\xc5\x9bpEm\xb5\x17jR\x03', b'\xcbu\x16\xc0\xab\x8a\xeb\'\xf1\xe5\xa3"\xeb\xc7\xd4\x13\x82@d\xee', b'Y%\xf2N\xb1\x88\x9e\xad\xb6\x988)\xd9\x01w"0J\xe0\xe7', b'\xb3^3\xa2\xa4L\x05\xf0z\x0e\xf9\xaf\x9e\x03o\xa9h\xa6\x1d\xd8', b's8\xc7\x9br*\xd0H\xc0\xa2\x11\xba\xacz\xc2\xcbN\x07\xb9H', b'\xafi^\x9fZ\xf5\x19\xe8\x84\x89\xbf\xe3\x1b\xa2cd\x95z_\xc0', b'[\xff\xc3\xe3|i\xc9\x86\xb4\x81\rd\xbc\xd0\x9e\xbe\x19$k\xe3', b'=<\x9a\xcb\xe8\xc0\xac\xb4\x00.\xac9g%,."\xd2^z', b'yU\x81j\x10\xfd\xbb~0nl\xa0\xec^\x1dI\x10\xc81t', b'n\xef\x18\xb9\xdbiz\x8a\xc6\x9b\xfd\x9eqb#\xc8\x18\x8dX\x8a', b'\xe7}\xcf\x19\xdc\x02\x8d\xf5\x14\xbf\xdb\xbf_\x85/e\xc1\x17\xd4e', b'fOJx\xef/)\xdb\xbd\xbf\xc2\xc7W\x88\x87\xc5\

We convert the data we have collected into a Panda DataFrame using various set operations:

In [94]:
pd.set_option("display.max_rows", 500)

all_seen_relays = set()

last_month = set()

result = {
    "date": [],
    "relay_count_sum": [],
    "relay_count": [],
    "relay_new_count": [],
    "relay_return_count": [],
    "relay_lost_count": [],
}

for key in sorted(data.keys()):
    for kind in sorted(data[key].keys()):
        current_month = data[key][kind]
        year, month = key
    
        number_of_relays_seen = len(current_month)
        relay_new_count = len(current_month - all_seen_relays)
        
        # A returning relay is a relay we have seen before, but not last month.
        relay_return_count = len((current_month - last_month) & all_seen_relays)
        
        # A lost relay is one that was there last month, but not in current month.
        relay_lost_count = len(last_month - current_month)
        
        all_seen_relays.update(current_month)
        total_number_of_relays_seen = len(all_seen_relays)
    
        result["date"].append(datetime.datetime(year, month, 1))
        result["relay_count"].append(number_of_relays_seen)
        result["relay_count_sum"].append(total_number_of_relays_seen)
        result["relay_new_count"].append(relay_new_count)
        result["relay_return_count"].append(relay_return_count)
        result["relay_lost_count"].append(relay_lost_count)

        last_month = current_month
    
frame = pd.DataFrame.from_dict(result)
frame

Unnamed: 0,date,relay_count_sum,relay_count,relay_new_count,relay_return_count,relay_lost_count
0,2007-10-01,3239,3239,3239,0,0
1,2007-11-01,9273,8487,6034,0,786
2,2007-12-01,14650,8866,5377,37,5035
3,2008-01-01,20364,9462,5714,195,5313
4,2008-02-01,24337,7803,3973,243,5875
5,2008-03-01,28066,7571,3729,344,4305
6,2008-04-01,31171,6790,3105,315,4201
7,2008-05-01,34645,7087,3474,329,3506
8,2008-06-01,37934,6680,3289,340,4036
9,2008-07-01,41209,6699,3275,357,3613


The columns above tries to explain the following values:

- *date*: The first day of the given month.
- *relay_count_sum*: The total sum of different, unique, relays we have seen in the consensus at this given point in time.
- *relay_count*: The number of different, unique, relays we have seen in the consensus in the given month.
- *relay_new_count*: The number of relays that we have not seen before.
- *relay_return_count*: The number of relays that we have seen before, but that was not found last month, but has now returned.
- *relay_lost_count*: The number of relays that we saw last month, but that are gone in the current month.