# Relay Churn Rate

The goal with this project is to build up some intuition about the current churn rates within The Tor Network.

The Tor Network is an open network, which means that anybody with access to a networked computer with a semi-static IP address can set it up to participate as a relay in the network.

We currently don't have a good overview of how long relays stay in the Tor network and how many new nodes joins the network each month.

Before we can dive into this, we need to be clear about some definition. When we talk about "Relay Churn Rate" here, we are interested in seeing how many nodes that leaves the network. We look at the network on a monthly basis, so if a relay have participated in the network for a single day, the relay will be considered to have been active during that month.

We are interested in figuring out:

1. How many never before seen (new) relays are joining the Tor network each month?
2. How many relays have left the network in the given month (but was active last month)?
3. How many relays are returning from having been away from the network for an entire month?

Additionally we are interested in learning about the following:

1. How many unique relays have the Tor network seen in total during its lifetime?
2. What is the lifetime properties of a relay? For example: The average lifetime of a relay?

## Implementation

The first thing we have to do is to build up the data structure from the historical data in the archives. We use the Stem library to parse the historical data into a Python data structure that we can work with.

In [1]:
from stem.descriptor import parse_file
from concurrent.futures import ThreadPoolExecutor

import os
import binascii
import logging

DATA_DIR = "/home/user/stem-collector-data"

logging.basicConfig(format='%(asctime)s %(message)s', level=logging.DEBUG)

Next we build our data structure that we are going to use for the computations later. The resulting structure will be a map of a `(year, month)` tuple to a set of relay fingerprints.

This part is going to read around 120 GB of data, so it takes around 6 hours on my laptop. The pickled result is around 63 MB of data.

In [2]:
def parse_one_consensus(path):
    result = {}
    year, month = tuple(path.strip("consensus-").strip(".tar").split("-"))
    
    year = int(year)
    month = int(month)
    
    logging.info("Parsing: {}".format(path))
    
    for relay in parse_file("{}/{}".format(DATA_DIR, path)):
        fingerprint = binascii.unhexlify(relay.fingerprint)
        
        if "Running" not in relay.flags:
            continue
            
        result[fingerprint] = {
            "nickname": relay.nickname,
        }
    
    return (year, month, result) 

files = sorted(os.listdir(DATA_DIR))
result = {}
metadata = {}

for year, month, relays in map(parse_one_consensus, files[0:4]):
    result[year, month] = set(relays.keys())
    metadata[year, month] = relays

2020-05-10 21:40:11,750 Parsing: consensuses-2007-10.tar
2020-05-10 21:40:26,927 Parsing: consensuses-2007-11.tar
2020-05-10 21:42:04,988 Parsing: consensuses-2007-12.tar
2020-05-10 21:44:13,879 Parsing: consensuses-2008-01.tar


We store the computed data on disk since parsing the files with stem takes forever.

In [3]:
import pickle

pickle.dump(result, open("result.pickle", "wb"))
pickle.dump(metadata, open("metadata.pickle", "wb"))

We convert the data we have collected into a Panda DataFrame using various set operations:

In [4]:
import datetime
import pandas as pd

pd.set_option("display.max_rows", 500)

all_seen_relays = set()

last_month = set()

output = {
    "date": [],
    "relay_count_sum": [],
    "relay_count": [],
    "relay_new_count": [],
    "relay_loss_count": [],
    "relay_return_count": [],
}

for key in sorted(result.keys()):
    current_month = result[key]
    year, month = key
    
    # Stats.
    relay_count = len(current_month)
    relay_new_count = len(current_month - all_seen_relays)
    relay_loss_count = len(last_month - current_month)
    
    relay_return_count = len((current_month - last_month) & all_seen_relays)
    
    # The sum of relays.
    all_seen_relays.update(current_month)
    relay_count_sum = len(all_seen_relays)
        
    output["date"].append(datetime.datetime(year, month, 1))
    output["relay_count"].append(relay_count)
    output["relay_count_sum"].append(relay_count_sum)
    output["relay_new_count"].append(relay_new_count)
    output["relay_loss_count"].append(relay_loss_count)
    output["relay_return_count"].append(relay_return_count)

    last_month = current_month
    
pd.DataFrame.from_dict(output)

Unnamed: 0,date,relay_count_sum,relay_count,relay_new_count,relay_loss_count,relay_return_count
0,2007-10-01,3239,3239,3239,0,0
1,2007-11-01,9273,8487,6034,786,0
2,2007-12-01,14650,8866,5377,5035,37
3,2008-01-01,20364,9462,5714,5313,195


The columns above tries to explain the following values:

- *date*: The first day of the given month.
- *relay_count_sum*: The total sum of different, unique, relays we have seen in the consensus at this given point in time.
- *relay_count*: The number of different, unique, relays we have seen in the consensus in the given month.
- *relay_new_count*: The number of relays that we have not seen before.
- *relay_return_count*: The number of relays that we have seen before, but that was not found last month, but has now returned.
- *relay_loss_count*: The number of relays that we saw last month, but that are gone in the current month.