# Relay Churn Rate

This notebook tries to show the churn rate of relays within the Tor network.

The first thing we are going to do is to include Stem's `parse_file` function that allows us to easily parse various documents from the collector store.

In [44]:
from stem.descriptor import parse_file

import os
import binascii

We define a number of constants that we will use throughout the notebook:

In [45]:
DATA_DIR = "/home/user/stem-collector-data"

Next we build our data structure that we are going to use for the computations later. The resulting structure will be a map of a `(year, month)` tuple to a set of relay fingerprints.

This part is going to read around 120 GB of data, so it takes around 6 hours on my laptop. The pickled result is around 63 MB of data.

In [70]:
data = {}

for file in sorted(os.listdir(DATA_DIR)):
    year, month = tuple(file.strip("consensus-").strip(".tar").split("-"))
    year = int(year)
    month = int(month)
    print("{}: Reading {}/{}".format(datetime.datetime.now(), year, month))
    relays = set()
    
    for relay in parse_file("{}/{}".format(DATA_DIR, file)):
        fingerprint = relay.fingerprint

        if "Running" not in relay.flags:
            continue
        
        relays.add(fingerprint)
        
    data[(year, month)] = {
        "relays": relays,
    }

2020-02-08 00:32:16.867441: Reading 2007/10
2020-02-08 00:32:28.174948: Reading 2007/11
2020-02-08 00:33:41.539700: Reading 2007/12
2020-02-08 00:35:33.150632: Reading 2008/1
2020-02-08 00:37:55.426752: Reading 2008/2
2020-02-08 00:40:06.913897: Reading 2008/3
2020-02-08 00:42:02.437945: Reading 2008/4
2020-02-08 00:43:40.791636: Reading 2008/5
2020-02-08 00:45:29.378677: Reading 2008/6
2020-02-08 00:47:22.598613: Reading 2008/7
2020-02-08 00:49:15.655111: Reading 2008/8
2020-02-08 00:50:55.059184: Reading 2008/9
2020-02-08 00:52:20.135707: Reading 2008/10
2020-02-08 00:53:37.363678: Reading 2008/11
2020-02-08 00:54:43.682661: Reading 2008/12
2020-02-08 00:55:51.939733: Reading 2009/1
2020-02-08 00:57:02.572578: Reading 2009/2
2020-02-08 00:58:07.567234: Reading 2009/3
2020-02-08 00:59:27.046225: Reading 2009/4
2020-02-08 01:00:52.746923: Reading 2009/5
2020-02-08 01:02:24.660058: Reading 2009/6
2020-02-08 01:04:04.171033: Reading 2009/7
2020-02-08 01:05:57.002750: Reading 2009/8
2020-

We convert the data we have collected into a Panda DataFrame using various set operations:

In [93]:
all_seen_relays = set()

last_month = set()

result = {
    "date": [],
    "relay_count_sum": [],
    "relay_count": [],
    "relay_new_count": [],
    "relay_return_count": [],
    "relay_lost_count": [],
}

for key in sorted(data.keys()):
    for kind in sorted(data[key].keys()):
        current_month = data[key][kind]
        year, month = key
    
        number_of_relays_seen = len(current_month)
        relay_new_count = len(current_month - all_seen_relays)
        
        # A returning relay is a relay we have seen before, but not last month.
        relay_return_count = len((current_month - last_month) & all_seen_relays)
        
        # A lost relay is one that was there last month, but not in current month.
        relay_lost_count = len(last_month - current_month)
        
        all_seen_relays.update(current_month)
        total_number_of_relays_seen = len(all_seen_relays)
    
        result["date"].append(datetime.datetime(year, month, 1))
        result["relay_count"].append(number_of_relays_seen)
        result["relay_count_sum"].append(total_number_of_relays_seen)
        result["relay_new_count"].append(relay_new_count)
        result["relay_return_count"].append(relay_return_count)
        result["relay_lost_count"].append(relay_lost_count)

        last_month = current_month
    
frame = pd.DataFrame.from_dict(result)
frame

Unnamed: 0,date,relay_count_sum,relay_count,relay_new_count,relay_return_count,relay_lost_count
0,2007-10-01,3239,3239,3239,0,0
1,2007-11-01,9273,8487,6034,0,786
2,2007-12-01,14650,8866,5377,37,5035
3,2008-01-01,20364,9462,5714,195,5313
4,2008-02-01,24337,7803,3973,243,5875
...,...,...,...,...,...,...
144,2019-10-01,504985,10271,1912,334,1998
145,2019-11-01,506766,9705,1781,290,2637
146,2019-12-01,508676,9991,1910,355,1979
147,2020-01-01,510592,10085,1916,309,2131


The columns above tries to explain the following values:

- *date*: The first day of the given month.
- *relay_count_sum*: The total sum of different, unique, relays we have seen in the consensus at this given point in time.
- *relay_count*: The number of different, unique, relays we have seen in the consensus in the given month.
- *relay_new_count*: The number of relays that we have not seen before.
- *relay_return_count*: The number of relays that we have seen before, but that was not found last month, but has now returned.
- *relay_lost_count*: The number of relays that we saw last month, but that are gone in the current month.