# How much data is accessible over the Bittorrent network?

In [13]:
library(readr)
library(dplyr)
library(plotly)
options(warn=0)

We use a dataset assembled by [Fabio Hecht, Thomas Bocek, David Hausheer](http://www.csg.uzh.ch/publications/data/piratebay/) at the University of Zürich to do a back of the envelope calculation on the amount of data that could potentially be stored securely on a P2P network. This is an imperfect choice of dataset but it does highlight an approximate threshold for data security in p2p networks.

For reference, the total size of the Internet Archive currently (Oct 2016) stands at 15 PB.

In [14]:
temp <- tempfile()
download.file("http://www.csg.uzh.ch/csg/dam/jcr:00000000-6205-a81d-ffff-ffff8a32c836/20081205_thepiratebay.zip",temp)
tpb <- read_csv(unz(temp, "20081205_thepiratebay.csv"))
unlink(temp)

Parsed with column specification:
cols(
  idtorrent = col_integer(),
  category = col_integer(),
  size = col_double(),
  seeders = col_integer(),
  leechers = col_integer()
)


In [16]:
p <- tpb %>%
group_by(seeders) %>%
summarise(sum=sum(size)/1024^5) %>%
arrange(desc(seeders)) %>%
mutate(cumsum=cumsum(sum)) %>%
plot_ly(x=~seeders, y=~cumsum, type="bar") %>%
layout(title="Total data seeded, by number of seeders - 2008-2012", xaxis = list(title="Number of seeders", type = "log"), yaxis=list(title="Total data available (PB)"))

embed_notebook(p)

The graph above shows the amount of data mirrored on the BitTorrent, by number of seeders. The more seeders there are, the more certain we are that the piece of data will be accessible for a longer period of time.

This graph highlights the difficulty in storing large amounts of data for long periods of time on a P2P network. If we assume that data is safe when there are 10 seeders, then the BitTorrent network can host 0.1 PB of data safely. If we raise this threshold to 100 seeders, then only 6.2 TB of data has been secured on the network.