# Binder Analytics
Analysing launches of this project in [MyBinder](https://mybinder.org).

In [26]:
%matplotlib inline
import pandas as pd
from concurrent.futures import ThreadPoolExecutor, as_completed, wait

from ipypb import irange, track
from tqdm.auto import tqdm, trange

In [28]:
import ipypb
from IPython.display import ProgressBar

## Get the source data
### Download the archive index

The index.jsonl file lists all the dates an event archive is available for. The following fields are present for each line:

- date :: The UTC date the event archive is for
- name :: The name of the file containing the events. This is a relative path - since we got the index.jsonl file from https://archive.analytics.mybinder.org, that is the base URL used to resolve these. For example when name is events-2018-11-05.jsonl, the full URL to the file is https://archive.analytics.mybinder.org/events-2018-11-05.jsonl.
- count :: Total number of events in the file.


In [3]:
start_date = pd.Timestamp("2021-08-08")

In [4]:
%time index = pd.read_json("https://archive.analytics.mybinder.org/index.jsonl", lines=True)

CPU times: user 30.8 ms, sys: 4.51 ms, total: 35.3 ms
Wall time: 397 ms


Only keep index entries from our `start_date` to the present day, and reindex so it starts at 0.

In [5]:
index = index[index.date >= start_date].reset_index(drop=True)

In [6]:
pd.set_option("display.max_rows", 4)
index

Unnamed: 0,name,date,count
0,events-2021-08-08.jsonl,2021-08-08,13331
1,events-2021-08-09.jsonl,2021-08-09,23388
...,...,...,...
10,events-2021-08-18.jsonl,2021-08-18,25364
11,events-2021-08-19.jsonl,2021-08-19,24305


### Download the event archives

Get event archives for all the days since the first version of this repository was created:

1. The main progress bar will have the len of archives and progresses for each archive completed.
1. I want to have sub-progress bars for each archive as they are added, ordered properly.
1. These subprocess bars will have the count for the archive and progress by the len of each chunk.

In [7]:
frames = []

In [29]:
pb = ProgressBar(12)

In [38]:
pb.display()

In [37]:
pb._progress = 1

In [22]:
def get_events(archive):
    desc = str(archive["date"].date())
    total = archive["count"]
    url = f"https://archive.analytics.mybinder.org/{archive['name']}"
    with tqdm(total=total, desc=desc) as pbar:
        with pd.read_json(url, lines=True, chunksize=250) as reader:
            for chunk in reader:
                frames.append(chunk)
                pbar.update(len(chunk))

In [23]:
with ThreadPoolExecutor(max_workers=5) as executor:
    futures = []
    pbar = trange(len(index), desc="Archives")
    futures = executor.map(get_events, index.iterrows())
    _ = (f.add_done_callback(lambda fut: pbar.update(1)) for f in futures)

results = wait(futures)
print(f"{len(results.done)} archives completed")

Archives:   0%|          | 0/12 [00:00<?, ?it/s]

TypeError: 'generator' object is not subscriptable

In [10]:
df = pd.concat(frames)

### Massage the data

Limit to records that are from my GitHub repositories, and reset the index.

In [None]:
df = df[
    (df["provider"] == "GitHub") & (df["spec"].str.startswith("philipsd6"))
].reset_index(drop=True)

Does it look right?

In [None]:
df.sample(3)

Split the spec out into the repo/ref/commit. Often the ref is the same as the commit.

In [None]:
df["commit"] = df["ref"]
df[["repo", "ref"]] = df["spec"].str.rsplit("/", 1, expand=True)

Drop unneeded columns and reindex in a nicer order.

In [None]:
df = df.drop(columns=["schema", "version", "provider", "spec", "status"]).reindex(
    columns=["timestamp", "build_token", "origin", "repo", "ref", "commit"]
)

Does it look better?

In [None]:
df.sample(3)

## Analyze the data
### Total Launches

In [None]:
df.shape[0]

### Launches per day

In [None]:
daily = df.set_index("timestamp").resample("D").count()

In [None]:
daily["repo"].plot()